Document Extraction Test

This test procedure validates the DocumentManager's ability to extract content from files using AI-powered analysis.

Files Created

test_document_extraction.py - Main test script
test_sample_document.txt - Sample document for testing
run_document_test.ps1 - PowerShell wrapper script
test_document_extraction.log - Generated log file (cleared on each run)

Usage

Method 1: Using PowerShell Script (Recommended)

# Test with default sample file
.\run_document_test.ps1

# Test with custom file
.\run_document_test.ps1 "path\to\your\document.pdf"

Method 2: Direct Python Execution

# Test with default sample file
python test_document_extraction.py test_sample_document.txt

# Test with custom file
python test_document_extraction.py "path/to/your/document.docx"

Test Features

File Validation: Checks if the specified file exists
MIME Type Detection: Automatically detects file type based on extension
Content Extraction: Uses the DocumentManager to extract content
AI Processing: Applies the prompt "summarize the content and give list of the major topics"
Comprehensive Logging: Logs all steps and results to test_document_extraction.log
Log Cleanup: Clears the log file on each test run

Supported File Types

Text files (.txt, .md)
CSV files (.csv)
JSON files (.json)
XML files (.xml)
HTML files (.html, .htm)
Images (.jpg, .jpeg, .png, .gif, .svg)
PDF files (.pdf)
Office documents (.docx, .xlsx, .pptx)
And more (fallback to binary processing)

Test Output

The test generates detailed logs including:

File information (path, size, MIME type)
Extraction process details
Extracted content summary
AI-processed results
Error details if any issues occur

Example Output

=== STARTING DOCUMENT EXTRACTION TEST ===
File information: {
  "file_path": "test_sample_document.txt",
  "filename": "test_sample_document.txt",
  "mime_type": "text/plain",
  "file_size_bytes": 2048,
  "file_size_mb": 0.0
}
Document extraction completed successfully: {
  "extracted_content_id": "test-doc-1234567890",
  "content_items_count": 1,
  "object_type": "ExtractedContent"
}
COMPLETE EXTRACTED CONTENT: {
  "total_length": 1500,
  "content": "PowerOn System Architecture Overview... [AI processed summary]"
}

Error Handling

The test includes comprehensive error handling for:

File not found errors
File reading errors
Document processing errors
AI processing errors
Import errors

All errors are logged with detailed information for debugging.

Configuration

The test uses the same configuration as other tests:

Environment variable: POWERON_CONFIG_FILE = 'test_config.ini'
Log file: test_document_extraction.log
Log level: DEBUG

Dependencies

The test requires the same dependencies as the main PowerOn system:

Python 3.8+
Required Python packages (see requirements.txt)
Access to AI services (if AI processing is enabled)
Proper configuration in test_config.ini

3 KiB Raw Blame History