3 KiB
3 KiB
Document Extraction Test
This test procedure validates the DocumentManager's ability to extract content from files using AI-powered analysis.
Files Created
test_document_extraction.py- Main test scripttest_sample_document.txt- Sample document for testingrun_document_test.ps1- PowerShell wrapper scripttest_document_extraction.log- Generated log file (cleared on each run)
Usage
Method 1: Using PowerShell Script (Recommended)
# Test with default sample file
.\run_document_test.ps1
# Test with custom file
.\run_document_test.ps1 "path\to\your\document.pdf"
Method 2: Direct Python Execution
# Test with default sample file
python test_document_extraction.py test_sample_document.txt
# Test with custom file
python test_document_extraction.py "path/to/your/document.docx"
Test Features
- File Validation: Checks if the specified file exists
- MIME Type Detection: Automatically detects file type based on extension
- Content Extraction: Uses the DocumentManager to extract content
- AI Processing: Applies the prompt "summarize the content and give list of the major topics"
- Comprehensive Logging: Logs all steps and results to
test_document_extraction.log - Log Cleanup: Clears the log file on each test run
Supported File Types
- Text files (.txt, .md)
- CSV files (.csv)
- JSON files (.json)
- XML files (.xml)
- HTML files (.html, .htm)
- Images (.jpg, .jpeg, .png, .gif, .svg)
- PDF files (.pdf)
- Office documents (.docx, .xlsx, .pptx)
- And more (fallback to binary processing)
Test Output
The test generates detailed logs including:
- File information (path, size, MIME type)
- Extraction process details
- Extracted content summary
- AI-processed results
- Error details if any issues occur
Example Output
=== STARTING DOCUMENT EXTRACTION TEST ===
File information: {
"file_path": "test_sample_document.txt",
"filename": "test_sample_document.txt",
"mime_type": "text/plain",
"file_size_bytes": 2048,
"file_size_mb": 0.0
}
Document extraction completed successfully: {
"extracted_content_id": "test-doc-1234567890",
"content_items_count": 1,
"object_type": "ExtractedContent"
}
COMPLETE EXTRACTED CONTENT: {
"total_length": 1500,
"content": "PowerOn System Architecture Overview... [AI processed summary]"
}
Error Handling
The test includes comprehensive error handling for:
- File not found errors
- File reading errors
- Document processing errors
- AI processing errors
- Import errors
All errors are logged with detailed information for debugging.
Configuration
The test uses the same configuration as other tests:
- Environment variable:
POWERON_CONFIG_FILE = 'test_config.ini' - Log file:
test_document_extraction.log - Log level: DEBUG
Dependencies
The test requires the same dependencies as the main PowerOn system:
- Python 3.8+
- Required Python packages (see requirements.txt)
- Access to AI services (if AI processing is enabled)
- Proper configuration in test_config.ini