# Document Extraction Test This test procedure validates the DocumentManager's ability to extract content from files using AI-powered analysis. ## Files Created - `test_document_extraction.py` - Main test script - `test_sample_document.txt` - Sample document for testing - `run_document_test.ps1` - PowerShell wrapper script - `test_document_extraction.log` - Generated log file (cleared on each run) ## Usage ### Method 1: Using PowerShell Script (Recommended) ```powershell # Test with default sample file .\run_document_test.ps1 # Test with custom file .\run_document_test.ps1 "path\to\your\document.pdf" ``` ### Method 2: Direct Python Execution ```bash # Test with default sample file python test_document_extraction.py test_sample_document.txt # Test with custom file python test_document_extraction.py "path/to/your/document.docx" ``` ## Test Features 1. **File Validation**: Checks if the specified file exists 2. **MIME Type Detection**: Automatically detects file type based on extension 3. **Content Extraction**: Uses the DocumentManager to extract content 4. **AI Processing**: Applies the prompt "summarize the content and give list of the major topics" 5. **Comprehensive Logging**: Logs all steps and results to `test_document_extraction.log` 6. **Log Cleanup**: Clears the log file on each test run ## Supported File Types - Text files (.txt, .md) - CSV files (.csv) - JSON files (.json) - XML files (.xml) - HTML files (.html, .htm) - Images (.jpg, .jpeg, .png, .gif, .svg) - PDF files (.pdf) - Office documents (.docx, .xlsx, .pptx) - And more (fallback to binary processing) ## Test Output The test generates detailed logs including: - File information (path, size, MIME type) - Extraction process details - Extracted content summary - AI-processed results - Error details if any issues occur ## Example Output ``` === STARTING DOCUMENT EXTRACTION TEST === File information: { "file_path": "test_sample_document.txt", "filename": "test_sample_document.txt", "mime_type": "text/plain", "file_size_bytes": 2048, "file_size_mb": 0.0 } Document extraction completed successfully: { "extracted_content_id": "test-doc-1234567890", "content_items_count": 1, "object_type": "ExtractedContent" } COMPLETE EXTRACTED CONTENT: { "total_length": 1500, "content": "PowerOn System Architecture Overview... [AI processed summary]" } ``` ## Error Handling The test includes comprehensive error handling for: - File not found errors - File reading errors - Document processing errors - AI processing errors - Import errors All errors are logged with detailed information for debugging. ## Configuration The test uses the same configuration as other tests: - Environment variable: `POWERON_CONFIG_FILE = 'test_config.ini'` - Log file: `test_document_extraction.log` - Log level: DEBUG ## Dependencies The test requires the same dependencies as the main PowerOn system: - Python 3.8+ - Required Python packages (see requirements.txt) - Access to AI services (if AI processing is enabled) - Proper configuration in test_config.ini