gateway/README_document_test.md
2025-07-10 16:13:05 +02:00

114 lines
No EOL
3 KiB
Markdown

# Document Extraction Test
This test procedure validates the DocumentManager's ability to extract content from files using AI-powered analysis.
## Files Created
- `test_document_extraction.py` - Main test script
- `test_sample_document.txt` - Sample document for testing
- `run_document_test.ps1` - PowerShell wrapper script
- `test_document_extraction.log` - Generated log file (cleared on each run)
## Usage
### Method 1: Using PowerShell Script (Recommended)
```powershell
# Test with default sample file
.\run_document_test.ps1
# Test with custom file
.\run_document_test.ps1 "path\to\your\document.pdf"
```
### Method 2: Direct Python Execution
```bash
# Test with default sample file
python test_document_extraction.py test_sample_document.txt
# Test with custom file
python test_document_extraction.py "path/to/your/document.docx"
```
## Test Features
1. **File Validation**: Checks if the specified file exists
2. **MIME Type Detection**: Automatically detects file type based on extension
3. **Content Extraction**: Uses the DocumentManager to extract content
4. **AI Processing**: Applies the prompt "summarize the content and give list of the major topics"
5. **Comprehensive Logging**: Logs all steps and results to `test_document_extraction.log`
6. **Log Cleanup**: Clears the log file on each test run
## Supported File Types
- Text files (.txt, .md)
- CSV files (.csv)
- JSON files (.json)
- XML files (.xml)
- HTML files (.html, .htm)
- Images (.jpg, .jpeg, .png, .gif, .svg)
- PDF files (.pdf)
- Office documents (.docx, .xlsx, .pptx)
- And more (fallback to binary processing)
## Test Output
The test generates detailed logs including:
- File information (path, size, MIME type)
- Extraction process details
- Extracted content summary
- AI-processed results
- Error details if any issues occur
## Example Output
```
=== STARTING DOCUMENT EXTRACTION TEST ===
File information: {
"file_path": "test_sample_document.txt",
"filename": "test_sample_document.txt",
"mime_type": "text/plain",
"file_size_bytes": 2048,
"file_size_mb": 0.0
}
Document extraction completed successfully: {
"extracted_content_id": "test-doc-1234567890",
"content_items_count": 1,
"object_type": "ExtractedContent"
}
COMPLETE EXTRACTED CONTENT: {
"total_length": 1500,
"content": "PowerOn System Architecture Overview... [AI processed summary]"
}
```
## Error Handling
The test includes comprehensive error handling for:
- File not found errors
- File reading errors
- Document processing errors
- AI processing errors
- Import errors
All errors are logged with detailed information for debugging.
## Configuration
The test uses the same configuration as other tests:
- Environment variable: `POWERON_CONFIG_FILE = 'test_config.ini'`
- Log file: `test_document_extraction.log`
- Log level: DEBUG
## Dependencies
The test requires the same dependencies as the main PowerOn system:
- Python 3.8+
- Required Python packages (see requirements.txt)
- Access to AI services (if AI processing is enabled)
- Proper configuration in test_config.ini