114 lines
No EOL
3 KiB
Markdown
114 lines
No EOL
3 KiB
Markdown
# Document Extraction Test
|
|
|
|
This test procedure validates the DocumentManager's ability to extract content from files using AI-powered analysis.
|
|
|
|
## Files Created
|
|
|
|
- `test_document_extraction.py` - Main test script
|
|
- `test_sample_document.txt` - Sample document for testing
|
|
- `run_document_test.ps1` - PowerShell wrapper script
|
|
- `test_document_extraction.log` - Generated log file (cleared on each run)
|
|
|
|
## Usage
|
|
|
|
### Method 1: Using PowerShell Script (Recommended)
|
|
|
|
```powershell
|
|
# Test with default sample file
|
|
.\run_document_test.ps1
|
|
|
|
# Test with custom file
|
|
.\run_document_test.ps1 "path\to\your\document.pdf"
|
|
```
|
|
|
|
### Method 2: Direct Python Execution
|
|
|
|
```bash
|
|
# Test with default sample file
|
|
python test_document_extraction.py test_sample_document.txt
|
|
|
|
# Test with custom file
|
|
python test_document_extraction.py "path/to/your/document.docx"
|
|
```
|
|
|
|
## Test Features
|
|
|
|
1. **File Validation**: Checks if the specified file exists
|
|
2. **MIME Type Detection**: Automatically detects file type based on extension
|
|
3. **Content Extraction**: Uses the DocumentManager to extract content
|
|
4. **AI Processing**: Applies the prompt "summarize the content and give list of the major topics"
|
|
5. **Comprehensive Logging**: Logs all steps and results to `test_document_extraction.log`
|
|
6. **Log Cleanup**: Clears the log file on each test run
|
|
|
|
## Supported File Types
|
|
|
|
- Text files (.txt, .md)
|
|
- CSV files (.csv)
|
|
- JSON files (.json)
|
|
- XML files (.xml)
|
|
- HTML files (.html, .htm)
|
|
- Images (.jpg, .jpeg, .png, .gif, .svg)
|
|
- PDF files (.pdf)
|
|
- Office documents (.docx, .xlsx, .pptx)
|
|
- And more (fallback to binary processing)
|
|
|
|
## Test Output
|
|
|
|
The test generates detailed logs including:
|
|
|
|
- File information (path, size, MIME type)
|
|
- Extraction process details
|
|
- Extracted content summary
|
|
- AI-processed results
|
|
- Error details if any issues occur
|
|
|
|
## Example Output
|
|
|
|
```
|
|
=== STARTING DOCUMENT EXTRACTION TEST ===
|
|
File information: {
|
|
"file_path": "test_sample_document.txt",
|
|
"filename": "test_sample_document.txt",
|
|
"mime_type": "text/plain",
|
|
"file_size_bytes": 2048,
|
|
"file_size_mb": 0.0
|
|
}
|
|
Document extraction completed successfully: {
|
|
"extracted_content_id": "test-doc-1234567890",
|
|
"content_items_count": 1,
|
|
"object_type": "ExtractedContent"
|
|
}
|
|
COMPLETE EXTRACTED CONTENT: {
|
|
"total_length": 1500,
|
|
"content": "PowerOn System Architecture Overview... [AI processed summary]"
|
|
}
|
|
```
|
|
|
|
## Error Handling
|
|
|
|
The test includes comprehensive error handling for:
|
|
|
|
- File not found errors
|
|
- File reading errors
|
|
- Document processing errors
|
|
- AI processing errors
|
|
- Import errors
|
|
|
|
All errors are logged with detailed information for debugging.
|
|
|
|
## Configuration
|
|
|
|
The test uses the same configuration as other tests:
|
|
|
|
- Environment variable: `POWERON_CONFIG_FILE = 'test_config.ini'`
|
|
- Log file: `test_document_extraction.log`
|
|
- Log level: DEBUG
|
|
|
|
## Dependencies
|
|
|
|
The test requires the same dependencies as the main PowerOn system:
|
|
|
|
- Python 3.8+
|
|
- Required Python packages (see requirements.txt)
|
|
- Access to AI services (if AI processing is enabled)
|
|
- Proper configuration in test_config.ini |