gateway/README_document_test.md
2025-07-10 16:13:05 +02:00

3 KiB

Document Extraction Test

This test procedure validates the DocumentManager's ability to extract content from files using AI-powered analysis.

Files Created

  • test_document_extraction.py - Main test script
  • test_sample_document.txt - Sample document for testing
  • run_document_test.ps1 - PowerShell wrapper script
  • test_document_extraction.log - Generated log file (cleared on each run)

Usage

# Test with default sample file
.\run_document_test.ps1

# Test with custom file
.\run_document_test.ps1 "path\to\your\document.pdf"

Method 2: Direct Python Execution

# Test with default sample file
python test_document_extraction.py test_sample_document.txt

# Test with custom file
python test_document_extraction.py "path/to/your/document.docx"

Test Features

  1. File Validation: Checks if the specified file exists
  2. MIME Type Detection: Automatically detects file type based on extension
  3. Content Extraction: Uses the DocumentManager to extract content
  4. AI Processing: Applies the prompt "summarize the content and give list of the major topics"
  5. Comprehensive Logging: Logs all steps and results to test_document_extraction.log
  6. Log Cleanup: Clears the log file on each test run

Supported File Types

  • Text files (.txt, .md)
  • CSV files (.csv)
  • JSON files (.json)
  • XML files (.xml)
  • HTML files (.html, .htm)
  • Images (.jpg, .jpeg, .png, .gif, .svg)
  • PDF files (.pdf)
  • Office documents (.docx, .xlsx, .pptx)
  • And more (fallback to binary processing)

Test Output

The test generates detailed logs including:

  • File information (path, size, MIME type)
  • Extraction process details
  • Extracted content summary
  • AI-processed results
  • Error details if any issues occur

Example Output

=== STARTING DOCUMENT EXTRACTION TEST ===
File information: {
  "file_path": "test_sample_document.txt",
  "filename": "test_sample_document.txt",
  "mime_type": "text/plain",
  "file_size_bytes": 2048,
  "file_size_mb": 0.0
}
Document extraction completed successfully: {
  "extracted_content_id": "test-doc-1234567890",
  "content_items_count": 1,
  "object_type": "ExtractedContent"
}
COMPLETE EXTRACTED CONTENT: {
  "total_length": 1500,
  "content": "PowerOn System Architecture Overview... [AI processed summary]"
}

Error Handling

The test includes comprehensive error handling for:

  • File not found errors
  • File reading errors
  • Document processing errors
  • AI processing errors
  • Import errors

All errors are logged with detailed information for debugging.

Configuration

The test uses the same configuration as other tests:

  • Environment variable: POWERON_CONFIG_FILE = 'test_config.ini'
  • Log file: test_document_extraction.log
  • Log level: DEBUG

Dependencies

The test requires the same dependencies as the main PowerOn system:

  • Python 3.8+
  • Required Python packages (see requirements.txt)
  • Access to AI services (if AI processing is enabled)
  • Proper configuration in test_config.ini