gateway/README_document_test.md

# Document Extraction Test

This test procedure validates the DocumentManager's ability to extract content from files using AI-powered analysis.

## Files Created

- `test_document_extraction.py` - Main test script
- `test_sample_document.txt` - Sample document for testing
- `run_document_test.ps1` - PowerShell wrapper script
- `test_document_extraction.log` - Generated log file (cleared on each run)

## Usage

### Method 1: Using PowerShell Script (Recommended)

```powershell
# Test with default sample file
.\run_document_test.ps1

# Test with custom file
.\run_document_test.ps1 "path\to\your\document.pdf"
```

### Method 2: Direct Python Execution

```bash
# Test with default sample file
python test_document_extraction.py test_sample_document.txt

# Test with custom file
python test_document_extraction.py "path/to/your/document.docx"
```

## Test Features

1. **File Validation**: Checks if the specified file exists
2. **MIME Type Detection**: Automatically detects file type based on extension
3. **Content Extraction**: Uses the DocumentManager to extract content
4. **AI Processing**: Applies the prompt "summarize the content and give list of the major topics"
5. **Comprehensive Logging**: Logs all steps and results to `test_document_extraction.log`
6. **Log Cleanup**: Clears the log file on each test run

## Supported File Types

- Text files (.txt, .md)
- CSV files (.csv)
- JSON files (.json)
- XML files (.xml)
- HTML files (.html, .htm)
- Images (.jpg, .jpeg, .png, .gif, .svg)
- PDF files (.pdf)
- Office documents (.docx, .xlsx, .pptx)
- And more (fallback to binary processing)

## Test Output

The test generates detailed logs including:

- File information (path, size, MIME type)
- Extraction process details
- Extracted content summary
- AI-processed results
- Error details if any issues occur

## Example Output

```
=== STARTING DOCUMENT EXTRACTION TEST ===
File information: {
  "file_path": "test_sample_document.txt",
  "filename": "test_sample_document.txt",
  "mime_type": "text/plain",
  "file_size_bytes": 2048,
  "file_size_mb": 0.0
}
Document extraction completed successfully: {
  "extracted_content_id": "test-doc-1234567890",
  "content_items_count": 1,
  "object_type": "ExtractedContent"
}
COMPLETE EXTRACTED CONTENT: {
  "total_length": 1500,
  "content": "PowerOn System Architecture Overview... [AI processed summary]"
}
```

## Error Handling

The test includes comprehensive error handling for:

- File not found errors
- File reading errors
- Document processing errors
- AI processing errors
- Import errors

All errors are logged with detailed information for debugging.

## Configuration

The test uses the same configuration as other tests:

- Environment variable: `POWERON_CONFIG_FILE = 'test_config.ini'`
- Log file: `test_document_extraction.log`
- Log level: DEBUG

## Dependencies

The test requires the same dependencies as the main PowerOn system:

- Python 3.8+
- Required Python packages (see requirements.txt)
- Access to AI services (if AI processing is enabled)
- Proper configuration in test_config.ini