266 lines
9.6 KiB
Markdown
266 lines
9.6 KiB
Markdown
# Content Validator - Deep Analysis & Target Design
|
|
|
|
## CURRENT STATE ANALYSIS
|
|
|
|
### How Validator Currently Works
|
|
|
|
#### 1. **Document Input Flow**
|
|
```
|
|
ActionResult.documents (List[ActionDocument])
|
|
→ modeReact.py extracts "structured content" with hardcoded checks
|
|
→ Creates SimpleNamespace objects with wrapped documentData
|
|
→ Passes to ContentValidator.validateContent()
|
|
```
|
|
|
|
#### 2. **Current Problems in modeReact.py (Lines 99-136)**
|
|
- ❌ **Hardcoded document name checks**: `docName == "structured_content.json"`
|
|
- ❌ **Hardcoded mimeType checks**: `mimeType == "application/json"`
|
|
- ❌ **Hardcoded structure checks**: `'content' in docData or 'documents' in docData or 'sections' in docData`
|
|
- ❌ **Single document selection**: `break` after first match - ignores other documents
|
|
- ❌ **Non-generic logic**: Specific to certain document structures
|
|
- ❌ **Workaround approach**: Trying to find structured content in various ways
|
|
|
|
#### 3. **Current Problems in contentValidator.py**
|
|
|
|
**`_extractContent()` method (Lines 21-41)**:
|
|
- ❌ **Inconsistent handling**: Checks for `dict with 'content'` but then also handles raw `data`
|
|
- ❌ **Silent failures**: Returns empty string on any exception
|
|
- ❌ **Size limit hardcoded**: 10KB threshold is arbitrary
|
|
- ❌ **No format awareness**: Doesn't check if document is binary/base64 before extracting
|
|
- ❌ **No document type detection**: Doesn't distinguish text vs binary vs structured data
|
|
|
|
**`_validateWithAI()` method (Lines 60-200)**:
|
|
- ❌ **Forces all content to string**: `content[:2000]` truncation assumes text
|
|
- ❌ **No document metadata passed**: Only name and content, no size, format, mimeType info
|
|
- ❌ **No binary/base64 handling**: Will fail or show garbage for binary documents
|
|
- ❌ **Multiple JSON extraction strategies**: Indicates unreliable AI response parsing
|
|
- ❌ **Size limits inconsistent**: 10KB in extraction, 2KB in prompt - why different?
|
|
|
|
#### 4. **Missing Capabilities**
|
|
- ❌ No document size reporting to validator
|
|
- ❌ No format validation (txt vs md vs pdf vs docx)
|
|
- ❌ No binary data handling (images, PDFs, etc.)
|
|
- ❌ No document count/summary statistics
|
|
- ❌ No distinction between document types for validation
|
|
|
|
---
|
|
|
|
## TARGET DESIGN
|
|
|
|
### Core Principles
|
|
1. **GENERIC**: No hardcoded document names, types, or structures
|
|
2. **DOCUMENT-AWARE**: Handle all document types (text, binary, base64, structured)
|
|
3. **SIZE-CONSCIOUS**: Never pass full large documents to AI
|
|
4. **METADATA-RICH**: Pass document metadata (name, size, format, mimeType) to validator
|
|
5. **FORMAT-FLEXIBLE**: Allow format flexibility (md ≈ text, but pdf ≠ docx)
|
|
|
|
### Target Architecture
|
|
|
|
```
|
|
Documents Input (List[ActionDocument])
|
|
↓
|
|
Document Analyzer (generic)
|
|
- Extract metadata (name, size, mimeType, format)
|
|
- Determine content type (text/binary/base64/structured)
|
|
- Create preview/summary for large documents
|
|
↓
|
|
Document Summary (for AI validation)
|
|
- Metadata only for binary/base64
|
|
- Preview/sample for large text documents
|
|
- Full content for small text/structured documents
|
|
↓
|
|
Validation Prompt Builder (generic)
|
|
- Include document summaries (not full content)
|
|
- Include document metadata
|
|
- Include format validation rules (generic)
|
|
↓
|
|
AI Validator
|
|
- Validates against task objective (generic)
|
|
- Validates format compliance (flexible)
|
|
- Validates document count/size
|
|
```
|
|
|
|
---
|
|
|
|
## REQUIRED CHANGES
|
|
|
|
### 1. **Remove All Hardcoded Checks from modeReact.py**
|
|
- ❌ Remove document name checks
|
|
- ❌ Remove mimeType-specific logic
|
|
- ❌ Remove structure-specific checks
|
|
- ✅ Pass ALL documents to validator (let validator decide what to validate)
|
|
- ✅ Keep it simple: `validationDocs = result.documents`
|
|
|
|
### 2. **Redesign contentValidator.py - New Structure**
|
|
|
|
#### New Method: `_analyzeDocuments(documents)`
|
|
```python
|
|
def _analyzeDocuments(self, documents: List[Any]) -> List[Dict[str, Any]]:
|
|
"""
|
|
Generic document analysis - extract metadata and create summaries.
|
|
Returns list of document summaries ready for validation prompt.
|
|
"""
|
|
summaries = []
|
|
for doc in documents:
|
|
summary = {
|
|
"name": getattr(doc, 'documentName', 'Unknown'),
|
|
"mimeType": getattr(doc, 'mimeType', 'unknown'),
|
|
"format": self._detectFormat(doc),
|
|
"size": self._calculateSize(doc),
|
|
"type": self._detectContentType(doc), # text/binary/base64/structured
|
|
"preview": self._createPreview(doc), # None for binary, sample for large text
|
|
"isAccessible": self._isContentAccessible(doc) # Can we read content?
|
|
}
|
|
summaries.append(summary)
|
|
return summaries
|
|
```
|
|
|
|
#### New Method: `_detectFormat(doc)`
|
|
- Extract from filename extension or mimeType
|
|
- Generic mapping: `text/plain` → `txt`, `text/markdown` → `md`, etc.
|
|
- Return format string (txt, md, pdf, docx, json, etc.)
|
|
|
|
#### New Method: `_calculateSize(doc)`
|
|
- Calculate document size in bytes
|
|
- Handle string, dict, list, bytes, base64
|
|
- Return: `{"bytes": int, "readable": "1.5 MB"}`
|
|
|
|
#### New Method: `_detectContentType(doc)`
|
|
- `text`: Readable text content
|
|
- `structured`: JSON/dict/list structures
|
|
- `binary`: Binary data (PDF, images, etc.)
|
|
- `base64`: Base64-encoded data
|
|
- Return content type string
|
|
|
|
#### New Method: `_createPreview(doc)`
|
|
- **Binary/Base64**: Return `None` (metadata only)
|
|
- **Large text (>50KB)**: Return first 1KB + size indicator
|
|
- **Small text (≤50KB)**: Return full content
|
|
- **Structured data**: Return JSON string (truncated if large)
|
|
|
|
#### New Method: `_isContentAccessible(doc)`
|
|
- Check if document content can be extracted for validation
|
|
- Binary/base64 documents: `False` (validate by metadata only)
|
|
- Text/structured documents: `True`
|
|
|
|
### 3. **Redesign Validation Prompt (Generic)**
|
|
|
|
```python
|
|
validationPrompt = f"""TASK VALIDATION
|
|
|
|
USER REQUEST: '{intent.get('primaryGoal', 'Unknown')}'
|
|
EXPECTED DATA TYPE: {intent.get('dataType', 'unknown')}
|
|
EXPECTED FORMAT: {intent.get('expectedFormat', 'unknown')}
|
|
SUCCESS CRITERIA ({criteriaCount} items): {successCriteria}
|
|
|
|
DELIVERED DOCUMENTS ({len(documentSummaries)} items):
|
|
{json.dumps(documentSummaries, indent=2)}
|
|
|
|
VALIDATION RULES:
|
|
1. Check if delivered documents match expected data type
|
|
2. Check if delivered formats are compatible with expected format
|
|
(Note: text formats like txt/md are compatible; pdf ≠ docx but both are documents)
|
|
3. Verify each success criterion is met based on document content/metadata
|
|
4. Check document sizes are reasonable for the task
|
|
5. Rate overall quality (0.0-1.0)
|
|
6. Identify specific gaps
|
|
7. Suggest next steps
|
|
|
|
OUTPUT FORMAT - JSON ONLY (no prose):
|
|
{{
|
|
"overallSuccess": false,
|
|
"qualityScore": 0.0,
|
|
"dataTypeMatch": false,
|
|
"formatMatch": false,
|
|
"documentCount": {len(documentSummaries)},
|
|
"successCriteriaMet": {[False] * criteriaCount},
|
|
"gapAnalysis": "Specific gaps found",
|
|
"improvementSuggestions": ["NEXT STEP: Action 1"],
|
|
"validationDetails": [
|
|
{{
|
|
"documentName": "document.ext",
|
|
"issues": ["Issue 1"],
|
|
"suggestions": ["NEXT STEP: Fix 1"]
|
|
}}
|
|
]
|
|
}}
|
|
"""
|
|
```
|
|
|
|
### 4. **Format Validation Logic (Generic & Flexible)**
|
|
|
|
```python
|
|
def _isFormatCompatible(self, deliveredFormat: str, expectedFormat: str) -> bool:
|
|
"""
|
|
Generic format compatibility check.
|
|
- txt/md/html are text formats (compatible with each other)
|
|
- pdf/docx/xlsx are document formats (not compatible with each other)
|
|
- json/xml are structured formats
|
|
- images are image formats
|
|
"""
|
|
# Text formats are interchangeable
|
|
textFormats = ['txt', 'md', 'html', 'text', 'plain']
|
|
if deliveredFormat.lower() in textFormats and expectedFormat.lower() in textFormats:
|
|
return True
|
|
|
|
# Exact match
|
|
if deliveredFormat.lower() == expectedFormat.lower():
|
|
return True
|
|
|
|
# Structured formats
|
|
if deliveredFormat.lower() in ['json', 'xml'] and expectedFormat.lower() in ['json', 'xml']:
|
|
return True # Could be made more flexible
|
|
|
|
return False
|
|
```
|
|
|
|
---
|
|
|
|
## IMPLEMENTATION PLAN
|
|
|
|
### Phase 1: Clean Up modeReact.py
|
|
- Remove all hardcoded checks
|
|
- Simply pass `result.documents` to validator
|
|
|
|
### Phase 2: Redesign Document Analysis
|
|
- Implement `_analyzeDocuments()`
|
|
- Implement helper methods: `_detectFormat()`, `_calculateSize()`, `_detectContentType()`, `_createPreview()`
|
|
|
|
### Phase 3: Redesign Validation Prompt
|
|
- Generic prompt with document summaries
|
|
- Include metadata, not full content
|
|
- Size-aware handling
|
|
|
|
### Phase 4: Implement Format Validation
|
|
- Generic format compatibility logic
|
|
- Flexible matching (text formats, document formats, etc.)
|
|
|
|
### Phase 5: Testing
|
|
- Test with text documents (small & large)
|
|
- Test with binary documents (PDF, images)
|
|
- Test with base64 documents
|
|
- Test with structured data (JSON)
|
|
|
|
---
|
|
|
|
## KEY DESIGN DECISIONS
|
|
|
|
1. **Pass ALL documents**: Validator decides what to validate, not the caller
|
|
2. **Metadata over content**: For large/binary documents, pass metadata only
|
|
3. **Preview samples**: For large text documents, pass preview + size info
|
|
4. **Generic prompts**: No task-specific or format-specific logic
|
|
5. **Flexible format matching**: Text formats compatible, document formats strict
|
|
6. **Size limits**: 50KB threshold for full content (configurable)
|
|
7. **Content type detection**: Explicit type detection (text/binary/base64/structured)
|
|
|
|
---
|
|
|
|
## BENEFITS OF TARGET DESIGN
|
|
|
|
✅ **Generic**: Works with any document type without hardcoding
|
|
✅ **Scalable**: Handles large documents without issues
|
|
✅ **Flexible**: Format validation is flexible where appropriate
|
|
✅ **Maintainable**: Clear separation of concerns
|
|
✅ **Robust**: Handles edge cases (binary, base64, large files)
|
|
✅ **Testable**: Each component can be tested independently
|
|
|