9.6 KiB
9.6 KiB
Content Validator - Deep Analysis & Target Design
CURRENT STATE ANALYSIS
How Validator Currently Works
1. Document Input Flow
ActionResult.documents (List[ActionDocument])
→ modeReact.py extracts "structured content" with hardcoded checks
→ Creates SimpleNamespace objects with wrapped documentData
→ Passes to ContentValidator.validateContent()
2. Current Problems in modeReact.py (Lines 99-136)
- ❌ Hardcoded document name checks:
docName == "structured_content.json" - ❌ Hardcoded mimeType checks:
mimeType == "application/json" - ❌ Hardcoded structure checks:
'content' in docData or 'documents' in docData or 'sections' in docData - ❌ Single document selection:
breakafter first match - ignores other documents - ❌ Non-generic logic: Specific to certain document structures
- ❌ Workaround approach: Trying to find structured content in various ways
3. Current Problems in contentValidator.py
_extractContent() method (Lines 21-41):
- ❌ Inconsistent handling: Checks for
dict with 'content'but then also handles rawdata - ❌ Silent failures: Returns empty string on any exception
- ❌ Size limit hardcoded: 10KB threshold is arbitrary
- ❌ No format awareness: Doesn't check if document is binary/base64 before extracting
- ❌ No document type detection: Doesn't distinguish text vs binary vs structured data
_validateWithAI() method (Lines 60-200):
- ❌ Forces all content to string:
content[:2000]truncation assumes text - ❌ No document metadata passed: Only name and content, no size, format, mimeType info
- ❌ No binary/base64 handling: Will fail or show garbage for binary documents
- ❌ Multiple JSON extraction strategies: Indicates unreliable AI response parsing
- ❌ Size limits inconsistent: 10KB in extraction, 2KB in prompt - why different?
4. Missing Capabilities
- ❌ No document size reporting to validator
- ❌ No format validation (txt vs md vs pdf vs docx)
- ❌ No binary data handling (images, PDFs, etc.)
- ❌ No document count/summary statistics
- ❌ No distinction between document types for validation
TARGET DESIGN
Core Principles
- GENERIC: No hardcoded document names, types, or structures
- DOCUMENT-AWARE: Handle all document types (text, binary, base64, structured)
- SIZE-CONSCIOUS: Never pass full large documents to AI
- METADATA-RICH: Pass document metadata (name, size, format, mimeType) to validator
- FORMAT-FLEXIBLE: Allow format flexibility (md ≈ text, but pdf ≠ docx)
Target Architecture
Documents Input (List[ActionDocument])
↓
Document Analyzer (generic)
- Extract metadata (name, size, mimeType, format)
- Determine content type (text/binary/base64/structured)
- Create preview/summary for large documents
↓
Document Summary (for AI validation)
- Metadata only for binary/base64
- Preview/sample for large text documents
- Full content for small text/structured documents
↓
Validation Prompt Builder (generic)
- Include document summaries (not full content)
- Include document metadata
- Include format validation rules (generic)
↓
AI Validator
- Validates against task objective (generic)
- Validates format compliance (flexible)
- Validates document count/size
REQUIRED CHANGES
1. Remove All Hardcoded Checks from modeReact.py
- ❌ Remove document name checks
- ❌ Remove mimeType-specific logic
- ❌ Remove structure-specific checks
- ✅ Pass ALL documents to validator (let validator decide what to validate)
- ✅ Keep it simple:
validationDocs = result.documents
2. Redesign contentValidator.py - New Structure
New Method: _analyzeDocuments(documents)
def _analyzeDocuments(self, documents: List[Any]) -> List[Dict[str, Any]]:
"""
Generic document analysis - extract metadata and create summaries.
Returns list of document summaries ready for validation prompt.
"""
summaries = []
for doc in documents:
summary = {
"name": getattr(doc, 'documentName', 'Unknown'),
"mimeType": getattr(doc, 'mimeType', 'unknown'),
"format": self._detectFormat(doc),
"size": self._calculateSize(doc),
"type": self._detectContentType(doc), # text/binary/base64/structured
"preview": self._createPreview(doc), # None for binary, sample for large text
"isAccessible": self._isContentAccessible(doc) # Can we read content?
}
summaries.append(summary)
return summaries
New Method: _detectFormat(doc)
- Extract from filename extension or mimeType
- Generic mapping:
text/plain→txt,text/markdown→md, etc. - Return format string (txt, md, pdf, docx, json, etc.)
New Method: _calculateSize(doc)
- Calculate document size in bytes
- Handle string, dict, list, bytes, base64
- Return:
{"bytes": int, "readable": "1.5 MB"}
New Method: _detectContentType(doc)
text: Readable text contentstructured: JSON/dict/list structuresbinary: Binary data (PDF, images, etc.)base64: Base64-encoded data- Return content type string
New Method: _createPreview(doc)
- Binary/Base64: Return
None(metadata only) - Large text (>50KB): Return first 1KB + size indicator
- Small text (≤50KB): Return full content
- Structured data: Return JSON string (truncated if large)
New Method: _isContentAccessible(doc)
- Check if document content can be extracted for validation
- Binary/base64 documents:
False(validate by metadata only) - Text/structured documents:
True
3. Redesign Validation Prompt (Generic)
validationPrompt = f"""TASK VALIDATION
USER REQUEST: '{intent.get('primaryGoal', 'Unknown')}'
EXPECTED DATA TYPE: {intent.get('dataType', 'unknown')}
EXPECTED FORMAT: {intent.get('expectedFormat', 'unknown')}
SUCCESS CRITERIA ({criteriaCount} items): {successCriteria}
DELIVERED DOCUMENTS ({len(documentSummaries)} items):
{json.dumps(documentSummaries, indent=2)}
VALIDATION RULES:
1. Check if delivered documents match expected data type
2. Check if delivered formats are compatible with expected format
(Note: text formats like txt/md are compatible; pdf ≠ docx but both are documents)
3. Verify each success criterion is met based on document content/metadata
4. Check document sizes are reasonable for the task
5. Rate overall quality (0.0-1.0)
6. Identify specific gaps
7. Suggest next steps
OUTPUT FORMAT - JSON ONLY (no prose):
{{
"overallSuccess": false,
"qualityScore": 0.0,
"dataTypeMatch": false,
"formatMatch": false,
"documentCount": {len(documentSummaries)},
"successCriteriaMet": {[False] * criteriaCount},
"gapAnalysis": "Specific gaps found",
"improvementSuggestions": ["NEXT STEP: Action 1"],
"validationDetails": [
{{
"documentName": "document.ext",
"issues": ["Issue 1"],
"suggestions": ["NEXT STEP: Fix 1"]
}}
]
}}
"""
4. Format Validation Logic (Generic & Flexible)
def _isFormatCompatible(self, deliveredFormat: str, expectedFormat: str) -> bool:
"""
Generic format compatibility check.
- txt/md/html are text formats (compatible with each other)
- pdf/docx/xlsx are document formats (not compatible with each other)
- json/xml are structured formats
- images are image formats
"""
# Text formats are interchangeable
textFormats = ['txt', 'md', 'html', 'text', 'plain']
if deliveredFormat.lower() in textFormats and expectedFormat.lower() in textFormats:
return True
# Exact match
if deliveredFormat.lower() == expectedFormat.lower():
return True
# Structured formats
if deliveredFormat.lower() in ['json', 'xml'] and expectedFormat.lower() in ['json', 'xml']:
return True # Could be made more flexible
return False
IMPLEMENTATION PLAN
Phase 1: Clean Up modeReact.py
- Remove all hardcoded checks
- Simply pass
result.documentsto validator
Phase 2: Redesign Document Analysis
- Implement
_analyzeDocuments() - Implement helper methods:
_detectFormat(),_calculateSize(),_detectContentType(),_createPreview()
Phase 3: Redesign Validation Prompt
- Generic prompt with document summaries
- Include metadata, not full content
- Size-aware handling
Phase 4: Implement Format Validation
- Generic format compatibility logic
- Flexible matching (text formats, document formats, etc.)
Phase 5: Testing
- Test with text documents (small & large)
- Test with binary documents (PDF, images)
- Test with base64 documents
- Test with structured data (JSON)
KEY DESIGN DECISIONS
- Pass ALL documents: Validator decides what to validate, not the caller
- Metadata over content: For large/binary documents, pass metadata only
- Preview samples: For large text documents, pass preview + size info
- Generic prompts: No task-specific or format-specific logic
- Flexible format matching: Text formats compatible, document formats strict
- Size limits: 50KB threshold for full content (configurable)
- Content type detection: Explicit type detection (text/binary/base64/structured)
BENEFITS OF TARGET DESIGN
✅ Generic: Works with any document type without hardcoding ✅ Scalable: Handles large documents without issues ✅ Flexible: Format validation is flexible where appropriate ✅ Maintainable: Clear separation of concerns ✅ Robust: Handles edge cases (binary, base64, large files) ✅ Testable: Each component can be tested independently