# Content Validator - Deep Analysis & Target Design ## CURRENT STATE ANALYSIS ### How Validator Currently Works #### 1. **Document Input Flow** ``` ActionResult.documents (List[ActionDocument]) → modeReact.py extracts "structured content" with hardcoded checks → Creates SimpleNamespace objects with wrapped documentData → Passes to ContentValidator.validateContent() ``` #### 2. **Current Problems in modeReact.py (Lines 99-136)** - ❌ **Hardcoded document name checks**: `docName == "structured_content.json"` - ❌ **Hardcoded mimeType checks**: `mimeType == "application/json"` - ❌ **Hardcoded structure checks**: `'content' in docData or 'documents' in docData or 'sections' in docData` - ❌ **Single document selection**: `break` after first match - ignores other documents - ❌ **Non-generic logic**: Specific to certain document structures - ❌ **Workaround approach**: Trying to find structured content in various ways #### 3. **Current Problems in contentValidator.py** **`_extractContent()` method (Lines 21-41)**: - ❌ **Inconsistent handling**: Checks for `dict with 'content'` but then also handles raw `data` - ❌ **Silent failures**: Returns empty string on any exception - ❌ **Size limit hardcoded**: 10KB threshold is arbitrary - ❌ **No format awareness**: Doesn't check if document is binary/base64 before extracting - ❌ **No document type detection**: Doesn't distinguish text vs binary vs structured data **`_validateWithAI()` method (Lines 60-200)**: - ❌ **Forces all content to string**: `content[:2000]` truncation assumes text - ❌ **No document metadata passed**: Only name and content, no size, format, mimeType info - ❌ **No binary/base64 handling**: Will fail or show garbage for binary documents - ❌ **Multiple JSON extraction strategies**: Indicates unreliable AI response parsing - ❌ **Size limits inconsistent**: 10KB in extraction, 2KB in prompt - why different? #### 4. **Missing Capabilities** - ❌ No document size reporting to validator - ❌ No format validation (txt vs md vs pdf vs docx) - ❌ No binary data handling (images, PDFs, etc.) - ❌ No document count/summary statistics - ❌ No distinction between document types for validation --- ## TARGET DESIGN ### Core Principles 1. **GENERIC**: No hardcoded document names, types, or structures 2. **DOCUMENT-AWARE**: Handle all document types (text, binary, base64, structured) 3. **SIZE-CONSCIOUS**: Never pass full large documents to AI 4. **METADATA-RICH**: Pass document metadata (name, size, format, mimeType) to validator 5. **FORMAT-FLEXIBLE**: Allow format flexibility (md ≈ text, but pdf ≠ docx) ### Target Architecture ``` Documents Input (List[ActionDocument]) ↓ Document Analyzer (generic) - Extract metadata (name, size, mimeType, format) - Determine content type (text/binary/base64/structured) - Create preview/summary for large documents ↓ Document Summary (for AI validation) - Metadata only for binary/base64 - Preview/sample for large text documents - Full content for small text/structured documents ↓ Validation Prompt Builder (generic) - Include document summaries (not full content) - Include document metadata - Include format validation rules (generic) ↓ AI Validator - Validates against task objective (generic) - Validates format compliance (flexible) - Validates document count/size ``` --- ## REQUIRED CHANGES ### 1. **Remove All Hardcoded Checks from modeReact.py** - ❌ Remove document name checks - ❌ Remove mimeType-specific logic - ❌ Remove structure-specific checks - ✅ Pass ALL documents to validator (let validator decide what to validate) - ✅ Keep it simple: `validationDocs = result.documents` ### 2. **Redesign contentValidator.py - New Structure** #### New Method: `_analyzeDocuments(documents)` ```python def _analyzeDocuments(self, documents: List[Any]) -> List[Dict[str, Any]]: """ Generic document analysis - extract metadata and create summaries. Returns list of document summaries ready for validation prompt. """ summaries = [] for doc in documents: summary = { "name": getattr(doc, 'documentName', 'Unknown'), "mimeType": getattr(doc, 'mimeType', 'unknown'), "format": self._detectFormat(doc), "size": self._calculateSize(doc), "type": self._detectContentType(doc), # text/binary/base64/structured "preview": self._createPreview(doc), # None for binary, sample for large text "isAccessible": self._isContentAccessible(doc) # Can we read content? } summaries.append(summary) return summaries ``` #### New Method: `_detectFormat(doc)` - Extract from filename extension or mimeType - Generic mapping: `text/plain` → `txt`, `text/markdown` → `md`, etc. - Return format string (txt, md, pdf, docx, json, etc.) #### New Method: `_calculateSize(doc)` - Calculate document size in bytes - Handle string, dict, list, bytes, base64 - Return: `{"bytes": int, "readable": "1.5 MB"}` #### New Method: `_detectContentType(doc)` - `text`: Readable text content - `structured`: JSON/dict/list structures - `binary`: Binary data (PDF, images, etc.) - `base64`: Base64-encoded data - Return content type string #### New Method: `_createPreview(doc)` - **Binary/Base64**: Return `None` (metadata only) - **Large text (>50KB)**: Return first 1KB + size indicator - **Small text (≤50KB)**: Return full content - **Structured data**: Return JSON string (truncated if large) #### New Method: `_isContentAccessible(doc)` - Check if document content can be extracted for validation - Binary/base64 documents: `False` (validate by metadata only) - Text/structured documents: `True` ### 3. **Redesign Validation Prompt (Generic)** ```python validationPrompt = f"""TASK VALIDATION USER REQUEST: '{intent.get('primaryGoal', 'Unknown')}' EXPECTED DATA TYPE: {intent.get('dataType', 'unknown')} EXPECTED FORMAT: {intent.get('expectedFormat', 'unknown')} SUCCESS CRITERIA ({criteriaCount} items): {successCriteria} DELIVERED DOCUMENTS ({len(documentSummaries)} items): {json.dumps(documentSummaries, indent=2)} VALIDATION RULES: 1. Check if delivered documents match expected data type 2. Check if delivered formats are compatible with expected format (Note: text formats like txt/md are compatible; pdf ≠ docx but both are documents) 3. Verify each success criterion is met based on document content/metadata 4. Check document sizes are reasonable for the task 5. Rate overall quality (0.0-1.0) 6. Identify specific gaps 7. Suggest next steps OUTPUT FORMAT - JSON ONLY (no prose): {{ "overallSuccess": false, "qualityScore": 0.0, "dataTypeMatch": false, "formatMatch": false, "documentCount": {len(documentSummaries)}, "successCriteriaMet": {[False] * criteriaCount}, "gapAnalysis": "Specific gaps found", "improvementSuggestions": ["NEXT STEP: Action 1"], "validationDetails": [ {{ "documentName": "document.ext", "issues": ["Issue 1"], "suggestions": ["NEXT STEP: Fix 1"] }} ] }} """ ``` ### 4. **Format Validation Logic (Generic & Flexible)** ```python def _isFormatCompatible(self, deliveredFormat: str, expectedFormat: str) -> bool: """ Generic format compatibility check. - txt/md/html are text formats (compatible with each other) - pdf/docx/xlsx are document formats (not compatible with each other) - json/xml are structured formats - images are image formats """ # Text formats are interchangeable textFormats = ['txt', 'md', 'html', 'text', 'plain'] if deliveredFormat.lower() in textFormats and expectedFormat.lower() in textFormats: return True # Exact match if deliveredFormat.lower() == expectedFormat.lower(): return True # Structured formats if deliveredFormat.lower() in ['json', 'xml'] and expectedFormat.lower() in ['json', 'xml']: return True # Could be made more flexible return False ``` --- ## IMPLEMENTATION PLAN ### Phase 1: Clean Up modeReact.py - Remove all hardcoded checks - Simply pass `result.documents` to validator ### Phase 2: Redesign Document Analysis - Implement `_analyzeDocuments()` - Implement helper methods: `_detectFormat()`, `_calculateSize()`, `_detectContentType()`, `_createPreview()` ### Phase 3: Redesign Validation Prompt - Generic prompt with document summaries - Include metadata, not full content - Size-aware handling ### Phase 4: Implement Format Validation - Generic format compatibility logic - Flexible matching (text formats, document formats, etc.) ### Phase 5: Testing - Test with text documents (small & large) - Test with binary documents (PDF, images) - Test with base64 documents - Test with structured data (JSON) --- ## KEY DESIGN DECISIONS 1. **Pass ALL documents**: Validator decides what to validate, not the caller 2. **Metadata over content**: For large/binary documents, pass metadata only 3. **Preview samples**: For large text documents, pass preview + size info 4. **Generic prompts**: No task-specific or format-specific logic 5. **Flexible format matching**: Text formats compatible, document formats strict 6. **Size limits**: 50KB threshold for full content (configurable) 7. **Content type detection**: Explicit type detection (text/binary/base64/structured) --- ## BENEFITS OF TARGET DESIGN ✅ **Generic**: Works with any document type without hardcoding ✅ **Scalable**: Handles large documents without issues ✅ **Flexible**: Format validation is flexible where appropriate ✅ **Maintainable**: Clear separation of concerns ✅ **Robust**: Handles edge cases (binary, base64, large files) ✅ **Testable**: Each component can be tested independently