PowerOn/gateway

Fork 0

ValueOn AG 259ccabbe3 Prompt tuning for generation and validation step

2025-10-31 14:28:14 +01:00

9.6 KiB

Raw Blame History

Content Validator - Deep Analysis & Target Design

CURRENT STATE ANALYSIS

How Validator Currently Works

1. Document Input Flow

ActionResult.documents (List[ActionDocument]) 
  → modeReact.py extracts "structured content" with hardcoded checks
  → Creates SimpleNamespace objects with wrapped documentData
  → Passes to ContentValidator.validateContent()

2. Current Problems in modeReact.py (Lines 99-136)

❌ Hardcoded document name checks: docName == "structured_content.json"
❌ Hardcoded mimeType checks: mimeType == "application/json"
❌ Hardcoded structure checks: 'content' in docData or 'documents' in docData or 'sections' in docData
❌ Single document selection: break after first match - ignores other documents
❌ Non-generic logic: Specific to certain document structures
❌ Workaround approach: Trying to find structured content in various ways

3. Current Problems in contentValidator.py

_extractContent() method (Lines 21-41):

❌ Inconsistent handling: Checks for dict with 'content' but then also handles raw data
❌ Silent failures: Returns empty string on any exception
❌ Size limit hardcoded: 10KB threshold is arbitrary
❌ No format awareness: Doesn't check if document is binary/base64 before extracting
❌ No document type detection: Doesn't distinguish text vs binary vs structured data

_validateWithAI() method (Lines 60-200):

❌ Forces all content to string: content[:2000] truncation assumes text
❌ No document metadata passed: Only name and content, no size, format, mimeType info
❌ No binary/base64 handling: Will fail or show garbage for binary documents
❌ Multiple JSON extraction strategies: Indicates unreliable AI response parsing
❌ Size limits inconsistent: 10KB in extraction, 2KB in prompt - why different?

4. Missing Capabilities

❌ No document size reporting to validator
❌ No format validation (txt vs md vs pdf vs docx)
❌ No binary data handling (images, PDFs, etc.)
❌ No document count/summary statistics
❌ No distinction between document types for validation

TARGET DESIGN

Core Principles

GENERIC: No hardcoded document names, types, or structures
DOCUMENT-AWARE: Handle all document types (text, binary, base64, structured)
SIZE-CONSCIOUS: Never pass full large documents to AI
METADATA-RICH: Pass document metadata (name, size, format, mimeType) to validator
FORMAT-FLEXIBLE: Allow format flexibility (md ≈ text, but pdf ≠ docx)

Target Architecture

Documents Input (List[ActionDocument])
  ↓
Document Analyzer (generic)
  - Extract metadata (name, size, mimeType, format)
  - Determine content type (text/binary/base64/structured)
  - Create preview/summary for large documents
  ↓
Document Summary (for AI validation)
  - Metadata only for binary/base64
  - Preview/sample for large text documents
  - Full content for small text/structured documents
  ↓
Validation Prompt Builder (generic)
  - Include document summaries (not full content)
  - Include document metadata
  - Include format validation rules (generic)
  ↓
AI Validator
  - Validates against task objective (generic)
  - Validates format compliance (flexible)
  - Validates document count/size

REQUIRED CHANGES

1. Remove All Hardcoded Checks from modeReact.py

❌ Remove document name checks
❌ Remove mimeType-specific logic
❌ Remove structure-specific checks
✅ Pass ALL documents to validator (let validator decide what to validate)
✅ Keep it simple: validationDocs = result.documents

2. Redesign contentValidator.py - New Structure

New Method: `_analyzeDocuments(documents)`

def _analyzeDocuments(self, documents: List[Any]) -> List[Dict[str, Any]]:
    """
    Generic document analysis - extract metadata and create summaries.
    Returns list of document summaries ready for validation prompt.
    """
    summaries = []
    for doc in documents:
        summary = {
            "name": getattr(doc, 'documentName', 'Unknown'),
            "mimeType": getattr(doc, 'mimeType', 'unknown'),
            "format": self._detectFormat(doc),
            "size": self._calculateSize(doc),
            "type": self._detectContentType(doc),  # text/binary/base64/structured
            "preview": self._createPreview(doc),  # None for binary, sample for large text
            "isAccessible": self._isContentAccessible(doc)  # Can we read content?
        }
        summaries.append(summary)
    return summaries

New Method: `_detectFormat(doc)`

Extract from filename extension or mimeType
Generic mapping: text/plain → txt, text/markdown → md, etc.
Return format string (txt, md, pdf, docx, json, etc.)

New Method: `_calculateSize(doc)`

Calculate document size in bytes
Handle string, dict, list, bytes, base64
Return: {"bytes": int, "readable": "1.5 MB"}

New Method: `_detectContentType(doc)`

text: Readable text content
structured: JSON/dict/list structures
binary: Binary data (PDF, images, etc.)
base64: Base64-encoded data
Return content type string

New Method: `_createPreview(doc)`

Binary/Base64: Return None (metadata only)
Large text (>50KB): Return first 1KB + size indicator
Small text (≤50KB): Return full content
Structured data: Return JSON string (truncated if large)

New Method: `_isContentAccessible(doc)`

Check if document content can be extracted for validation
Binary/base64 documents: False (validate by metadata only)
Text/structured documents: True

3. Redesign Validation Prompt (Generic)

validationPrompt = f"""TASK VALIDATION

USER REQUEST: '{intent.get('primaryGoal', 'Unknown')}'
EXPECTED DATA TYPE: {intent.get('dataType', 'unknown')}
EXPECTED FORMAT: {intent.get('expectedFormat', 'unknown')}
SUCCESS CRITERIA ({criteriaCount} items): {successCriteria}

DELIVERED DOCUMENTS ({len(documentSummaries)} items):
{json.dumps(documentSummaries, indent=2)}

VALIDATION RULES:
1. Check if delivered documents match expected data type
2. Check if delivered formats are compatible with expected format
   (Note: text formats like txt/md are compatible; pdf ≠ docx but both are documents)
3. Verify each success criterion is met based on document content/metadata
4. Check document sizes are reasonable for the task
5. Rate overall quality (0.0-1.0)
6. Identify specific gaps
7. Suggest next steps

OUTPUT FORMAT - JSON ONLY (no prose):
{{
  "overallSuccess": false,
  "qualityScore": 0.0,
  "dataTypeMatch": false,
  "formatMatch": false,
  "documentCount": {len(documentSummaries)},
  "successCriteriaMet": {[False] * criteriaCount},
  "gapAnalysis": "Specific gaps found",
  "improvementSuggestions": ["NEXT STEP: Action 1"],
  "validationDetails": [
    {{
      "documentName": "document.ext",
      "issues": ["Issue 1"],
      "suggestions": ["NEXT STEP: Fix 1"]
    }}
  ]
}}
"""

4. Format Validation Logic (Generic & Flexible)

def _isFormatCompatible(self, deliveredFormat: str, expectedFormat: str) -> bool:
    """
    Generic format compatibility check.
    - txt/md/html are text formats (compatible with each other)
    - pdf/docx/xlsx are document formats (not compatible with each other)
    - json/xml are structured formats
    - images are image formats
    """
    # Text formats are interchangeable
    textFormats = ['txt', 'md', 'html', 'text', 'plain']
    if deliveredFormat.lower() in textFormats and expectedFormat.lower() in textFormats:
        return True
    
    # Exact match
    if deliveredFormat.lower() == expectedFormat.lower():
        return True
    
    # Structured formats
    if deliveredFormat.lower() in ['json', 'xml'] and expectedFormat.lower() in ['json', 'xml']:
        return True  # Could be made more flexible
    
    return False

IMPLEMENTATION PLAN

Phase 1: Clean Up modeReact.py

Remove all hardcoded checks
Simply pass result.documents to validator

Phase 2: Redesign Document Analysis

Implement _analyzeDocuments()
Implement helper methods: _detectFormat(), _calculateSize(), _detectContentType(), _createPreview()

Phase 3: Redesign Validation Prompt

Generic prompt with document summaries
Include metadata, not full content
Size-aware handling

Phase 4: Implement Format Validation

Generic format compatibility logic
Flexible matching (text formats, document formats, etc.)

Phase 5: Testing

Test with text documents (small & large)
Test with binary documents (PDF, images)
Test with base64 documents
Test with structured data (JSON)

KEY DESIGN DECISIONS

Pass ALL documents: Validator decides what to validate, not the caller
Metadata over content: For large/binary documents, pass metadata only
Preview samples: For large text documents, pass preview + size info
Generic prompts: No task-specific or format-specific logic
Flexible format matching: Text formats compatible, document formats strict
Size limits: 50KB threshold for full content (configurable)
Content type detection: Explicit type detection (text/binary/base64/structured)

BENEFITS OF TARGET DESIGN

✅ Generic: Works with any document type without hardcoding ✅ Scalable: Handles large documents without issues ✅ Flexible: Format validation is flexible where appropriate ✅ Maintainable: Clear separation of concerns ✅ Robust: Handles edge cases (binary, base64, large files) ✅ Testable: Each component can be tested independently

9.6 KiB Raw Blame History