wiki/implementation/implementation_content_handling_done.md

86 KiB

Implementation Plan: Content Handling Architecture Migration

Overview

This document provides a detailed implementation plan for migrating to the target architecture for content extraction and document generation. The plan focuses on:

  • Documents and Content Handling: Intelligent merging of documentList and contentParts with deduplication
  • Output Document Formats: Per-document format determination (not global) - AI determines formats from user prompt, multiple documents can have different formats
  • Languages Handling: Per-document language determination (not global) - uses validated currentUserLanguage infrastructure
  • Clear Handover States: Defined validation at each phase boundary using existing infrastructure
  • Structure Filling: Two prompt types (with content vs. without content)

Verified Infrastructure (Ready to Use)

The following infrastructure already exists and can be reused:

  • Language Validation: currentUserLanguage is validated at workflowManager.py:695-727 - always valid 2-character ISO code (validates AI response, falls back to user language, then "en"). Safe to use via self.services.currentUserLanguage or _getUserLanguage() method.

  • Format Validation: Renderer registry exists at mainServiceGeneration.py:529 (_getFormatRenderer() uses getRenderer()). Can be imported: from modules.services.serviceGeneration.renderers.registry import getRenderer. Returns None if format invalid, falls back to text renderer.

  • Language Extraction: _getDocumentLanguage() works correctly at subStructureFilling.py:349 - extracts per-document language from structure. Used properly during section generation.

Context

This implementation plan is based on the analysis documented in:

  • gateway/modules/services/serviceAi/CONTENT_EXTRACTION_ANALYSIS.md (Section 9.3: Target State)

The target architecture addresses architectural issues identified in the current implementation:

  1. Single extraction path in AI service (no duplication in ai.process)
  2. Intelligent merging of contentParts and documentList with deduplication
  3. Clear separation of concerns: action layer delegates to service layer
  4. Consistent behavior across all code paths
  5. Per-document format/language determination (not global)

1. Overview: Major Phases and Handover States

Phase Flow Diagram

┌─────────────────────────────────────────────────────────────────────┐
│ PHASE 1: Document Intent Clarification                            │
│ ────────────────────────────────────────────────────────────────── │
│ INPUT:                                                              │
│   - userPrompt: str (fenced)                                        │
│   - documentList: DocumentReferenceList (optional)                 │
│   - contentParts: List[ContentPart] (optional)                     │
│   - actionParameters: Dict (outputFormat, language, etc.)          │
│                                                                     │
│ THROUGHPUT:                                                         │
│   1. Resolve documents from documentList                           │
│   2. Identify pre-extracted JSON documents                         │
│      - Check if JSON contains ContentExtracted structure            │
│      - Map pre-extracted JSONs to original documents               │
│   3. Filter out original documents covered by pre-extracted        │
│   4. AI analyzes document purposes                                 │
│   5. Map intents back to JSON doc IDs (if applicable)              │
│                                                                     │
│ OUTPUT:                                                             │
│   - documentIntents: List[DocumentIntent]                           │
│     * documentId: str                                              │
│     * intents: List[str] (["extract", "render", "reference"])     │
│     * extractionPrompt: str (optional)                              │
│     * reasoning: str                                                │
│     Note: outputFormat and language are NOT determined here - │
│           they're determined in Phase 3 (Structure Generation)     │
│                                                                     │
│ HANDOVER STATE:                                                     │
│   - documentIntents: Complete intent analysis                      │
│   - documents: Resolved ChatDocuments                              │
│   - preExtractedMapping: Map[originalDocId, jsonDocId]             │
└─────────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────────┐
│ PHASE 2: Content Extraction and Preparation                         │
│ ────────────────────────────────────────────────────────────────── │
│ INPUT:                                                              │
│   - documents: List[ChatDocument]                                   │
│   - documentIntents: List[DocumentIntent]                          │
│   - contentParts: List[ContentPart] (optional, pre-extracted)      │
│   - preExtractedMapping: Map[originalDocId, jsonDocId]            │
│                                                                     │
│ THROUGHPUT:                                                         │
│   1. Process pre-extracted JSON documents → ContentParts           │
│      - Extract ContentParts from JSON (not treat as regular JSON)  │
│      - Apply intents (extract, render, reference)                   │
│      - Mark with isPreExtracted=True                                │
│   2. RAW extraction (NO AI) for regular documents                 │
│      - Extract content using extraction service                    │
│      - Create ContentParts with metadata                           │
│   3. Merge all ContentParts                                        │
│      - Pre-extracted parts (from JSON documents)                    │
│      - Extracted parts (from regular documents)                    │
│      - Provided parts (from contentParts parameter)                │
│   4. Apply intents to ContentParts (extract, render, reference)   │
│   5. Mark images for Vision AI extraction (deferred)              │
│                                                                     │
│ OUTPUT:                                                             │
│   - finalContentParts: List[ContentPart]                           │
│     * id: str                                                       │
│     * typeGroup: str                                                │
│     * mimeType: str                                                 │
│     * data: Union[str, bytes]                                       │
│     * metadata: Dict                                                │
│       - documentId: str                                             │
│       - contentFormat: str ("extracted", "object", "reference")   │
│       - intent: str                                                 │
│       - needsVisionExtraction: bool (for images)                   │
│       - extractionPrompt: str (for Vision AI)                       │
│       - originalFileName: str                                       │
│       - isPreExtracted: bool                                        │
│       Note: outputFormat and language are NOT propagated here - │
│             they're determined in Phase 3 (Structure Generation)   │
│                                                                     │
│ HANDOVER STATE:                                                     │
│   - finalContentParts: Complete merged list                        │
│     - All pre-extracted JSON documents processed → ContentParts   │
│     - All regular documents extracted → ContentParts               │
│     - All provided contentParts merged                             │
│   - All documents processed (extracted or pre-extracted)          │
│   - Vision AI extraction deferred to Phase 4                      │
└─────────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────────┐
│ PHASE 3: Structure Generation                                       │
│ ────────────────────────────────────────────────────────────────── │
│ INPUT:                                                              │
│   - userPrompt: str                                                 │
│   - finalContentParts: List[ContentPart]                           │
│   - outputFormat: Optional[str] (optional fallback, defaults to "txt") │
│   - currentUserLanguage: str (always valid, validated during user intention analysis) │
│     * From: self.services.currentUserLanguage (always valid, validated during user intention analysis) │
│                                                                     │
│ THROUGHPUT:                                                         │
│   1. Group ContentParts by documentId (for context)                  │
│   2. AI generates structure with documents and chapters             │
│   3. AI determines per-document outputFormat in structure JSON     │
│      from user prompt → else optional outputFormat fallback (or "txt") │
│   4. AI determines per-document language in structure JSON         │
│      from user prompt → else validated currentUserLanguage (always valid) │
│   5. Assign ContentParts to chapters                                │
│                                                                     │
│ OUTPUT:                                                             │
│   - chapterStructure: Dict                                          │
│     * documents: List[Dict]                                         │
│       - id: str                                                     │
│       - title: str                                                  │
│       - outputFormat: str (per-document) ← NEW                    │
│       - language: str (per-document) ← NEW                         │
│       - chapters: List[Dict]                                        │
│         * id: str                                                   │
│         * level: int                                                │
│         * title: str                                                │
│         * generationHint: str                                       │
│         * contentParts: List[str] (ContentPart IDs)                │
│                                                                     │
│ HANDOVER STATE:                                                     │
│   - chapterStructure: Complete structure with ContentPart          │
│     assignments                                                     │
│   - Per-document format/language determined                         │
└─────────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────────┐
│ PHASE 4: Structure Filling                                          │
│ ────────────────────────────────────────────────────────────────── │
│ INPUT:                                                              │
│   - chapterStructure: Dict (with per-document language from Phase 3)│
│   - finalContentParts: List[ContentPart]                            │
│   - userPrompt: str                                                 │
│                                                                     │
│ THROUGHPUT:                                                         │
│   For each document (with per-document language):                   │
│     For each chapter:                                              │
│       1. Generate sections structure (parallel)                     │
│       2. For each section:                                         │
│          a. Extract per-document language from structure            │
│          b. Check if ContentParts need Vision AI extraction         │
│          c. If yes: Call Vision AI (Phase 2 deferred extraction)   │
│          d. Determine prompt type:                                  │
│             - WITH CONTENT: If contentParts assigned                │
│               → Use aggregation prompt (isAggregation=True)         │
│               → ContentParts passed as parameters                   │
│               → Use per-document language for generation            │
│             - WITHOUT CONTENT: If no contentParts                   │
│               → Use generation prompt (isAggregation=False)         │
│               → Only generationHint in prompt                       │
│               → Use per-document language for generation            │
│          e. Generate section content with AI                        │
│                                                                     │
│ OUTPUT:                                                             │
│   - filledStructure: Dict                                           │
│     * documents: List[Dict]                                         │
│       - language: str (preserved from input structure, per-document)│
│       - chapters: List[Dict]                                         │
│         * sections: List[Dict]                                      │
│           - id: str                                                 │
│           - content_type: str                                       │
│           - elements: List[Dict]                                    │
│             * type: str                                             │
│             * content: str (or base64 for images)                  │
│                                                                     │
│ HANDOVER STATE:                                                     │
│   - filledStructure: Complete content, ready for rendering         │
│   - Per-document language preserved from structure                  │
│   - All Vision AI extractions completed                            │
└─────────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────────┐
│ PHASE 5: Document Rendering                                        │
│ ────────────────────────────────────────────────────────────────── │
│ INPUT:                                                              │
│   - filledStructure: Dict                                           │
│   - per-document outputFormat (from Phase 3, determined from prompt) │
│   - per-document language (from Phase 3, validated currentUserLanguage) │
│                                                                     │
│ THROUGHPUT:                                                         │
│   1. Group sections by document (from structure)                   │
│   2. For each document:                                            │
│      a. Use per-document outputFormat                              │
│      b. Use per-document language                                  │
│      c. Render document in specified format                        │
│                                                                     │
│ OUTPUT:                                                             │
│   - renderedDocuments: List[DocumentData]                          │
│     * documentName: str                                             │
│     * documentData: bytes                                           │
│     * mimeType: str                                                 │
│                                                                     │
│ HANDOVER STATE:                                                     │
│   - renderedDocuments: Final output ready for user                 │
└─────────────────────────────────────────────────────────────────────┘

2. Detailed Implementation Steps

Step 1: Update DocumentIntent Model

File: gateway/modules/datamodels/datamodelExtraction.py

Changes:

class DocumentIntent(BaseModel):
    documentId: str
    intents: List[str]  # ["extract", "render", "reference"]
    extractionPrompt: Optional[str] = None
    # Note: outputFormat and language are NOT here - determined during 
    #       structure generation (Phase 3) in the chapter structure JSON
    reasoning: str

Rationale:

  • Intent clarification focuses on document purpose (extract, render, reference)
  • Output format and language are determined later during structure generation (Phase 3)
  • Structure generation has full context (user prompt, ContentParts, chapters) to determine format/language

Step 2: Update Intent Analysis Prompt

File: gateway/modules/services/serviceAi/subDocumentIntents.py

Changes:

  1. Add fencing around userPrompt (Security Fix):
def _buildIntentAnalysisPrompt(
    self,
    userPrompt: str,
    documents: List[ChatDocument],
    actionParameters: Dict[str, Any]
) -> str:
    # FENCE user input to prevent prompt injection
    fencedUserPrompt = f"""```user_request
{userPrompt}
```"""
    
    prompt = f"""USER REQUEST:
{fencedUserPrompt}

DOCUMENTS TO ANALYZE:
{docListText}

TASK: For each document, determine:
1. Intents (can be multiple): "extract", "render", "reference"
Note: Output format and language are NOT determined here - they will be 
      determined during structure generation (Phase 3) in the chapter structure JSON

OUTPUT FORMAT: {outputFormat} (global fallback - for reference only)

RETURN JSON:
{{
  "intents": [
    {{
      "documentId": "doc_1",
      "intents": ["extract"],
      "extractionPrompt": "Extract all text content",
      // Note: outputFormat and language are NOT here - determined during 
      //       structure generation in the chapter structure JSON
      "reasoning": "..."
    }}
  ]
}}
"""
  1. Remove global outputFormat from prompt (keep as fallback):
    • Output format should be determined per document based on intent
    • Global format remains as fallback if not specified per document

Step 3: Update ContentPart Metadata Propagation

File: gateway/modules/services/serviceAi/subContentExtraction.py

Changes:

async def extractAndPrepareContent(
    self,
    documents: List[ChatDocument],
    documentIntents: List[DocumentIntent],
    parentOperationId: str,
    getIntentForDocument: callable
) -> List[ContentPart]:
    # ... existing extraction logic ...
    
    # Note: outputFormat and language are NOT propagated here - they're determined
    #       during structure generation (Phase 3) in the chapter structure JSON
    # ContentParts are created with intent information only

Rationale:

  • ContentParts carry intent and extraction information only
  • Output format and language are determined during structure generation (Phase 3)
  • Structure generation has full context to make format/language decisions

Step 4: Update Structure Generation

File: gateway/modules/services/serviceAi/subStructureGeneration.py

Global Format Source Chain

Note: outputFormat parameter is optional. If omitted, formats are determined from user prompt by AI.

If outputFormat provided:

  1. Action parameters: action_parameters.get("outputFormat") or action_parameters.get("resultType")
  2. Passed to callAiContent(outputFormat=...)generateStructure(outputFormat=...) as parameter
  3. Used as fallback in State 3 validation if AI doesn't return format per document
  4. Final fallback: "txt" if global format is also missing/invalid

If outputFormat omitted:

  1. AI determines formats per document from user prompt
  2. Validation fallback: "txt" (if AI doesn't return format per document)

Rationale: With per-document format determination, AI can determine different formats for different documents based on user prompt. The outputFormat parameter is primarily a fallback for validation, not a requirement.

Language Source Chain

Note: currentUserLanguage is always valid (validated during user intention analysis).

  1. AI determines per-document language in structure JSON response
  2. If AI doesn't return language: Use validated currentUserLanguage (always valid, validated during user intention analysis)
  3. currentUserLanguage validation ensures:
    • AI response detectedLanguage is validated (2-character ISO code)
    • If AI didn't return language or invalid → uses user language (self.services.user.language)
    • If user language not set → uses "en"
    • Always safe to use directly without fallback logic

Changes:

  1. Make outputFormat optional in generateStructure method signature:
async def generateStructure(
    self,
    userPrompt: str,
    contentParts: List[ContentPart],
    outputFormat: Optional[str] = None,  # ← Optional: if omitted, formats determined from prompt by AI
    parentOperationId: str
) -> Dict[str, Any]:
    """
    Generate document structure with per-document format determination.
    
    Multiple documents can be produced with different formats (e.g., one PDF, one HTML).
    AI determines formats per-document from user prompt. The outputFormat parameter is 
    only a validation fallback - used if AI doesn't return format per document.
    
    Args:
        outputFormat: Optional global format fallback. If omitted, formats are determined 
                     from user prompt by AI. Used as validation fallback if AI doesn't 
                     return format per document. Defaults to "txt" if not provided.
    """
    # If outputFormat not provided, use "txt" as fallback for validation
    # AI will determine formats per document from user prompt
    if not outputFormat:
        outputFormat = "txt"
        logger.debug("outputFormat not provided - using 'txt' as validation fallback, formats determined from prompt")
    
    # Group ContentParts by documentId (for context in prompt)
    partsByDocument = {}
    for part in contentParts:
        docId = part.metadata.get("documentId", "default")
        if docId not in partsByDocument:
            partsByDocument[docId] = []
        partsByDocument[docId].append(part)
    
    # AI determines per-document format and language in structure JSON response
    # Pass global fallback for AI to use if not specified per document
    prompt = self._buildChapterStructurePrompt(
        userPrompt=userPrompt,
        contentParts=contentParts,
        outputFormat=outputFormat  # Fallback for validation (AI determines formats from prompt)
    )

Note:

  • outputFormat is optional. If omitted, formats are determined from user prompt by AI.
  • Used as validation fallback if AI doesn't return format per document.
  • User prompt language comes from self.services.currentUserLanguage which is validated during user intention analysis (workflowManager._sendFirstMessage()). The validation ensures:
    • AI response detectedLanguage is validated (2-character ISO code)
    • If AI didn't return language or invalid → uses user language (self.services.user.language)
    • If user language not set → uses "en"
    • currentUserLanguage is always valid and safe to use directly without fallback logic
  1. Update prompt to clarify format determination from prompt:
def _buildChapterStructurePrompt(
    self,
    userPrompt: str,
    contentParts: List[ContentPart],
    outputFormat: str  # Global fallback (for validation only)
) -> str:
    # Get language from services (validated currentUserLanguage infrastructure)
    language = self._getUserLanguage()  # Uses self.services.currentUserLanguage (always valid)
    
    # ... existing prompt building ...
    
    prompt += f"""
## OUTPUT FORMAT (per document)
- Each document can have its own output format (pdf, docx, html, etc.)
- **Determine the format for each document from the USER REQUEST above**
- Multiple documents can have different formats (e.g., one PDF, one HTML)
- Analyze user prompt to identify format requirements:
  * Explicit format mentions (e.g., "as PDF", "in Excel", "HTML document")
  * Document purpose (e.g., "spreadsheet" → xlsx, "presentation" → pptx)
  * Content type requirements
- If format cannot be determined from prompt, use fallback: "{outputFormat}" (for validation only)
- Include "outputFormat" field in each document in the JSON structure
- **CRITICAL**: Formats are determined from user prompt, not from the fallback value

## DOCUMENT LANGUAGE (per document)
- Each document can have its own language (ISO 639-1 code: "de", "en", "fr", etc.)
- Determine the language for each document based on:
  * User prompt language/context
  * Document content context
  * User's explicit language requirements
- If not specified, use validated currentUserLanguage: "{language}" (always valid, validated during user intention analysis)
- Include "language" field in each document in the JSON structure

EXAMPLE JSON STRUCTURE:
{{
  "documents": [
    {{
      "id": "doc_1",
      "title": "Document Title",
      "outputFormat": "pdf",  // ← Determined by AI from user prompt
      "language": "de",        // ← Determined by AI from user prompt
      "chapters": [...]
    }},
    {{
      "id": "doc_2",
      "title": "Another Document",
      "outputFormat": "html", // ← Different format for different document
      "language": "en",        // ← Different language for different document
      "chapters": [...]
    }}
  ]
}}
"""

Step 5: Update Structure Filling - Two Prompt Types

File: gateway/modules/services/serviceAi/subStructureFilling.py

Changes:

  1. Ensure two prompt types are used (already implemented, verify):
async def _fillSingleSection(
    self,
    section: Dict[str, Any],
    contentParts: List[ContentPart],
    userPrompt: str,
    generationHint: str,
    document: Dict[str, Any],  # ← NEW: Need document to get per-document language
    # ... other params ...
) -> List[Dict[str, Any]]:
    # Extract per-document language from structure
    # Language MUST be defined in structure (validated in State 3)
    # If missing, this is an error - should not happen after State 3 validation
    if "language" not in document:
        raise ValueError(f"Document {document.get('id')} missing 'language' field - should have been set in Phase 3 validation")
    
    docLanguage = document["language"]
    
    # Validate language format (should be 2-character ISO code)
    if not isinstance(docLanguage, str) or len(docLanguage) != 2:
        raise ValueError(f"Document {document.get('id')} has invalid language format: {docLanguage} - should be 2-character ISO 639-1 code")
    
    contentPartIds = section.get("contentPartIds", [])
    hasContentParts = len(contentPartIds) > 0
    
    if hasContentParts:
        # PROMPT TYPE 1: WITH CONTENT (Aggregation)
        # ContentParts passed as parameters, not in prompt text
        isAggregation = True
        relevantParts = [p for p in contentParts if p.id in contentPartIds]
        
        generationPrompt = self._buildSectionGenerationPrompt(
            section=section,
            contentParts=relevantParts,  # Passed as parameters
            userPrompt=userPrompt,
            generationHint=generationHint,
            isAggregation=True,  # ← Key flag
            language=docLanguage  # ← Per-document language from structure
        )
    else:
        # PROMPT TYPE 2: WITHOUT CONTENT (Generation)
        # Only generationHint in prompt, no ContentParts
        isAggregation = False
        
        generationPrompt = self._buildSectionGenerationPrompt(
            section=section,
            contentParts=[],  # Empty
            userPrompt=userPrompt,
            generationHint=generationHint,
            isAggregation=False,  # ← Key flag
            language=docLanguage  # ← Per-document language from structure
        )

Note: Language comes from the document in the structure (per-document), not a global parameter. Each document can have its own language as determined in Phase 3. The language MUST be defined and validated in Phase 3 (State 3 validation) - if missing here, it's an error.

  1. Verify _buildSectionGenerationPrompt handles both cases:
def _buildSectionGenerationPrompt(
    self,
    section: Dict[str, Any],
    contentParts: List[ContentPart],
    userPrompt: str,
    generationHint: str,
    isAggregation: bool,  # ← Determines prompt type
    language: str
) -> str:
    if isAggregation:
        # TYPE 1: WITH CONTENT
        # ContentParts are passed as parameters to AI call
        # Don't include full content in prompt text (token efficiency)
        prompt = f"""Generate content for section based on provided ContentParts.

Section: {sectionTitle}
Generation Hint: {generationHint}
Language: {language}

ContentParts are provided as parameters (not shown in prompt for efficiency).
Use the ContentParts data to generate the section content.
"""
    else:
        # TYPE 2: WITHOUT CONTENT
        # Only generationHint, no ContentParts
        prompt = f"""Generate content for section based on generation hint.

Section: {sectionTitle}
Generation Hint: {generationHint}
Language: {language}

Generate content based on the generation hint without referencing external content.
"""

Rationale:

  • Type 1 (with content): Efficient for large content (ContentParts as parameters)
  • Type 2 (without content): Simple generation based on hint only
  • Already implemented via isAggregation flag, verify it's used correctly

Step 6: Update Document Rendering

File: gateway/modules/services/serviceAi/mainServiceAi.py (renderResult method) File: gateway/modules/services/serviceGeneration/mainServiceGeneration.py (renderReport method)

Current Implementation:

  • renderResult() calls generationService.renderReport()
  • renderReport() already processes each document separately (line 385)
  • Currently checks doc.get("format", outputFormat) (line 397) - but should check outputFormat field
  • Language is not handled per-document

Changes:

  1. Update renderResult to pass language (from structure, validated before rendering):
async def renderResult(
    self,
    filledStructure: Dict[str, Any],
    outputFormat: str,  # Global fallback
    language: str,      # ← NEW: Add language parameter (global fallback)
    title: str,
    userPrompt: str,
    parentOperationId: str
) -> List[RenderedDocument]:
    """
    Render filled structure to documents.
    
    Per-document format and language are extracted from structure (validated in State 3).
    The outputFormat and language parameters are only used as global fallbacks.
    Multiple documents can have different formats and languages.
    """
    # Language comes from structure (per-document), validated in State 3
    # This parameter is only used as global fallback if structure validation fails
    # Use validated currentUserLanguage as fallback (always valid)
    if not language:
        language = self._getUserLanguage()  # Uses validated currentUserLanguage infrastructure
    
    # ... existing code ...
    
    renderedDocuments = await generationService.renderReport(
        filledStructure,
        outputFormat,
        language,  # ← Pass language (global fallback, per-document extracted in renderReport)
        title,
        userPrompt,
        self,
        parentOperationId=renderOperationId
    )

Note:

  • Language comes from structure (per-document) as determined in Phase 3
  • The language parameter here is only used as a global fallback
  • Per-document language is validated in State 3 (Structure Generation) and extracted from structure in renderReport()
  • Uses validated currentUserLanguage infrastructure if fallback needed
  1. Update renderReport to handle per-document format and language:
async def renderReport(
    self, 
    extractedContent: Dict[str, Any], 
    outputFormat: str,  # Global fallback
    language: str,      # ← NEW: Add language parameter (global fallback)
    title: str, 
    userPrompt: str = None, 
    aiService=None, 
    parentOperationId: Optional[str] = None
) -> List[RenderedDocument]:
    # ... existing validation ...
    
    # Process EACH document separately
    for docIndex, doc in enumerate(documents):
        # ... existing validation ...
        
        # Determine format for this document
        # Check outputFormat field first (per-document), then format field (legacy), then global fallback
        docFormat = doc.get("outputFormat") or doc.get("format") or outputFormat
        
        # Determine language for this document
        # Extract per-document language from structure (validated in State 3), fallback to global
        docLanguage = doc.get("language") or language
        
        # Validate language format (should be 2-character ISO code, validated in State 3)
        if not isinstance(docLanguage, str) or len(docLanguage) != 2:
            logger.warning(f"Document {doc.get('id')} has invalid language format: {docLanguage}, using fallback")
            docLanguage = language  # Use global fallback
        
        # Get renderer for this document's format (uses existing renderer registry)
        renderer = self._getFormatRenderer(docFormat)
        if not renderer:
            logger.warning(f"Unsupported format '{docFormat}' for document {doc.get('id', docIndex)}, skipping")
            continue
        
        # Create JSON structure with single document (preserving metadata)
        singleDocContent = {
            "metadata": {**metadata, "language": docLanguage},  # ← Add per-document language to metadata
            "documents": [doc]
        }
        
        # Render this document (can return multiple files, e.g., HTML + images)
        renderedDocs = await renderer.render(singleDocContent, docTitle, userPrompt, aiService)
        allRenderedDocuments.extend(renderedDocs)

Note:

  • Per-document format and language are extracted from structure (validated in State 3)
  • Renderers (RendererPdf, RendererHtml, etc.) receive the structure with language in metadata
  • They can use it for language-specific formatting if needed
  • Multiple documents can have different formats and languages

Step 7: Update ai.process to Pass documentList and Make outputFormat Optional

File: gateway/modules/workflows/methods/methodAi/actions/process.py

Changes:

# Phase 7.3: Pass both documentList and contentParts to AI service
# (Remove extraction logic from here - handled by AI service)

# resultType is optional - if omitted, formats determined from prompt by AI
# Default "txt" is validation fallback only
resultType = parameters.get("resultType")  # Optional: if None, formats determined from prompt
if resultType:
    normalized_result_type = (str(resultType).strip().lstrip('.').lower() or "txt")
    output_format = output_extension.replace('.', '') or 'txt'
else:
    # No format specified - AI will determine formats from prompt
    output_format = None
    logger.debug("resultType not provided - formats will be determined from prompt by AI")

# Use unified callAiContent method with BOTH parameters
aiResponse = await self.services.ai.callAiContent(
    prompt=aiPrompt,
    options=options,
    documentList=documentList,  # ← PASS documentList (was missing)
    contentParts=contentParts,  # ← PASS contentParts
    outputFormat=output_format,  # ← Optional: if None, formats determined from prompt
    parentOperationId=operationId,
    generationIntent=generationIntent
)

Note:

  • resultType parameter is optional. If omitted, formats are determined from user prompt by AI.
  • Default "txt" (if provided) is used as validation fallback only.
  • Language detection from user prompt is already done and validated. self.services.currentUserLanguage is always valid (validated during user intention analysis in workflowManager._sendFirstMessage()).

3. Handover State Definitions and Validation

Purpose: These state definitions document the expected structure and validation rules at each phase boundary.

Implementation Approach:

  • Inline validation in each phase method
  • Auto-fix where possible (use defaults, skip invalid items)
  • Stop with error for critical structural issues
  • Log warnings for skipped items

See: Appendix "Validation Failure Handling Decisions" below for detailed Q&A on each validation

Summary of Validation Decisions:

  • State 1: Skip intents for unknown documents; documents without intents are OK
  • State 2: Skip ContentParts with missing/invalid metadata (with warnings)
  • State 3: Auto-fix format/language with fallbacks; error on missing structure fields
  • State 4: Auto-fix missing elements field; allow empty elements
  • State 5: Skip empty documents; infer mimeType from filename

State 1: After Intent Clarification

Location: gateway/modules/services/serviceAi/subDocumentIntents.py - After clarifyDocumentIntents() returns (line 115)

Expected State:

documentIntents: List[DocumentIntent]  # Complete intent analysis
documents: List[ChatDocument]  # Resolved documents
preExtractedMapping: Dict[str, str]  # Map[originalDocId, jsonDocId]

Implementation Code (add after line 115, before return):

# Validation and auto-fix
documentIds = {d.id for d in documents}
validatedIntents = []

for intent in documentIntents:
    # Validation 1.2: Skip intents for unknown documents
    if intent.documentId not in documentIds:
        logger.warning(f"Skipping intent for unknown document: {intent.documentId}")
        continue
    validatedIntents.append(intent)

# Validation 1.1: Documents without intents are OK (not needed)
# Intents for non-existing documents are already filtered above
documentIntents = validatedIntents

State 2: After Content Extraction

Location: gateway/modules/services/serviceAi/subContentExtraction.py - After extractAndPrepareContent() returns (at end of method, before return)

Expected State:

finalContentParts: List[ContentPart]  # All content parts ready

Implementation Code (add at end of method, before return):

# Validation and auto-fix
validatedParts = []
for part in finalContentParts:
    # Validation 2.1: Skip ContentParts without documentId
    if not part.metadata.get("documentId"):
        logger.warning(f"Skipping ContentPart {part.id} - missing documentId in metadata")
        continue
    
    # Validation 2.2: Skip ContentParts with invalid contentFormat
    contentFormat = part.metadata.get("contentFormat")
    if contentFormat not in ["extracted", "object", "reference"]:
        logger.warning(
            f"Skipping ContentPart {part.id} - invalid contentFormat: {contentFormat}"
        )
        continue
    
    validatedParts.append(part)

return validatedParts

State 3: After Structure Generation

Location: gateway/modules/services/serviceAi/subStructureGeneration.py - After generateStructure() returns (after parsing JSON, before return, around line 182)

Expected State:

chapterStructure: Dict[str, Any]  # Complete structure with documents, chapters, outputFormat, language

Implementation Code (add after structure JSON is parsed, before return):

# After structure JSON is parsed (around line 182)
# Validation and auto-fix

# Validation 3.1: Structure missing 'documents' field
if "documents" not in structure:
    raise ValueError("Structure missing 'documents' field - cannot auto-fix")

documents = structure["documents"]

# Validation 3.2: Structure has no documents
if not isinstance(documents, list) or len(documents) == 0:
    raise ValueError("Structure has no documents - cannot generate without documents")

# Import renderer registry for format validation (existing infrastructure)
from modules.services.serviceGeneration.renderers.registry import getRenderer

# Validate and fix each document
for doc in documents:
    # Validation 3.3 & 3.4: Document outputFormat
    # outputFormat parameter is optional - if omitted, formats determined from prompt by AI
    # Use as fallback only if AI doesn't return format per document
    # Multiple documents can have different formats (e.g., one PDF, one HTML)
    globalFormatFallback = outputFormat or "txt"  # Fallback for validation
    
    if "outputFormat" not in doc or not doc["outputFormat"]:
        # AI didn't return format or returned empty - use global fallback
        doc["outputFormat"] = globalFormatFallback
        logger.info(f"Document {doc.get('id')} missing outputFormat - using fallback: {doc['outputFormat']}")
    else:
        # AI returned format - validate using existing renderer registry
        formatName = str(doc["outputFormat"]).lower().strip()
        renderer = getRenderer(formatName)  # Uses existing infrastructure
        
        if not renderer:
            # Format doesn't match any renderer - use txt (simple approach)
            logger.warning(f"Document {doc.get('id')} has format without renderer: {formatName}, using 'txt'")
            doc["outputFormat"] = "txt"
        else:
            # Valid format with renderer - normalize and keep AI result
            doc["outputFormat"] = formatName
            logger.debug(f"Document {doc.get('id')} using AI-determined format: {formatName}")
    
    # Validation 3.5 & 3.6: Document language
    # Use validated currentUserLanguage (always valid, validated during user intention analysis)
    # Access via _getUserLanguage() which uses self.services.currentUserLanguage
    userPromptLanguage = self._getUserLanguage()  # Uses validated currentUserLanguage infrastructure
    
    if "language" not in doc or not isinstance(doc["language"], str) or len(doc["language"]) != 2:
        # AI didn't return language or invalid format - use validated currentUserLanguage
        doc["language"] = userPromptLanguage
        if "language" not in doc:
            logger.info(f"Document {doc.get('id')} missing language - using currentUserLanguage: {doc['language']}")
        else:
            logger.warning(f"Document {doc.get('id')} has invalid language format from AI: {doc['language']}, using currentUserLanguage")
    else:
        # AI returned valid language format - normalize
        doc["language"] = doc["language"].lower().strip()[:2]
        logger.debug(f"Document {doc.get('id')} using AI-determined language: {doc['language']}")
    
    # Validation 3.7: Document missing 'chapters' field
    if "chapters" not in doc:
        raise ValueError(f"Document {doc.get('id')} missing 'chapters' field - cannot auto-fix")
    
    # Validation 3.8: Chapter missing 'contentParts' field
    for chapter in doc["chapters"]:
        if "contentParts" not in chapter:
            raise ValueError(f"Chapter {chapter.get('id')} missing 'contentParts' field - cannot auto-fix")

return structure

State 4: After Structure Filling

Location: gateway/modules/services/serviceAi/subStructureFilling.py - After fillStructure() returns (at end of method, before return, around line 204)

Expected State:

filledStructure: Dict[str, Any]  # Complete content with elements

Implementation Code (add at end of method, before return):

# Validation and auto-fix

# Validation 4.1: Filled structure missing 'documents' field
if "documents" not in filledStructure:
    raise ValueError("Filled structure missing 'documents' field - cannot auto-fix")

for doc in filledStructure["documents"]:
    # Validation 4.4: Verify language is preserved from input structure
    # Language MUST be preserved from Phase 3 structure (validated in State 3)
    if "language" not in doc:
        raise ValueError(f"Document {doc.get('id')} missing language in filled structure - should have been preserved from Phase 3")
    
    # Validate language format
    if not isinstance(doc["language"], str) or len(doc["language"]) != 2:
        raise ValueError(f"Document {doc.get('id')} has invalid language format in filled structure: {doc['language']} - should be 2-character ISO 639-1 code")
    
    for chapter in doc.get("chapters", []):
        for section in chapter.get("sections", []):
            # Validation 4.2: Section missing 'elements' field
            if "elements" not in section:
                section["elements"] = []
                logger.info(f"Section {section.get('id')} missing 'elements' - created empty list")
            
            # Validation 4.3: Section has empty elements list - ALLOW (intentionally empty is OK)
            # No action needed - empty elements are allowed

return filledStructure

State 5: After Document Rendering

Location: gateway/modules/services/serviceGeneration/paths/documentPath.py - After renderResult() returns (line 151, after line 157, before building documentDataList)

Expected State:

renderedDocuments: List[RenderedDocument]  # Final output

Implementation Code (add after line 157, before building documentDataList):

# Validation 5.1: Already implemented at line 175-176
if not renderedDocuments:
    raise ValueError("No documents were rendered")

# Validation 5.2 & 5.3: Validate and filter rendered documents
validatedRenderedDocs = []
for doc in renderedDocuments:
    # Validation 5.2: Skip documents with empty documentData
    if not doc.documentData:
        logger.warning(f"Skipping rendered document {doc.filename} - empty documentData")
        continue
    
    # Validation 5.3: Infer mimeType from filename if missing
    if not doc.mimeType:
        from modules.services.serviceGeneration.subDocumentUtility import getMimeTypeFromExtension
        if doc.filename:
            inferredMimeType = getMimeTypeFromExtension(doc.filename)
            if inferredMimeType:
                doc.mimeType = inferredMimeType
                logger.info(f"Inferred mimeType '{inferredMimeType}' from filename '{doc.filename}'")
            else:
                logger.warning(f"Could not infer mimeType from filename '{doc.filename}' - keeping as None")
        else:
            logger.warning(f"Rendered document missing mimeType and filename - cannot infer")
    
    validatedRenderedDocs.append(doc)

# Use validated list
renderedDocuments = validatedRenderedDocs

# Re-check after filtering
if not renderedDocuments:
    raise ValueError("No valid documents after validation")

4. Migration Checklist

Phase 1: Model Updates

  • Verify DocumentIntent model does NOT include outputFormat or language
  • Intent clarification focuses only on document purpose (intents, extractionPrompt)
  • Note: outputFormat and language are determined during structure generation (Phase 3)

Phase 2: Intent Analysis Updates

  • CRITICAL: Add fencing around userPrompt in intent analysis prompt
    • Fence user input with code blocks: user_request\n{userPrompt}\n
    • Test with various user inputs (special chars, JSON, newlines, prompt injection attempts)
  • Update prompt to focus only on document intents (extract, render, reference)
  • Remove any outputFormat/language determination from intent analysis prompt
  • Keep global outputFormat/language as reference only (not for determination)
  • Verify intent mapping logic (already implemented in clarifyDocumentIntents):
    • Step 1: Map pre-extracted JSONs to original documents (lines 63-83)
    • Step 2: AI analyzes intents for original documents (line 86)
    • Step 3: Map intents back to JSON doc IDs (lines 96-104)
    • Test with pre-extracted JSONs to verify mapping works correctly

Phase 3: Content Extraction Updates

  • Verify ContentParts do NOT include outputFormat or language in metadata

  • ContentParts carry only intent and extraction information

  • Verify pre-extracted JSON handling preserves intent information

  • Add filtering to Data Extraction Path (_handleDataExtraction): Current State (BEFORE filtering):

    # Line 708: Get documents directly from documentList
    documents = self.services.chat.getChatDocumentsFromDocumentList(documentList)
    # Line 721: Call extractAndPrepareContent() with ALL documents
    preparedContentParts = await self.extractAndPrepareContent(documents, ...)
    

    Problem: If documentList contains both:

    • Original document: original_pdf_123.pdf
    • Pre-extracted JSON: pre_extracted_456.json (contains ContentParts from original_pdf_123.pdf) → Both are processed → DUPLICATE ContentParts created

    How Filtering Works (Reference: documentPath.py lines 62-87):

    Step 1: Identify Pre-Extracted JSONs and Map to Originals

    # Collect all original document IDs that are covered by pre-extracted JSONs
    originalDocIdsCoveredByPreExtracted = set()
    for doc in documents:
        preExtracted = self.intentAnalyzer.resolvePreExtractedDocument(doc)
        if preExtracted:
            # Pre-extracted JSON found - get the original document ID it covers
            originalDocId = preExtracted["originalDocument"]["id"]
            originalDocIdsCoveredByPreExtracted.add(originalDocId)
    

    Result: originalDocIdsCoveredByPreExtracted = {"original_pdf_123"} (if pre-extracted JSON covers it)

    Step 2: Filter Documents List

    filteredDocuments = []
    for doc in documents:
        preExtracted = self.intentAnalyzer.resolvePreExtractedDocument(doc)
        if preExtracted:
            # Pre-extracted JSON - KEEP IT (will be processed as ContentParts)
            filteredDocuments.append(doc)
        elif doc.id in originalDocIdsCoveredByPreExtracted:
            # Original document covered by pre-extracted JSON - REMOVE IT
            logger.info(f"Skipping original document {doc.id} - already covered")
            # Do NOT append - skip this document
        else:
            # Regular document (not pre-extracted, not covered) - KEEP IT
            filteredDocuments.append(doc)
    
    documents = filteredDocuments  # Use filtered list
    

    Result:

    • Pre-extracted JSON: pre_extracted_456.json → KEPT
    • Original document: original_pdf_123.pdf → REMOVED (covered by pre-extracted JSON)
    • Regular document: other_doc.pdf → KEPT (not covered)

    Step 3: Use Filtered Documents

    # Now call extractAndPrepareContent() with filtered documents only
    preparedContentParts = await self.extractAndPrepareContent(
        documents,  # Only pre-extracted JSONs + regular docs (no originals covered by JSONs)
        documentIntents or [],
        extractOperationId
    )
    

    Result: No duplicates - original documents already filtered out

    Implementation Steps:

    • Add filtering logic between line 708 (get documents) and line 710 (clarify intents)
    • Copy filtering code from documentPath.py lines 62-87
    • Adapt to use self.intentAnalyzer.resolvePreExtractedDocument() (same method)
    • Filtering Logic:
      # Step 1: Identify all original document IDs covered by pre-extracted JSONs
      originalDocIdsCoveredByPreExtracted = set()
      for doc in documents:
          preExtracted = self.intentAnalyzer.resolvePreExtractedDocument(doc)
          if preExtracted:
              originalDocId = preExtracted["originalDocument"]["id"]
              originalDocIdsCoveredByPreExtracted.add(originalDocId)
              logger.debug(f"Found pre-extracted JSON {doc.id} covering original document {originalDocId}")
      
      # Step 2: Filter documents - remove originals covered by pre-extracted JSONs
      filteredDocuments = []
      for doc in documents:
          preExtracted = self.intentAnalyzer.resolvePreExtractedDocument(doc)
          if preExtracted:
              filteredDocuments.append(doc)  # Keep pre-extracted JSON
          elif doc.id in originalDocIdsCoveredByPreExtracted:
              logger.info(f"Skipping original document {doc.id} ({doc.fileName}) - already covered by pre-extracted JSON")
          else:
              filteredDocuments.append(doc)  # Keep regular document
      
      documents = filteredDocuments  # Use filtered list
      
    • Test with scenario: original document + pre-extracted JSON → verify no duplicates
  • Remove redundant check from extractAndPrepareContent():

    • Remove pre-extracted JSON check (line 77 in subContentExtraction.py)
    • Trust that filtering is done upstream
    • Cleaner code, single responsibility
  • Test merging logic

  • Test that both document generation and data extraction paths handle pre-extracted JSONs correctly

  • Note: outputFormat and language are NOT propagated here - determined in structure generation

Phase 4: Structure Generation Updates

  • Make outputFormat optional in generateStructure() method signature:
    • Update subStructureGeneration.py method signature (line 47): outputFormat: Optional[str] = None
    • Update mainServiceAi.py wrapper method (line 444): Make outputFormat optional
    • If outputFormat not provided, use "txt" as validation fallback (AI determines formats from prompt)
    • Add logging: "outputFormat not provided - using 'txt' as validation fallback, formats determined from prompt"
    • Context: outputFormat is only a validation fallback - AI determines per-document formats from user prompt. Multiple documents can have different formats (e.g., one PDF, one HTML).
  • Note on language handling: Language is accessed via self.services.currentUserLanguage (always valid, validated during user intention analysis). No language parameter needed in generateStructure() method signature - language is accessed directly from services within the method.
    • Verify currentUserLanguage is used correctly in subStructureGeneration.py (via self.services.currentUserLanguage)
    • Verify currentUserLanguage is used correctly in prompt building (via self.services.currentUserLanguage)
    • Note: mainServiceGeneration.py uses different service - verify if update needed
  • Group ContentParts by documentId (for context in prompt)
  • Update _buildChapterStructurePrompt() to access language via self.services.currentUserLanguage (no parameter needed)
  • Update structure generation prompt to ask AI to determine per-document outputFormat
    • Explicitly require outputFormat field in each document JSON structure
    • Update example structure to show outputFormat field (not just filename)
    • Clarify that multiple documents can have different formats
  • Update structure generation prompt to ask AI to determine per-document language
    • Explicitly require language field in each document JSON structure
    • Clarify that multiple documents can have different languages
  • Provide global fallbacks (outputFormat, language) for AI to use if not specified
    • outputFormat fallback: from parameter or "txt"
    • language fallback: use self._getUserLanguage() (validated currentUserLanguage infrastructure)
  • Parse and validate format/language from AI response:
    • Extract outputFormat and language from each document in structure JSON
    • Format validation (use existing renderer registry infrastructure):
      • Import: from modules.services.serviceGeneration.renderers.registry import getRenderer
      • If outputFormat missing or empty → use global fallback (outputFormat or "txt")
      • If outputFormat exists → check if it has a renderer using getRenderer(formatName) (existing infrastructure)
      • Normalize format name: formatName.lower().strip()
      • If format doesn't match any renderer → use "txt" (simple approach, no global fallback attempt)
      • Log warnings for invalid formats
      • Note: Infrastructure exists at mainServiceGeneration.py:529 - reuse getRenderer() function
    • Language validation (use existing validated infrastructure):
      • Validate language (must be 2-character ISO 639-1 code)
      • If language missing: Set to self._getUserLanguage() which uses validated currentUserLanguage (always valid, validated during user intention analysis at workflowManager.py:695-727)
      • If language invalid format: Use self._getUserLanguage() (always valid)
      • Normalize language: language.lower().strip()[:2]
      • Log warnings for invalid/missing values
      • Note: currentUserLanguage is always valid - safe to use directly via _getUserLanguage() method
  • Error handling:
    • If structure JSON is malformed → raise error with details
    • If no documents in structure → raise error
    • If AI doesn't return format → use global outputFormat fallback (or "txt" if not provided), log warning
    • If AI doesn't return language → use validated currentUserLanguage (always valid), log warning
  • Verify structure output includes per-document format and language (from AI in JSON response)

Phase 5: Structure Filling Verification

  • Verify two prompt types are correctly used:
    • isAggregation=True: ContentParts as parameters
    • isAggregation=False: Only generationHint
  • Verify per-document language is extracted and used:
    • Language MUST be defined in structure (validated in State 3)
    • Language extracted from document in structure (per-document) - NO fallback to "en"
    • If language missing: Raise error (should not happen after State 3 validation)
    • If language invalid format: Raise error (should not happen after State 3 validation)
    • Language passed to _buildSectionGenerationPrompt() for each section
    • Language preserved in filled structure (State 4 validation)
  • Test both prompt types with various scenarios
  • Verify Vision AI extraction happens during filling phase
  • Test with multi-document scenarios (different languages per document)

Phase 6: Document Rendering Updates

  • Add language parameter to renderResult() method:
    • Update mainServiceAi.py renderResult() signature (line 460)
    • Pass language to generationService.renderReport() (as global fallback)
  • Update renderResult call site (documentPath.py line 151):
    • Language comes from structure (per-document), validated in State 3
    • Use validated currentUserLanguage as global fallback (always valid)
    • Per-document language will be extracted in renderReport() from filledStructure
    • Code example:
      # Language is already validated in structure (State 3) and preserved in filled structure (State 4)
      # Per-document language will be extracted in renderReport() from filledStructure
      # Use validated currentUserLanguage as global fallback (always valid infrastructure)
      language = self.services.currentUserLanguage or "en"  # Uses validated infrastructure
      
      renderedDocuments = await self.services.ai.renderResult(
          filledStructure,
          outputFormat,
          language,  # ← Global fallback (per-document language extracted from structure in renderReport)
          title or "Generated Document",
          userPrompt,
          docOperationId
      )
      
  • Update renderReport() to handle per-document format and language:
    • Add language parameter to method signature (line 349): language: str (global fallback)
    • Extract per-document format: docFormat = doc.get("outputFormat") or doc.get("format") or outputFormat (check outputFormat field first)
    • Extract per-document language: docLanguage = doc.get("language") or language (from structure, validated in State 3)
    • Validate language format (should be 2-character ISO code, validated in State 3)
    • Add language to metadata passed to renderers: metadata["language"] = docLanguage
    • Note: Per-document format and language are extracted from structure (validated in State 3). Multiple documents can have different formats and languages.
  • Error handling:
    • If no documents in structure → raise error
    • If filtering removes all documents → raise error
    • If format not supported → log warning, skip document
  • Test multi-document rendering with different formats/languages

Phase 7: ai.process Refactoring

  • Remove extraction logic from ai.process (lines 72-119)

  • Make resultType optional: IMPLEMENTED

    • Update ai.process: Make resultType optional (can be None) - COMPLETED
    • Update ai.generateDocument: Make resultType optional, removed auto-detection - COMPLETED
    • Update ai.generateCode: Make resultType optional, removed auto-detection - COMPLETED
    • If resultType omitted → pass None to callAiContent() (formats determined from prompt) - COMPLETED
    • Updated action parameter definitions in methodAi.py - COMPLETED

    Implementation Status:

    • ai.process: resultType optional, passes None if omitted
    • ai.generateDocument: resultType optional, passes None if omitted
    • ai.generateCode: resultType optional, passes None if omitted
    • callAiContent: Already supports optional outputFormat (defaults to "txt")
    • generateStructure: Make outputFormat optional (see Phase 4 checklist)
  • Add filtering to Data Extraction Path (_handleDataExtraction):

    • Location: mainServiceAi.py between line 708 (get documents) and line 721 (extract content)
    • Purpose: Prevent duplicate ContentParts when both original document and pre-extracted JSON are provided
    • Implementation: Copy filtering logic from documentPath.py:62-87
    • Filter out original documents covered by pre-extracted JSONs before calling extractAndPrepareContent()
    • See Phase 3 checklist for detailed filtering code
  • Pass documentList to callAiContent() (currently missing, line 155-162 in process.py)

    • documentList is available in process.py (lines 43-55) but not passed to callAiContent()
    • Add documentList=documentList parameter to callAiContent() call
  • Pass contentParts to callAiContent() (already done)

  • Error handling:

    • If no documents and no contentParts → raise error
    • If filtering removes all documents → raise error
  • Verify intelligent merging in AI service works correctly

Phase 8: Testing

  • Test with pre-extracted JSON documents
  • Test with mixed documentList + contentParts
  • Test per-document format/language determination
  • Test two prompt types in structure filling
  • Test multi-document output with different formats/languages
  • Test security: prompt injection attempts with fenced input
  • Test optional outputFormat handling:
    • Test with resultType provided → formats used as fallback
    • Test with resultType omitted → AI determines formats from prompt
    • Test format validation: invalid format → uses "txt"
    • Test format validation: format without renderer → uses "txt"

Phase 9: Documentation

  • Update API documentation
  • Update developer documentation
  • Update user documentation (if applicable)

Priority Order

High Priority (Security & Critical Path):

  1. Phase 2: Intent Analysis Updates - Security fix (fencing) is CRITICAL
  2. Phase 7: ai.process Refactoring - Add filtering to Data Extraction Path (prevents duplicate ContentParts)
  3. Phase 1: Model Updates - Foundation for all other changes

Medium Priority (Architectural Improvements): 4. Phase 4: Structure Generation Updates

  • Make outputFormat optional (AI determines per-document formats)
  • Implement State 3 validation (use existing renderer registry and language infrastructure)
  • Update prompt to require outputFormat field per document
  1. Phase 6: Document Rendering Updates
    • Extract per-document format/language from structure
    • Add language parameter to renderResult() and renderReport()
  2. Phase 3: Content Extraction Updates
    • Remove redundant pre-extracted check AFTER filtering added upstream

Low Priority (Verification & Polish): 7. Phase 5: Structure Filling Verification (already implemented, verify) 8. Phase 8: Testing 9. Phase 9: Documentation


Notes

  • The two prompt types in Phase 4 (Structure Filling) are already implemented via the isAggregation flag. This step focuses on verification and documentation.
  • Per-document format/language determination follows the same pattern as existing per-document language handling.
  • The security fix (fencing user input) should be implemented immediately as it addresses a potential prompt injection vulnerability.

Architectural Note: Filtering and Redundant Pre-Extracted JSON Checks

Problem Statement

When a user provides both an original document and a pre-extracted JSON containing ContentParts from that original document, we need to prevent duplicate ContentParts from being created.

Current State

The pre-extracted JSON check happens twice:

  1. Phase 1 (documentPath.py lines 67-87): Filters documents before intent clarification
  2. Phase 2 (subContentExtraction.py line 77): Checks again during extraction loop

Why Filtering is Necessary

The redundant check in extractAndPrepareContent() only identifies if a document IS a pre-extracted JSON. It does NOT identify if a document is an ORIGINAL covered by a pre-extracted JSON.

Example:

# In extractAndPrepareContent loop:
for document in [original_pdf_123, pre_extracted_456]:
    # Check document 1: original_pdf_123
    preExtracted = resolvePreExtractedDocument(original_pdf_123)
    # Returns: None (it's not a pre-extracted JSON)
    # → Processes original_pdf_123 → extracts ContentParts
    
    # Check document 2: pre_extracted_456
    preExtracted = resolvePreExtractedDocument(pre_extracted_456)
    # Returns: {originalDocument: {id: "original_pdf_123"}, ...}
    # → Processes pre_extracted_456 → extracts ContentParts
    
    # Result: BOTH processed → DUPLICATES

The redundant check doesn't help because:

  • It only looks at ONE document at a time
  • It doesn't know about OTHER documents in the list
  • It can't compare documents to find relationships

Why Filtering Works

Filtering happens BEFORE the extraction loop, so it can:

  1. Look at ALL documents at once
  2. Identify relationships between documents
  3. Remove originals BEFORE extraction starts

Code Path Analysis

Path 1: Document Generation Path (documentPath.py)

Location: Line 103 Filtering: YES (lines 62-87)

  • Identifies pre-extracted JSONs
  • Filters out original documents covered by pre-extracted JSONs
  • Only passes filtered documents to extractAndPrepareContent()

Result: NO DUPLICATES - Original document already filtered out

Path 2: Data Extraction Path (mainServiceAi.py _handleDataExtraction)

Location: Line 721 Filtering: NO

  • Gets documents directly from documentList (line 708)
  • Calls extractAndPrepareContent() without any filtering
  • Does NOT filter out original documents covered by pre-extracted JSONs

Result: DUPLICATES CREATED - Both documents processed, same content extracted twice

Visual Flow Comparison

Document Generation Path (WITH Filtering - CURRENT)

documentList: [original_pdf_123, pre_extracted_456]
    ↓
[FILTERING] Identify relationships, remove originals
    ↓
filteredDocuments: [pre_extracted_456]  ← original_pdf_123 removed
    ↓
extractAndPrepareContent([pre_extracted_456])
    ↓
ContentParts from pre_extracted_456 only
    ↓
✅ NO DUPLICATES

Data Extraction Path (WITHOUT Filtering - CURRENT)

documentList: [original_pdf_123, pre_extracted_456]
    ↓
[NO FILTERING] Pass all documents
    ↓
extractAndPrepareContent([original_pdf_123, pre_extracted_456])
    ↓
Process original_pdf_123 → ContentParts
Process pre_extracted_456 → ContentParts
    ↓
❌ DUPLICATES (same content twice)

Data Extraction Path (WITH Filtering - TARGET)

documentList: [original_pdf_123, pre_extracted_456]
    ↓
[FILTERING] Identify relationships, remove originals
    ↓
filteredDocuments: [pre_extracted_456]  ← original_pdf_123 removed
    ↓
extractAndPrepareContent([pre_extracted_456])
    ↓
ContentParts from pre_extracted_456 only
    ↓
✅ NO DUPLICATES

Solution

Target State: Add filtering to Data Extraction Path, then remove redundant check

Steps:

  1. Add filtering logic to _handleDataExtraction (between line 708 and line 721)
    • Copy filtering code from documentPath.py lines 62-87
    • Filter out original documents covered by pre-extracted JSONs
  2. Remove redundant check from extractAndPrepareContent() (line 77)
    • Trust that filtering is done upstream
    • Cleaner code, single responsibility

Risk Assessment:

  • If we remove redundant check WITHOUT adding filtering: ⚠️ Duplicates still occur (no change from current state)
  • If we add filtering THEN remove redundant check: No duplicates, cleaner code

Conclusion

  1. Filtering is necessary because it can look at ALL documents and identify relationships
  2. Redundant check is insufficient because it only looks at ONE document at a time
  3. Current state: Document Generation Path filters → safe. Data Extraction Path doesn't filter → duplicates possible
  4. Solution: Add filtering to Data Extraction Path, then remove redundant check (it's not needed if filtering is done)
  5. Risk of removing redundant check: None IF filtering is added first. High IF filtering is NOT added (but duplicates already exist anyway)

Appendix: Pre-Extracted JSON Document Check Locations

Where the Check is Done

1. Phase 1 (Before Intent Clarification):

  • File: gateway/modules/services/serviceGeneration/paths/documentPath.py
  • Lines: 67-87
  • Purpose: Filter documents before intent analysis
  • Method: self.services.ai.intentAnalyzer.resolvePreExtractedDocument(doc)
  • Action: Identifies pre-extracted JSONs and filters out original documents covered by them

2. Phase 2 (During Content Extraction):

  • File: gateway/modules/services/serviceAi/subContentExtraction.py
  • Line: 77
  • Purpose: Process each document during extraction loop
  • Method: self.intentAnalyzer.resolvePreExtractedDocument(document)
  • Action: Extracts ContentParts from pre-extracted JSON (not treat as regular JSON)
  • Note: ⚠️ REDUNDANT - This check happens again even though Phase 1 already filtered documents
  • Reason: extractAndPrepareContent() is called from multiple code paths:
    • Document generation path (documentPath.py) - filtering already done
    • Data extraction path (mainServiceAi.py) - filtering may not be done
    • The extraction service needs to handle pre-extracted JSONs defensively
  • Optimization Opportunity: Could pass filtered documents or a flag to skip redundant checks

3. Check Implementation:

  • File: gateway/modules/services/serviceAi/subDocumentIntents.py
  • Line: 122
  • Method: resolvePreExtractedDocument(document: ChatDocument)
  • Logic:
    • Checks if mimeType == "application/json"
    • Parses JSON and checks for validationMetadata.actionType == "context.extractContent"
    • Extracts ContentExtracted structure from documentData
    • Returns dict with originalDocument and contentExtracted info

Where Final Merged List is Available

After Phase 2 (Content Extraction):

  • File: gateway/modules/services/serviceGeneration/paths/documentPath.py
  • Line: 119
  • Code: contentParts = preparedContentParts
  • State:
    • All pre-extracted JSON documents processed → ContentParts
    • All regular documents extracted → ContentParts
    • All provided contentParts merged
    • Final clean merged list ready for Phase 3 (Structure Generation)

Before Phase 3 (Structure Generation):

  • File: gateway/modules/services/serviceGeneration/paths/documentPath.py
  • Line: 129
  • Usage: contentParts or [] passed to generateStructure()
  • Note: This is the clean merged list containing all ContentParts from all sources

Appendix: Intent Mapping Logic for Pre-Extracted JSONs

How Intent Mapping Works

Problem: When a pre-extracted JSON document is provided, we need to:

  1. Analyze intents for the original document (not the JSON file itself)
  2. Map the intents back to the JSON document ID (so they can be applied to the ContentParts extracted from the JSON)

Implementation Logic (Already in clarifyDocumentIntents)

Location: gateway/modules/services/serviceAi/subDocumentIntents.py lines 63-104

Step 1: Build Mapping (lines 63-83)

documentMapping = {}  # Maps original doc ID → JSON doc ID
resolvedDocuments = []

for doc in documents:
    preExtracted = self.resolvePreExtractedDocument(doc)
    if preExtracted:
        # This is a pre-extracted JSON
        originalDocId = preExtracted["originalDocument"]["id"]
        jsonDocId = doc.id  # Current document is the JSON
        
        # Map: original doc ID → JSON doc ID
        documentMapping[originalDocId] = jsonDocId
        
        # Create temporary ChatDocument for original document
        originalDoc = ChatDocument(
            id=originalDocId,
            fileName=preExtracted["originalDocument"]["fileName"],
            mimeType=preExtracted["originalDocument"]["mimeType"],
            # ... other fields from preExtracted["originalDocument"]
        )
        resolvedDocuments.append(originalDoc)  # Use original doc for intent analysis
    else:
        resolvedDocuments.append(doc)  # Regular document, use as-is

Result:

  • documentMapping = {"original_pdf_123": "pre_extracted_456"}
  • resolvedDocuments = [ChatDocument(id="original_pdf_123"), ChatDocument(id="other_doc")]

Step 2: AI Analyzes Intents (line 86)

# AI analyzes intents for resolvedDocuments (original documents, not JSONs)
intentPrompt = self._buildIntentAnalysisPrompt(userPrompt, resolvedDocuments, actionParameters)
aiResponse = await self.aiService.callAiPlanning(prompt=intentPrompt, ...)

AI Response:

{
  "intents": [
    {
      "documentId": "original_pdf_123",  // ← Original document ID
      "intents": ["extract"],
      "extractionPrompt": "Extract all text",
      "reasoning": "..."
    }
  ]
}

Step 3: Map Intents Back to JSON Doc IDs (lines 96-104)

intentsData = json.loads(self.services.utils.jsonExtractString(aiResponse))
documentIntents = []

for intent in intentsData.get("intents", []):
    docId = intent.get("documentId")  # "original_pdf_123"
    
    # If intent is for an original document covered by a pre-extracted JSON
    if docId in documentMapping:
        # Map back to JSON document ID
        intent["documentId"] = documentMapping[docId]  # "pre_extracted_456"
    
    documentIntents.append(DocumentIntent(**intent))

Result:

  • DocumentIntent(documentId="pre_extracted_456", intents=["extract"], ...)
  • Intent is now mapped to the JSON document ID, so it can be applied to ContentParts extracted from the JSON

Why This Works

  1. AI analyzes original documents: More meaningful context (file name, MIME type, etc.)
  2. Intents mapped to JSON IDs: ContentParts extracted from JSON can be tagged with correct intents
  3. Consistent with filtering: Original documents are filtered out, but their intents are preserved via mapping

Example Flow

Input:
- documentList: [original_pdf_123.pdf, pre_extracted_456.json]

Step 1: Filtering (Phase 1)
- Identify: pre_extracted_456.json covers original_pdf_123.pdf
- Filter: Remove original_pdf_123.pdf
- Result: documents = [pre_extracted_456.json]

Step 2: Intent Mapping (Phase 1)
- Build mapping: {"original_pdf_123": "pre_extracted_456"}
- Resolve: resolvedDocuments = [ChatDocument(id="original_pdf_123")]
- AI analyzes: intents for "original_pdf_123"
- Map back: intents for "pre_extracted_456"

Step 3: Content Extraction (Phase 2)
- Extract ContentParts from pre_extracted_456.json
- Apply intents (from Step 2) to ContentParts
- Result: ContentParts with correct intents

Implementation Notes

Infrastructure Available

The following infrastructure already exists and should be reused:

  • Language Validation: currentUserLanguage is validated at workflowManager.py:695-727 - always valid 2-character ISO code. Access via self.services.currentUserLanguage or _getUserLanguage() method.

  • Format Validation: Renderer registry exists at mainServiceGeneration.py:529 (_getFormatRenderer() uses getRenderer()). Import: from modules.services.serviceGeneration.renderers.registry import getRenderer. Returns None if format invalid, falls back to text renderer.

  • Language Extraction: _getDocumentLanguage() works correctly at subStructureFilling.py:349 - extracts per-document language from structure. Used properly during section generation.

Key Implementation Points

  1. Per-Document Format/Language: Multiple documents can have different formats and languages. AI determines these from user prompt. Parameters are only validation fallbacks.

  2. Filtering: Must filter pre-extracted JSONs before content extraction to prevent duplicate ContentParts. Filtering logic exists in documentPath.py:62-87 and should be copied to data extraction path.

  3. State 3 Validation: Use existing infrastructure (getRenderer(), _getUserLanguage()) for validation. Infrastructure exists, just needs to be called.

  4. Rendering: Extract per-document outputFormat and language from structure (validated in State 3). Check outputFormat field first, then format field (legacy), then global fallback.


Appendix: Validation Failure Handling Decisions

This appendix documents the decision-making process for how to handle each validation failure. The actual implementation code is integrated into Section 3 above.

Approach

  • Try to fix automatically (use defaults) when validation fails
  • All validations are critical (must not fail - fix or error)
  • Validation happens inline in each phase method

State 1: After Intent Clarification

Validation 1.1: Intent count mismatch

Check: len(documentIntents) != len(documents) Decision: Documents without intents are OK. Intents for non-existing documents should be skipped. Rationale: Not all documents need intents (some may be reference-only). Intents referencing unknown documents are invalid and should be removed.

Validation 1.2: Intent references unknown document

Check: intent.documentId not in documentIds Decision: Skip this intent (remove it) Rationale: Cannot map intent to non-existent document. Better to skip than fail.


State 2: After Content Extraction

Validation 2.1: ContentPart missing documentId

Check: not part.metadata.get("documentId") Decision: Skip this ContentPart (remove it) with warning in logger Rationale: ContentPart without documentId cannot be properly assigned. Skip with warning for debugging.

Validation 2.2: ContentPart has invalid contentFormat

Check: contentFormat not in ["extracted", "object", "reference"] Decision: Skip this ContentPart (remove it) with warning in logger Rationale: Invalid contentFormat indicates corrupted data. Skip with warning for debugging.


State 3: After Structure Generation

Validation 3.1: Structure missing 'documents' field

Check: "documents" not in chapterStructure Decision: Stop with error (cannot auto-fix - structure is invalid) Rationale: Structure without documents field is fundamentally broken. Cannot proceed.

Validation 3.2: Structure has no documents

Check: len(documents) == 0 Decision: Stop with error (cannot generate without documents) Rationale: Cannot generate output without documents. Must have at least one document.

Validation 3.3: Document missing 'outputFormat' field

Check: "outputFormat" not in doc Decision: Use global fallback format (from parameters), if not available use default "txt" Rationale: Format is required for rendering. Use fallback chain: per-document → global → default.

Validation 3.4: Document has invalid outputFormat

Check: outputFormat not in valid formats Decision: Use renderer registry to check if format has a renderer. If no renderer exists, try global fallback, then default "txt" Rationale: Use dynamic renderer registry (not hardcoded list) to check format validity. Fallback chain ensures we always have a valid format.

Validation 3.5: Document missing 'language' field

Check: "language" not in doc Decision: Use user prompt language (from self.services.currentUserLanguage via _getUserLanguage()), not "en" fallback Rationale: Language is required for content generation. Use user prompt language (detected from user intention analysis) as fallback, not hardcoded "en".

Validation 3.6: Document has invalid language

Check: len(doc["language"]) != 2 Decision: Use validated currentUserLanguage (always valid, validated during user intention analysis) Rationale: currentUserLanguage is validated during user intention analysis and is always a valid 2-character ISO 639-1 code. Safe to use directly.

Validation 3.7: Document missing 'chapters' field

Check: "chapters" not in doc Decision: Stop with error (cannot auto-fix - document structure invalid) Rationale: Document without chapters is structurally invalid. Cannot proceed.

Validation 3.8: Chapter missing 'contentParts' field

Check: "contentParts" not in chapter Decision: Stop with error (cannot auto-fix - chapter structure invalid) Rationale: Chapter without contentParts field is structurally invalid. Cannot proceed.


State 4: After Structure Filling

Validation 4.1: Filled structure missing 'documents' field

Check: "documents" not in filledStructure Decision: Stop with error (cannot auto-fix - structure is invalid) Rationale: Structure without documents field is fundamentally broken. Cannot proceed.

Validation 4.2: Section missing 'elements' field

Check: "elements" not in section Decision: Create empty elements list: section["elements"] = [] Rationale: Section can be intentionally empty. Create empty list to maintain structure.

Validation 4.3: Section has empty elements list

Check: not section["elements"] (empty list) Decision: Allow empty elements (section might be intentionally empty) Rationale: Empty sections are valid (e.g., placeholder sections). No action needed.

Validation 4.4: Document missing 'language' field in filled structure

Check: "language" not in doc (in filledStructure) Decision: Stop with error (language MUST be preserved from Phase 3) Rationale: Language is validated and set in Phase 3 (State 3). If missing in filled structure, it's a critical error - language must be preserved.

Validation 4.5: Document has invalid language format in filled structure

Check: not isinstance(doc["language"], str) or len(doc["language"]) != 2 Decision: Stop with error (language format MUST be valid) Rationale: Language format is validated in Phase 3 (State 3). If invalid in filled structure, it's a critical error.


State 5: After Document Rendering

Validation 5.1: No documents rendered

Check: len(renderedDocuments) == 0 Decision: Stop with error (already implemented in documentPath.py line 176) Rationale: Cannot return empty result. Error already implemented.

Validation 5.2: Rendered document has empty documentData

Check: not doc.documentData Decision: Skip this document (remove from list) Rationale: Empty document is not useful. Skip it rather than fail entire operation.

Validation 5.3: Rendered document missing mimeType

Check: not doc.mimeType Decision: Infer mimeType from filename extension Rationale: mimeType can be inferred from filename. Use utility function to detect.