# Implementation Plan: Content Handling Architecture Migration ## Overview This document provides a detailed implementation plan for migrating to the target architecture for content extraction and document generation. The plan focuses on: - **Documents and Content Handling**: Intelligent merging of `documentList` and `contentParts` with deduplication - **Output Document Formats**: Per-document format determination (not global) - AI determines formats from user prompt, multiple documents can have different formats - **Languages Handling**: Per-document language determination (not global) - uses validated `currentUserLanguage` infrastructure - **Clear Handover States**: Defined validation at each phase boundary using existing infrastructure - **Structure Filling**: Two prompt types (with content vs. without content) ## Verified Infrastructure (Ready to Use) The following infrastructure already exists and can be reused: - ✅ **Language Validation**: `currentUserLanguage` is validated at `workflowManager.py:695-727` - always valid 2-character ISO code (validates AI response, falls back to user language, then "en"). Safe to use via `self.services.currentUserLanguage` or `_getUserLanguage()` method. - ✅ **Format Validation**: Renderer registry exists at `mainServiceGeneration.py:529` (`_getFormatRenderer()` uses `getRenderer()`). Can be imported: `from modules.services.serviceGeneration.renderers.registry import getRenderer`. Returns None if format invalid, falls back to text renderer. - ✅ **Language Extraction**: `_getDocumentLanguage()` works correctly at `subStructureFilling.py:349` - extracts per-document language from structure. Used properly during section generation. ## Context This implementation plan is based on the analysis documented in: - `gateway/modules/services/serviceAi/CONTENT_EXTRACTION_ANALYSIS.md` (Section 9.3: Target State) The target architecture addresses architectural issues identified in the current implementation: 1. **Single extraction path** in AI service (no duplication in `ai.process`) 2. **Intelligent merging** of `contentParts` and `documentList` with deduplication 3. **Clear separation** of concerns: action layer delegates to service layer 4. **Consistent behavior** across all code paths 5. **Per-document format/language** determination (not global) --- ## 1. Overview: Major Phases and Handover States ### Phase Flow Diagram ``` ┌─────────────────────────────────────────────────────────────────────┐ │ PHASE 1: Document Intent Clarification │ │ ────────────────────────────────────────────────────────────────── │ │ INPUT: │ │ - userPrompt: str (fenced) │ │ - documentList: DocumentReferenceList (optional) │ │ - contentParts: List[ContentPart] (optional) │ │ - actionParameters: Dict (outputFormat, language, etc.) │ │ │ │ THROUGHPUT: │ │ 1. Resolve documents from documentList │ │ 2. Identify pre-extracted JSON documents │ │ - Check if JSON contains ContentExtracted structure │ │ - Map pre-extracted JSONs to original documents │ │ 3. Filter out original documents covered by pre-extracted │ │ 4. AI analyzes document purposes │ │ 5. Map intents back to JSON doc IDs (if applicable) │ │ │ │ OUTPUT: │ │ - documentIntents: List[DocumentIntent] │ │ * documentId: str │ │ * intents: List[str] (["extract", "render", "reference"]) │ │ * extractionPrompt: str (optional) │ │ * reasoning: str │ │ Note: outputFormat and language are NOT determined here - │ │ they're determined in Phase 3 (Structure Generation) │ │ │ │ HANDOVER STATE: │ │ - documentIntents: Complete intent analysis │ │ - documents: Resolved ChatDocuments │ │ - preExtractedMapping: Map[originalDocId, jsonDocId] │ └─────────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────┐ │ PHASE 2: Content Extraction and Preparation │ │ ────────────────────────────────────────────────────────────────── │ │ INPUT: │ │ - documents: List[ChatDocument] │ │ - documentIntents: List[DocumentIntent] │ │ - contentParts: List[ContentPart] (optional, pre-extracted) │ │ - preExtractedMapping: Map[originalDocId, jsonDocId] │ │ │ │ THROUGHPUT: │ │ 1. Process pre-extracted JSON documents → ContentParts │ │ - Extract ContentParts from JSON (not treat as regular JSON) │ │ - Apply intents (extract, render, reference) │ │ - Mark with isPreExtracted=True │ │ 2. RAW extraction (NO AI) for regular documents │ │ - Extract content using extraction service │ │ - Create ContentParts with metadata │ │ 3. Merge all ContentParts │ │ - Pre-extracted parts (from JSON documents) │ │ - Extracted parts (from regular documents) │ │ - Provided parts (from contentParts parameter) │ │ 4. Apply intents to ContentParts (extract, render, reference) │ │ 5. Mark images for Vision AI extraction (deferred) │ │ │ │ OUTPUT: │ │ - finalContentParts: List[ContentPart] │ │ * id: str │ │ * typeGroup: str │ │ * mimeType: str │ │ * data: Union[str, bytes] │ │ * metadata: Dict │ │ - documentId: str │ │ - contentFormat: str ("extracted", "object", "reference") │ │ - intent: str │ │ - needsVisionExtraction: bool (for images) │ │ - extractionPrompt: str (for Vision AI) │ │ - originalFileName: str │ │ - isPreExtracted: bool │ │ Note: outputFormat and language are NOT propagated here - │ │ they're determined in Phase 3 (Structure Generation) │ │ │ │ HANDOVER STATE: │ │ - finalContentParts: Complete merged list │ │ - All pre-extracted JSON documents processed → ContentParts │ │ - All regular documents extracted → ContentParts │ │ - All provided contentParts merged │ │ - All documents processed (extracted or pre-extracted) │ │ - Vision AI extraction deferred to Phase 4 │ └─────────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────┐ │ PHASE 3: Structure Generation │ │ ────────────────────────────────────────────────────────────────── │ │ INPUT: │ │ - userPrompt: str │ │ - finalContentParts: List[ContentPart] │ │ - outputFormat: Optional[str] (optional fallback, defaults to "txt") │ │ - currentUserLanguage: str (always valid, validated during user intention analysis) │ │ * From: self.services.currentUserLanguage (always valid, validated during user intention analysis) │ │ │ │ THROUGHPUT: │ │ 1. Group ContentParts by documentId (for context) │ │ 2. AI generates structure with documents and chapters │ │ 3. AI determines per-document outputFormat in structure JSON │ │ from user prompt → else optional outputFormat fallback (or "txt") │ │ 4. AI determines per-document language in structure JSON │ │ from user prompt → else validated currentUserLanguage (always valid) │ │ 5. Assign ContentParts to chapters │ │ │ │ OUTPUT: │ │ - chapterStructure: Dict │ │ * documents: List[Dict] │ │ - id: str │ │ - title: str │ │ - outputFormat: str (per-document) ← NEW │ │ - language: str (per-document) ← NEW │ │ - chapters: List[Dict] │ │ * id: str │ │ * level: int │ │ * title: str │ │ * generationHint: str │ │ * contentParts: List[str] (ContentPart IDs) │ │ │ │ HANDOVER STATE: │ │ - chapterStructure: Complete structure with ContentPart │ │ assignments │ │ - Per-document format/language determined │ └─────────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────┐ │ PHASE 4: Structure Filling │ │ ────────────────────────────────────────────────────────────────── │ │ INPUT: │ │ - chapterStructure: Dict (with per-document language from Phase 3)│ │ - finalContentParts: List[ContentPart] │ │ - userPrompt: str │ │ │ │ THROUGHPUT: │ │ For each document (with per-document language): │ │ For each chapter: │ │ 1. Generate sections structure (parallel) │ │ 2. For each section: │ │ a. Extract per-document language from structure │ │ b. Check if ContentParts need Vision AI extraction │ │ c. If yes: Call Vision AI (Phase 2 deferred extraction) │ │ d. Determine prompt type: │ │ - WITH CONTENT: If contentParts assigned │ │ → Use aggregation prompt (isAggregation=True) │ │ → ContentParts passed as parameters │ │ → Use per-document language for generation │ │ - WITHOUT CONTENT: If no contentParts │ │ → Use generation prompt (isAggregation=False) │ │ → Only generationHint in prompt │ │ → Use per-document language for generation │ │ e. Generate section content with AI │ │ │ │ OUTPUT: │ │ - filledStructure: Dict │ │ * documents: List[Dict] │ │ - language: str (preserved from input structure, per-document)│ │ - chapters: List[Dict] │ │ * sections: List[Dict] │ │ - id: str │ │ - content_type: str │ │ - elements: List[Dict] │ │ * type: str │ │ * content: str (or base64 for images) │ │ │ │ HANDOVER STATE: │ │ - filledStructure: Complete content, ready for rendering │ │ - Per-document language preserved from structure │ │ - All Vision AI extractions completed │ └─────────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────┐ │ PHASE 5: Document Rendering │ │ ────────────────────────────────────────────────────────────────── │ │ INPUT: │ │ - filledStructure: Dict │ │ - per-document outputFormat (from Phase 3, determined from prompt) │ │ - per-document language (from Phase 3, validated currentUserLanguage) │ │ │ │ THROUGHPUT: │ │ 1. Group sections by document (from structure) │ │ 2. For each document: │ │ a. Use per-document outputFormat │ │ b. Use per-document language │ │ c. Render document in specified format │ │ │ │ OUTPUT: │ │ - renderedDocuments: List[DocumentData] │ │ * documentName: str │ │ * documentData: bytes │ │ * mimeType: str │ │ │ │ HANDOVER STATE: │ │ - renderedDocuments: Final output ready for user │ └─────────────────────────────────────────────────────────────────────┘ ``` --- ## 2. Detailed Implementation Steps ### Step 1: Update DocumentIntent Model **File**: `gateway/modules/datamodels/datamodelExtraction.py` **Changes**: ```python class DocumentIntent(BaseModel): documentId: str intents: List[str] # ["extract", "render", "reference"] extractionPrompt: Optional[str] = None # Note: outputFormat and language are NOT here - determined during # structure generation (Phase 3) in the chapter structure JSON reasoning: str ``` **Rationale**: - Intent clarification focuses on document purpose (extract, render, reference) - Output format and language are determined later during structure generation (Phase 3) - Structure generation has full context (user prompt, ContentParts, chapters) to determine format/language --- ### Step 2: Update Intent Analysis Prompt **File**: `gateway/modules/services/serviceAi/subDocumentIntents.py` **Changes**: 1. **Add fencing around userPrompt** (Security Fix): ```python def _buildIntentAnalysisPrompt( self, userPrompt: str, documents: List[ChatDocument], actionParameters: Dict[str, Any] ) -> str: # FENCE user input to prevent prompt injection fencedUserPrompt = f"""```user_request {userPrompt} ```""" prompt = f"""USER REQUEST: {fencedUserPrompt} DOCUMENTS TO ANALYZE: {docListText} TASK: For each document, determine: 1. Intents (can be multiple): "extract", "render", "reference" Note: Output format and language are NOT determined here - they will be determined during structure generation (Phase 3) in the chapter structure JSON OUTPUT FORMAT: {outputFormat} (global fallback - for reference only) RETURN JSON: {{ "intents": [ {{ "documentId": "doc_1", "intents": ["extract"], "extractionPrompt": "Extract all text content", // Note: outputFormat and language are NOT here - determined during // structure generation in the chapter structure JSON "reasoning": "..." }} ] }} """ ``` 2. **Remove global outputFormat from prompt** (keep as fallback): - Output format should be determined per document based on intent - Global format remains as fallback if not specified per document --- ### Step 3: Update ContentPart Metadata Propagation **File**: `gateway/modules/services/serviceAi/subContentExtraction.py` **Changes**: ```python async def extractAndPrepareContent( self, documents: List[ChatDocument], documentIntents: List[DocumentIntent], parentOperationId: str, getIntentForDocument: callable ) -> List[ContentPart]: # ... existing extraction logic ... # Note: outputFormat and language are NOT propagated here - they're determined # during structure generation (Phase 3) in the chapter structure JSON # ContentParts are created with intent information only ``` **Rationale**: - ContentParts carry intent and extraction information only - Output format and language are determined during structure generation (Phase 3) - Structure generation has full context to make format/language decisions --- ### Step 4: Update Structure Generation **File**: `gateway/modules/services/serviceAi/subStructureGeneration.py` #### Global Format Source Chain **Note**: `outputFormat` parameter is **optional**. If omitted, formats are determined from user prompt by AI. **If outputFormat provided**: 1. Action parameters: `action_parameters.get("outputFormat")` or `action_parameters.get("resultType")` 2. Passed to `callAiContent(outputFormat=...)` → `generateStructure(outputFormat=...)` as parameter 3. Used as fallback in State 3 validation if AI doesn't return format per document 4. Final fallback: "txt" if global format is also missing/invalid **If outputFormat omitted**: 1. AI determines formats per document from user prompt 2. Validation fallback: "txt" (if AI doesn't return format per document) **Rationale**: With per-document format determination, AI can determine different formats for different documents based on user prompt. The `outputFormat` parameter is primarily a fallback for validation, not a requirement. #### Language Source Chain **Note**: `currentUserLanguage` is always valid (validated during user intention analysis). 1. AI determines per-document language in structure JSON response 2. If AI doesn't return language: Use validated `currentUserLanguage` (always valid, validated during user intention analysis) 3. `currentUserLanguage` validation ensures: - AI response `detectedLanguage` is validated (2-character ISO code) - If AI didn't return language or invalid → uses user language (`self.services.user.language`) - If user language not set → uses "en" - Always safe to use directly without fallback logic **Changes**: 1. **Make outputFormat optional in generateStructure method signature**: ```python async def generateStructure( self, userPrompt: str, contentParts: List[ContentPart], outputFormat: Optional[str] = None, # ← Optional: if omitted, formats determined from prompt by AI parentOperationId: str ) -> Dict[str, Any]: """ Generate document structure with per-document format determination. Multiple documents can be produced with different formats (e.g., one PDF, one HTML). AI determines formats per-document from user prompt. The outputFormat parameter is only a validation fallback - used if AI doesn't return format per document. Args: outputFormat: Optional global format fallback. If omitted, formats are determined from user prompt by AI. Used as validation fallback if AI doesn't return format per document. Defaults to "txt" if not provided. """ # If outputFormat not provided, use "txt" as fallback for validation # AI will determine formats per document from user prompt if not outputFormat: outputFormat = "txt" logger.debug("outputFormat not provided - using 'txt' as validation fallback, formats determined from prompt") # Group ContentParts by documentId (for context in prompt) partsByDocument = {} for part in contentParts: docId = part.metadata.get("documentId", "default") if docId not in partsByDocument: partsByDocument[docId] = [] partsByDocument[docId].append(part) # AI determines per-document format and language in structure JSON response # Pass global fallback for AI to use if not specified per document prompt = self._buildChapterStructurePrompt( userPrompt=userPrompt, contentParts=contentParts, outputFormat=outputFormat # Fallback for validation (AI determines formats from prompt) ) ``` **Note**: - `outputFormat` is **optional**. If omitted, formats are determined from user prompt by AI. - Used as validation fallback if AI doesn't return format per document. - User prompt language comes from `self.services.currentUserLanguage` which is validated during user intention analysis (`workflowManager._sendFirstMessage()`). The validation ensures: - AI response `detectedLanguage` is validated (2-character ISO code) - If AI didn't return language or invalid → uses user language (`self.services.user.language`) - If user language not set → uses "en" - `currentUserLanguage` is always valid and safe to use directly without fallback logic 2. **Update prompt to clarify format determination from prompt**: ```python def _buildChapterStructurePrompt( self, userPrompt: str, contentParts: List[ContentPart], outputFormat: str # Global fallback (for validation only) ) -> str: # Get language from services (validated currentUserLanguage infrastructure) language = self._getUserLanguage() # Uses self.services.currentUserLanguage (always valid) # ... existing prompt building ... prompt += f""" ## OUTPUT FORMAT (per document) - Each document can have its own output format (pdf, docx, html, etc.) - **Determine the format for each document from the USER REQUEST above** - Multiple documents can have different formats (e.g., one PDF, one HTML) - Analyze user prompt to identify format requirements: * Explicit format mentions (e.g., "as PDF", "in Excel", "HTML document") * Document purpose (e.g., "spreadsheet" → xlsx, "presentation" → pptx) * Content type requirements - If format cannot be determined from prompt, use fallback: "{outputFormat}" (for validation only) - Include "outputFormat" field in each document in the JSON structure - **CRITICAL**: Formats are determined from user prompt, not from the fallback value ## DOCUMENT LANGUAGE (per document) - Each document can have its own language (ISO 639-1 code: "de", "en", "fr", etc.) - Determine the language for each document based on: * User prompt language/context * Document content context * User's explicit language requirements - If not specified, use validated currentUserLanguage: "{language}" (always valid, validated during user intention analysis) - Include "language" field in each document in the JSON structure EXAMPLE JSON STRUCTURE: {{ "documents": [ {{ "id": "doc_1", "title": "Document Title", "outputFormat": "pdf", // ← Determined by AI from user prompt "language": "de", // ← Determined by AI from user prompt "chapters": [...] }}, {{ "id": "doc_2", "title": "Another Document", "outputFormat": "html", // ← Different format for different document "language": "en", // ← Different language for different document "chapters": [...] }} ] }} """ ``` --- ### Step 5: Update Structure Filling - Two Prompt Types **File**: `gateway/modules/services/serviceAi/subStructureFilling.py` **Changes**: 1. **Ensure two prompt types are used** (already implemented, verify): ```python async def _fillSingleSection( self, section: Dict[str, Any], contentParts: List[ContentPart], userPrompt: str, generationHint: str, document: Dict[str, Any], # ← NEW: Need document to get per-document language # ... other params ... ) -> List[Dict[str, Any]]: # Extract per-document language from structure # Language MUST be defined in structure (validated in State 3) # If missing, this is an error - should not happen after State 3 validation if "language" not in document: raise ValueError(f"Document {document.get('id')} missing 'language' field - should have been set in Phase 3 validation") docLanguage = document["language"] # Validate language format (should be 2-character ISO code) if not isinstance(docLanguage, str) or len(docLanguage) != 2: raise ValueError(f"Document {document.get('id')} has invalid language format: {docLanguage} - should be 2-character ISO 639-1 code") contentPartIds = section.get("contentPartIds", []) hasContentParts = len(contentPartIds) > 0 if hasContentParts: # PROMPT TYPE 1: WITH CONTENT (Aggregation) # ContentParts passed as parameters, not in prompt text isAggregation = True relevantParts = [p for p in contentParts if p.id in contentPartIds] generationPrompt = self._buildSectionGenerationPrompt( section=section, contentParts=relevantParts, # Passed as parameters userPrompt=userPrompt, generationHint=generationHint, isAggregation=True, # ← Key flag language=docLanguage # ← Per-document language from structure ) else: # PROMPT TYPE 2: WITHOUT CONTENT (Generation) # Only generationHint in prompt, no ContentParts isAggregation = False generationPrompt = self._buildSectionGenerationPrompt( section=section, contentParts=[], # Empty userPrompt=userPrompt, generationHint=generationHint, isAggregation=False, # ← Key flag language=docLanguage # ← Per-document language from structure ) ``` **Note**: Language comes from the document in the structure (per-document), not a global parameter. Each document can have its own language as determined in Phase 3. The language MUST be defined and validated in Phase 3 (State 3 validation) - if missing here, it's an error. 2. **Verify `_buildSectionGenerationPrompt` handles both cases**: ```python def _buildSectionGenerationPrompt( self, section: Dict[str, Any], contentParts: List[ContentPart], userPrompt: str, generationHint: str, isAggregation: bool, # ← Determines prompt type language: str ) -> str: if isAggregation: # TYPE 1: WITH CONTENT # ContentParts are passed as parameters to AI call # Don't include full content in prompt text (token efficiency) prompt = f"""Generate content for section based on provided ContentParts. Section: {sectionTitle} Generation Hint: {generationHint} Language: {language} ContentParts are provided as parameters (not shown in prompt for efficiency). Use the ContentParts data to generate the section content. """ else: # TYPE 2: WITHOUT CONTENT # Only generationHint, no ContentParts prompt = f"""Generate content for section based on generation hint. Section: {sectionTitle} Generation Hint: {generationHint} Language: {language} Generate content based on the generation hint without referencing external content. """ ``` **Rationale**: - **Type 1 (with content)**: Efficient for large content (ContentParts as parameters) - **Type 2 (without content)**: Simple generation based on hint only - Already implemented via `isAggregation` flag, verify it's used correctly --- ### Step 6: Update Document Rendering **File**: `gateway/modules/services/serviceAi/mainServiceAi.py` (renderResult method) **File**: `gateway/modules/services/serviceGeneration/mainServiceGeneration.py` (renderReport method) **Current Implementation**: - `renderResult()` calls `generationService.renderReport()` - `renderReport()` already processes each document separately (line 385) - Currently checks `doc.get("format", outputFormat)` (line 397) - but should check `outputFormat` field - Language is not handled per-document **Changes**: 1. **Update renderResult to pass language (from structure, validated before rendering)**: ```python async def renderResult( self, filledStructure: Dict[str, Any], outputFormat: str, # Global fallback language: str, # ← NEW: Add language parameter (global fallback) title: str, userPrompt: str, parentOperationId: str ) -> List[RenderedDocument]: """ Render filled structure to documents. Per-document format and language are extracted from structure (validated in State 3). The outputFormat and language parameters are only used as global fallbacks. Multiple documents can have different formats and languages. """ # Language comes from structure (per-document), validated in State 3 # This parameter is only used as global fallback if structure validation fails # Use validated currentUserLanguage as fallback (always valid) if not language: language = self._getUserLanguage() # Uses validated currentUserLanguage infrastructure # ... existing code ... renderedDocuments = await generationService.renderReport( filledStructure, outputFormat, language, # ← Pass language (global fallback, per-document extracted in renderReport) title, userPrompt, self, parentOperationId=renderOperationId ) ``` **Note**: - Language comes from structure (per-document) as determined in Phase 3 - The `language` parameter here is only used as a global fallback - Per-document language is validated in State 3 (Structure Generation) and extracted from structure in `renderReport()` - Uses validated `currentUserLanguage` infrastructure if fallback needed 2. **Update renderReport to handle per-document format and language**: ```python async def renderReport( self, extractedContent: Dict[str, Any], outputFormat: str, # Global fallback language: str, # ← NEW: Add language parameter (global fallback) title: str, userPrompt: str = None, aiService=None, parentOperationId: Optional[str] = None ) -> List[RenderedDocument]: # ... existing validation ... # Process EACH document separately for docIndex, doc in enumerate(documents): # ... existing validation ... # Determine format for this document # Check outputFormat field first (per-document), then format field (legacy), then global fallback docFormat = doc.get("outputFormat") or doc.get("format") or outputFormat # Determine language for this document # Extract per-document language from structure (validated in State 3), fallback to global docLanguage = doc.get("language") or language # Validate language format (should be 2-character ISO code, validated in State 3) if not isinstance(docLanguage, str) or len(docLanguage) != 2: logger.warning(f"Document {doc.get('id')} has invalid language format: {docLanguage}, using fallback") docLanguage = language # Use global fallback # Get renderer for this document's format (uses existing renderer registry) renderer = self._getFormatRenderer(docFormat) if not renderer: logger.warning(f"Unsupported format '{docFormat}' for document {doc.get('id', docIndex)}, skipping") continue # Create JSON structure with single document (preserving metadata) singleDocContent = { "metadata": {**metadata, "language": docLanguage}, # ← Add per-document language to metadata "documents": [doc] } # Render this document (can return multiple files, e.g., HTML + images) renderedDocs = await renderer.render(singleDocContent, docTitle, userPrompt, aiService) allRenderedDocuments.extend(renderedDocs) ``` **Note**: - Per-document format and language are extracted from structure (validated in State 3) - Renderers (`RendererPdf`, `RendererHtml`, etc.) receive the structure with language in metadata - They can use it for language-specific formatting if needed - Multiple documents can have different formats and languages --- ### Step 7: Update ai.process to Pass documentList and Make outputFormat Optional **File**: `gateway/modules/workflows/methods/methodAi/actions/process.py` **Changes**: ```python # Phase 7.3: Pass both documentList and contentParts to AI service # (Remove extraction logic from here - handled by AI service) # resultType is optional - if omitted, formats determined from prompt by AI # Default "txt" is validation fallback only resultType = parameters.get("resultType") # Optional: if None, formats determined from prompt if resultType: normalized_result_type = (str(resultType).strip().lstrip('.').lower() or "txt") output_format = output_extension.replace('.', '') or 'txt' else: # No format specified - AI will determine formats from prompt output_format = None logger.debug("resultType not provided - formats will be determined from prompt by AI") # Use unified callAiContent method with BOTH parameters aiResponse = await self.services.ai.callAiContent( prompt=aiPrompt, options=options, documentList=documentList, # ← PASS documentList (was missing) contentParts=contentParts, # ← PASS contentParts outputFormat=output_format, # ← Optional: if None, formats determined from prompt parentOperationId=operationId, generationIntent=generationIntent ) ``` **Note**: - `resultType` parameter is **optional**. If omitted, formats are determined from user prompt by AI. - Default "txt" (if provided) is used as validation fallback only. - Language detection from user prompt is already done and validated. `self.services.currentUserLanguage` is always valid (validated during user intention analysis in `workflowManager._sendFirstMessage()`). --- ## 3. Handover State Definitions and Validation **Purpose**: These state definitions document the expected structure and validation rules at each phase boundary. **Implementation Approach**: - **Inline validation** in each phase method - **Auto-fix** where possible (use defaults, skip invalid items) - **Stop with error** for critical structural issues - **Log warnings** for skipped items **See**: Appendix "Validation Failure Handling Decisions" below for detailed Q&A on each validation **Summary of Validation Decisions**: - **State 1**: Skip intents for unknown documents; documents without intents are OK - **State 2**: Skip ContentParts with missing/invalid metadata (with warnings) - **State 3**: Auto-fix format/language with fallbacks; error on missing structure fields - **State 4**: Auto-fix missing elements field; allow empty elements - **State 5**: Skip empty documents; infer mimeType from filename ### State 1: After Intent Clarification **Location**: `gateway/modules/services/serviceAi/subDocumentIntents.py` - After `clarifyDocumentIntents()` returns (line 115) **Expected State**: ```python documentIntents: List[DocumentIntent] # Complete intent analysis documents: List[ChatDocument] # Resolved documents preExtractedMapping: Dict[str, str] # Map[originalDocId, jsonDocId] ``` **Implementation Code** (add after line 115, before return): ```python # Validation and auto-fix documentIds = {d.id for d in documents} validatedIntents = [] for intent in documentIntents: # Validation 1.2: Skip intents for unknown documents if intent.documentId not in documentIds: logger.warning(f"Skipping intent for unknown document: {intent.documentId}") continue validatedIntents.append(intent) # Validation 1.1: Documents without intents are OK (not needed) # Intents for non-existing documents are already filtered above documentIntents = validatedIntents ``` ### State 2: After Content Extraction **Location**: `gateway/modules/services/serviceAi/subContentExtraction.py` - After `extractAndPrepareContent()` returns (at end of method, before return) **Expected State**: ```python finalContentParts: List[ContentPart] # All content parts ready ``` **Implementation Code** (add at end of method, before return): ```python # Validation and auto-fix validatedParts = [] for part in finalContentParts: # Validation 2.1: Skip ContentParts without documentId if not part.metadata.get("documentId"): logger.warning(f"Skipping ContentPart {part.id} - missing documentId in metadata") continue # Validation 2.2: Skip ContentParts with invalid contentFormat contentFormat = part.metadata.get("contentFormat") if contentFormat not in ["extracted", "object", "reference"]: logger.warning( f"Skipping ContentPart {part.id} - invalid contentFormat: {contentFormat}" ) continue validatedParts.append(part) return validatedParts ``` ### State 3: After Structure Generation **Location**: `gateway/modules/services/serviceAi/subStructureGeneration.py` - After `generateStructure()` returns (after parsing JSON, before return, around line 182) **Expected State**: ```python chapterStructure: Dict[str, Any] # Complete structure with documents, chapters, outputFormat, language ``` **Implementation Code** (add after structure JSON is parsed, before return): ```python # After structure JSON is parsed (around line 182) # Validation and auto-fix # Validation 3.1: Structure missing 'documents' field if "documents" not in structure: raise ValueError("Structure missing 'documents' field - cannot auto-fix") documents = structure["documents"] # Validation 3.2: Structure has no documents if not isinstance(documents, list) or len(documents) == 0: raise ValueError("Structure has no documents - cannot generate without documents") # Import renderer registry for format validation (existing infrastructure) from modules.services.serviceGeneration.renderers.registry import getRenderer # Validate and fix each document for doc in documents: # Validation 3.3 & 3.4: Document outputFormat # outputFormat parameter is optional - if omitted, formats determined from prompt by AI # Use as fallback only if AI doesn't return format per document # Multiple documents can have different formats (e.g., one PDF, one HTML) globalFormatFallback = outputFormat or "txt" # Fallback for validation if "outputFormat" not in doc or not doc["outputFormat"]: # AI didn't return format or returned empty - use global fallback doc["outputFormat"] = globalFormatFallback logger.info(f"Document {doc.get('id')} missing outputFormat - using fallback: {doc['outputFormat']}") else: # AI returned format - validate using existing renderer registry formatName = str(doc["outputFormat"]).lower().strip() renderer = getRenderer(formatName) # Uses existing infrastructure if not renderer: # Format doesn't match any renderer - use txt (simple approach) logger.warning(f"Document {doc.get('id')} has format without renderer: {formatName}, using 'txt'") doc["outputFormat"] = "txt" else: # Valid format with renderer - normalize and keep AI result doc["outputFormat"] = formatName logger.debug(f"Document {doc.get('id')} using AI-determined format: {formatName}") # Validation 3.5 & 3.6: Document language # Use validated currentUserLanguage (always valid, validated during user intention analysis) # Access via _getUserLanguage() which uses self.services.currentUserLanguage userPromptLanguage = self._getUserLanguage() # Uses validated currentUserLanguage infrastructure if "language" not in doc or not isinstance(doc["language"], str) or len(doc["language"]) != 2: # AI didn't return language or invalid format - use validated currentUserLanguage doc["language"] = userPromptLanguage if "language" not in doc: logger.info(f"Document {doc.get('id')} missing language - using currentUserLanguage: {doc['language']}") else: logger.warning(f"Document {doc.get('id')} has invalid language format from AI: {doc['language']}, using currentUserLanguage") else: # AI returned valid language format - normalize doc["language"] = doc["language"].lower().strip()[:2] logger.debug(f"Document {doc.get('id')} using AI-determined language: {doc['language']}") # Validation 3.7: Document missing 'chapters' field if "chapters" not in doc: raise ValueError(f"Document {doc.get('id')} missing 'chapters' field - cannot auto-fix") # Validation 3.8: Chapter missing 'contentParts' field for chapter in doc["chapters"]: if "contentParts" not in chapter: raise ValueError(f"Chapter {chapter.get('id')} missing 'contentParts' field - cannot auto-fix") return structure ``` ### State 4: After Structure Filling **Location**: `gateway/modules/services/serviceAi/subStructureFilling.py` - After `fillStructure()` returns (at end of method, before return, around line 204) **Expected State**: ```python filledStructure: Dict[str, Any] # Complete content with elements ``` **Implementation Code** (add at end of method, before return): ```python # Validation and auto-fix # Validation 4.1: Filled structure missing 'documents' field if "documents" not in filledStructure: raise ValueError("Filled structure missing 'documents' field - cannot auto-fix") for doc in filledStructure["documents"]: # Validation 4.4: Verify language is preserved from input structure # Language MUST be preserved from Phase 3 structure (validated in State 3) if "language" not in doc: raise ValueError(f"Document {doc.get('id')} missing language in filled structure - should have been preserved from Phase 3") # Validate language format if not isinstance(doc["language"], str) or len(doc["language"]) != 2: raise ValueError(f"Document {doc.get('id')} has invalid language format in filled structure: {doc['language']} - should be 2-character ISO 639-1 code") for chapter in doc.get("chapters", []): for section in chapter.get("sections", []): # Validation 4.2: Section missing 'elements' field if "elements" not in section: section["elements"] = [] logger.info(f"Section {section.get('id')} missing 'elements' - created empty list") # Validation 4.3: Section has empty elements list - ALLOW (intentionally empty is OK) # No action needed - empty elements are allowed return filledStructure ``` ### State 5: After Document Rendering **Location**: `gateway/modules/services/serviceGeneration/paths/documentPath.py` - After `renderResult()` returns (line 151, after line 157, before building documentDataList) **Expected State**: ```python renderedDocuments: List[RenderedDocument] # Final output ``` **Implementation Code** (add after line 157, before building documentDataList): ```python # Validation 5.1: Already implemented at line 175-176 if not renderedDocuments: raise ValueError("No documents were rendered") # Validation 5.2 & 5.3: Validate and filter rendered documents validatedRenderedDocs = [] for doc in renderedDocuments: # Validation 5.2: Skip documents with empty documentData if not doc.documentData: logger.warning(f"Skipping rendered document {doc.filename} - empty documentData") continue # Validation 5.3: Infer mimeType from filename if missing if not doc.mimeType: from modules.services.serviceGeneration.subDocumentUtility import getMimeTypeFromExtension if doc.filename: inferredMimeType = getMimeTypeFromExtension(doc.filename) if inferredMimeType: doc.mimeType = inferredMimeType logger.info(f"Inferred mimeType '{inferredMimeType}' from filename '{doc.filename}'") else: logger.warning(f"Could not infer mimeType from filename '{doc.filename}' - keeping as None") else: logger.warning(f"Rendered document missing mimeType and filename - cannot infer") validatedRenderedDocs.append(doc) # Use validated list renderedDocuments = validatedRenderedDocs # Re-check after filtering if not renderedDocuments: raise ValueError("No valid documents after validation") ``` --- ## 4. Migration Checklist ### Phase 1: Model Updates - [ ] Verify `DocumentIntent` model does NOT include `outputFormat` or `language` - [ ] Intent clarification focuses only on document purpose (intents, extractionPrompt) - [ ] Note: outputFormat and language are determined during structure generation (Phase 3) ### Phase 2: Intent Analysis Updates - [ ] **CRITICAL**: Add fencing around `userPrompt` in intent analysis prompt - [ ] Fence user input with code blocks: ```user_request\n{userPrompt}\n``` - [ ] Test with various user inputs (special chars, JSON, newlines, prompt injection attempts) - [ ] Update prompt to focus only on document intents (extract, render, reference) - [ ] Remove any outputFormat/language determination from intent analysis prompt - [ ] Keep global outputFormat/language as reference only (not for determination) - [ ] **Verify intent mapping logic** (already implemented in `clarifyDocumentIntents`): - [ ] Step 1: Map pre-extracted JSONs to original documents (lines 63-83) - [ ] Step 2: AI analyzes intents for original documents (line 86) - [ ] Step 3: Map intents back to JSON doc IDs (lines 96-104) - [ ] Test with pre-extracted JSONs to verify mapping works correctly ### Phase 3: Content Extraction Updates - [ ] Verify ContentParts do NOT include outputFormat or language in metadata - [ ] ContentParts carry only intent and extraction information - [ ] Verify pre-extracted JSON handling preserves intent information - [ ] **Add filtering to Data Extraction Path** (`_handleDataExtraction`): **Current State (BEFORE filtering)**: ```python # Line 708: Get documents directly from documentList documents = self.services.chat.getChatDocumentsFromDocumentList(documentList) # Line 721: Call extractAndPrepareContent() with ALL documents preparedContentParts = await self.extractAndPrepareContent(documents, ...) ``` **Problem**: If `documentList` contains both: - Original document: `original_pdf_123.pdf` - Pre-extracted JSON: `pre_extracted_456.json` (contains ContentParts from `original_pdf_123.pdf`) → Both are processed → **DUPLICATE ContentParts created** **How Filtering Works (Reference: `documentPath.py` lines 62-87)**: **Step 1: Identify Pre-Extracted JSONs and Map to Originals** ```python # Collect all original document IDs that are covered by pre-extracted JSONs originalDocIdsCoveredByPreExtracted = set() for doc in documents: preExtracted = self.intentAnalyzer.resolvePreExtractedDocument(doc) if preExtracted: # Pre-extracted JSON found - get the original document ID it covers originalDocId = preExtracted["originalDocument"]["id"] originalDocIdsCoveredByPreExtracted.add(originalDocId) ``` **Result**: `originalDocIdsCoveredByPreExtracted = {"original_pdf_123"}` (if pre-extracted JSON covers it) **Step 2: Filter Documents List** ```python filteredDocuments = [] for doc in documents: preExtracted = self.intentAnalyzer.resolvePreExtractedDocument(doc) if preExtracted: # Pre-extracted JSON - KEEP IT (will be processed as ContentParts) filteredDocuments.append(doc) elif doc.id in originalDocIdsCoveredByPreExtracted: # Original document covered by pre-extracted JSON - REMOVE IT logger.info(f"Skipping original document {doc.id} - already covered") # Do NOT append - skip this document else: # Regular document (not pre-extracted, not covered) - KEEP IT filteredDocuments.append(doc) documents = filteredDocuments # Use filtered list ``` **Result**: - ✅ Pre-extracted JSON: `pre_extracted_456.json` → KEPT - ❌ Original document: `original_pdf_123.pdf` → REMOVED (covered by pre-extracted JSON) - ✅ Regular document: `other_doc.pdf` → KEPT (not covered) **Step 3: Use Filtered Documents** ```python # Now call extractAndPrepareContent() with filtered documents only preparedContentParts = await self.extractAndPrepareContent( documents, # Only pre-extracted JSONs + regular docs (no originals covered by JSONs) documentIntents or [], extractOperationId ) ``` **Result**: No duplicates - original documents already filtered out **Implementation Steps**: - [ ] Add filtering logic between line 708 (get documents) and line 710 (clarify intents) - [ ] Copy filtering code from `documentPath.py` lines 62-87 - [ ] Adapt to use `self.intentAnalyzer.resolvePreExtractedDocument()` (same method) - [ ] **Filtering Logic**: ```python # Step 1: Identify all original document IDs covered by pre-extracted JSONs originalDocIdsCoveredByPreExtracted = set() for doc in documents: preExtracted = self.intentAnalyzer.resolvePreExtractedDocument(doc) if preExtracted: originalDocId = preExtracted["originalDocument"]["id"] originalDocIdsCoveredByPreExtracted.add(originalDocId) logger.debug(f"Found pre-extracted JSON {doc.id} covering original document {originalDocId}") # Step 2: Filter documents - remove originals covered by pre-extracted JSONs filteredDocuments = [] for doc in documents: preExtracted = self.intentAnalyzer.resolvePreExtractedDocument(doc) if preExtracted: filteredDocuments.append(doc) # Keep pre-extracted JSON elif doc.id in originalDocIdsCoveredByPreExtracted: logger.info(f"Skipping original document {doc.id} ({doc.fileName}) - already covered by pre-extracted JSON") else: filteredDocuments.append(doc) # Keep regular document documents = filteredDocuments # Use filtered list ``` - [ ] Test with scenario: original document + pre-extracted JSON → verify no duplicates - [ ] **Remove redundant check from `extractAndPrepareContent()`**: - [ ] Remove pre-extracted JSON check (line 77 in `subContentExtraction.py`) - [ ] Trust that filtering is done upstream - [ ] Cleaner code, single responsibility - [ ] Test merging logic - [ ] Test that both document generation and data extraction paths handle pre-extracted JSONs correctly - [ ] Note: outputFormat and language are NOT propagated here - determined in structure generation ### Phase 4: Structure Generation Updates - [ ] **Make outputFormat optional in generateStructure() method signature**: - [ ] Update `subStructureGeneration.py` method signature (line 47): `outputFormat: Optional[str] = None` - [ ] Update `mainServiceAi.py` wrapper method (line 444): Make `outputFormat` optional - [ ] If `outputFormat` not provided, use "txt" as validation fallback (AI determines formats from prompt) - [ ] Add logging: "outputFormat not provided - using 'txt' as validation fallback, formats determined from prompt" - [ ] **Context**: `outputFormat` is only a validation fallback - AI determines per-document formats from user prompt. Multiple documents can have different formats (e.g., one PDF, one HTML). - [ ] **Note on language handling**: Language is accessed via `self.services.currentUserLanguage` (always valid, validated during user intention analysis). No language parameter needed in `generateStructure()` method signature - language is accessed directly from services within the method. - [ ] Verify `currentUserLanguage` is used correctly in `subStructureGeneration.py` (via `self.services.currentUserLanguage`) - [ ] Verify `currentUserLanguage` is used correctly in prompt building (via `self.services.currentUserLanguage`) - [ ] Note: `mainServiceGeneration.py` uses different service - verify if update needed - [ ] Group ContentParts by documentId (for context in prompt) - [ ] Update `_buildChapterStructurePrompt()` to access language via `self.services.currentUserLanguage` (no parameter needed) - [ ] Update structure generation prompt to ask AI to determine per-document outputFormat - [ ] Explicitly require `outputFormat` field in each document JSON structure - [ ] Update example structure to show `outputFormat` field (not just filename) - [ ] Clarify that multiple documents can have different formats - [ ] Update structure generation prompt to ask AI to determine per-document language - [ ] Explicitly require `language` field in each document JSON structure - [ ] Clarify that multiple documents can have different languages - [ ] Provide global fallbacks (outputFormat, language) for AI to use if not specified - [ ] `outputFormat` fallback: from parameter or "txt" - [ ] `language` fallback: use `self._getUserLanguage()` (validated currentUserLanguage infrastructure) - [ ] **Parse and validate format/language from AI response**: - [ ] Extract `outputFormat` and `language` from each document in structure JSON - [ ] **Format validation (use existing renderer registry infrastructure)**: - [ ] Import: `from modules.services.serviceGeneration.renderers.registry import getRenderer` - [ ] If `outputFormat` missing or empty → use global fallback (`outputFormat` or "txt") - [ ] If `outputFormat` exists → check if it has a renderer using `getRenderer(formatName)` (existing infrastructure) - [ ] Normalize format name: `formatName.lower().strip()` - [ ] If format doesn't match any renderer → use "txt" (simple approach, no global fallback attempt) - [ ] Log warnings for invalid formats - [ ] **Note**: Infrastructure exists at `mainServiceGeneration.py:529` - reuse `getRenderer()` function - [ ] **Language validation (use existing validated infrastructure)**: - [ ] Validate language (must be 2-character ISO 639-1 code) - [ ] **If language missing**: Set to `self._getUserLanguage()` which uses validated `currentUserLanguage` (always valid, validated during user intention analysis at `workflowManager.py:695-727`) - [ ] **If language invalid format**: Use `self._getUserLanguage()` (always valid) - [ ] Normalize language: `language.lower().strip()[:2]` - [ ] Log warnings for invalid/missing values - [ ] **Note**: `currentUserLanguage` is always valid - safe to use directly via `_getUserLanguage()` method - [ ] **Error handling**: - [ ] If structure JSON is malformed → raise error with details - [ ] If no documents in structure → raise error - [ ] If AI doesn't return format → use global `outputFormat` fallback (or "txt" if not provided), log warning - [ ] If AI doesn't return language → use validated `currentUserLanguage` (always valid), log warning - [ ] Verify structure output includes per-document format and language (from AI in JSON response) ### Phase 5: Structure Filling Verification - [ ] Verify two prompt types are correctly used: - [ ] `isAggregation=True`: ContentParts as parameters - [ ] `isAggregation=False`: Only generationHint - [ ] **Verify per-document language is extracted and used**: - [ ] Language MUST be defined in structure (validated in State 3) - [ ] Language extracted from document in structure (per-document) - NO fallback to "en" - [ ] If language missing: Raise error (should not happen after State 3 validation) - [ ] If language invalid format: Raise error (should not happen after State 3 validation) - [ ] Language passed to `_buildSectionGenerationPrompt()` for each section - [ ] Language preserved in filled structure (State 4 validation) - [ ] Test both prompt types with various scenarios - [ ] Verify Vision AI extraction happens during filling phase - [ ] Test with multi-document scenarios (different languages per document) ### Phase 6: Document Rendering Updates - [ ] **Add language parameter to renderResult() method**: - [ ] Update `mainServiceAi.py` renderResult() signature (line 460) - [ ] Pass language to `generationService.renderReport()` (as global fallback) - [ ] **Update renderResult call site** (`documentPath.py` line 151): - [ ] Language comes from structure (per-document), validated in State 3 - [ ] Use validated `currentUserLanguage` as global fallback (always valid) - [ ] Per-document language will be extracted in `renderReport()` from filledStructure - [ ] Code example: ```python # Language is already validated in structure (State 3) and preserved in filled structure (State 4) # Per-document language will be extracted in renderReport() from filledStructure # Use validated currentUserLanguage as global fallback (always valid infrastructure) language = self.services.currentUserLanguage or "en" # Uses validated infrastructure renderedDocuments = await self.services.ai.renderResult( filledStructure, outputFormat, language, # ← Global fallback (per-document language extracted from structure in renderReport) title or "Generated Document", userPrompt, docOperationId ) ``` - [ ] **Update renderReport() to handle per-document format and language**: - [ ] Add language parameter to method signature (line 349): `language: str` (global fallback) - [ ] Extract per-document format: `docFormat = doc.get("outputFormat") or doc.get("format") or outputFormat` (check `outputFormat` field first) - [ ] Extract per-document language: `docLanguage = doc.get("language") or language` (from structure, validated in State 3) - [ ] Validate language format (should be 2-character ISO code, validated in State 3) - [ ] Add language to metadata passed to renderers: `metadata["language"] = docLanguage` - [ ] **Note**: Per-document format and language are extracted from structure (validated in State 3). Multiple documents can have different formats and languages. - [ ] **Error handling**: - [ ] If no documents in structure → raise error - [ ] If filtering removes all documents → raise error - [ ] If format not supported → log warning, skip document - [ ] Test multi-document rendering with different formats/languages ### Phase 7: ai.process Refactoring - [ ] Remove extraction logic from `ai.process` (lines 72-119) - [x] **Make resultType optional**: ✅ **IMPLEMENTED** - [x] Update `ai.process`: Make `resultType` optional (can be `None`) - ✅ **COMPLETED** - [x] Update `ai.generateDocument`: Make `resultType` optional, removed auto-detection - ✅ **COMPLETED** - [x] Update `ai.generateCode`: Make `resultType` optional, removed auto-detection - ✅ **COMPLETED** - [x] If `resultType` omitted → pass `None` to `callAiContent()` (formats determined from prompt) - ✅ **COMPLETED** - [x] Updated action parameter definitions in `methodAi.py` - ✅ **COMPLETED** **Implementation Status**: - ✅ **ai.process**: `resultType` optional, passes `None` if omitted - ✅ **ai.generateDocument**: `resultType` optional, passes `None` if omitted - ✅ **ai.generateCode**: `resultType` optional, passes `None` if omitted - ✅ **callAiContent**: Already supports optional `outputFormat` (defaults to "txt") - [ ] **generateStructure**: Make `outputFormat` optional (see Phase 4 checklist) - [ ] **Add filtering to Data Extraction Path** (`_handleDataExtraction`): - [ ] **Location**: `mainServiceAi.py` between line 708 (get documents) and line 721 (extract content) - [ ] **Purpose**: Prevent duplicate ContentParts when both original document and pre-extracted JSON are provided - [ ] **Implementation**: Copy filtering logic from `documentPath.py:62-87` - [ ] Filter out original documents covered by pre-extracted JSONs before calling `extractAndPrepareContent()` - [ ] See Phase 3 checklist for detailed filtering code - [ ] Pass `documentList` to `callAiContent()` (currently missing, line 155-162 in `process.py`) - [ ] `documentList` is available in `process.py` (lines 43-55) but not passed to `callAiContent()` - [ ] Add `documentList=documentList` parameter to `callAiContent()` call - [ ] Pass `contentParts` to `callAiContent()` (already done) - [ ] **Error handling**: - [ ] If no documents and no contentParts → raise error - [ ] If filtering removes all documents → raise error - [ ] Verify intelligent merging in AI service works correctly ### Phase 8: Testing - [ ] Test with pre-extracted JSON documents - [ ] Test with mixed `documentList` + `contentParts` - [ ] Test per-document format/language determination - [ ] Test two prompt types in structure filling - [ ] Test multi-document output with different formats/languages - [ ] Test security: prompt injection attempts with fenced input - [ ] **Test optional outputFormat handling**: - [ ] Test with `resultType` provided → formats used as fallback - [ ] Test with `resultType` omitted → AI determines formats from prompt - [ ] Test format validation: invalid format → uses "txt" - [ ] Test format validation: format without renderer → uses "txt" ### Phase 9: Documentation - [ ] Update API documentation - [ ] Update developer documentation - [ ] Update user documentation (if applicable) --- ## Priority Order **High Priority (Security & Critical Path)**: 1. **Phase 2**: Intent Analysis Updates - Security fix (fencing) is CRITICAL 2. **Phase 7**: ai.process Refactoring - Add filtering to Data Extraction Path (prevents duplicate ContentParts) 3. **Phase 1**: Model Updates - Foundation for all other changes **Medium Priority (Architectural Improvements)**: 4. **Phase 4**: Structure Generation Updates - Make outputFormat optional (AI determines per-document formats) - Implement State 3 validation (use existing renderer registry and language infrastructure) - Update prompt to require outputFormat field per document 5. **Phase 6**: Document Rendering Updates - Extract per-document format/language from structure - Add language parameter to renderResult() and renderReport() 6. **Phase 3**: Content Extraction Updates - Remove redundant pre-extracted check AFTER filtering added upstream **Low Priority (Verification & Polish)**: 7. **Phase 5**: Structure Filling Verification (already implemented, verify) 8. **Phase 8**: Testing 9. **Phase 9**: Documentation --- ## Notes - The two prompt types in Phase 4 (Structure Filling) are already implemented via the `isAggregation` flag. This step focuses on verification and documentation. - Per-document format/language determination follows the same pattern as existing per-document language handling. - The security fix (fencing user input) should be implemented immediately as it addresses a potential prompt injection vulnerability. --- ## Architectural Note: Filtering and Redundant Pre-Extracted JSON Checks ### Problem Statement When a user provides both an original document and a pre-extracted JSON containing ContentParts from that original document, we need to prevent duplicate ContentParts from being created. ### Current State The pre-extracted JSON check happens **twice**: 1. **Phase 1** (`documentPath.py` lines 67-87): Filters documents before intent clarification 2. **Phase 2** (`subContentExtraction.py` line 77): Checks again during extraction loop ### Why Filtering is Necessary **The redundant check in `extractAndPrepareContent()` only identifies if a document IS a pre-extracted JSON. It does NOT identify if a document is an ORIGINAL covered by a pre-extracted JSON.** **Example**: ```python # In extractAndPrepareContent loop: for document in [original_pdf_123, pre_extracted_456]: # Check document 1: original_pdf_123 preExtracted = resolvePreExtractedDocument(original_pdf_123) # Returns: None (it's not a pre-extracted JSON) # → Processes original_pdf_123 → extracts ContentParts # Check document 2: pre_extracted_456 preExtracted = resolvePreExtractedDocument(pre_extracted_456) # Returns: {originalDocument: {id: "original_pdf_123"}, ...} # → Processes pre_extracted_456 → extracts ContentParts # Result: BOTH processed → DUPLICATES ``` **The redundant check doesn't help because**: - It only looks at ONE document at a time - It doesn't know about OTHER documents in the list - It can't compare documents to find relationships ### Why Filtering Works Filtering happens BEFORE the extraction loop, so it can: 1. Look at ALL documents at once 2. Identify relationships between documents 3. Remove originals BEFORE extraction starts ### Code Path Analysis #### Path 1: Document Generation Path (`documentPath.py`) **Location**: Line 103 **Filtering**: ✅ YES (lines 62-87) - Identifies pre-extracted JSONs - Filters out original documents covered by pre-extracted JSONs - Only passes filtered documents to `extractAndPrepareContent()` **Result**: ✅ **NO DUPLICATES** - Original document already filtered out #### Path 2: Data Extraction Path (`mainServiceAi.py` `_handleDataExtraction`) **Location**: Line 721 **Filtering**: ❌ **NO** - Gets documents directly from `documentList` (line 708) - Calls `extractAndPrepareContent()` without any filtering - Does NOT filter out original documents covered by pre-extracted JSONs **Result**: ❌ **DUPLICATES CREATED** - Both documents processed, same content extracted twice ### Visual Flow Comparison #### Document Generation Path (WITH Filtering - CURRENT) ``` documentList: [original_pdf_123, pre_extracted_456] ↓ [FILTERING] Identify relationships, remove originals ↓ filteredDocuments: [pre_extracted_456] ← original_pdf_123 removed ↓ extractAndPrepareContent([pre_extracted_456]) ↓ ContentParts from pre_extracted_456 only ↓ ✅ NO DUPLICATES ``` #### Data Extraction Path (WITHOUT Filtering - CURRENT) ``` documentList: [original_pdf_123, pre_extracted_456] ↓ [NO FILTERING] Pass all documents ↓ extractAndPrepareContent([original_pdf_123, pre_extracted_456]) ↓ Process original_pdf_123 → ContentParts Process pre_extracted_456 → ContentParts ↓ ❌ DUPLICATES (same content twice) ``` #### Data Extraction Path (WITH Filtering - TARGET) ``` documentList: [original_pdf_123, pre_extracted_456] ↓ [FILTERING] Identify relationships, remove originals ↓ filteredDocuments: [pre_extracted_456] ← original_pdf_123 removed ↓ extractAndPrepareContent([pre_extracted_456]) ↓ ContentParts from pre_extracted_456 only ↓ ✅ NO DUPLICATES ``` ### Solution **Target State**: Add filtering to Data Extraction Path, then remove redundant check **Steps**: 1. **Add filtering logic to `_handleDataExtraction`** (between line 708 and line 721) - Copy filtering code from `documentPath.py` lines 62-87 - Filter out original documents covered by pre-extracted JSONs 2. **Remove redundant check from `extractAndPrepareContent()`** (line 77) - Trust that filtering is done upstream - Cleaner code, single responsibility **Risk Assessment**: - **If we remove redundant check WITHOUT adding filtering**: ⚠️ Duplicates still occur (no change from current state) - **If we add filtering THEN remove redundant check**: ✅ No duplicates, cleaner code ### Conclusion 1. **Filtering is necessary** because it can look at ALL documents and identify relationships 2. **Redundant check is insufficient** because it only looks at ONE document at a time 3. **Current state**: Document Generation Path filters → safe. Data Extraction Path doesn't filter → duplicates possible 4. **Solution**: Add filtering to Data Extraction Path, then remove redundant check (it's not needed if filtering is done) 5. **Risk of removing redundant check**: None IF filtering is added first. High IF filtering is NOT added (but duplicates already exist anyway) --- ## Appendix: Pre-Extracted JSON Document Check Locations ### Where the Check is Done **1. Phase 1 (Before Intent Clarification)**: - **File**: `gateway/modules/services/serviceGeneration/paths/documentPath.py` - **Lines**: 67-87 - **Purpose**: Filter documents before intent analysis - **Method**: `self.services.ai.intentAnalyzer.resolvePreExtractedDocument(doc)` - **Action**: Identifies pre-extracted JSONs and filters out original documents covered by them **2. Phase 2 (During Content Extraction)**: - **File**: `gateway/modules/services/serviceAi/subContentExtraction.py` - **Line**: 77 - **Purpose**: Process each document during extraction loop - **Method**: `self.intentAnalyzer.resolvePreExtractedDocument(document)` - **Action**: Extracts ContentParts from pre-extracted JSON (not treat as regular JSON) - **Note**: ⚠️ **REDUNDANT** - This check happens again even though Phase 1 already filtered documents - **Reason**: `extractAndPrepareContent()` is called from multiple code paths: - Document generation path (`documentPath.py`) - filtering already done - Data extraction path (`mainServiceAi.py`) - filtering may not be done - The extraction service needs to handle pre-extracted JSONs defensively - **Optimization Opportunity**: Could pass filtered documents or a flag to skip redundant checks **3. Check Implementation**: - **File**: `gateway/modules/services/serviceAi/subDocumentIntents.py` - **Line**: 122 - **Method**: `resolvePreExtractedDocument(document: ChatDocument)` - **Logic**: - Checks if `mimeType == "application/json"` - Parses JSON and checks for `validationMetadata.actionType == "context.extractContent"` - Extracts `ContentExtracted` structure from `documentData` - Returns dict with `originalDocument` and `contentExtracted` info ### Where Final Merged List is Available **After Phase 2 (Content Extraction)**: - **File**: `gateway/modules/services/serviceGeneration/paths/documentPath.py` - **Line**: 119 - **Code**: `contentParts = preparedContentParts` - **State**: - ✅ All pre-extracted JSON documents processed → ContentParts - ✅ All regular documents extracted → ContentParts - ✅ All provided contentParts merged - ✅ Final clean merged list ready for Phase 3 (Structure Generation) **Before Phase 3 (Structure Generation)**: - **File**: `gateway/modules/services/serviceGeneration/paths/documentPath.py` - **Line**: 129 - **Usage**: `contentParts or []` passed to `generateStructure()` - **Note**: This is the clean merged list containing all ContentParts from all sources --- ## Appendix: Intent Mapping Logic for Pre-Extracted JSONs ### How Intent Mapping Works **Problem**: When a pre-extracted JSON document is provided, we need to: 1. Analyze intents for the **original document** (not the JSON file itself) 2. Map the intents back to the **JSON document ID** (so they can be applied to the ContentParts extracted from the JSON) ### Implementation Logic (Already in `clarifyDocumentIntents`) **Location**: `gateway/modules/services/serviceAi/subDocumentIntents.py` lines 63-104 **Step 1: Build Mapping** (lines 63-83) ```python documentMapping = {} # Maps original doc ID → JSON doc ID resolvedDocuments = [] for doc in documents: preExtracted = self.resolvePreExtractedDocument(doc) if preExtracted: # This is a pre-extracted JSON originalDocId = preExtracted["originalDocument"]["id"] jsonDocId = doc.id # Current document is the JSON # Map: original doc ID → JSON doc ID documentMapping[originalDocId] = jsonDocId # Create temporary ChatDocument for original document originalDoc = ChatDocument( id=originalDocId, fileName=preExtracted["originalDocument"]["fileName"], mimeType=preExtracted["originalDocument"]["mimeType"], # ... other fields from preExtracted["originalDocument"] ) resolvedDocuments.append(originalDoc) # Use original doc for intent analysis else: resolvedDocuments.append(doc) # Regular document, use as-is ``` **Result**: - `documentMapping = {"original_pdf_123": "pre_extracted_456"}` - `resolvedDocuments = [ChatDocument(id="original_pdf_123"), ChatDocument(id="other_doc")]` **Step 2: AI Analyzes Intents** (line 86) ```python # AI analyzes intents for resolvedDocuments (original documents, not JSONs) intentPrompt = self._buildIntentAnalysisPrompt(userPrompt, resolvedDocuments, actionParameters) aiResponse = await self.aiService.callAiPlanning(prompt=intentPrompt, ...) ``` **AI Response**: ```json { "intents": [ { "documentId": "original_pdf_123", // ← Original document ID "intents": ["extract"], "extractionPrompt": "Extract all text", "reasoning": "..." } ] } ``` **Step 3: Map Intents Back to JSON Doc IDs** (lines 96-104) ```python intentsData = json.loads(self.services.utils.jsonExtractString(aiResponse)) documentIntents = [] for intent in intentsData.get("intents", []): docId = intent.get("documentId") # "original_pdf_123" # If intent is for an original document covered by a pre-extracted JSON if docId in documentMapping: # Map back to JSON document ID intent["documentId"] = documentMapping[docId] # "pre_extracted_456" documentIntents.append(DocumentIntent(**intent)) ``` **Result**: - `DocumentIntent(documentId="pre_extracted_456", intents=["extract"], ...)` - Intent is now mapped to the JSON document ID, so it can be applied to ContentParts extracted from the JSON ### Why This Works 1. **AI analyzes original documents**: More meaningful context (file name, MIME type, etc.) 2. **Intents mapped to JSON IDs**: ContentParts extracted from JSON can be tagged with correct intents 3. **Consistent with filtering**: Original documents are filtered out, but their intents are preserved via mapping ### Example Flow ``` Input: - documentList: [original_pdf_123.pdf, pre_extracted_456.json] Step 1: Filtering (Phase 1) - Identify: pre_extracted_456.json covers original_pdf_123.pdf - Filter: Remove original_pdf_123.pdf - Result: documents = [pre_extracted_456.json] Step 2: Intent Mapping (Phase 1) - Build mapping: {"original_pdf_123": "pre_extracted_456"} - Resolve: resolvedDocuments = [ChatDocument(id="original_pdf_123")] - AI analyzes: intents for "original_pdf_123" - Map back: intents for "pre_extracted_456" Step 3: Content Extraction (Phase 2) - Extract ContentParts from pre_extracted_456.json - Apply intents (from Step 2) to ContentParts - Result: ContentParts with correct intents ``` --- ## Implementation Notes ### Infrastructure Available The following infrastructure already exists and should be reused: - **Language Validation**: `currentUserLanguage` is validated at `workflowManager.py:695-727` - always valid 2-character ISO code. Access via `self.services.currentUserLanguage` or `_getUserLanguage()` method. - **Format Validation**: Renderer registry exists at `mainServiceGeneration.py:529` (`_getFormatRenderer()` uses `getRenderer()`). Import: `from modules.services.serviceGeneration.renderers.registry import getRenderer`. Returns None if format invalid, falls back to text renderer. - **Language Extraction**: `_getDocumentLanguage()` works correctly at `subStructureFilling.py:349` - extracts per-document language from structure. Used properly during section generation. ### Key Implementation Points 1. **Per-Document Format/Language**: Multiple documents can have different formats and languages. AI determines these from user prompt. Parameters are only validation fallbacks. 2. **Filtering**: Must filter pre-extracted JSONs before content extraction to prevent duplicate ContentParts. Filtering logic exists in `documentPath.py:62-87` and should be copied to data extraction path. 3. **State 3 Validation**: Use existing infrastructure (`getRenderer()`, `_getUserLanguage()`) for validation. Infrastructure exists, just needs to be called. 4. **Rendering**: Extract per-document `outputFormat` and `language` from structure (validated in State 3). Check `outputFormat` field first, then `format` field (legacy), then global fallback. --- ## Appendix: Validation Failure Handling Decisions This appendix documents the decision-making process for how to handle each validation failure. The actual implementation code is integrated into Section 3 above. ### Approach - **Try to fix automatically** (use defaults) when validation fails - **All validations are critical** (must not fail - fix or error) - **Validation happens inline** in each phase method ### State 1: After Intent Clarification #### Validation 1.1: Intent count mismatch **Check**: `len(documentIntents) != len(documents)` **Decision**: Documents without intents are OK. Intents for non-existing documents should be skipped. **Rationale**: Not all documents need intents (some may be reference-only). Intents referencing unknown documents are invalid and should be removed. #### Validation 1.2: Intent references unknown document **Check**: `intent.documentId not in documentIds` **Decision**: Skip this intent (remove it) **Rationale**: Cannot map intent to non-existent document. Better to skip than fail. --- ### State 2: After Content Extraction #### Validation 2.1: ContentPart missing documentId **Check**: `not part.metadata.get("documentId")` **Decision**: Skip this ContentPart (remove it) with warning in logger **Rationale**: ContentPart without documentId cannot be properly assigned. Skip with warning for debugging. #### Validation 2.2: ContentPart has invalid contentFormat **Check**: `contentFormat not in ["extracted", "object", "reference"]` **Decision**: Skip this ContentPart (remove it) with warning in logger **Rationale**: Invalid contentFormat indicates corrupted data. Skip with warning for debugging. --- ### State 3: After Structure Generation #### Validation 3.1: Structure missing 'documents' field **Check**: `"documents" not in chapterStructure` **Decision**: Stop with error (cannot auto-fix - structure is invalid) **Rationale**: Structure without documents field is fundamentally broken. Cannot proceed. #### Validation 3.2: Structure has no documents **Check**: `len(documents) == 0` **Decision**: Stop with error (cannot generate without documents) **Rationale**: Cannot generate output without documents. Must have at least one document. #### Validation 3.3: Document missing 'outputFormat' field **Check**: `"outputFormat" not in doc` **Decision**: Use global fallback format (from parameters), if not available use default "txt" **Rationale**: Format is required for rendering. Use fallback chain: per-document → global → default. #### Validation 3.4: Document has invalid outputFormat **Check**: `outputFormat not in valid formats` **Decision**: Use renderer registry to check if format has a renderer. If no renderer exists, try global fallback, then default "txt" **Rationale**: Use dynamic renderer registry (not hardcoded list) to check format validity. Fallback chain ensures we always have a valid format. #### Validation 3.5: Document missing 'language' field **Check**: `"language" not in doc` **Decision**: Use user prompt language (from `self.services.currentUserLanguage` via `_getUserLanguage()`), not "en" fallback **Rationale**: Language is required for content generation. Use user prompt language (detected from user intention analysis) as fallback, not hardcoded "en". #### Validation 3.6: Document has invalid language **Check**: `len(doc["language"]) != 2` **Decision**: Use validated `currentUserLanguage` (always valid, validated during user intention analysis) **Rationale**: `currentUserLanguage` is validated during user intention analysis and is always a valid 2-character ISO 639-1 code. Safe to use directly. #### Validation 3.7: Document missing 'chapters' field **Check**: `"chapters" not in doc` **Decision**: Stop with error (cannot auto-fix - document structure invalid) **Rationale**: Document without chapters is structurally invalid. Cannot proceed. #### Validation 3.8: Chapter missing 'contentParts' field **Check**: `"contentParts" not in chapter` **Decision**: Stop with error (cannot auto-fix - chapter structure invalid) **Rationale**: Chapter without contentParts field is structurally invalid. Cannot proceed. --- ### State 4: After Structure Filling #### Validation 4.1: Filled structure missing 'documents' field **Check**: `"documents" not in filledStructure` **Decision**: Stop with error (cannot auto-fix - structure is invalid) **Rationale**: Structure without documents field is fundamentally broken. Cannot proceed. #### Validation 4.2: Section missing 'elements' field **Check**: `"elements" not in section` **Decision**: Create empty elements list: `section["elements"] = []` **Rationale**: Section can be intentionally empty. Create empty list to maintain structure. #### Validation 4.3: Section has empty elements list **Check**: `not section["elements"]` (empty list) **Decision**: Allow empty elements (section might be intentionally empty) **Rationale**: Empty sections are valid (e.g., placeholder sections). No action needed. #### Validation 4.4: Document missing 'language' field in filled structure **Check**: `"language" not in doc` (in filledStructure) **Decision**: Stop with error (language MUST be preserved from Phase 3) **Rationale**: Language is validated and set in Phase 3 (State 3). If missing in filled structure, it's a critical error - language must be preserved. #### Validation 4.5: Document has invalid language format in filled structure **Check**: `not isinstance(doc["language"], str) or len(doc["language"]) != 2` **Decision**: Stop with error (language format MUST be valid) **Rationale**: Language format is validated in Phase 3 (State 3). If invalid in filled structure, it's a critical error. --- ### State 5: After Document Rendering #### Validation 5.1: No documents rendered **Check**: `len(renderedDocuments) == 0` **Decision**: Stop with error (already implemented in documentPath.py line 176) **Rationale**: Cannot return empty result. Error already implemented. #### Validation 5.2: Rendered document has empty documentData **Check**: `not doc.documentData` **Decision**: Skip this document (remove from list) **Rationale**: Empty document is not useful. Skip it rather than fail entire operation. #### Validation 5.3: Rendered document missing mimeType **Check**: `not doc.mimeType` **Decision**: Infer mimeType from filename extension **Rationale**: mimeType can be inferred from filename. Use utility function to detect.