86 KiB
Implementation Plan: Content Handling Architecture Migration
Overview
This document provides a detailed implementation plan for migrating to the target architecture for content extraction and document generation. The plan focuses on:
- Documents and Content Handling: Intelligent merging of
documentListandcontentPartswith deduplication - Output Document Formats: Per-document format determination (not global) - AI determines formats from user prompt, multiple documents can have different formats
- Languages Handling: Per-document language determination (not global) - uses validated
currentUserLanguageinfrastructure - Clear Handover States: Defined validation at each phase boundary using existing infrastructure
- Structure Filling: Two prompt types (with content vs. without content)
Verified Infrastructure (Ready to Use)
The following infrastructure already exists and can be reused:
-
✅ Language Validation:
currentUserLanguageis validated atworkflowManager.py:695-727- always valid 2-character ISO code (validates AI response, falls back to user language, then "en"). Safe to use viaself.services.currentUserLanguageor_getUserLanguage()method. -
✅ Format Validation: Renderer registry exists at
mainServiceGeneration.py:529(_getFormatRenderer()usesgetRenderer()). Can be imported:from modules.services.serviceGeneration.renderers.registry import getRenderer. Returns None if format invalid, falls back to text renderer. -
✅ Language Extraction:
_getDocumentLanguage()works correctly atsubStructureFilling.py:349- extracts per-document language from structure. Used properly during section generation.
Context
This implementation plan is based on the analysis documented in:
gateway/modules/services/serviceAi/CONTENT_EXTRACTION_ANALYSIS.md(Section 9.3: Target State)
The target architecture addresses architectural issues identified in the current implementation:
- Single extraction path in AI service (no duplication in
ai.process) - Intelligent merging of
contentPartsanddocumentListwith deduplication - Clear separation of concerns: action layer delegates to service layer
- Consistent behavior across all code paths
- Per-document format/language determination (not global)
1. Overview: Major Phases and Handover States
Phase Flow Diagram
┌─────────────────────────────────────────────────────────────────────┐
│ PHASE 1: Document Intent Clarification │
│ ────────────────────────────────────────────────────────────────── │
│ INPUT: │
│ - userPrompt: str (fenced) │
│ - documentList: DocumentReferenceList (optional) │
│ - contentParts: List[ContentPart] (optional) │
│ - actionParameters: Dict (outputFormat, language, etc.) │
│ │
│ THROUGHPUT: │
│ 1. Resolve documents from documentList │
│ 2. Identify pre-extracted JSON documents │
│ - Check if JSON contains ContentExtracted structure │
│ - Map pre-extracted JSONs to original documents │
│ 3. Filter out original documents covered by pre-extracted │
│ 4. AI analyzes document purposes │
│ 5. Map intents back to JSON doc IDs (if applicable) │
│ │
│ OUTPUT: │
│ - documentIntents: List[DocumentIntent] │
│ * documentId: str │
│ * intents: List[str] (["extract", "render", "reference"]) │
│ * extractionPrompt: str (optional) │
│ * reasoning: str │
│ Note: outputFormat and language are NOT determined here - │
│ they're determined in Phase 3 (Structure Generation) │
│ │
│ HANDOVER STATE: │
│ - documentIntents: Complete intent analysis │
│ - documents: Resolved ChatDocuments │
│ - preExtractedMapping: Map[originalDocId, jsonDocId] │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ PHASE 2: Content Extraction and Preparation │
│ ────────────────────────────────────────────────────────────────── │
│ INPUT: │
│ - documents: List[ChatDocument] │
│ - documentIntents: List[DocumentIntent] │
│ - contentParts: List[ContentPart] (optional, pre-extracted) │
│ - preExtractedMapping: Map[originalDocId, jsonDocId] │
│ │
│ THROUGHPUT: │
│ 1. Process pre-extracted JSON documents → ContentParts │
│ - Extract ContentParts from JSON (not treat as regular JSON) │
│ - Apply intents (extract, render, reference) │
│ - Mark with isPreExtracted=True │
│ 2. RAW extraction (NO AI) for regular documents │
│ - Extract content using extraction service │
│ - Create ContentParts with metadata │
│ 3. Merge all ContentParts │
│ - Pre-extracted parts (from JSON documents) │
│ - Extracted parts (from regular documents) │
│ - Provided parts (from contentParts parameter) │
│ 4. Apply intents to ContentParts (extract, render, reference) │
│ 5. Mark images for Vision AI extraction (deferred) │
│ │
│ OUTPUT: │
│ - finalContentParts: List[ContentPart] │
│ * id: str │
│ * typeGroup: str │
│ * mimeType: str │
│ * data: Union[str, bytes] │
│ * metadata: Dict │
│ - documentId: str │
│ - contentFormat: str ("extracted", "object", "reference") │
│ - intent: str │
│ - needsVisionExtraction: bool (for images) │
│ - extractionPrompt: str (for Vision AI) │
│ - originalFileName: str │
│ - isPreExtracted: bool │
│ Note: outputFormat and language are NOT propagated here - │
│ they're determined in Phase 3 (Structure Generation) │
│ │
│ HANDOVER STATE: │
│ - finalContentParts: Complete merged list │
│ - All pre-extracted JSON documents processed → ContentParts │
│ - All regular documents extracted → ContentParts │
│ - All provided contentParts merged │
│ - All documents processed (extracted or pre-extracted) │
│ - Vision AI extraction deferred to Phase 4 │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ PHASE 3: Structure Generation │
│ ────────────────────────────────────────────────────────────────── │
│ INPUT: │
│ - userPrompt: str │
│ - finalContentParts: List[ContentPart] │
│ - outputFormat: Optional[str] (optional fallback, defaults to "txt") │
│ - currentUserLanguage: str (always valid, validated during user intention analysis) │
│ * From: self.services.currentUserLanguage (always valid, validated during user intention analysis) │
│ │
│ THROUGHPUT: │
│ 1. Group ContentParts by documentId (for context) │
│ 2. AI generates structure with documents and chapters │
│ 3. AI determines per-document outputFormat in structure JSON │
│ from user prompt → else optional outputFormat fallback (or "txt") │
│ 4. AI determines per-document language in structure JSON │
│ from user prompt → else validated currentUserLanguage (always valid) │
│ 5. Assign ContentParts to chapters │
│ │
│ OUTPUT: │
│ - chapterStructure: Dict │
│ * documents: List[Dict] │
│ - id: str │
│ - title: str │
│ - outputFormat: str (per-document) ← NEW │
│ - language: str (per-document) ← NEW │
│ - chapters: List[Dict] │
│ * id: str │
│ * level: int │
│ * title: str │
│ * generationHint: str │
│ * contentParts: List[str] (ContentPart IDs) │
│ │
│ HANDOVER STATE: │
│ - chapterStructure: Complete structure with ContentPart │
│ assignments │
│ - Per-document format/language determined │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ PHASE 4: Structure Filling │
│ ────────────────────────────────────────────────────────────────── │
│ INPUT: │
│ - chapterStructure: Dict (with per-document language from Phase 3)│
│ - finalContentParts: List[ContentPart] │
│ - userPrompt: str │
│ │
│ THROUGHPUT: │
│ For each document (with per-document language): │
│ For each chapter: │
│ 1. Generate sections structure (parallel) │
│ 2. For each section: │
│ a. Extract per-document language from structure │
│ b. Check if ContentParts need Vision AI extraction │
│ c. If yes: Call Vision AI (Phase 2 deferred extraction) │
│ d. Determine prompt type: │
│ - WITH CONTENT: If contentParts assigned │
│ → Use aggregation prompt (isAggregation=True) │
│ → ContentParts passed as parameters │
│ → Use per-document language for generation │
│ - WITHOUT CONTENT: If no contentParts │
│ → Use generation prompt (isAggregation=False) │
│ → Only generationHint in prompt │
│ → Use per-document language for generation │
│ e. Generate section content with AI │
│ │
│ OUTPUT: │
│ - filledStructure: Dict │
│ * documents: List[Dict] │
│ - language: str (preserved from input structure, per-document)│
│ - chapters: List[Dict] │
│ * sections: List[Dict] │
│ - id: str │
│ - content_type: str │
│ - elements: List[Dict] │
│ * type: str │
│ * content: str (or base64 for images) │
│ │
│ HANDOVER STATE: │
│ - filledStructure: Complete content, ready for rendering │
│ - Per-document language preserved from structure │
│ - All Vision AI extractions completed │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ PHASE 5: Document Rendering │
│ ────────────────────────────────────────────────────────────────── │
│ INPUT: │
│ - filledStructure: Dict │
│ - per-document outputFormat (from Phase 3, determined from prompt) │
│ - per-document language (from Phase 3, validated currentUserLanguage) │
│ │
│ THROUGHPUT: │
│ 1. Group sections by document (from structure) │
│ 2. For each document: │
│ a. Use per-document outputFormat │
│ b. Use per-document language │
│ c. Render document in specified format │
│ │
│ OUTPUT: │
│ - renderedDocuments: List[DocumentData] │
│ * documentName: str │
│ * documentData: bytes │
│ * mimeType: str │
│ │
│ HANDOVER STATE: │
│ - renderedDocuments: Final output ready for user │
└─────────────────────────────────────────────────────────────────────┘
2. Detailed Implementation Steps
Step 1: Update DocumentIntent Model
File: gateway/modules/datamodels/datamodelExtraction.py
Changes:
class DocumentIntent(BaseModel):
documentId: str
intents: List[str] # ["extract", "render", "reference"]
extractionPrompt: Optional[str] = None
# Note: outputFormat and language are NOT here - determined during
# structure generation (Phase 3) in the chapter structure JSON
reasoning: str
Rationale:
- Intent clarification focuses on document purpose (extract, render, reference)
- Output format and language are determined later during structure generation (Phase 3)
- Structure generation has full context (user prompt, ContentParts, chapters) to determine format/language
Step 2: Update Intent Analysis Prompt
File: gateway/modules/services/serviceAi/subDocumentIntents.py
Changes:
- Add fencing around userPrompt (Security Fix):
def _buildIntentAnalysisPrompt(
self,
userPrompt: str,
documents: List[ChatDocument],
actionParameters: Dict[str, Any]
) -> str:
# FENCE user input to prevent prompt injection
fencedUserPrompt = f"""```user_request
{userPrompt}
```"""
prompt = f"""USER REQUEST:
{fencedUserPrompt}
DOCUMENTS TO ANALYZE:
{docListText}
TASK: For each document, determine:
1. Intents (can be multiple): "extract", "render", "reference"
Note: Output format and language are NOT determined here - they will be
determined during structure generation (Phase 3) in the chapter structure JSON
OUTPUT FORMAT: {outputFormat} (global fallback - for reference only)
RETURN JSON:
{{
"intents": [
{{
"documentId": "doc_1",
"intents": ["extract"],
"extractionPrompt": "Extract all text content",
// Note: outputFormat and language are NOT here - determined during
// structure generation in the chapter structure JSON
"reasoning": "..."
}}
]
}}
"""
- Remove global outputFormat from prompt (keep as fallback):
- Output format should be determined per document based on intent
- Global format remains as fallback if not specified per document
Step 3: Update ContentPart Metadata Propagation
File: gateway/modules/services/serviceAi/subContentExtraction.py
Changes:
async def extractAndPrepareContent(
self,
documents: List[ChatDocument],
documentIntents: List[DocumentIntent],
parentOperationId: str,
getIntentForDocument: callable
) -> List[ContentPart]:
# ... existing extraction logic ...
# Note: outputFormat and language are NOT propagated here - they're determined
# during structure generation (Phase 3) in the chapter structure JSON
# ContentParts are created with intent information only
Rationale:
- ContentParts carry intent and extraction information only
- Output format and language are determined during structure generation (Phase 3)
- Structure generation has full context to make format/language decisions
Step 4: Update Structure Generation
File: gateway/modules/services/serviceAi/subStructureGeneration.py
Global Format Source Chain
Note: outputFormat parameter is optional. If omitted, formats are determined from user prompt by AI.
If outputFormat provided:
- Action parameters:
action_parameters.get("outputFormat")oraction_parameters.get("resultType") - Passed to
callAiContent(outputFormat=...)→generateStructure(outputFormat=...)as parameter - Used as fallback in State 3 validation if AI doesn't return format per document
- Final fallback: "txt" if global format is also missing/invalid
If outputFormat omitted:
- AI determines formats per document from user prompt
- Validation fallback: "txt" (if AI doesn't return format per document)
Rationale: With per-document format determination, AI can determine different formats for different documents based on user prompt. The outputFormat parameter is primarily a fallback for validation, not a requirement.
Language Source Chain
Note: currentUserLanguage is always valid (validated during user intention analysis).
- AI determines per-document language in structure JSON response
- If AI doesn't return language: Use validated
currentUserLanguage(always valid, validated during user intention analysis) currentUserLanguagevalidation ensures:- AI response
detectedLanguageis validated (2-character ISO code) - If AI didn't return language or invalid → uses user language (
self.services.user.language) - If user language not set → uses "en"
- Always safe to use directly without fallback logic
- AI response
Changes:
- Make outputFormat optional in generateStructure method signature:
async def generateStructure(
self,
userPrompt: str,
contentParts: List[ContentPart],
outputFormat: Optional[str] = None, # ← Optional: if omitted, formats determined from prompt by AI
parentOperationId: str
) -> Dict[str, Any]:
"""
Generate document structure with per-document format determination.
Multiple documents can be produced with different formats (e.g., one PDF, one HTML).
AI determines formats per-document from user prompt. The outputFormat parameter is
only a validation fallback - used if AI doesn't return format per document.
Args:
outputFormat: Optional global format fallback. If omitted, formats are determined
from user prompt by AI. Used as validation fallback if AI doesn't
return format per document. Defaults to "txt" if not provided.
"""
# If outputFormat not provided, use "txt" as fallback for validation
# AI will determine formats per document from user prompt
if not outputFormat:
outputFormat = "txt"
logger.debug("outputFormat not provided - using 'txt' as validation fallback, formats determined from prompt")
# Group ContentParts by documentId (for context in prompt)
partsByDocument = {}
for part in contentParts:
docId = part.metadata.get("documentId", "default")
if docId not in partsByDocument:
partsByDocument[docId] = []
partsByDocument[docId].append(part)
# AI determines per-document format and language in structure JSON response
# Pass global fallback for AI to use if not specified per document
prompt = self._buildChapterStructurePrompt(
userPrompt=userPrompt,
contentParts=contentParts,
outputFormat=outputFormat # Fallback for validation (AI determines formats from prompt)
)
Note:
outputFormatis optional. If omitted, formats are determined from user prompt by AI.- Used as validation fallback if AI doesn't return format per document.
- User prompt language comes from
self.services.currentUserLanguagewhich is validated during user intention analysis (workflowManager._sendFirstMessage()). The validation ensures:- AI response
detectedLanguageis validated (2-character ISO code) - If AI didn't return language or invalid → uses user language (
self.services.user.language) - If user language not set → uses "en"
currentUserLanguageis always valid and safe to use directly without fallback logic
- AI response
- Update prompt to clarify format determination from prompt:
def _buildChapterStructurePrompt(
self,
userPrompt: str,
contentParts: List[ContentPart],
outputFormat: str # Global fallback (for validation only)
) -> str:
# Get language from services (validated currentUserLanguage infrastructure)
language = self._getUserLanguage() # Uses self.services.currentUserLanguage (always valid)
# ... existing prompt building ...
prompt += f"""
## OUTPUT FORMAT (per document)
- Each document can have its own output format (pdf, docx, html, etc.)
- **Determine the format for each document from the USER REQUEST above**
- Multiple documents can have different formats (e.g., one PDF, one HTML)
- Analyze user prompt to identify format requirements:
* Explicit format mentions (e.g., "as PDF", "in Excel", "HTML document")
* Document purpose (e.g., "spreadsheet" → xlsx, "presentation" → pptx)
* Content type requirements
- If format cannot be determined from prompt, use fallback: "{outputFormat}" (for validation only)
- Include "outputFormat" field in each document in the JSON structure
- **CRITICAL**: Formats are determined from user prompt, not from the fallback value
## DOCUMENT LANGUAGE (per document)
- Each document can have its own language (ISO 639-1 code: "de", "en", "fr", etc.)
- Determine the language for each document based on:
* User prompt language/context
* Document content context
* User's explicit language requirements
- If not specified, use validated currentUserLanguage: "{language}" (always valid, validated during user intention analysis)
- Include "language" field in each document in the JSON structure
EXAMPLE JSON STRUCTURE:
{{
"documents": [
{{
"id": "doc_1",
"title": "Document Title",
"outputFormat": "pdf", // ← Determined by AI from user prompt
"language": "de", // ← Determined by AI from user prompt
"chapters": [...]
}},
{{
"id": "doc_2",
"title": "Another Document",
"outputFormat": "html", // ← Different format for different document
"language": "en", // ← Different language for different document
"chapters": [...]
}}
]
}}
"""
Step 5: Update Structure Filling - Two Prompt Types
File: gateway/modules/services/serviceAi/subStructureFilling.py
Changes:
- Ensure two prompt types are used (already implemented, verify):
async def _fillSingleSection(
self,
section: Dict[str, Any],
contentParts: List[ContentPart],
userPrompt: str,
generationHint: str,
document: Dict[str, Any], # ← NEW: Need document to get per-document language
# ... other params ...
) -> List[Dict[str, Any]]:
# Extract per-document language from structure
# Language MUST be defined in structure (validated in State 3)
# If missing, this is an error - should not happen after State 3 validation
if "language" not in document:
raise ValueError(f"Document {document.get('id')} missing 'language' field - should have been set in Phase 3 validation")
docLanguage = document["language"]
# Validate language format (should be 2-character ISO code)
if not isinstance(docLanguage, str) or len(docLanguage) != 2:
raise ValueError(f"Document {document.get('id')} has invalid language format: {docLanguage} - should be 2-character ISO 639-1 code")
contentPartIds = section.get("contentPartIds", [])
hasContentParts = len(contentPartIds) > 0
if hasContentParts:
# PROMPT TYPE 1: WITH CONTENT (Aggregation)
# ContentParts passed as parameters, not in prompt text
isAggregation = True
relevantParts = [p for p in contentParts if p.id in contentPartIds]
generationPrompt = self._buildSectionGenerationPrompt(
section=section,
contentParts=relevantParts, # Passed as parameters
userPrompt=userPrompt,
generationHint=generationHint,
isAggregation=True, # ← Key flag
language=docLanguage # ← Per-document language from structure
)
else:
# PROMPT TYPE 2: WITHOUT CONTENT (Generation)
# Only generationHint in prompt, no ContentParts
isAggregation = False
generationPrompt = self._buildSectionGenerationPrompt(
section=section,
contentParts=[], # Empty
userPrompt=userPrompt,
generationHint=generationHint,
isAggregation=False, # ← Key flag
language=docLanguage # ← Per-document language from structure
)
Note: Language comes from the document in the structure (per-document), not a global parameter. Each document can have its own language as determined in Phase 3. The language MUST be defined and validated in Phase 3 (State 3 validation) - if missing here, it's an error.
- Verify
_buildSectionGenerationPrompthandles both cases:
def _buildSectionGenerationPrompt(
self,
section: Dict[str, Any],
contentParts: List[ContentPart],
userPrompt: str,
generationHint: str,
isAggregation: bool, # ← Determines prompt type
language: str
) -> str:
if isAggregation:
# TYPE 1: WITH CONTENT
# ContentParts are passed as parameters to AI call
# Don't include full content in prompt text (token efficiency)
prompt = f"""Generate content for section based on provided ContentParts.
Section: {sectionTitle}
Generation Hint: {generationHint}
Language: {language}
ContentParts are provided as parameters (not shown in prompt for efficiency).
Use the ContentParts data to generate the section content.
"""
else:
# TYPE 2: WITHOUT CONTENT
# Only generationHint, no ContentParts
prompt = f"""Generate content for section based on generation hint.
Section: {sectionTitle}
Generation Hint: {generationHint}
Language: {language}
Generate content based on the generation hint without referencing external content.
"""
Rationale:
- Type 1 (with content): Efficient for large content (ContentParts as parameters)
- Type 2 (without content): Simple generation based on hint only
- Already implemented via
isAggregationflag, verify it's used correctly
Step 6: Update Document Rendering
File: gateway/modules/services/serviceAi/mainServiceAi.py (renderResult method)
File: gateway/modules/services/serviceGeneration/mainServiceGeneration.py (renderReport method)
Current Implementation:
renderResult()callsgenerationService.renderReport()renderReport()already processes each document separately (line 385)- Currently checks
doc.get("format", outputFormat)(line 397) - but should checkoutputFormatfield - Language is not handled per-document
Changes:
- Update renderResult to pass language (from structure, validated before rendering):
async def renderResult(
self,
filledStructure: Dict[str, Any],
outputFormat: str, # Global fallback
language: str, # ← NEW: Add language parameter (global fallback)
title: str,
userPrompt: str,
parentOperationId: str
) -> List[RenderedDocument]:
"""
Render filled structure to documents.
Per-document format and language are extracted from structure (validated in State 3).
The outputFormat and language parameters are only used as global fallbacks.
Multiple documents can have different formats and languages.
"""
# Language comes from structure (per-document), validated in State 3
# This parameter is only used as global fallback if structure validation fails
# Use validated currentUserLanguage as fallback (always valid)
if not language:
language = self._getUserLanguage() # Uses validated currentUserLanguage infrastructure
# ... existing code ...
renderedDocuments = await generationService.renderReport(
filledStructure,
outputFormat,
language, # ← Pass language (global fallback, per-document extracted in renderReport)
title,
userPrompt,
self,
parentOperationId=renderOperationId
)
Note:
- Language comes from structure (per-document) as determined in Phase 3
- The
languageparameter here is only used as a global fallback - Per-document language is validated in State 3 (Structure Generation) and extracted from structure in
renderReport() - Uses validated
currentUserLanguageinfrastructure if fallback needed
- Update renderReport to handle per-document format and language:
async def renderReport(
self,
extractedContent: Dict[str, Any],
outputFormat: str, # Global fallback
language: str, # ← NEW: Add language parameter (global fallback)
title: str,
userPrompt: str = None,
aiService=None,
parentOperationId: Optional[str] = None
) -> List[RenderedDocument]:
# ... existing validation ...
# Process EACH document separately
for docIndex, doc in enumerate(documents):
# ... existing validation ...
# Determine format for this document
# Check outputFormat field first (per-document), then format field (legacy), then global fallback
docFormat = doc.get("outputFormat") or doc.get("format") or outputFormat
# Determine language for this document
# Extract per-document language from structure (validated in State 3), fallback to global
docLanguage = doc.get("language") or language
# Validate language format (should be 2-character ISO code, validated in State 3)
if not isinstance(docLanguage, str) or len(docLanguage) != 2:
logger.warning(f"Document {doc.get('id')} has invalid language format: {docLanguage}, using fallback")
docLanguage = language # Use global fallback
# Get renderer for this document's format (uses existing renderer registry)
renderer = self._getFormatRenderer(docFormat)
if not renderer:
logger.warning(f"Unsupported format '{docFormat}' for document {doc.get('id', docIndex)}, skipping")
continue
# Create JSON structure with single document (preserving metadata)
singleDocContent = {
"metadata": {**metadata, "language": docLanguage}, # ← Add per-document language to metadata
"documents": [doc]
}
# Render this document (can return multiple files, e.g., HTML + images)
renderedDocs = await renderer.render(singleDocContent, docTitle, userPrompt, aiService)
allRenderedDocuments.extend(renderedDocs)
Note:
- Per-document format and language are extracted from structure (validated in State 3)
- Renderers (
RendererPdf,RendererHtml, etc.) receive the structure with language in metadata - They can use it for language-specific formatting if needed
- Multiple documents can have different formats and languages
Step 7: Update ai.process to Pass documentList and Make outputFormat Optional
File: gateway/modules/workflows/methods/methodAi/actions/process.py
Changes:
# Phase 7.3: Pass both documentList and contentParts to AI service
# (Remove extraction logic from here - handled by AI service)
# resultType is optional - if omitted, formats determined from prompt by AI
# Default "txt" is validation fallback only
resultType = parameters.get("resultType") # Optional: if None, formats determined from prompt
if resultType:
normalized_result_type = (str(resultType).strip().lstrip('.').lower() or "txt")
output_format = output_extension.replace('.', '') or 'txt'
else:
# No format specified - AI will determine formats from prompt
output_format = None
logger.debug("resultType not provided - formats will be determined from prompt by AI")
# Use unified callAiContent method with BOTH parameters
aiResponse = await self.services.ai.callAiContent(
prompt=aiPrompt,
options=options,
documentList=documentList, # ← PASS documentList (was missing)
contentParts=contentParts, # ← PASS contentParts
outputFormat=output_format, # ← Optional: if None, formats determined from prompt
parentOperationId=operationId,
generationIntent=generationIntent
)
Note:
resultTypeparameter is optional. If omitted, formats are determined from user prompt by AI.- Default "txt" (if provided) is used as validation fallback only.
- Language detection from user prompt is already done and validated.
self.services.currentUserLanguageis always valid (validated during user intention analysis inworkflowManager._sendFirstMessage()).
3. Handover State Definitions and Validation
Purpose: These state definitions document the expected structure and validation rules at each phase boundary.
Implementation Approach:
- Inline validation in each phase method
- Auto-fix where possible (use defaults, skip invalid items)
- Stop with error for critical structural issues
- Log warnings for skipped items
See: Appendix "Validation Failure Handling Decisions" below for detailed Q&A on each validation
Summary of Validation Decisions:
- State 1: Skip intents for unknown documents; documents without intents are OK
- State 2: Skip ContentParts with missing/invalid metadata (with warnings)
- State 3: Auto-fix format/language with fallbacks; error on missing structure fields
- State 4: Auto-fix missing elements field; allow empty elements
- State 5: Skip empty documents; infer mimeType from filename
State 1: After Intent Clarification
Location: gateway/modules/services/serviceAi/subDocumentIntents.py - After clarifyDocumentIntents() returns (line 115)
Expected State:
documentIntents: List[DocumentIntent] # Complete intent analysis
documents: List[ChatDocument] # Resolved documents
preExtractedMapping: Dict[str, str] # Map[originalDocId, jsonDocId]
Implementation Code (add after line 115, before return):
# Validation and auto-fix
documentIds = {d.id for d in documents}
validatedIntents = []
for intent in documentIntents:
# Validation 1.2: Skip intents for unknown documents
if intent.documentId not in documentIds:
logger.warning(f"Skipping intent for unknown document: {intent.documentId}")
continue
validatedIntents.append(intent)
# Validation 1.1: Documents without intents are OK (not needed)
# Intents for non-existing documents are already filtered above
documentIntents = validatedIntents
State 2: After Content Extraction
Location: gateway/modules/services/serviceAi/subContentExtraction.py - After extractAndPrepareContent() returns (at end of method, before return)
Expected State:
finalContentParts: List[ContentPart] # All content parts ready
Implementation Code (add at end of method, before return):
# Validation and auto-fix
validatedParts = []
for part in finalContentParts:
# Validation 2.1: Skip ContentParts without documentId
if not part.metadata.get("documentId"):
logger.warning(f"Skipping ContentPart {part.id} - missing documentId in metadata")
continue
# Validation 2.2: Skip ContentParts with invalid contentFormat
contentFormat = part.metadata.get("contentFormat")
if contentFormat not in ["extracted", "object", "reference"]:
logger.warning(
f"Skipping ContentPart {part.id} - invalid contentFormat: {contentFormat}"
)
continue
validatedParts.append(part)
return validatedParts
State 3: After Structure Generation
Location: gateway/modules/services/serviceAi/subStructureGeneration.py - After generateStructure() returns (after parsing JSON, before return, around line 182)
Expected State:
chapterStructure: Dict[str, Any] # Complete structure with documents, chapters, outputFormat, language
Implementation Code (add after structure JSON is parsed, before return):
# After structure JSON is parsed (around line 182)
# Validation and auto-fix
# Validation 3.1: Structure missing 'documents' field
if "documents" not in structure:
raise ValueError("Structure missing 'documents' field - cannot auto-fix")
documents = structure["documents"]
# Validation 3.2: Structure has no documents
if not isinstance(documents, list) or len(documents) == 0:
raise ValueError("Structure has no documents - cannot generate without documents")
# Import renderer registry for format validation (existing infrastructure)
from modules.services.serviceGeneration.renderers.registry import getRenderer
# Validate and fix each document
for doc in documents:
# Validation 3.3 & 3.4: Document outputFormat
# outputFormat parameter is optional - if omitted, formats determined from prompt by AI
# Use as fallback only if AI doesn't return format per document
# Multiple documents can have different formats (e.g., one PDF, one HTML)
globalFormatFallback = outputFormat or "txt" # Fallback for validation
if "outputFormat" not in doc or not doc["outputFormat"]:
# AI didn't return format or returned empty - use global fallback
doc["outputFormat"] = globalFormatFallback
logger.info(f"Document {doc.get('id')} missing outputFormat - using fallback: {doc['outputFormat']}")
else:
# AI returned format - validate using existing renderer registry
formatName = str(doc["outputFormat"]).lower().strip()
renderer = getRenderer(formatName) # Uses existing infrastructure
if not renderer:
# Format doesn't match any renderer - use txt (simple approach)
logger.warning(f"Document {doc.get('id')} has format without renderer: {formatName}, using 'txt'")
doc["outputFormat"] = "txt"
else:
# Valid format with renderer - normalize and keep AI result
doc["outputFormat"] = formatName
logger.debug(f"Document {doc.get('id')} using AI-determined format: {formatName}")
# Validation 3.5 & 3.6: Document language
# Use validated currentUserLanguage (always valid, validated during user intention analysis)
# Access via _getUserLanguage() which uses self.services.currentUserLanguage
userPromptLanguage = self._getUserLanguage() # Uses validated currentUserLanguage infrastructure
if "language" not in doc or not isinstance(doc["language"], str) or len(doc["language"]) != 2:
# AI didn't return language or invalid format - use validated currentUserLanguage
doc["language"] = userPromptLanguage
if "language" not in doc:
logger.info(f"Document {doc.get('id')} missing language - using currentUserLanguage: {doc['language']}")
else:
logger.warning(f"Document {doc.get('id')} has invalid language format from AI: {doc['language']}, using currentUserLanguage")
else:
# AI returned valid language format - normalize
doc["language"] = doc["language"].lower().strip()[:2]
logger.debug(f"Document {doc.get('id')} using AI-determined language: {doc['language']}")
# Validation 3.7: Document missing 'chapters' field
if "chapters" not in doc:
raise ValueError(f"Document {doc.get('id')} missing 'chapters' field - cannot auto-fix")
# Validation 3.8: Chapter missing 'contentParts' field
for chapter in doc["chapters"]:
if "contentParts" not in chapter:
raise ValueError(f"Chapter {chapter.get('id')} missing 'contentParts' field - cannot auto-fix")
return structure
State 4: After Structure Filling
Location: gateway/modules/services/serviceAi/subStructureFilling.py - After fillStructure() returns (at end of method, before return, around line 204)
Expected State:
filledStructure: Dict[str, Any] # Complete content with elements
Implementation Code (add at end of method, before return):
# Validation and auto-fix
# Validation 4.1: Filled structure missing 'documents' field
if "documents" not in filledStructure:
raise ValueError("Filled structure missing 'documents' field - cannot auto-fix")
for doc in filledStructure["documents"]:
# Validation 4.4: Verify language is preserved from input structure
# Language MUST be preserved from Phase 3 structure (validated in State 3)
if "language" not in doc:
raise ValueError(f"Document {doc.get('id')} missing language in filled structure - should have been preserved from Phase 3")
# Validate language format
if not isinstance(doc["language"], str) or len(doc["language"]) != 2:
raise ValueError(f"Document {doc.get('id')} has invalid language format in filled structure: {doc['language']} - should be 2-character ISO 639-1 code")
for chapter in doc.get("chapters", []):
for section in chapter.get("sections", []):
# Validation 4.2: Section missing 'elements' field
if "elements" not in section:
section["elements"] = []
logger.info(f"Section {section.get('id')} missing 'elements' - created empty list")
# Validation 4.3: Section has empty elements list - ALLOW (intentionally empty is OK)
# No action needed - empty elements are allowed
return filledStructure
State 5: After Document Rendering
Location: gateway/modules/services/serviceGeneration/paths/documentPath.py - After renderResult() returns (line 151, after line 157, before building documentDataList)
Expected State:
renderedDocuments: List[RenderedDocument] # Final output
Implementation Code (add after line 157, before building documentDataList):
# Validation 5.1: Already implemented at line 175-176
if not renderedDocuments:
raise ValueError("No documents were rendered")
# Validation 5.2 & 5.3: Validate and filter rendered documents
validatedRenderedDocs = []
for doc in renderedDocuments:
# Validation 5.2: Skip documents with empty documentData
if not doc.documentData:
logger.warning(f"Skipping rendered document {doc.filename} - empty documentData")
continue
# Validation 5.3: Infer mimeType from filename if missing
if not doc.mimeType:
from modules.services.serviceGeneration.subDocumentUtility import getMimeTypeFromExtension
if doc.filename:
inferredMimeType = getMimeTypeFromExtension(doc.filename)
if inferredMimeType:
doc.mimeType = inferredMimeType
logger.info(f"Inferred mimeType '{inferredMimeType}' from filename '{doc.filename}'")
else:
logger.warning(f"Could not infer mimeType from filename '{doc.filename}' - keeping as None")
else:
logger.warning(f"Rendered document missing mimeType and filename - cannot infer")
validatedRenderedDocs.append(doc)
# Use validated list
renderedDocuments = validatedRenderedDocs
# Re-check after filtering
if not renderedDocuments:
raise ValueError("No valid documents after validation")
4. Migration Checklist
Phase 1: Model Updates
- Verify
DocumentIntentmodel does NOT includeoutputFormatorlanguage - Intent clarification focuses only on document purpose (intents, extractionPrompt)
- Note: outputFormat and language are determined during structure generation (Phase 3)
Phase 2: Intent Analysis Updates
- CRITICAL: Add fencing around
userPromptin intent analysis prompt- Fence user input with code blocks:
user_request\n{userPrompt}\n - Test with various user inputs (special chars, JSON, newlines, prompt injection attempts)
- Fence user input with code blocks:
- Update prompt to focus only on document intents (extract, render, reference)
- Remove any outputFormat/language determination from intent analysis prompt
- Keep global outputFormat/language as reference only (not for determination)
- Verify intent mapping logic (already implemented in
clarifyDocumentIntents):- Step 1: Map pre-extracted JSONs to original documents (lines 63-83)
- Step 2: AI analyzes intents for original documents (line 86)
- Step 3: Map intents back to JSON doc IDs (lines 96-104)
- Test with pre-extracted JSONs to verify mapping works correctly
Phase 3: Content Extraction Updates
-
Verify ContentParts do NOT include outputFormat or language in metadata
-
ContentParts carry only intent and extraction information
-
Verify pre-extracted JSON handling preserves intent information
-
Add filtering to Data Extraction Path (
_handleDataExtraction): Current State (BEFORE filtering):# Line 708: Get documents directly from documentList documents = self.services.chat.getChatDocumentsFromDocumentList(documentList) # Line 721: Call extractAndPrepareContent() with ALL documents preparedContentParts = await self.extractAndPrepareContent(documents, ...)Problem: If
documentListcontains both:- Original document:
original_pdf_123.pdf - Pre-extracted JSON:
pre_extracted_456.json(contains ContentParts fromoriginal_pdf_123.pdf) → Both are processed → DUPLICATE ContentParts created
How Filtering Works (Reference:
documentPath.pylines 62-87):Step 1: Identify Pre-Extracted JSONs and Map to Originals
# Collect all original document IDs that are covered by pre-extracted JSONs originalDocIdsCoveredByPreExtracted = set() for doc in documents: preExtracted = self.intentAnalyzer.resolvePreExtractedDocument(doc) if preExtracted: # Pre-extracted JSON found - get the original document ID it covers originalDocId = preExtracted["originalDocument"]["id"] originalDocIdsCoveredByPreExtracted.add(originalDocId)Result:
originalDocIdsCoveredByPreExtracted = {"original_pdf_123"}(if pre-extracted JSON covers it)Step 2: Filter Documents List
filteredDocuments = [] for doc in documents: preExtracted = self.intentAnalyzer.resolvePreExtractedDocument(doc) if preExtracted: # Pre-extracted JSON - KEEP IT (will be processed as ContentParts) filteredDocuments.append(doc) elif doc.id in originalDocIdsCoveredByPreExtracted: # Original document covered by pre-extracted JSON - REMOVE IT logger.info(f"Skipping original document {doc.id} - already covered") # Do NOT append - skip this document else: # Regular document (not pre-extracted, not covered) - KEEP IT filteredDocuments.append(doc) documents = filteredDocuments # Use filtered listResult:
- ✅ Pre-extracted JSON:
pre_extracted_456.json→ KEPT - ❌ Original document:
original_pdf_123.pdf→ REMOVED (covered by pre-extracted JSON) - ✅ Regular document:
other_doc.pdf→ KEPT (not covered)
Step 3: Use Filtered Documents
# Now call extractAndPrepareContent() with filtered documents only preparedContentParts = await self.extractAndPrepareContent( documents, # Only pre-extracted JSONs + regular docs (no originals covered by JSONs) documentIntents or [], extractOperationId )Result: No duplicates - original documents already filtered out
Implementation Steps:
- Add filtering logic between line 708 (get documents) and line 710 (clarify intents)
- Copy filtering code from
documentPath.pylines 62-87 - Adapt to use
self.intentAnalyzer.resolvePreExtractedDocument()(same method) - Filtering Logic:
# Step 1: Identify all original document IDs covered by pre-extracted JSONs originalDocIdsCoveredByPreExtracted = set() for doc in documents: preExtracted = self.intentAnalyzer.resolvePreExtractedDocument(doc) if preExtracted: originalDocId = preExtracted["originalDocument"]["id"] originalDocIdsCoveredByPreExtracted.add(originalDocId) logger.debug(f"Found pre-extracted JSON {doc.id} covering original document {originalDocId}") # Step 2: Filter documents - remove originals covered by pre-extracted JSONs filteredDocuments = [] for doc in documents: preExtracted = self.intentAnalyzer.resolvePreExtractedDocument(doc) if preExtracted: filteredDocuments.append(doc) # Keep pre-extracted JSON elif doc.id in originalDocIdsCoveredByPreExtracted: logger.info(f"Skipping original document {doc.id} ({doc.fileName}) - already covered by pre-extracted JSON") else: filteredDocuments.append(doc) # Keep regular document documents = filteredDocuments # Use filtered list - Test with scenario: original document + pre-extracted JSON → verify no duplicates
- Original document:
-
Remove redundant check from
extractAndPrepareContent():- Remove pre-extracted JSON check (line 77 in
subContentExtraction.py) - Trust that filtering is done upstream
- Cleaner code, single responsibility
- Remove pre-extracted JSON check (line 77 in
-
Test merging logic
-
Test that both document generation and data extraction paths handle pre-extracted JSONs correctly
-
Note: outputFormat and language are NOT propagated here - determined in structure generation
Phase 4: Structure Generation Updates
- Make outputFormat optional in generateStructure() method signature:
- Update
subStructureGeneration.pymethod signature (line 47):outputFormat: Optional[str] = None - Update
mainServiceAi.pywrapper method (line 444): MakeoutputFormatoptional - If
outputFormatnot provided, use "txt" as validation fallback (AI determines formats from prompt) - Add logging: "outputFormat not provided - using 'txt' as validation fallback, formats determined from prompt"
- Context:
outputFormatis only a validation fallback - AI determines per-document formats from user prompt. Multiple documents can have different formats (e.g., one PDF, one HTML).
- Update
- Note on language handling: Language is accessed via
self.services.currentUserLanguage(always valid, validated during user intention analysis). No language parameter needed ingenerateStructure()method signature - language is accessed directly from services within the method.- Verify
currentUserLanguageis used correctly insubStructureGeneration.py(viaself.services.currentUserLanguage) - Verify
currentUserLanguageis used correctly in prompt building (viaself.services.currentUserLanguage) - Note:
mainServiceGeneration.pyuses different service - verify if update needed
- Verify
- Group ContentParts by documentId (for context in prompt)
- Update
_buildChapterStructurePrompt()to access language viaself.services.currentUserLanguage(no parameter needed) - Update structure generation prompt to ask AI to determine per-document outputFormat
- Explicitly require
outputFormatfield in each document JSON structure - Update example structure to show
outputFormatfield (not just filename) - Clarify that multiple documents can have different formats
- Explicitly require
- Update structure generation prompt to ask AI to determine per-document language
- Explicitly require
languagefield in each document JSON structure - Clarify that multiple documents can have different languages
- Explicitly require
- Provide global fallbacks (outputFormat, language) for AI to use if not specified
outputFormatfallback: from parameter or "txt"languagefallback: useself._getUserLanguage()(validated currentUserLanguage infrastructure)
- Parse and validate format/language from AI response:
- Extract
outputFormatandlanguagefrom each document in structure JSON - Format validation (use existing renderer registry infrastructure):
- Import:
from modules.services.serviceGeneration.renderers.registry import getRenderer - If
outputFormatmissing or empty → use global fallback (outputFormator "txt") - If
outputFormatexists → check if it has a renderer usinggetRenderer(formatName)(existing infrastructure) - Normalize format name:
formatName.lower().strip() - If format doesn't match any renderer → use "txt" (simple approach, no global fallback attempt)
- Log warnings for invalid formats
- Note: Infrastructure exists at
mainServiceGeneration.py:529- reusegetRenderer()function
- Import:
- Language validation (use existing validated infrastructure):
- Validate language (must be 2-character ISO 639-1 code)
- If language missing: Set to
self._getUserLanguage()which uses validatedcurrentUserLanguage(always valid, validated during user intention analysis atworkflowManager.py:695-727) - If language invalid format: Use
self._getUserLanguage()(always valid) - Normalize language:
language.lower().strip()[:2] - Log warnings for invalid/missing values
- Note:
currentUserLanguageis always valid - safe to use directly via_getUserLanguage()method
- Extract
- Error handling:
- If structure JSON is malformed → raise error with details
- If no documents in structure → raise error
- If AI doesn't return format → use global
outputFormatfallback (or "txt" if not provided), log warning - If AI doesn't return language → use validated
currentUserLanguage(always valid), log warning
- Verify structure output includes per-document format and language (from AI in JSON response)
Phase 5: Structure Filling Verification
- Verify two prompt types are correctly used:
isAggregation=True: ContentParts as parametersisAggregation=False: Only generationHint
- Verify per-document language is extracted and used:
- Language MUST be defined in structure (validated in State 3)
- Language extracted from document in structure (per-document) - NO fallback to "en"
- If language missing: Raise error (should not happen after State 3 validation)
- If language invalid format: Raise error (should not happen after State 3 validation)
- Language passed to
_buildSectionGenerationPrompt()for each section - Language preserved in filled structure (State 4 validation)
- Test both prompt types with various scenarios
- Verify Vision AI extraction happens during filling phase
- Test with multi-document scenarios (different languages per document)
Phase 6: Document Rendering Updates
- Add language parameter to renderResult() method:
- Update
mainServiceAi.pyrenderResult() signature (line 460) - Pass language to
generationService.renderReport()(as global fallback)
- Update
- Update renderResult call site (
documentPath.pyline 151):- Language comes from structure (per-document), validated in State 3
- Use validated
currentUserLanguageas global fallback (always valid) - Per-document language will be extracted in
renderReport()from filledStructure - Code example:
# Language is already validated in structure (State 3) and preserved in filled structure (State 4) # Per-document language will be extracted in renderReport() from filledStructure # Use validated currentUserLanguage as global fallback (always valid infrastructure) language = self.services.currentUserLanguage or "en" # Uses validated infrastructure renderedDocuments = await self.services.ai.renderResult( filledStructure, outputFormat, language, # ← Global fallback (per-document language extracted from structure in renderReport) title or "Generated Document", userPrompt, docOperationId )
- Update renderReport() to handle per-document format and language:
- Add language parameter to method signature (line 349):
language: str(global fallback) - Extract per-document format:
docFormat = doc.get("outputFormat") or doc.get("format") or outputFormat(checkoutputFormatfield first) - Extract per-document language:
docLanguage = doc.get("language") or language(from structure, validated in State 3) - Validate language format (should be 2-character ISO code, validated in State 3)
- Add language to metadata passed to renderers:
metadata["language"] = docLanguage - Note: Per-document format and language are extracted from structure (validated in State 3). Multiple documents can have different formats and languages.
- Add language parameter to method signature (line 349):
- Error handling:
- If no documents in structure → raise error
- If filtering removes all documents → raise error
- If format not supported → log warning, skip document
- Test multi-document rendering with different formats/languages
Phase 7: ai.process Refactoring
-
Remove extraction logic from
ai.process(lines 72-119) -
Make resultType optional: ✅ IMPLEMENTED
- Update
ai.process: MakeresultTypeoptional (can beNone) - ✅ COMPLETED - Update
ai.generateDocument: MakeresultTypeoptional, removed auto-detection - ✅ COMPLETED - Update
ai.generateCode: MakeresultTypeoptional, removed auto-detection - ✅ COMPLETED - If
resultTypeomitted → passNonetocallAiContent()(formats determined from prompt) - ✅ COMPLETED - Updated action parameter definitions in
methodAi.py- ✅ COMPLETED
Implementation Status:
- ✅ ai.process:
resultTypeoptional, passesNoneif omitted - ✅ ai.generateDocument:
resultTypeoptional, passesNoneif omitted - ✅ ai.generateCode:
resultTypeoptional, passesNoneif omitted - ✅ callAiContent: Already supports optional
outputFormat(defaults to "txt") - generateStructure: Make
outputFormatoptional (see Phase 4 checklist)
- Update
-
Add filtering to Data Extraction Path (
_handleDataExtraction):- Location:
mainServiceAi.pybetween line 708 (get documents) and line 721 (extract content) - Purpose: Prevent duplicate ContentParts when both original document and pre-extracted JSON are provided
- Implementation: Copy filtering logic from
documentPath.py:62-87 - Filter out original documents covered by pre-extracted JSONs before calling
extractAndPrepareContent() - See Phase 3 checklist for detailed filtering code
- Location:
-
Pass
documentListtocallAiContent()(currently missing, line 155-162 inprocess.py)documentListis available inprocess.py(lines 43-55) but not passed tocallAiContent()- Add
documentList=documentListparameter tocallAiContent()call
-
Pass
contentPartstocallAiContent()(already done) -
Error handling:
- If no documents and no contentParts → raise error
- If filtering removes all documents → raise error
-
Verify intelligent merging in AI service works correctly
Phase 8: Testing
- Test with pre-extracted JSON documents
- Test with mixed
documentList+contentParts - Test per-document format/language determination
- Test two prompt types in structure filling
- Test multi-document output with different formats/languages
- Test security: prompt injection attempts with fenced input
- Test optional outputFormat handling:
- Test with
resultTypeprovided → formats used as fallback - Test with
resultTypeomitted → AI determines formats from prompt - Test format validation: invalid format → uses "txt"
- Test format validation: format without renderer → uses "txt"
- Test with
Phase 9: Documentation
- Update API documentation
- Update developer documentation
- Update user documentation (if applicable)
Priority Order
High Priority (Security & Critical Path):
- Phase 2: Intent Analysis Updates - Security fix (fencing) is CRITICAL
- Phase 7: ai.process Refactoring - Add filtering to Data Extraction Path (prevents duplicate ContentParts)
- Phase 1: Model Updates - Foundation for all other changes
Medium Priority (Architectural Improvements): 4. Phase 4: Structure Generation Updates
- Make outputFormat optional (AI determines per-document formats)
- Implement State 3 validation (use existing renderer registry and language infrastructure)
- Update prompt to require outputFormat field per document
- Phase 6: Document Rendering Updates
- Extract per-document format/language from structure
- Add language parameter to renderResult() and renderReport()
- Phase 3: Content Extraction Updates
- Remove redundant pre-extracted check AFTER filtering added upstream
Low Priority (Verification & Polish): 7. Phase 5: Structure Filling Verification (already implemented, verify) 8. Phase 8: Testing 9. Phase 9: Documentation
Notes
- The two prompt types in Phase 4 (Structure Filling) are already implemented via the
isAggregationflag. This step focuses on verification and documentation. - Per-document format/language determination follows the same pattern as existing per-document language handling.
- The security fix (fencing user input) should be implemented immediately as it addresses a potential prompt injection vulnerability.
Architectural Note: Filtering and Redundant Pre-Extracted JSON Checks
Problem Statement
When a user provides both an original document and a pre-extracted JSON containing ContentParts from that original document, we need to prevent duplicate ContentParts from being created.
Current State
The pre-extracted JSON check happens twice:
- Phase 1 (
documentPath.pylines 67-87): Filters documents before intent clarification - Phase 2 (
subContentExtraction.pyline 77): Checks again during extraction loop
Why Filtering is Necessary
The redundant check in extractAndPrepareContent() only identifies if a document IS a pre-extracted JSON. It does NOT identify if a document is an ORIGINAL covered by a pre-extracted JSON.
Example:
# In extractAndPrepareContent loop:
for document in [original_pdf_123, pre_extracted_456]:
# Check document 1: original_pdf_123
preExtracted = resolvePreExtractedDocument(original_pdf_123)
# Returns: None (it's not a pre-extracted JSON)
# → Processes original_pdf_123 → extracts ContentParts
# Check document 2: pre_extracted_456
preExtracted = resolvePreExtractedDocument(pre_extracted_456)
# Returns: {originalDocument: {id: "original_pdf_123"}, ...}
# → Processes pre_extracted_456 → extracts ContentParts
# Result: BOTH processed → DUPLICATES
The redundant check doesn't help because:
- It only looks at ONE document at a time
- It doesn't know about OTHER documents in the list
- It can't compare documents to find relationships
Why Filtering Works
Filtering happens BEFORE the extraction loop, so it can:
- Look at ALL documents at once
- Identify relationships between documents
- Remove originals BEFORE extraction starts
Code Path Analysis
Path 1: Document Generation Path (documentPath.py)
Location: Line 103 Filtering: ✅ YES (lines 62-87)
- Identifies pre-extracted JSONs
- Filters out original documents covered by pre-extracted JSONs
- Only passes filtered documents to
extractAndPrepareContent()
Result: ✅ NO DUPLICATES - Original document already filtered out
Path 2: Data Extraction Path (mainServiceAi.py _handleDataExtraction)
Location: Line 721 Filtering: ❌ NO
- Gets documents directly from
documentList(line 708) - Calls
extractAndPrepareContent()without any filtering - Does NOT filter out original documents covered by pre-extracted JSONs
Result: ❌ DUPLICATES CREATED - Both documents processed, same content extracted twice
Visual Flow Comparison
Document Generation Path (WITH Filtering - CURRENT)
documentList: [original_pdf_123, pre_extracted_456]
↓
[FILTERING] Identify relationships, remove originals
↓
filteredDocuments: [pre_extracted_456] ← original_pdf_123 removed
↓
extractAndPrepareContent([pre_extracted_456])
↓
ContentParts from pre_extracted_456 only
↓
✅ NO DUPLICATES
Data Extraction Path (WITHOUT Filtering - CURRENT)
documentList: [original_pdf_123, pre_extracted_456]
↓
[NO FILTERING] Pass all documents
↓
extractAndPrepareContent([original_pdf_123, pre_extracted_456])
↓
Process original_pdf_123 → ContentParts
Process pre_extracted_456 → ContentParts
↓
❌ DUPLICATES (same content twice)
Data Extraction Path (WITH Filtering - TARGET)
documentList: [original_pdf_123, pre_extracted_456]
↓
[FILTERING] Identify relationships, remove originals
↓
filteredDocuments: [pre_extracted_456] ← original_pdf_123 removed
↓
extractAndPrepareContent([pre_extracted_456])
↓
ContentParts from pre_extracted_456 only
↓
✅ NO DUPLICATES
Solution
Target State: Add filtering to Data Extraction Path, then remove redundant check
Steps:
- Add filtering logic to
_handleDataExtraction(between line 708 and line 721)- Copy filtering code from
documentPath.pylines 62-87 - Filter out original documents covered by pre-extracted JSONs
- Copy filtering code from
- Remove redundant check from
extractAndPrepareContent()(line 77)- Trust that filtering is done upstream
- Cleaner code, single responsibility
Risk Assessment:
- If we remove redundant check WITHOUT adding filtering: ⚠️ Duplicates still occur (no change from current state)
- If we add filtering THEN remove redundant check: ✅ No duplicates, cleaner code
Conclusion
- Filtering is necessary because it can look at ALL documents and identify relationships
- Redundant check is insufficient because it only looks at ONE document at a time
- Current state: Document Generation Path filters → safe. Data Extraction Path doesn't filter → duplicates possible
- Solution: Add filtering to Data Extraction Path, then remove redundant check (it's not needed if filtering is done)
- Risk of removing redundant check: None IF filtering is added first. High IF filtering is NOT added (but duplicates already exist anyway)
Appendix: Pre-Extracted JSON Document Check Locations
Where the Check is Done
1. Phase 1 (Before Intent Clarification):
- File:
gateway/modules/services/serviceGeneration/paths/documentPath.py - Lines: 67-87
- Purpose: Filter documents before intent analysis
- Method:
self.services.ai.intentAnalyzer.resolvePreExtractedDocument(doc) - Action: Identifies pre-extracted JSONs and filters out original documents covered by them
2. Phase 2 (During Content Extraction):
- File:
gateway/modules/services/serviceAi/subContentExtraction.py - Line: 77
- Purpose: Process each document during extraction loop
- Method:
self.intentAnalyzer.resolvePreExtractedDocument(document) - Action: Extracts ContentParts from pre-extracted JSON (not treat as regular JSON)
- Note: ⚠️ REDUNDANT - This check happens again even though Phase 1 already filtered documents
- Reason:
extractAndPrepareContent()is called from multiple code paths:- Document generation path (
documentPath.py) - filtering already done - Data extraction path (
mainServiceAi.py) - filtering may not be done - The extraction service needs to handle pre-extracted JSONs defensively
- Document generation path (
- Optimization Opportunity: Could pass filtered documents or a flag to skip redundant checks
3. Check Implementation:
- File:
gateway/modules/services/serviceAi/subDocumentIntents.py - Line: 122
- Method:
resolvePreExtractedDocument(document: ChatDocument) - Logic:
- Checks if
mimeType == "application/json" - Parses JSON and checks for
validationMetadata.actionType == "context.extractContent" - Extracts
ContentExtractedstructure fromdocumentData - Returns dict with
originalDocumentandcontentExtractedinfo
- Checks if
Where Final Merged List is Available
After Phase 2 (Content Extraction):
- File:
gateway/modules/services/serviceGeneration/paths/documentPath.py - Line: 119
- Code:
contentParts = preparedContentParts - State:
- ✅ All pre-extracted JSON documents processed → ContentParts
- ✅ All regular documents extracted → ContentParts
- ✅ All provided contentParts merged
- ✅ Final clean merged list ready for Phase 3 (Structure Generation)
Before Phase 3 (Structure Generation):
- File:
gateway/modules/services/serviceGeneration/paths/documentPath.py - Line: 129
- Usage:
contentParts or []passed togenerateStructure() - Note: This is the clean merged list containing all ContentParts from all sources
Appendix: Intent Mapping Logic for Pre-Extracted JSONs
How Intent Mapping Works
Problem: When a pre-extracted JSON document is provided, we need to:
- Analyze intents for the original document (not the JSON file itself)
- Map the intents back to the JSON document ID (so they can be applied to the ContentParts extracted from the JSON)
Implementation Logic (Already in clarifyDocumentIntents)
Location: gateway/modules/services/serviceAi/subDocumentIntents.py lines 63-104
Step 1: Build Mapping (lines 63-83)
documentMapping = {} # Maps original doc ID → JSON doc ID
resolvedDocuments = []
for doc in documents:
preExtracted = self.resolvePreExtractedDocument(doc)
if preExtracted:
# This is a pre-extracted JSON
originalDocId = preExtracted["originalDocument"]["id"]
jsonDocId = doc.id # Current document is the JSON
# Map: original doc ID → JSON doc ID
documentMapping[originalDocId] = jsonDocId
# Create temporary ChatDocument for original document
originalDoc = ChatDocument(
id=originalDocId,
fileName=preExtracted["originalDocument"]["fileName"],
mimeType=preExtracted["originalDocument"]["mimeType"],
# ... other fields from preExtracted["originalDocument"]
)
resolvedDocuments.append(originalDoc) # Use original doc for intent analysis
else:
resolvedDocuments.append(doc) # Regular document, use as-is
Result:
documentMapping = {"original_pdf_123": "pre_extracted_456"}resolvedDocuments = [ChatDocument(id="original_pdf_123"), ChatDocument(id="other_doc")]
Step 2: AI Analyzes Intents (line 86)
# AI analyzes intents for resolvedDocuments (original documents, not JSONs)
intentPrompt = self._buildIntentAnalysisPrompt(userPrompt, resolvedDocuments, actionParameters)
aiResponse = await self.aiService.callAiPlanning(prompt=intentPrompt, ...)
AI Response:
{
"intents": [
{
"documentId": "original_pdf_123", // ← Original document ID
"intents": ["extract"],
"extractionPrompt": "Extract all text",
"reasoning": "..."
}
]
}
Step 3: Map Intents Back to JSON Doc IDs (lines 96-104)
intentsData = json.loads(self.services.utils.jsonExtractString(aiResponse))
documentIntents = []
for intent in intentsData.get("intents", []):
docId = intent.get("documentId") # "original_pdf_123"
# If intent is for an original document covered by a pre-extracted JSON
if docId in documentMapping:
# Map back to JSON document ID
intent["documentId"] = documentMapping[docId] # "pre_extracted_456"
documentIntents.append(DocumentIntent(**intent))
Result:
DocumentIntent(documentId="pre_extracted_456", intents=["extract"], ...)- Intent is now mapped to the JSON document ID, so it can be applied to ContentParts extracted from the JSON
Why This Works
- AI analyzes original documents: More meaningful context (file name, MIME type, etc.)
- Intents mapped to JSON IDs: ContentParts extracted from JSON can be tagged with correct intents
- Consistent with filtering: Original documents are filtered out, but their intents are preserved via mapping
Example Flow
Input:
- documentList: [original_pdf_123.pdf, pre_extracted_456.json]
Step 1: Filtering (Phase 1)
- Identify: pre_extracted_456.json covers original_pdf_123.pdf
- Filter: Remove original_pdf_123.pdf
- Result: documents = [pre_extracted_456.json]
Step 2: Intent Mapping (Phase 1)
- Build mapping: {"original_pdf_123": "pre_extracted_456"}
- Resolve: resolvedDocuments = [ChatDocument(id="original_pdf_123")]
- AI analyzes: intents for "original_pdf_123"
- Map back: intents for "pre_extracted_456"
Step 3: Content Extraction (Phase 2)
- Extract ContentParts from pre_extracted_456.json
- Apply intents (from Step 2) to ContentParts
- Result: ContentParts with correct intents
Implementation Notes
Infrastructure Available
The following infrastructure already exists and should be reused:
-
Language Validation:
currentUserLanguageis validated atworkflowManager.py:695-727- always valid 2-character ISO code. Access viaself.services.currentUserLanguageor_getUserLanguage()method. -
Format Validation: Renderer registry exists at
mainServiceGeneration.py:529(_getFormatRenderer()usesgetRenderer()). Import:from modules.services.serviceGeneration.renderers.registry import getRenderer. Returns None if format invalid, falls back to text renderer. -
Language Extraction:
_getDocumentLanguage()works correctly atsubStructureFilling.py:349- extracts per-document language from structure. Used properly during section generation.
Key Implementation Points
-
Per-Document Format/Language: Multiple documents can have different formats and languages. AI determines these from user prompt. Parameters are only validation fallbacks.
-
Filtering: Must filter pre-extracted JSONs before content extraction to prevent duplicate ContentParts. Filtering logic exists in
documentPath.py:62-87and should be copied to data extraction path. -
State 3 Validation: Use existing infrastructure (
getRenderer(),_getUserLanguage()) for validation. Infrastructure exists, just needs to be called. -
Rendering: Extract per-document
outputFormatandlanguagefrom structure (validated in State 3). CheckoutputFormatfield first, thenformatfield (legacy), then global fallback.
Appendix: Validation Failure Handling Decisions
This appendix documents the decision-making process for how to handle each validation failure. The actual implementation code is integrated into Section 3 above.
Approach
- Try to fix automatically (use defaults) when validation fails
- All validations are critical (must not fail - fix or error)
- Validation happens inline in each phase method
State 1: After Intent Clarification
Validation 1.1: Intent count mismatch
Check: len(documentIntents) != len(documents)
Decision: Documents without intents are OK. Intents for non-existing documents should be skipped.
Rationale: Not all documents need intents (some may be reference-only). Intents referencing unknown documents are invalid and should be removed.
Validation 1.2: Intent references unknown document
Check: intent.documentId not in documentIds
Decision: Skip this intent (remove it)
Rationale: Cannot map intent to non-existent document. Better to skip than fail.
State 2: After Content Extraction
Validation 2.1: ContentPart missing documentId
Check: not part.metadata.get("documentId")
Decision: Skip this ContentPart (remove it) with warning in logger
Rationale: ContentPart without documentId cannot be properly assigned. Skip with warning for debugging.
Validation 2.2: ContentPart has invalid contentFormat
Check: contentFormat not in ["extracted", "object", "reference"]
Decision: Skip this ContentPart (remove it) with warning in logger
Rationale: Invalid contentFormat indicates corrupted data. Skip with warning for debugging.
State 3: After Structure Generation
Validation 3.1: Structure missing 'documents' field
Check: "documents" not in chapterStructure
Decision: Stop with error (cannot auto-fix - structure is invalid)
Rationale: Structure without documents field is fundamentally broken. Cannot proceed.
Validation 3.2: Structure has no documents
Check: len(documents) == 0
Decision: Stop with error (cannot generate without documents)
Rationale: Cannot generate output without documents. Must have at least one document.
Validation 3.3: Document missing 'outputFormat' field
Check: "outputFormat" not in doc
Decision: Use global fallback format (from parameters), if not available use default "txt"
Rationale: Format is required for rendering. Use fallback chain: per-document → global → default.
Validation 3.4: Document has invalid outputFormat
Check: outputFormat not in valid formats
Decision: Use renderer registry to check if format has a renderer. If no renderer exists, try global fallback, then default "txt"
Rationale: Use dynamic renderer registry (not hardcoded list) to check format validity. Fallback chain ensures we always have a valid format.
Validation 3.5: Document missing 'language' field
Check: "language" not in doc
Decision: Use user prompt language (from self.services.currentUserLanguage via _getUserLanguage()), not "en" fallback
Rationale: Language is required for content generation. Use user prompt language (detected from user intention analysis) as fallback, not hardcoded "en".
Validation 3.6: Document has invalid language
Check: len(doc["language"]) != 2
Decision: Use validated currentUserLanguage (always valid, validated during user intention analysis)
Rationale: currentUserLanguage is validated during user intention analysis and is always a valid 2-character ISO 639-1 code. Safe to use directly.
Validation 3.7: Document missing 'chapters' field
Check: "chapters" not in doc
Decision: Stop with error (cannot auto-fix - document structure invalid)
Rationale: Document without chapters is structurally invalid. Cannot proceed.
Validation 3.8: Chapter missing 'contentParts' field
Check: "contentParts" not in chapter
Decision: Stop with error (cannot auto-fix - chapter structure invalid)
Rationale: Chapter without contentParts field is structurally invalid. Cannot proceed.
State 4: After Structure Filling
Validation 4.1: Filled structure missing 'documents' field
Check: "documents" not in filledStructure
Decision: Stop with error (cannot auto-fix - structure is invalid)
Rationale: Structure without documents field is fundamentally broken. Cannot proceed.
Validation 4.2: Section missing 'elements' field
Check: "elements" not in section
Decision: Create empty elements list: section["elements"] = []
Rationale: Section can be intentionally empty. Create empty list to maintain structure.
Validation 4.3: Section has empty elements list
Check: not section["elements"] (empty list)
Decision: Allow empty elements (section might be intentionally empty)
Rationale: Empty sections are valid (e.g., placeholder sections). No action needed.
Validation 4.4: Document missing 'language' field in filled structure
Check: "language" not in doc (in filledStructure)
Decision: Stop with error (language MUST be preserved from Phase 3)
Rationale: Language is validated and set in Phase 3 (State 3). If missing in filled structure, it's a critical error - language must be preserved.
Validation 4.5: Document has invalid language format in filled structure
Check: not isinstance(doc["language"], str) or len(doc["language"]) != 2
Decision: Stop with error (language format MUST be valid)
Rationale: Language format is validated in Phase 3 (State 3). If invalid in filled structure, it's a critical error.
State 5: After Document Rendering
Validation 5.1: No documents rendered
Check: len(renderedDocuments) == 0
Decision: Stop with error (already implemented in documentPath.py line 176)
Rationale: Cannot return empty result. Error already implemented.
Validation 5.2: Rendered document has empty documentData
Check: not doc.documentData
Decision: Skip this document (remove from list)
Rationale: Empty document is not useful. Skip it rather than fail entire operation.
Validation 5.3: Rendered document missing mimeType
Check: not doc.mimeType
Decision: Infer mimeType from filename extension
Rationale: mimeType can be inferred from filename. Use utility function to detect.