1766 lines
86 KiB
Markdown
1766 lines
86 KiB
Markdown
# Implementation Plan: Content Handling Architecture Migration
|
|
|
|
## Overview
|
|
|
|
This document provides a detailed implementation plan for migrating to the target architecture for content extraction and document generation. The plan focuses on:
|
|
|
|
- **Documents and Content Handling**: Intelligent merging of `documentList` and `contentParts` with deduplication
|
|
- **Output Document Formats**: Per-document format determination (not global) - AI determines formats from user prompt, multiple documents can have different formats
|
|
- **Languages Handling**: Per-document language determination (not global) - uses validated `currentUserLanguage` infrastructure
|
|
- **Clear Handover States**: Defined validation at each phase boundary using existing infrastructure
|
|
- **Structure Filling**: Two prompt types (with content vs. without content)
|
|
|
|
## Verified Infrastructure (Ready to Use)
|
|
|
|
The following infrastructure already exists and can be reused:
|
|
|
|
- ✅ **Language Validation**: `currentUserLanguage` is validated at `workflowManager.py:695-727` - always valid 2-character ISO code (validates AI response, falls back to user language, then "en"). Safe to use via `self.services.currentUserLanguage` or `_getUserLanguage()` method.
|
|
|
|
- ✅ **Format Validation**: Renderer registry exists at `mainServiceGeneration.py:529` (`_getFormatRenderer()` uses `getRenderer()`). Can be imported: `from modules.services.serviceGeneration.renderers.registry import getRenderer`. Returns None if format invalid, falls back to text renderer.
|
|
|
|
- ✅ **Language Extraction**: `_getDocumentLanguage()` works correctly at `subStructureFilling.py:349` - extracts per-document language from structure. Used properly during section generation.
|
|
|
|
## Context
|
|
|
|
This implementation plan is based on the analysis documented in:
|
|
- `gateway/modules/services/serviceAi/CONTENT_EXTRACTION_ANALYSIS.md` (Section 9.3: Target State)
|
|
|
|
The target architecture addresses architectural issues identified in the current implementation:
|
|
1. **Single extraction path** in AI service (no duplication in `ai.process`)
|
|
2. **Intelligent merging** of `contentParts` and `documentList` with deduplication
|
|
3. **Clear separation** of concerns: action layer delegates to service layer
|
|
4. **Consistent behavior** across all code paths
|
|
5. **Per-document format/language** determination (not global)
|
|
|
|
---
|
|
|
|
## 1. Overview: Major Phases and Handover States
|
|
|
|
### Phase Flow Diagram
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────────┐
|
|
│ PHASE 1: Document Intent Clarification │
|
|
│ ────────────────────────────────────────────────────────────────── │
|
|
│ INPUT: │
|
|
│ - userPrompt: str (fenced) │
|
|
│ - documentList: DocumentReferenceList (optional) │
|
|
│ - contentParts: List[ContentPart] (optional) │
|
|
│ - actionParameters: Dict (outputFormat, language, etc.) │
|
|
│ │
|
|
│ THROUGHPUT: │
|
|
│ 1. Resolve documents from documentList │
|
|
│ 2. Identify pre-extracted JSON documents │
|
|
│ - Check if JSON contains ContentExtracted structure │
|
|
│ - Map pre-extracted JSONs to original documents │
|
|
│ 3. Filter out original documents covered by pre-extracted │
|
|
│ 4. AI analyzes document purposes │
|
|
│ 5. Map intents back to JSON doc IDs (if applicable) │
|
|
│ │
|
|
│ OUTPUT: │
|
|
│ - documentIntents: List[DocumentIntent] │
|
|
│ * documentId: str │
|
|
│ * intents: List[str] (["extract", "render", "reference"]) │
|
|
│ * extractionPrompt: str (optional) │
|
|
│ * reasoning: str │
|
|
│ Note: outputFormat and language are NOT determined here - │
|
|
│ they're determined in Phase 3 (Structure Generation) │
|
|
│ │
|
|
│ HANDOVER STATE: │
|
|
│ - documentIntents: Complete intent analysis │
|
|
│ - documents: Resolved ChatDocuments │
|
|
│ - preExtractedMapping: Map[originalDocId, jsonDocId] │
|
|
└─────────────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────────┐
|
|
│ PHASE 2: Content Extraction and Preparation │
|
|
│ ────────────────────────────────────────────────────────────────── │
|
|
│ INPUT: │
|
|
│ - documents: List[ChatDocument] │
|
|
│ - documentIntents: List[DocumentIntent] │
|
|
│ - contentParts: List[ContentPart] (optional, pre-extracted) │
|
|
│ - preExtractedMapping: Map[originalDocId, jsonDocId] │
|
|
│ │
|
|
│ THROUGHPUT: │
|
|
│ 1. Process pre-extracted JSON documents → ContentParts │
|
|
│ - Extract ContentParts from JSON (not treat as regular JSON) │
|
|
│ - Apply intents (extract, render, reference) │
|
|
│ - Mark with isPreExtracted=True │
|
|
│ 2. RAW extraction (NO AI) for regular documents │
|
|
│ - Extract content using extraction service │
|
|
│ - Create ContentParts with metadata │
|
|
│ 3. Merge all ContentParts │
|
|
│ - Pre-extracted parts (from JSON documents) │
|
|
│ - Extracted parts (from regular documents) │
|
|
│ - Provided parts (from contentParts parameter) │
|
|
│ 4. Apply intents to ContentParts (extract, render, reference) │
|
|
│ 5. Mark images for Vision AI extraction (deferred) │
|
|
│ │
|
|
│ OUTPUT: │
|
|
│ - finalContentParts: List[ContentPart] │
|
|
│ * id: str │
|
|
│ * typeGroup: str │
|
|
│ * mimeType: str │
|
|
│ * data: Union[str, bytes] │
|
|
│ * metadata: Dict │
|
|
│ - documentId: str │
|
|
│ - contentFormat: str ("extracted", "object", "reference") │
|
|
│ - intent: str │
|
|
│ - needsVisionExtraction: bool (for images) │
|
|
│ - extractionPrompt: str (for Vision AI) │
|
|
│ - originalFileName: str │
|
|
│ - isPreExtracted: bool │
|
|
│ Note: outputFormat and language are NOT propagated here - │
|
|
│ they're determined in Phase 3 (Structure Generation) │
|
|
│ │
|
|
│ HANDOVER STATE: │
|
|
│ - finalContentParts: Complete merged list │
|
|
│ - All pre-extracted JSON documents processed → ContentParts │
|
|
│ - All regular documents extracted → ContentParts │
|
|
│ - All provided contentParts merged │
|
|
│ - All documents processed (extracted or pre-extracted) │
|
|
│ - Vision AI extraction deferred to Phase 4 │
|
|
└─────────────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────────┐
|
|
│ PHASE 3: Structure Generation │
|
|
│ ────────────────────────────────────────────────────────────────── │
|
|
│ INPUT: │
|
|
│ - userPrompt: str │
|
|
│ - finalContentParts: List[ContentPart] │
|
|
│ - outputFormat: Optional[str] (optional fallback, defaults to "txt") │
|
|
│ - currentUserLanguage: str (always valid, validated during user intention analysis) │
|
|
│ * From: self.services.currentUserLanguage (always valid, validated during user intention analysis) │
|
|
│ │
|
|
│ THROUGHPUT: │
|
|
│ 1. Group ContentParts by documentId (for context) │
|
|
│ 2. AI generates structure with documents and chapters │
|
|
│ 3. AI determines per-document outputFormat in structure JSON │
|
|
│ from user prompt → else optional outputFormat fallback (or "txt") │
|
|
│ 4. AI determines per-document language in structure JSON │
|
|
│ from user prompt → else validated currentUserLanguage (always valid) │
|
|
│ 5. Assign ContentParts to chapters │
|
|
│ │
|
|
│ OUTPUT: │
|
|
│ - chapterStructure: Dict │
|
|
│ * documents: List[Dict] │
|
|
│ - id: str │
|
|
│ - title: str │
|
|
│ - outputFormat: str (per-document) ← NEW │
|
|
│ - language: str (per-document) ← NEW │
|
|
│ - chapters: List[Dict] │
|
|
│ * id: str │
|
|
│ * level: int │
|
|
│ * title: str │
|
|
│ * generationHint: str │
|
|
│ * contentParts: List[str] (ContentPart IDs) │
|
|
│ │
|
|
│ HANDOVER STATE: │
|
|
│ - chapterStructure: Complete structure with ContentPart │
|
|
│ assignments │
|
|
│ - Per-document format/language determined │
|
|
└─────────────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────────┐
|
|
│ PHASE 4: Structure Filling │
|
|
│ ────────────────────────────────────────────────────────────────── │
|
|
│ INPUT: │
|
|
│ - chapterStructure: Dict (with per-document language from Phase 3)│
|
|
│ - finalContentParts: List[ContentPart] │
|
|
│ - userPrompt: str │
|
|
│ │
|
|
│ THROUGHPUT: │
|
|
│ For each document (with per-document language): │
|
|
│ For each chapter: │
|
|
│ 1. Generate sections structure (parallel) │
|
|
│ 2. For each section: │
|
|
│ a. Extract per-document language from structure │
|
|
│ b. Check if ContentParts need Vision AI extraction │
|
|
│ c. If yes: Call Vision AI (Phase 2 deferred extraction) │
|
|
│ d. Determine prompt type: │
|
|
│ - WITH CONTENT: If contentParts assigned │
|
|
│ → Use aggregation prompt (isAggregation=True) │
|
|
│ → ContentParts passed as parameters │
|
|
│ → Use per-document language for generation │
|
|
│ - WITHOUT CONTENT: If no contentParts │
|
|
│ → Use generation prompt (isAggregation=False) │
|
|
│ → Only generationHint in prompt │
|
|
│ → Use per-document language for generation │
|
|
│ e. Generate section content with AI │
|
|
│ │
|
|
│ OUTPUT: │
|
|
│ - filledStructure: Dict │
|
|
│ * documents: List[Dict] │
|
|
│ - language: str (preserved from input structure, per-document)│
|
|
│ - chapters: List[Dict] │
|
|
│ * sections: List[Dict] │
|
|
│ - id: str │
|
|
│ - content_type: str │
|
|
│ - elements: List[Dict] │
|
|
│ * type: str │
|
|
│ * content: str (or base64 for images) │
|
|
│ │
|
|
│ HANDOVER STATE: │
|
|
│ - filledStructure: Complete content, ready for rendering │
|
|
│ - Per-document language preserved from structure │
|
|
│ - All Vision AI extractions completed │
|
|
└─────────────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────────┐
|
|
│ PHASE 5: Document Rendering │
|
|
│ ────────────────────────────────────────────────────────────────── │
|
|
│ INPUT: │
|
|
│ - filledStructure: Dict │
|
|
│ - per-document outputFormat (from Phase 3, determined from prompt) │
|
|
│ - per-document language (from Phase 3, validated currentUserLanguage) │
|
|
│ │
|
|
│ THROUGHPUT: │
|
|
│ 1. Group sections by document (from structure) │
|
|
│ 2. For each document: │
|
|
│ a. Use per-document outputFormat │
|
|
│ b. Use per-document language │
|
|
│ c. Render document in specified format │
|
|
│ │
|
|
│ OUTPUT: │
|
|
│ - renderedDocuments: List[DocumentData] │
|
|
│ * documentName: str │
|
|
│ * documentData: bytes │
|
|
│ * mimeType: str │
|
|
│ │
|
|
│ HANDOVER STATE: │
|
|
│ - renderedDocuments: Final output ready for user │
|
|
└─────────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
---
|
|
|
|
## 2. Detailed Implementation Steps
|
|
|
|
### Step 1: Update DocumentIntent Model
|
|
|
|
**File**: `gateway/modules/datamodels/datamodelExtraction.py`
|
|
|
|
**Changes**:
|
|
```python
|
|
class DocumentIntent(BaseModel):
|
|
documentId: str
|
|
intents: List[str] # ["extract", "render", "reference"]
|
|
extractionPrompt: Optional[str] = None
|
|
# Note: outputFormat and language are NOT here - determined during
|
|
# structure generation (Phase 3) in the chapter structure JSON
|
|
reasoning: str
|
|
```
|
|
|
|
**Rationale**:
|
|
- Intent clarification focuses on document purpose (extract, render, reference)
|
|
- Output format and language are determined later during structure generation (Phase 3)
|
|
- Structure generation has full context (user prompt, ContentParts, chapters) to determine format/language
|
|
|
|
---
|
|
|
|
### Step 2: Update Intent Analysis Prompt
|
|
|
|
**File**: `gateway/modules/services/serviceAi/subDocumentIntents.py`
|
|
|
|
**Changes**:
|
|
|
|
1. **Add fencing around userPrompt** (Security Fix):
|
|
```python
|
|
def _buildIntentAnalysisPrompt(
|
|
self,
|
|
userPrompt: str,
|
|
documents: List[ChatDocument],
|
|
actionParameters: Dict[str, Any]
|
|
) -> str:
|
|
# FENCE user input to prevent prompt injection
|
|
fencedUserPrompt = f"""```user_request
|
|
{userPrompt}
|
|
```"""
|
|
|
|
prompt = f"""USER REQUEST:
|
|
{fencedUserPrompt}
|
|
|
|
DOCUMENTS TO ANALYZE:
|
|
{docListText}
|
|
|
|
TASK: For each document, determine:
|
|
1. Intents (can be multiple): "extract", "render", "reference"
|
|
Note: Output format and language are NOT determined here - they will be
|
|
determined during structure generation (Phase 3) in the chapter structure JSON
|
|
|
|
OUTPUT FORMAT: {outputFormat} (global fallback - for reference only)
|
|
|
|
RETURN JSON:
|
|
{{
|
|
"intents": [
|
|
{{
|
|
"documentId": "doc_1",
|
|
"intents": ["extract"],
|
|
"extractionPrompt": "Extract all text content",
|
|
// Note: outputFormat and language are NOT here - determined during
|
|
// structure generation in the chapter structure JSON
|
|
"reasoning": "..."
|
|
}}
|
|
]
|
|
}}
|
|
"""
|
|
```
|
|
|
|
2. **Remove global outputFormat from prompt** (keep as fallback):
|
|
- Output format should be determined per document based on intent
|
|
- Global format remains as fallback if not specified per document
|
|
|
|
---
|
|
|
|
### Step 3: Update ContentPart Metadata Propagation
|
|
|
|
**File**: `gateway/modules/services/serviceAi/subContentExtraction.py`
|
|
|
|
**Changes**:
|
|
```python
|
|
async def extractAndPrepareContent(
|
|
self,
|
|
documents: List[ChatDocument],
|
|
documentIntents: List[DocumentIntent],
|
|
parentOperationId: str,
|
|
getIntentForDocument: callable
|
|
) -> List[ContentPart]:
|
|
# ... existing extraction logic ...
|
|
|
|
# Note: outputFormat and language are NOT propagated here - they're determined
|
|
# during structure generation (Phase 3) in the chapter structure JSON
|
|
# ContentParts are created with intent information only
|
|
```
|
|
|
|
**Rationale**:
|
|
- ContentParts carry intent and extraction information only
|
|
- Output format and language are determined during structure generation (Phase 3)
|
|
- Structure generation has full context to make format/language decisions
|
|
|
|
---
|
|
|
|
### Step 4: Update Structure Generation
|
|
|
|
**File**: `gateway/modules/services/serviceAi/subStructureGeneration.py`
|
|
|
|
#### Global Format Source Chain
|
|
|
|
**Note**: `outputFormat` parameter is **optional**. If omitted, formats are determined from user prompt by AI.
|
|
|
|
**If outputFormat provided**:
|
|
1. Action parameters: `action_parameters.get("outputFormat")` or `action_parameters.get("resultType")`
|
|
2. Passed to `callAiContent(outputFormat=...)` → `generateStructure(outputFormat=...)` as parameter
|
|
3. Used as fallback in State 3 validation if AI doesn't return format per document
|
|
4. Final fallback: "txt" if global format is also missing/invalid
|
|
|
|
**If outputFormat omitted**:
|
|
1. AI determines formats per document from user prompt
|
|
2. Validation fallback: "txt" (if AI doesn't return format per document)
|
|
|
|
**Rationale**: With per-document format determination, AI can determine different formats for different documents based on user prompt. The `outputFormat` parameter is primarily a fallback for validation, not a requirement.
|
|
|
|
#### Language Source Chain
|
|
|
|
**Note**: `currentUserLanguage` is always valid (validated during user intention analysis).
|
|
|
|
1. AI determines per-document language in structure JSON response
|
|
2. If AI doesn't return language: Use validated `currentUserLanguage` (always valid, validated during user intention analysis)
|
|
3. `currentUserLanguage` validation ensures:
|
|
- AI response `detectedLanguage` is validated (2-character ISO code)
|
|
- If AI didn't return language or invalid → uses user language (`self.services.user.language`)
|
|
- If user language not set → uses "en"
|
|
- Always safe to use directly without fallback logic
|
|
|
|
**Changes**:
|
|
|
|
1. **Make outputFormat optional in generateStructure method signature**:
|
|
```python
|
|
async def generateStructure(
|
|
self,
|
|
userPrompt: str,
|
|
contentParts: List[ContentPart],
|
|
outputFormat: Optional[str] = None, # ← Optional: if omitted, formats determined from prompt by AI
|
|
parentOperationId: str
|
|
) -> Dict[str, Any]:
|
|
"""
|
|
Generate document structure with per-document format determination.
|
|
|
|
Multiple documents can be produced with different formats (e.g., one PDF, one HTML).
|
|
AI determines formats per-document from user prompt. The outputFormat parameter is
|
|
only a validation fallback - used if AI doesn't return format per document.
|
|
|
|
Args:
|
|
outputFormat: Optional global format fallback. If omitted, formats are determined
|
|
from user prompt by AI. Used as validation fallback if AI doesn't
|
|
return format per document. Defaults to "txt" if not provided.
|
|
"""
|
|
# If outputFormat not provided, use "txt" as fallback for validation
|
|
# AI will determine formats per document from user prompt
|
|
if not outputFormat:
|
|
outputFormat = "txt"
|
|
logger.debug("outputFormat not provided - using 'txt' as validation fallback, formats determined from prompt")
|
|
|
|
# Group ContentParts by documentId (for context in prompt)
|
|
partsByDocument = {}
|
|
for part in contentParts:
|
|
docId = part.metadata.get("documentId", "default")
|
|
if docId not in partsByDocument:
|
|
partsByDocument[docId] = []
|
|
partsByDocument[docId].append(part)
|
|
|
|
# AI determines per-document format and language in structure JSON response
|
|
# Pass global fallback for AI to use if not specified per document
|
|
prompt = self._buildChapterStructurePrompt(
|
|
userPrompt=userPrompt,
|
|
contentParts=contentParts,
|
|
outputFormat=outputFormat # Fallback for validation (AI determines formats from prompt)
|
|
)
|
|
```
|
|
|
|
**Note**:
|
|
- `outputFormat` is **optional**. If omitted, formats are determined from user prompt by AI.
|
|
- Used as validation fallback if AI doesn't return format per document.
|
|
- User prompt language comes from `self.services.currentUserLanguage` which is validated during user intention analysis (`workflowManager._sendFirstMessage()`). The validation ensures:
|
|
- AI response `detectedLanguage` is validated (2-character ISO code)
|
|
- If AI didn't return language or invalid → uses user language (`self.services.user.language`)
|
|
- If user language not set → uses "en"
|
|
- `currentUserLanguage` is always valid and safe to use directly without fallback logic
|
|
|
|
2. **Update prompt to clarify format determination from prompt**:
|
|
```python
|
|
def _buildChapterStructurePrompt(
|
|
self,
|
|
userPrompt: str,
|
|
contentParts: List[ContentPart],
|
|
outputFormat: str # Global fallback (for validation only)
|
|
) -> str:
|
|
# Get language from services (validated currentUserLanguage infrastructure)
|
|
language = self._getUserLanguage() # Uses self.services.currentUserLanguage (always valid)
|
|
|
|
# ... existing prompt building ...
|
|
|
|
prompt += f"""
|
|
## OUTPUT FORMAT (per document)
|
|
- Each document can have its own output format (pdf, docx, html, etc.)
|
|
- **Determine the format for each document from the USER REQUEST above**
|
|
- Multiple documents can have different formats (e.g., one PDF, one HTML)
|
|
- Analyze user prompt to identify format requirements:
|
|
* Explicit format mentions (e.g., "as PDF", "in Excel", "HTML document")
|
|
* Document purpose (e.g., "spreadsheet" → xlsx, "presentation" → pptx)
|
|
* Content type requirements
|
|
- If format cannot be determined from prompt, use fallback: "{outputFormat}" (for validation only)
|
|
- Include "outputFormat" field in each document in the JSON structure
|
|
- **CRITICAL**: Formats are determined from user prompt, not from the fallback value
|
|
|
|
## DOCUMENT LANGUAGE (per document)
|
|
- Each document can have its own language (ISO 639-1 code: "de", "en", "fr", etc.)
|
|
- Determine the language for each document based on:
|
|
* User prompt language/context
|
|
* Document content context
|
|
* User's explicit language requirements
|
|
- If not specified, use validated currentUserLanguage: "{language}" (always valid, validated during user intention analysis)
|
|
- Include "language" field in each document in the JSON structure
|
|
|
|
EXAMPLE JSON STRUCTURE:
|
|
{{
|
|
"documents": [
|
|
{{
|
|
"id": "doc_1",
|
|
"title": "Document Title",
|
|
"outputFormat": "pdf", // ← Determined by AI from user prompt
|
|
"language": "de", // ← Determined by AI from user prompt
|
|
"chapters": [...]
|
|
}},
|
|
{{
|
|
"id": "doc_2",
|
|
"title": "Another Document",
|
|
"outputFormat": "html", // ← Different format for different document
|
|
"language": "en", // ← Different language for different document
|
|
"chapters": [...]
|
|
}}
|
|
]
|
|
}}
|
|
"""
|
|
```
|
|
|
|
---
|
|
|
|
### Step 5: Update Structure Filling - Two Prompt Types
|
|
|
|
**File**: `gateway/modules/services/serviceAi/subStructureFilling.py`
|
|
|
|
**Changes**:
|
|
|
|
1. **Ensure two prompt types are used** (already implemented, verify):
|
|
```python
|
|
async def _fillSingleSection(
|
|
self,
|
|
section: Dict[str, Any],
|
|
contentParts: List[ContentPart],
|
|
userPrompt: str,
|
|
generationHint: str,
|
|
document: Dict[str, Any], # ← NEW: Need document to get per-document language
|
|
# ... other params ...
|
|
) -> List[Dict[str, Any]]:
|
|
# Extract per-document language from structure
|
|
# Language MUST be defined in structure (validated in State 3)
|
|
# If missing, this is an error - should not happen after State 3 validation
|
|
if "language" not in document:
|
|
raise ValueError(f"Document {document.get('id')} missing 'language' field - should have been set in Phase 3 validation")
|
|
|
|
docLanguage = document["language"]
|
|
|
|
# Validate language format (should be 2-character ISO code)
|
|
if not isinstance(docLanguage, str) or len(docLanguage) != 2:
|
|
raise ValueError(f"Document {document.get('id')} has invalid language format: {docLanguage} - should be 2-character ISO 639-1 code")
|
|
|
|
contentPartIds = section.get("contentPartIds", [])
|
|
hasContentParts = len(contentPartIds) > 0
|
|
|
|
if hasContentParts:
|
|
# PROMPT TYPE 1: WITH CONTENT (Aggregation)
|
|
# ContentParts passed as parameters, not in prompt text
|
|
isAggregation = True
|
|
relevantParts = [p for p in contentParts if p.id in contentPartIds]
|
|
|
|
generationPrompt = self._buildSectionGenerationPrompt(
|
|
section=section,
|
|
contentParts=relevantParts, # Passed as parameters
|
|
userPrompt=userPrompt,
|
|
generationHint=generationHint,
|
|
isAggregation=True, # ← Key flag
|
|
language=docLanguage # ← Per-document language from structure
|
|
)
|
|
else:
|
|
# PROMPT TYPE 2: WITHOUT CONTENT (Generation)
|
|
# Only generationHint in prompt, no ContentParts
|
|
isAggregation = False
|
|
|
|
generationPrompt = self._buildSectionGenerationPrompt(
|
|
section=section,
|
|
contentParts=[], # Empty
|
|
userPrompt=userPrompt,
|
|
generationHint=generationHint,
|
|
isAggregation=False, # ← Key flag
|
|
language=docLanguage # ← Per-document language from structure
|
|
)
|
|
```
|
|
|
|
**Note**: Language comes from the document in the structure (per-document), not a global parameter. Each document can have its own language as determined in Phase 3. The language MUST be defined and validated in Phase 3 (State 3 validation) - if missing here, it's an error.
|
|
|
|
2. **Verify `_buildSectionGenerationPrompt` handles both cases**:
|
|
```python
|
|
def _buildSectionGenerationPrompt(
|
|
self,
|
|
section: Dict[str, Any],
|
|
contentParts: List[ContentPart],
|
|
userPrompt: str,
|
|
generationHint: str,
|
|
isAggregation: bool, # ← Determines prompt type
|
|
language: str
|
|
) -> str:
|
|
if isAggregation:
|
|
# TYPE 1: WITH CONTENT
|
|
# ContentParts are passed as parameters to AI call
|
|
# Don't include full content in prompt text (token efficiency)
|
|
prompt = f"""Generate content for section based on provided ContentParts.
|
|
|
|
Section: {sectionTitle}
|
|
Generation Hint: {generationHint}
|
|
Language: {language}
|
|
|
|
ContentParts are provided as parameters (not shown in prompt for efficiency).
|
|
Use the ContentParts data to generate the section content.
|
|
"""
|
|
else:
|
|
# TYPE 2: WITHOUT CONTENT
|
|
# Only generationHint, no ContentParts
|
|
prompt = f"""Generate content for section based on generation hint.
|
|
|
|
Section: {sectionTitle}
|
|
Generation Hint: {generationHint}
|
|
Language: {language}
|
|
|
|
Generate content based on the generation hint without referencing external content.
|
|
"""
|
|
```
|
|
|
|
**Rationale**:
|
|
- **Type 1 (with content)**: Efficient for large content (ContentParts as parameters)
|
|
- **Type 2 (without content)**: Simple generation based on hint only
|
|
- Already implemented via `isAggregation` flag, verify it's used correctly
|
|
|
|
---
|
|
|
|
### Step 6: Update Document Rendering
|
|
|
|
**File**: `gateway/modules/services/serviceAi/mainServiceAi.py` (renderResult method)
|
|
**File**: `gateway/modules/services/serviceGeneration/mainServiceGeneration.py` (renderReport method)
|
|
|
|
**Current Implementation**:
|
|
- `renderResult()` calls `generationService.renderReport()`
|
|
- `renderReport()` already processes each document separately (line 385)
|
|
- Currently checks `doc.get("format", outputFormat)` (line 397) - but should check `outputFormat` field
|
|
- Language is not handled per-document
|
|
|
|
**Changes**:
|
|
|
|
1. **Update renderResult to pass language (from structure, validated before rendering)**:
|
|
```python
|
|
async def renderResult(
|
|
self,
|
|
filledStructure: Dict[str, Any],
|
|
outputFormat: str, # Global fallback
|
|
language: str, # ← NEW: Add language parameter (global fallback)
|
|
title: str,
|
|
userPrompt: str,
|
|
parentOperationId: str
|
|
) -> List[RenderedDocument]:
|
|
"""
|
|
Render filled structure to documents.
|
|
|
|
Per-document format and language are extracted from structure (validated in State 3).
|
|
The outputFormat and language parameters are only used as global fallbacks.
|
|
Multiple documents can have different formats and languages.
|
|
"""
|
|
# Language comes from structure (per-document), validated in State 3
|
|
# This parameter is only used as global fallback if structure validation fails
|
|
# Use validated currentUserLanguage as fallback (always valid)
|
|
if not language:
|
|
language = self._getUserLanguage() # Uses validated currentUserLanguage infrastructure
|
|
|
|
# ... existing code ...
|
|
|
|
renderedDocuments = await generationService.renderReport(
|
|
filledStructure,
|
|
outputFormat,
|
|
language, # ← Pass language (global fallback, per-document extracted in renderReport)
|
|
title,
|
|
userPrompt,
|
|
self,
|
|
parentOperationId=renderOperationId
|
|
)
|
|
```
|
|
|
|
**Note**:
|
|
- Language comes from structure (per-document) as determined in Phase 3
|
|
- The `language` parameter here is only used as a global fallback
|
|
- Per-document language is validated in State 3 (Structure Generation) and extracted from structure in `renderReport()`
|
|
- Uses validated `currentUserLanguage` infrastructure if fallback needed
|
|
|
|
2. **Update renderReport to handle per-document format and language**:
|
|
```python
|
|
async def renderReport(
|
|
self,
|
|
extractedContent: Dict[str, Any],
|
|
outputFormat: str, # Global fallback
|
|
language: str, # ← NEW: Add language parameter (global fallback)
|
|
title: str,
|
|
userPrompt: str = None,
|
|
aiService=None,
|
|
parentOperationId: Optional[str] = None
|
|
) -> List[RenderedDocument]:
|
|
# ... existing validation ...
|
|
|
|
# Process EACH document separately
|
|
for docIndex, doc in enumerate(documents):
|
|
# ... existing validation ...
|
|
|
|
# Determine format for this document
|
|
# Check outputFormat field first (per-document), then format field (legacy), then global fallback
|
|
docFormat = doc.get("outputFormat") or doc.get("format") or outputFormat
|
|
|
|
# Determine language for this document
|
|
# Extract per-document language from structure (validated in State 3), fallback to global
|
|
docLanguage = doc.get("language") or language
|
|
|
|
# Validate language format (should be 2-character ISO code, validated in State 3)
|
|
if not isinstance(docLanguage, str) or len(docLanguage) != 2:
|
|
logger.warning(f"Document {doc.get('id')} has invalid language format: {docLanguage}, using fallback")
|
|
docLanguage = language # Use global fallback
|
|
|
|
# Get renderer for this document's format (uses existing renderer registry)
|
|
renderer = self._getFormatRenderer(docFormat)
|
|
if not renderer:
|
|
logger.warning(f"Unsupported format '{docFormat}' for document {doc.get('id', docIndex)}, skipping")
|
|
continue
|
|
|
|
# Create JSON structure with single document (preserving metadata)
|
|
singleDocContent = {
|
|
"metadata": {**metadata, "language": docLanguage}, # ← Add per-document language to metadata
|
|
"documents": [doc]
|
|
}
|
|
|
|
# Render this document (can return multiple files, e.g., HTML + images)
|
|
renderedDocs = await renderer.render(singleDocContent, docTitle, userPrompt, aiService)
|
|
allRenderedDocuments.extend(renderedDocs)
|
|
```
|
|
|
|
**Note**:
|
|
- Per-document format and language are extracted from structure (validated in State 3)
|
|
- Renderers (`RendererPdf`, `RendererHtml`, etc.) receive the structure with language in metadata
|
|
- They can use it for language-specific formatting if needed
|
|
- Multiple documents can have different formats and languages
|
|
|
|
---
|
|
|
|
### Step 7: Update ai.process to Pass documentList and Make outputFormat Optional
|
|
|
|
**File**: `gateway/modules/workflows/methods/methodAi/actions/process.py`
|
|
|
|
**Changes**:
|
|
```python
|
|
# Phase 7.3: Pass both documentList and contentParts to AI service
|
|
# (Remove extraction logic from here - handled by AI service)
|
|
|
|
# resultType is optional - if omitted, formats determined from prompt by AI
|
|
# Default "txt" is validation fallback only
|
|
resultType = parameters.get("resultType") # Optional: if None, formats determined from prompt
|
|
if resultType:
|
|
normalized_result_type = (str(resultType).strip().lstrip('.').lower() or "txt")
|
|
output_format = output_extension.replace('.', '') or 'txt'
|
|
else:
|
|
# No format specified - AI will determine formats from prompt
|
|
output_format = None
|
|
logger.debug("resultType not provided - formats will be determined from prompt by AI")
|
|
|
|
# Use unified callAiContent method with BOTH parameters
|
|
aiResponse = await self.services.ai.callAiContent(
|
|
prompt=aiPrompt,
|
|
options=options,
|
|
documentList=documentList, # ← PASS documentList (was missing)
|
|
contentParts=contentParts, # ← PASS contentParts
|
|
outputFormat=output_format, # ← Optional: if None, formats determined from prompt
|
|
parentOperationId=operationId,
|
|
generationIntent=generationIntent
|
|
)
|
|
```
|
|
|
|
**Note**:
|
|
- `resultType` parameter is **optional**. If omitted, formats are determined from user prompt by AI.
|
|
- Default "txt" (if provided) is used as validation fallback only.
|
|
- Language detection from user prompt is already done and validated. `self.services.currentUserLanguage` is always valid (validated during user intention analysis in `workflowManager._sendFirstMessage()`).
|
|
|
|
|
|
|
|
---
|
|
|
|
## 3. Handover State Definitions and Validation
|
|
|
|
**Purpose**: These state definitions document the expected structure and validation rules at each phase boundary.
|
|
|
|
**Implementation Approach**:
|
|
- **Inline validation** in each phase method
|
|
- **Auto-fix** where possible (use defaults, skip invalid items)
|
|
- **Stop with error** for critical structural issues
|
|
- **Log warnings** for skipped items
|
|
|
|
**See**: Appendix "Validation Failure Handling Decisions" below for detailed Q&A on each validation
|
|
|
|
**Summary of Validation Decisions**:
|
|
- **State 1**: Skip intents for unknown documents; documents without intents are OK
|
|
- **State 2**: Skip ContentParts with missing/invalid metadata (with warnings)
|
|
- **State 3**: Auto-fix format/language with fallbacks; error on missing structure fields
|
|
- **State 4**: Auto-fix missing elements field; allow empty elements
|
|
- **State 5**: Skip empty documents; infer mimeType from filename
|
|
|
|
### State 1: After Intent Clarification
|
|
|
|
**Location**: `gateway/modules/services/serviceAi/subDocumentIntents.py` - After `clarifyDocumentIntents()` returns (line 115)
|
|
|
|
**Expected State**:
|
|
```python
|
|
documentIntents: List[DocumentIntent] # Complete intent analysis
|
|
documents: List[ChatDocument] # Resolved documents
|
|
preExtractedMapping: Dict[str, str] # Map[originalDocId, jsonDocId]
|
|
```
|
|
|
|
**Implementation Code** (add after line 115, before return):
|
|
```python
|
|
# Validation and auto-fix
|
|
documentIds = {d.id for d in documents}
|
|
validatedIntents = []
|
|
|
|
for intent in documentIntents:
|
|
# Validation 1.2: Skip intents for unknown documents
|
|
if intent.documentId not in documentIds:
|
|
logger.warning(f"Skipping intent for unknown document: {intent.documentId}")
|
|
continue
|
|
validatedIntents.append(intent)
|
|
|
|
# Validation 1.1: Documents without intents are OK (not needed)
|
|
# Intents for non-existing documents are already filtered above
|
|
documentIntents = validatedIntents
|
|
```
|
|
|
|
### State 2: After Content Extraction
|
|
|
|
**Location**: `gateway/modules/services/serviceAi/subContentExtraction.py` - After `extractAndPrepareContent()` returns (at end of method, before return)
|
|
|
|
**Expected State**:
|
|
```python
|
|
finalContentParts: List[ContentPart] # All content parts ready
|
|
```
|
|
|
|
**Implementation Code** (add at end of method, before return):
|
|
```python
|
|
# Validation and auto-fix
|
|
validatedParts = []
|
|
for part in finalContentParts:
|
|
# Validation 2.1: Skip ContentParts without documentId
|
|
if not part.metadata.get("documentId"):
|
|
logger.warning(f"Skipping ContentPart {part.id} - missing documentId in metadata")
|
|
continue
|
|
|
|
# Validation 2.2: Skip ContentParts with invalid contentFormat
|
|
contentFormat = part.metadata.get("contentFormat")
|
|
if contentFormat not in ["extracted", "object", "reference"]:
|
|
logger.warning(
|
|
f"Skipping ContentPart {part.id} - invalid contentFormat: {contentFormat}"
|
|
)
|
|
continue
|
|
|
|
validatedParts.append(part)
|
|
|
|
return validatedParts
|
|
```
|
|
|
|
### State 3: After Structure Generation
|
|
|
|
**Location**: `gateway/modules/services/serviceAi/subStructureGeneration.py` - After `generateStructure()` returns (after parsing JSON, before return, around line 182)
|
|
|
|
**Expected State**:
|
|
```python
|
|
chapterStructure: Dict[str, Any] # Complete structure with documents, chapters, outputFormat, language
|
|
```
|
|
|
|
**Implementation Code** (add after structure JSON is parsed, before return):
|
|
```python
|
|
# After structure JSON is parsed (around line 182)
|
|
# Validation and auto-fix
|
|
|
|
# Validation 3.1: Structure missing 'documents' field
|
|
if "documents" not in structure:
|
|
raise ValueError("Structure missing 'documents' field - cannot auto-fix")
|
|
|
|
documents = structure["documents"]
|
|
|
|
# Validation 3.2: Structure has no documents
|
|
if not isinstance(documents, list) or len(documents) == 0:
|
|
raise ValueError("Structure has no documents - cannot generate without documents")
|
|
|
|
# Import renderer registry for format validation (existing infrastructure)
|
|
from modules.services.serviceGeneration.renderers.registry import getRenderer
|
|
|
|
# Validate and fix each document
|
|
for doc in documents:
|
|
# Validation 3.3 & 3.4: Document outputFormat
|
|
# outputFormat parameter is optional - if omitted, formats determined from prompt by AI
|
|
# Use as fallback only if AI doesn't return format per document
|
|
# Multiple documents can have different formats (e.g., one PDF, one HTML)
|
|
globalFormatFallback = outputFormat or "txt" # Fallback for validation
|
|
|
|
if "outputFormat" not in doc or not doc["outputFormat"]:
|
|
# AI didn't return format or returned empty - use global fallback
|
|
doc["outputFormat"] = globalFormatFallback
|
|
logger.info(f"Document {doc.get('id')} missing outputFormat - using fallback: {doc['outputFormat']}")
|
|
else:
|
|
# AI returned format - validate using existing renderer registry
|
|
formatName = str(doc["outputFormat"]).lower().strip()
|
|
renderer = getRenderer(formatName) # Uses existing infrastructure
|
|
|
|
if not renderer:
|
|
# Format doesn't match any renderer - use txt (simple approach)
|
|
logger.warning(f"Document {doc.get('id')} has format without renderer: {formatName}, using 'txt'")
|
|
doc["outputFormat"] = "txt"
|
|
else:
|
|
# Valid format with renderer - normalize and keep AI result
|
|
doc["outputFormat"] = formatName
|
|
logger.debug(f"Document {doc.get('id')} using AI-determined format: {formatName}")
|
|
|
|
# Validation 3.5 & 3.6: Document language
|
|
# Use validated currentUserLanguage (always valid, validated during user intention analysis)
|
|
# Access via _getUserLanguage() which uses self.services.currentUserLanguage
|
|
userPromptLanguage = self._getUserLanguage() # Uses validated currentUserLanguage infrastructure
|
|
|
|
if "language" not in doc or not isinstance(doc["language"], str) or len(doc["language"]) != 2:
|
|
# AI didn't return language or invalid format - use validated currentUserLanguage
|
|
doc["language"] = userPromptLanguage
|
|
if "language" not in doc:
|
|
logger.info(f"Document {doc.get('id')} missing language - using currentUserLanguage: {doc['language']}")
|
|
else:
|
|
logger.warning(f"Document {doc.get('id')} has invalid language format from AI: {doc['language']}, using currentUserLanguage")
|
|
else:
|
|
# AI returned valid language format - normalize
|
|
doc["language"] = doc["language"].lower().strip()[:2]
|
|
logger.debug(f"Document {doc.get('id')} using AI-determined language: {doc['language']}")
|
|
|
|
# Validation 3.7: Document missing 'chapters' field
|
|
if "chapters" not in doc:
|
|
raise ValueError(f"Document {doc.get('id')} missing 'chapters' field - cannot auto-fix")
|
|
|
|
# Validation 3.8: Chapter missing 'contentParts' field
|
|
for chapter in doc["chapters"]:
|
|
if "contentParts" not in chapter:
|
|
raise ValueError(f"Chapter {chapter.get('id')} missing 'contentParts' field - cannot auto-fix")
|
|
|
|
return structure
|
|
```
|
|
|
|
### State 4: After Structure Filling
|
|
|
|
**Location**: `gateway/modules/services/serviceAi/subStructureFilling.py` - After `fillStructure()` returns (at end of method, before return, around line 204)
|
|
|
|
**Expected State**:
|
|
```python
|
|
filledStructure: Dict[str, Any] # Complete content with elements
|
|
```
|
|
|
|
**Implementation Code** (add at end of method, before return):
|
|
```python
|
|
# Validation and auto-fix
|
|
|
|
# Validation 4.1: Filled structure missing 'documents' field
|
|
if "documents" not in filledStructure:
|
|
raise ValueError("Filled structure missing 'documents' field - cannot auto-fix")
|
|
|
|
for doc in filledStructure["documents"]:
|
|
# Validation 4.4: Verify language is preserved from input structure
|
|
# Language MUST be preserved from Phase 3 structure (validated in State 3)
|
|
if "language" not in doc:
|
|
raise ValueError(f"Document {doc.get('id')} missing language in filled structure - should have been preserved from Phase 3")
|
|
|
|
# Validate language format
|
|
if not isinstance(doc["language"], str) or len(doc["language"]) != 2:
|
|
raise ValueError(f"Document {doc.get('id')} has invalid language format in filled structure: {doc['language']} - should be 2-character ISO 639-1 code")
|
|
|
|
for chapter in doc.get("chapters", []):
|
|
for section in chapter.get("sections", []):
|
|
# Validation 4.2: Section missing 'elements' field
|
|
if "elements" not in section:
|
|
section["elements"] = []
|
|
logger.info(f"Section {section.get('id')} missing 'elements' - created empty list")
|
|
|
|
# Validation 4.3: Section has empty elements list - ALLOW (intentionally empty is OK)
|
|
# No action needed - empty elements are allowed
|
|
|
|
return filledStructure
|
|
```
|
|
|
|
### State 5: After Document Rendering
|
|
|
|
**Location**: `gateway/modules/services/serviceGeneration/paths/documentPath.py` - After `renderResult()` returns (line 151, after line 157, before building documentDataList)
|
|
|
|
**Expected State**:
|
|
```python
|
|
renderedDocuments: List[RenderedDocument] # Final output
|
|
```
|
|
|
|
**Implementation Code** (add after line 157, before building documentDataList):
|
|
```python
|
|
# Validation 5.1: Already implemented at line 175-176
|
|
if not renderedDocuments:
|
|
raise ValueError("No documents were rendered")
|
|
|
|
# Validation 5.2 & 5.3: Validate and filter rendered documents
|
|
validatedRenderedDocs = []
|
|
for doc in renderedDocuments:
|
|
# Validation 5.2: Skip documents with empty documentData
|
|
if not doc.documentData:
|
|
logger.warning(f"Skipping rendered document {doc.filename} - empty documentData")
|
|
continue
|
|
|
|
# Validation 5.3: Infer mimeType from filename if missing
|
|
if not doc.mimeType:
|
|
from modules.services.serviceGeneration.subDocumentUtility import getMimeTypeFromExtension
|
|
if doc.filename:
|
|
inferredMimeType = getMimeTypeFromExtension(doc.filename)
|
|
if inferredMimeType:
|
|
doc.mimeType = inferredMimeType
|
|
logger.info(f"Inferred mimeType '{inferredMimeType}' from filename '{doc.filename}'")
|
|
else:
|
|
logger.warning(f"Could not infer mimeType from filename '{doc.filename}' - keeping as None")
|
|
else:
|
|
logger.warning(f"Rendered document missing mimeType and filename - cannot infer")
|
|
|
|
validatedRenderedDocs.append(doc)
|
|
|
|
# Use validated list
|
|
renderedDocuments = validatedRenderedDocs
|
|
|
|
# Re-check after filtering
|
|
if not renderedDocuments:
|
|
raise ValueError("No valid documents after validation")
|
|
```
|
|
|
|
---
|
|
|
|
## 4. Migration Checklist
|
|
|
|
### Phase 1: Model Updates
|
|
- [ ] Verify `DocumentIntent` model does NOT include `outputFormat` or `language`
|
|
- [ ] Intent clarification focuses only on document purpose (intents, extractionPrompt)
|
|
- [ ] Note: outputFormat and language are determined during structure generation (Phase 3)
|
|
|
|
### Phase 2: Intent Analysis Updates
|
|
- [ ] **CRITICAL**: Add fencing around `userPrompt` in intent analysis prompt
|
|
- [ ] Fence user input with code blocks: ```user_request\n{userPrompt}\n```
|
|
- [ ] Test with various user inputs (special chars, JSON, newlines, prompt injection attempts)
|
|
- [ ] Update prompt to focus only on document intents (extract, render, reference)
|
|
- [ ] Remove any outputFormat/language determination from intent analysis prompt
|
|
- [ ] Keep global outputFormat/language as reference only (not for determination)
|
|
- [ ] **Verify intent mapping logic** (already implemented in `clarifyDocumentIntents`):
|
|
- [ ] Step 1: Map pre-extracted JSONs to original documents (lines 63-83)
|
|
- [ ] Step 2: AI analyzes intents for original documents (line 86)
|
|
- [ ] Step 3: Map intents back to JSON doc IDs (lines 96-104)
|
|
- [ ] Test with pre-extracted JSONs to verify mapping works correctly
|
|
|
|
### Phase 3: Content Extraction Updates
|
|
- [ ] Verify ContentParts do NOT include outputFormat or language in metadata
|
|
- [ ] ContentParts carry only intent and extraction information
|
|
- [ ] Verify pre-extracted JSON handling preserves intent information
|
|
- [ ] **Add filtering to Data Extraction Path** (`_handleDataExtraction`):
|
|
**Current State (BEFORE filtering)**:
|
|
```python
|
|
# Line 708: Get documents directly from documentList
|
|
documents = self.services.chat.getChatDocumentsFromDocumentList(documentList)
|
|
# Line 721: Call extractAndPrepareContent() with ALL documents
|
|
preparedContentParts = await self.extractAndPrepareContent(documents, ...)
|
|
```
|
|
**Problem**: If `documentList` contains both:
|
|
- Original document: `original_pdf_123.pdf`
|
|
- Pre-extracted JSON: `pre_extracted_456.json` (contains ContentParts from `original_pdf_123.pdf`)
|
|
→ Both are processed → **DUPLICATE ContentParts created**
|
|
|
|
**How Filtering Works (Reference: `documentPath.py` lines 62-87)**:
|
|
|
|
**Step 1: Identify Pre-Extracted JSONs and Map to Originals**
|
|
```python
|
|
# Collect all original document IDs that are covered by pre-extracted JSONs
|
|
originalDocIdsCoveredByPreExtracted = set()
|
|
for doc in documents:
|
|
preExtracted = self.intentAnalyzer.resolvePreExtractedDocument(doc)
|
|
if preExtracted:
|
|
# Pre-extracted JSON found - get the original document ID it covers
|
|
originalDocId = preExtracted["originalDocument"]["id"]
|
|
originalDocIdsCoveredByPreExtracted.add(originalDocId)
|
|
```
|
|
**Result**: `originalDocIdsCoveredByPreExtracted = {"original_pdf_123"}` (if pre-extracted JSON covers it)
|
|
|
|
**Step 2: Filter Documents List**
|
|
```python
|
|
filteredDocuments = []
|
|
for doc in documents:
|
|
preExtracted = self.intentAnalyzer.resolvePreExtractedDocument(doc)
|
|
if preExtracted:
|
|
# Pre-extracted JSON - KEEP IT (will be processed as ContentParts)
|
|
filteredDocuments.append(doc)
|
|
elif doc.id in originalDocIdsCoveredByPreExtracted:
|
|
# Original document covered by pre-extracted JSON - REMOVE IT
|
|
logger.info(f"Skipping original document {doc.id} - already covered")
|
|
# Do NOT append - skip this document
|
|
else:
|
|
# Regular document (not pre-extracted, not covered) - KEEP IT
|
|
filteredDocuments.append(doc)
|
|
|
|
documents = filteredDocuments # Use filtered list
|
|
```
|
|
**Result**:
|
|
- ✅ Pre-extracted JSON: `pre_extracted_456.json` → KEPT
|
|
- ❌ Original document: `original_pdf_123.pdf` → REMOVED (covered by pre-extracted JSON)
|
|
- ✅ Regular document: `other_doc.pdf` → KEPT (not covered)
|
|
|
|
**Step 3: Use Filtered Documents**
|
|
```python
|
|
# Now call extractAndPrepareContent() with filtered documents only
|
|
preparedContentParts = await self.extractAndPrepareContent(
|
|
documents, # Only pre-extracted JSONs + regular docs (no originals covered by JSONs)
|
|
documentIntents or [],
|
|
extractOperationId
|
|
)
|
|
```
|
|
**Result**: No duplicates - original documents already filtered out
|
|
|
|
**Implementation Steps**:
|
|
- [ ] Add filtering logic between line 708 (get documents) and line 710 (clarify intents)
|
|
- [ ] Copy filtering code from `documentPath.py` lines 62-87
|
|
- [ ] Adapt to use `self.intentAnalyzer.resolvePreExtractedDocument()` (same method)
|
|
- [ ] **Filtering Logic**:
|
|
```python
|
|
# Step 1: Identify all original document IDs covered by pre-extracted JSONs
|
|
originalDocIdsCoveredByPreExtracted = set()
|
|
for doc in documents:
|
|
preExtracted = self.intentAnalyzer.resolvePreExtractedDocument(doc)
|
|
if preExtracted:
|
|
originalDocId = preExtracted["originalDocument"]["id"]
|
|
originalDocIdsCoveredByPreExtracted.add(originalDocId)
|
|
logger.debug(f"Found pre-extracted JSON {doc.id} covering original document {originalDocId}")
|
|
|
|
# Step 2: Filter documents - remove originals covered by pre-extracted JSONs
|
|
filteredDocuments = []
|
|
for doc in documents:
|
|
preExtracted = self.intentAnalyzer.resolvePreExtractedDocument(doc)
|
|
if preExtracted:
|
|
filteredDocuments.append(doc) # Keep pre-extracted JSON
|
|
elif doc.id in originalDocIdsCoveredByPreExtracted:
|
|
logger.info(f"Skipping original document {doc.id} ({doc.fileName}) - already covered by pre-extracted JSON")
|
|
else:
|
|
filteredDocuments.append(doc) # Keep regular document
|
|
|
|
documents = filteredDocuments # Use filtered list
|
|
```
|
|
- [ ] Test with scenario: original document + pre-extracted JSON → verify no duplicates
|
|
- [ ] **Remove redundant check from `extractAndPrepareContent()`**:
|
|
- [ ] Remove pre-extracted JSON check (line 77 in `subContentExtraction.py`)
|
|
- [ ] Trust that filtering is done upstream
|
|
- [ ] Cleaner code, single responsibility
|
|
- [ ] Test merging logic
|
|
- [ ] Test that both document generation and data extraction paths handle pre-extracted JSONs correctly
|
|
- [ ] Note: outputFormat and language are NOT propagated here - determined in structure generation
|
|
|
|
### Phase 4: Structure Generation Updates
|
|
- [ ] **Make outputFormat optional in generateStructure() method signature**:
|
|
- [ ] Update `subStructureGeneration.py` method signature (line 47): `outputFormat: Optional[str] = None`
|
|
- [ ] Update `mainServiceAi.py` wrapper method (line 444): Make `outputFormat` optional
|
|
- [ ] If `outputFormat` not provided, use "txt" as validation fallback (AI determines formats from prompt)
|
|
- [ ] Add logging: "outputFormat not provided - using 'txt' as validation fallback, formats determined from prompt"
|
|
- [ ] **Context**: `outputFormat` is only a validation fallback - AI determines per-document formats from user prompt. Multiple documents can have different formats (e.g., one PDF, one HTML).
|
|
- [ ] **Note on language handling**: Language is accessed via `self.services.currentUserLanguage` (always valid, validated during user intention analysis). No language parameter needed in `generateStructure()` method signature - language is accessed directly from services within the method.
|
|
- [ ] Verify `currentUserLanguage` is used correctly in `subStructureGeneration.py` (via `self.services.currentUserLanguage`)
|
|
- [ ] Verify `currentUserLanguage` is used correctly in prompt building (via `self.services.currentUserLanguage`)
|
|
- [ ] Note: `mainServiceGeneration.py` uses different service - verify if update needed
|
|
- [ ] Group ContentParts by documentId (for context in prompt)
|
|
- [ ] Update `_buildChapterStructurePrompt()` to access language via `self.services.currentUserLanguage` (no parameter needed)
|
|
- [ ] Update structure generation prompt to ask AI to determine per-document outputFormat
|
|
- [ ] Explicitly require `outputFormat` field in each document JSON structure
|
|
- [ ] Update example structure to show `outputFormat` field (not just filename)
|
|
- [ ] Clarify that multiple documents can have different formats
|
|
- [ ] Update structure generation prompt to ask AI to determine per-document language
|
|
- [ ] Explicitly require `language` field in each document JSON structure
|
|
- [ ] Clarify that multiple documents can have different languages
|
|
- [ ] Provide global fallbacks (outputFormat, language) for AI to use if not specified
|
|
- [ ] `outputFormat` fallback: from parameter or "txt"
|
|
- [ ] `language` fallback: use `self._getUserLanguage()` (validated currentUserLanguage infrastructure)
|
|
- [ ] **Parse and validate format/language from AI response**:
|
|
- [ ] Extract `outputFormat` and `language` from each document in structure JSON
|
|
- [ ] **Format validation (use existing renderer registry infrastructure)**:
|
|
- [ ] Import: `from modules.services.serviceGeneration.renderers.registry import getRenderer`
|
|
- [ ] If `outputFormat` missing or empty → use global fallback (`outputFormat` or "txt")
|
|
- [ ] If `outputFormat` exists → check if it has a renderer using `getRenderer(formatName)` (existing infrastructure)
|
|
- [ ] Normalize format name: `formatName.lower().strip()`
|
|
- [ ] If format doesn't match any renderer → use "txt" (simple approach, no global fallback attempt)
|
|
- [ ] Log warnings for invalid formats
|
|
- [ ] **Note**: Infrastructure exists at `mainServiceGeneration.py:529` - reuse `getRenderer()` function
|
|
- [ ] **Language validation (use existing validated infrastructure)**:
|
|
- [ ] Validate language (must be 2-character ISO 639-1 code)
|
|
- [ ] **If language missing**: Set to `self._getUserLanguage()` which uses validated `currentUserLanguage` (always valid, validated during user intention analysis at `workflowManager.py:695-727`)
|
|
- [ ] **If language invalid format**: Use `self._getUserLanguage()` (always valid)
|
|
- [ ] Normalize language: `language.lower().strip()[:2]`
|
|
- [ ] Log warnings for invalid/missing values
|
|
- [ ] **Note**: `currentUserLanguage` is always valid - safe to use directly via `_getUserLanguage()` method
|
|
- [ ] **Error handling**:
|
|
- [ ] If structure JSON is malformed → raise error with details
|
|
- [ ] If no documents in structure → raise error
|
|
- [ ] If AI doesn't return format → use global `outputFormat` fallback (or "txt" if not provided), log warning
|
|
- [ ] If AI doesn't return language → use validated `currentUserLanguage` (always valid), log warning
|
|
- [ ] Verify structure output includes per-document format and language (from AI in JSON response)
|
|
|
|
### Phase 5: Structure Filling Verification
|
|
- [ ] Verify two prompt types are correctly used:
|
|
- [ ] `isAggregation=True`: ContentParts as parameters
|
|
- [ ] `isAggregation=False`: Only generationHint
|
|
- [ ] **Verify per-document language is extracted and used**:
|
|
- [ ] Language MUST be defined in structure (validated in State 3)
|
|
- [ ] Language extracted from document in structure (per-document) - NO fallback to "en"
|
|
- [ ] If language missing: Raise error (should not happen after State 3 validation)
|
|
- [ ] If language invalid format: Raise error (should not happen after State 3 validation)
|
|
- [ ] Language passed to `_buildSectionGenerationPrompt()` for each section
|
|
- [ ] Language preserved in filled structure (State 4 validation)
|
|
- [ ] Test both prompt types with various scenarios
|
|
- [ ] Verify Vision AI extraction happens during filling phase
|
|
- [ ] Test with multi-document scenarios (different languages per document)
|
|
|
|
### Phase 6: Document Rendering Updates
|
|
- [ ] **Add language parameter to renderResult() method**:
|
|
- [ ] Update `mainServiceAi.py` renderResult() signature (line 460)
|
|
- [ ] Pass language to `generationService.renderReport()` (as global fallback)
|
|
- [ ] **Update renderResult call site** (`documentPath.py` line 151):
|
|
- [ ] Language comes from structure (per-document), validated in State 3
|
|
- [ ] Use validated `currentUserLanguage` as global fallback (always valid)
|
|
- [ ] Per-document language will be extracted in `renderReport()` from filledStructure
|
|
- [ ] Code example:
|
|
```python
|
|
# Language is already validated in structure (State 3) and preserved in filled structure (State 4)
|
|
# Per-document language will be extracted in renderReport() from filledStructure
|
|
# Use validated currentUserLanguage as global fallback (always valid infrastructure)
|
|
language = self.services.currentUserLanguage or "en" # Uses validated infrastructure
|
|
|
|
renderedDocuments = await self.services.ai.renderResult(
|
|
filledStructure,
|
|
outputFormat,
|
|
language, # ← Global fallback (per-document language extracted from structure in renderReport)
|
|
title or "Generated Document",
|
|
userPrompt,
|
|
docOperationId
|
|
)
|
|
```
|
|
- [ ] **Update renderReport() to handle per-document format and language**:
|
|
- [ ] Add language parameter to method signature (line 349): `language: str` (global fallback)
|
|
- [ ] Extract per-document format: `docFormat = doc.get("outputFormat") or doc.get("format") or outputFormat` (check `outputFormat` field first)
|
|
- [ ] Extract per-document language: `docLanguage = doc.get("language") or language` (from structure, validated in State 3)
|
|
- [ ] Validate language format (should be 2-character ISO code, validated in State 3)
|
|
- [ ] Add language to metadata passed to renderers: `metadata["language"] = docLanguage`
|
|
- [ ] **Note**: Per-document format and language are extracted from structure (validated in State 3). Multiple documents can have different formats and languages.
|
|
- [ ] **Error handling**:
|
|
- [ ] If no documents in structure → raise error
|
|
- [ ] If filtering removes all documents → raise error
|
|
- [ ] If format not supported → log warning, skip document
|
|
- [ ] Test multi-document rendering with different formats/languages
|
|
|
|
### Phase 7: ai.process Refactoring
|
|
- [ ] Remove extraction logic from `ai.process` (lines 72-119)
|
|
- [x] **Make resultType optional**: ✅ **IMPLEMENTED**
|
|
- [x] Update `ai.process`: Make `resultType` optional (can be `None`) - ✅ **COMPLETED**
|
|
- [x] Update `ai.generateDocument`: Make `resultType` optional, removed auto-detection - ✅ **COMPLETED**
|
|
- [x] Update `ai.generateCode`: Make `resultType` optional, removed auto-detection - ✅ **COMPLETED**
|
|
- [x] If `resultType` omitted → pass `None` to `callAiContent()` (formats determined from prompt) - ✅ **COMPLETED**
|
|
- [x] Updated action parameter definitions in `methodAi.py` - ✅ **COMPLETED**
|
|
|
|
**Implementation Status**:
|
|
- ✅ **ai.process**: `resultType` optional, passes `None` if omitted
|
|
- ✅ **ai.generateDocument**: `resultType` optional, passes `None` if omitted
|
|
- ✅ **ai.generateCode**: `resultType` optional, passes `None` if omitted
|
|
- ✅ **callAiContent**: Already supports optional `outputFormat` (defaults to "txt")
|
|
- [ ] **generateStructure**: Make `outputFormat` optional (see Phase 4 checklist)
|
|
|
|
- [ ] **Add filtering to Data Extraction Path** (`_handleDataExtraction`):
|
|
- [ ] **Location**: `mainServiceAi.py` between line 708 (get documents) and line 721 (extract content)
|
|
- [ ] **Purpose**: Prevent duplicate ContentParts when both original document and pre-extracted JSON are provided
|
|
- [ ] **Implementation**: Copy filtering logic from `documentPath.py:62-87`
|
|
- [ ] Filter out original documents covered by pre-extracted JSONs before calling `extractAndPrepareContent()`
|
|
- [ ] See Phase 3 checklist for detailed filtering code
|
|
- [ ] Pass `documentList` to `callAiContent()` (currently missing, line 155-162 in `process.py`)
|
|
- [ ] `documentList` is available in `process.py` (lines 43-55) but not passed to `callAiContent()`
|
|
- [ ] Add `documentList=documentList` parameter to `callAiContent()` call
|
|
- [ ] Pass `contentParts` to `callAiContent()` (already done)
|
|
- [ ] **Error handling**:
|
|
- [ ] If no documents and no contentParts → raise error
|
|
- [ ] If filtering removes all documents → raise error
|
|
- [ ] Verify intelligent merging in AI service works correctly
|
|
|
|
### Phase 8: Testing
|
|
- [ ] Test with pre-extracted JSON documents
|
|
- [ ] Test with mixed `documentList` + `contentParts`
|
|
- [ ] Test per-document format/language determination
|
|
- [ ] Test two prompt types in structure filling
|
|
- [ ] Test multi-document output with different formats/languages
|
|
- [ ] Test security: prompt injection attempts with fenced input
|
|
- [ ] **Test optional outputFormat handling**:
|
|
- [ ] Test with `resultType` provided → formats used as fallback
|
|
- [ ] Test with `resultType` omitted → AI determines formats from prompt
|
|
- [ ] Test format validation: invalid format → uses "txt"
|
|
- [ ] Test format validation: format without renderer → uses "txt"
|
|
|
|
### Phase 9: Documentation
|
|
- [ ] Update API documentation
|
|
- [ ] Update developer documentation
|
|
- [ ] Update user documentation (if applicable)
|
|
|
|
---
|
|
|
|
## Priority Order
|
|
|
|
**High Priority (Security & Critical Path)**:
|
|
1. **Phase 2**: Intent Analysis Updates - Security fix (fencing) is CRITICAL
|
|
2. **Phase 7**: ai.process Refactoring - Add filtering to Data Extraction Path (prevents duplicate ContentParts)
|
|
3. **Phase 1**: Model Updates - Foundation for all other changes
|
|
|
|
**Medium Priority (Architectural Improvements)**:
|
|
4. **Phase 4**: Structure Generation Updates
|
|
- Make outputFormat optional (AI determines per-document formats)
|
|
- Implement State 3 validation (use existing renderer registry and language infrastructure)
|
|
- Update prompt to require outputFormat field per document
|
|
5. **Phase 6**: Document Rendering Updates
|
|
- Extract per-document format/language from structure
|
|
- Add language parameter to renderResult() and renderReport()
|
|
6. **Phase 3**: Content Extraction Updates
|
|
- Remove redundant pre-extracted check AFTER filtering added upstream
|
|
|
|
**Low Priority (Verification & Polish)**:
|
|
7. **Phase 5**: Structure Filling Verification (already implemented, verify)
|
|
8. **Phase 8**: Testing
|
|
9. **Phase 9**: Documentation
|
|
|
|
---
|
|
|
|
## Notes
|
|
|
|
- The two prompt types in Phase 4 (Structure Filling) are already implemented via the `isAggregation` flag. This step focuses on verification and documentation.
|
|
- Per-document format/language determination follows the same pattern as existing per-document language handling.
|
|
- The security fix (fencing user input) should be implemented immediately as it addresses a potential prompt injection vulnerability.
|
|
|
|
---
|
|
|
|
## Architectural Note: Filtering and Redundant Pre-Extracted JSON Checks
|
|
|
|
### Problem Statement
|
|
|
|
When a user provides both an original document and a pre-extracted JSON containing ContentParts from that original document, we need to prevent duplicate ContentParts from being created.
|
|
|
|
### Current State
|
|
|
|
The pre-extracted JSON check happens **twice**:
|
|
|
|
1. **Phase 1** (`documentPath.py` lines 67-87): Filters documents before intent clarification
|
|
2. **Phase 2** (`subContentExtraction.py` line 77): Checks again during extraction loop
|
|
|
|
### Why Filtering is Necessary
|
|
|
|
**The redundant check in `extractAndPrepareContent()` only identifies if a document IS a pre-extracted JSON. It does NOT identify if a document is an ORIGINAL covered by a pre-extracted JSON.**
|
|
|
|
**Example**:
|
|
```python
|
|
# In extractAndPrepareContent loop:
|
|
for document in [original_pdf_123, pre_extracted_456]:
|
|
# Check document 1: original_pdf_123
|
|
preExtracted = resolvePreExtractedDocument(original_pdf_123)
|
|
# Returns: None (it's not a pre-extracted JSON)
|
|
# → Processes original_pdf_123 → extracts ContentParts
|
|
|
|
# Check document 2: pre_extracted_456
|
|
preExtracted = resolvePreExtractedDocument(pre_extracted_456)
|
|
# Returns: {originalDocument: {id: "original_pdf_123"}, ...}
|
|
# → Processes pre_extracted_456 → extracts ContentParts
|
|
|
|
# Result: BOTH processed → DUPLICATES
|
|
```
|
|
|
|
**The redundant check doesn't help because**:
|
|
- It only looks at ONE document at a time
|
|
- It doesn't know about OTHER documents in the list
|
|
- It can't compare documents to find relationships
|
|
|
|
### Why Filtering Works
|
|
|
|
Filtering happens BEFORE the extraction loop, so it can:
|
|
1. Look at ALL documents at once
|
|
2. Identify relationships between documents
|
|
3. Remove originals BEFORE extraction starts
|
|
|
|
### Code Path Analysis
|
|
|
|
#### Path 1: Document Generation Path (`documentPath.py`)
|
|
|
|
**Location**: Line 103
|
|
**Filtering**: ✅ YES (lines 62-87)
|
|
- Identifies pre-extracted JSONs
|
|
- Filters out original documents covered by pre-extracted JSONs
|
|
- Only passes filtered documents to `extractAndPrepareContent()`
|
|
|
|
**Result**: ✅ **NO DUPLICATES** - Original document already filtered out
|
|
|
|
#### Path 2: Data Extraction Path (`mainServiceAi.py` `_handleDataExtraction`)
|
|
|
|
**Location**: Line 721
|
|
**Filtering**: ❌ **NO**
|
|
- Gets documents directly from `documentList` (line 708)
|
|
- Calls `extractAndPrepareContent()` without any filtering
|
|
- Does NOT filter out original documents covered by pre-extracted JSONs
|
|
|
|
**Result**: ❌ **DUPLICATES CREATED** - Both documents processed, same content extracted twice
|
|
|
|
### Visual Flow Comparison
|
|
|
|
#### Document Generation Path (WITH Filtering - CURRENT)
|
|
```
|
|
documentList: [original_pdf_123, pre_extracted_456]
|
|
↓
|
|
[FILTERING] Identify relationships, remove originals
|
|
↓
|
|
filteredDocuments: [pre_extracted_456] ← original_pdf_123 removed
|
|
↓
|
|
extractAndPrepareContent([pre_extracted_456])
|
|
↓
|
|
ContentParts from pre_extracted_456 only
|
|
↓
|
|
✅ NO DUPLICATES
|
|
```
|
|
|
|
#### Data Extraction Path (WITHOUT Filtering - CURRENT)
|
|
```
|
|
documentList: [original_pdf_123, pre_extracted_456]
|
|
↓
|
|
[NO FILTERING] Pass all documents
|
|
↓
|
|
extractAndPrepareContent([original_pdf_123, pre_extracted_456])
|
|
↓
|
|
Process original_pdf_123 → ContentParts
|
|
Process pre_extracted_456 → ContentParts
|
|
↓
|
|
❌ DUPLICATES (same content twice)
|
|
```
|
|
|
|
#### Data Extraction Path (WITH Filtering - TARGET)
|
|
```
|
|
documentList: [original_pdf_123, pre_extracted_456]
|
|
↓
|
|
[FILTERING] Identify relationships, remove originals
|
|
↓
|
|
filteredDocuments: [pre_extracted_456] ← original_pdf_123 removed
|
|
↓
|
|
extractAndPrepareContent([pre_extracted_456])
|
|
↓
|
|
ContentParts from pre_extracted_456 only
|
|
↓
|
|
✅ NO DUPLICATES
|
|
```
|
|
|
|
### Solution
|
|
|
|
**Target State**: Add filtering to Data Extraction Path, then remove redundant check
|
|
|
|
**Steps**:
|
|
1. **Add filtering logic to `_handleDataExtraction`** (between line 708 and line 721)
|
|
- Copy filtering code from `documentPath.py` lines 62-87
|
|
- Filter out original documents covered by pre-extracted JSONs
|
|
2. **Remove redundant check from `extractAndPrepareContent()`** (line 77)
|
|
- Trust that filtering is done upstream
|
|
- Cleaner code, single responsibility
|
|
|
|
**Risk Assessment**:
|
|
- **If we remove redundant check WITHOUT adding filtering**: ⚠️ Duplicates still occur (no change from current state)
|
|
- **If we add filtering THEN remove redundant check**: ✅ No duplicates, cleaner code
|
|
|
|
### Conclusion
|
|
|
|
1. **Filtering is necessary** because it can look at ALL documents and identify relationships
|
|
2. **Redundant check is insufficient** because it only looks at ONE document at a time
|
|
3. **Current state**: Document Generation Path filters → safe. Data Extraction Path doesn't filter → duplicates possible
|
|
4. **Solution**: Add filtering to Data Extraction Path, then remove redundant check (it's not needed if filtering is done)
|
|
5. **Risk of removing redundant check**: None IF filtering is added first. High IF filtering is NOT added (but duplicates already exist anyway)
|
|
|
|
---
|
|
|
|
## Appendix: Pre-Extracted JSON Document Check Locations
|
|
|
|
### Where the Check is Done
|
|
|
|
**1. Phase 1 (Before Intent Clarification)**:
|
|
- **File**: `gateway/modules/services/serviceGeneration/paths/documentPath.py`
|
|
- **Lines**: 67-87
|
|
- **Purpose**: Filter documents before intent analysis
|
|
- **Method**: `self.services.ai.intentAnalyzer.resolvePreExtractedDocument(doc)`
|
|
- **Action**: Identifies pre-extracted JSONs and filters out original documents covered by them
|
|
|
|
**2. Phase 2 (During Content Extraction)**:
|
|
- **File**: `gateway/modules/services/serviceAi/subContentExtraction.py`
|
|
- **Line**: 77
|
|
- **Purpose**: Process each document during extraction loop
|
|
- **Method**: `self.intentAnalyzer.resolvePreExtractedDocument(document)`
|
|
- **Action**: Extracts ContentParts from pre-extracted JSON (not treat as regular JSON)
|
|
- **Note**: ⚠️ **REDUNDANT** - This check happens again even though Phase 1 already filtered documents
|
|
- **Reason**: `extractAndPrepareContent()` is called from multiple code paths:
|
|
- Document generation path (`documentPath.py`) - filtering already done
|
|
- Data extraction path (`mainServiceAi.py`) - filtering may not be done
|
|
- The extraction service needs to handle pre-extracted JSONs defensively
|
|
- **Optimization Opportunity**: Could pass filtered documents or a flag to skip redundant checks
|
|
|
|
**3. Check Implementation**:
|
|
- **File**: `gateway/modules/services/serviceAi/subDocumentIntents.py`
|
|
- **Line**: 122
|
|
- **Method**: `resolvePreExtractedDocument(document: ChatDocument)`
|
|
- **Logic**:
|
|
- Checks if `mimeType == "application/json"`
|
|
- Parses JSON and checks for `validationMetadata.actionType == "context.extractContent"`
|
|
- Extracts `ContentExtracted` structure from `documentData`
|
|
- Returns dict with `originalDocument` and `contentExtracted` info
|
|
|
|
### Where Final Merged List is Available
|
|
|
|
**After Phase 2 (Content Extraction)**:
|
|
- **File**: `gateway/modules/services/serviceGeneration/paths/documentPath.py`
|
|
- **Line**: 119
|
|
- **Code**: `contentParts = preparedContentParts`
|
|
- **State**:
|
|
- ✅ All pre-extracted JSON documents processed → ContentParts
|
|
- ✅ All regular documents extracted → ContentParts
|
|
- ✅ All provided contentParts merged
|
|
- ✅ Final clean merged list ready for Phase 3 (Structure Generation)
|
|
|
|
**Before Phase 3 (Structure Generation)**:
|
|
- **File**: `gateway/modules/services/serviceGeneration/paths/documentPath.py`
|
|
- **Line**: 129
|
|
- **Usage**: `contentParts or []` passed to `generateStructure()`
|
|
- **Note**: This is the clean merged list containing all ContentParts from all sources
|
|
|
|
---
|
|
|
|
## Appendix: Intent Mapping Logic for Pre-Extracted JSONs
|
|
|
|
### How Intent Mapping Works
|
|
|
|
**Problem**: When a pre-extracted JSON document is provided, we need to:
|
|
1. Analyze intents for the **original document** (not the JSON file itself)
|
|
2. Map the intents back to the **JSON document ID** (so they can be applied to the ContentParts extracted from the JSON)
|
|
|
|
### Implementation Logic (Already in `clarifyDocumentIntents`)
|
|
|
|
**Location**: `gateway/modules/services/serviceAi/subDocumentIntents.py` lines 63-104
|
|
|
|
**Step 1: Build Mapping** (lines 63-83)
|
|
```python
|
|
documentMapping = {} # Maps original doc ID → JSON doc ID
|
|
resolvedDocuments = []
|
|
|
|
for doc in documents:
|
|
preExtracted = self.resolvePreExtractedDocument(doc)
|
|
if preExtracted:
|
|
# This is a pre-extracted JSON
|
|
originalDocId = preExtracted["originalDocument"]["id"]
|
|
jsonDocId = doc.id # Current document is the JSON
|
|
|
|
# Map: original doc ID → JSON doc ID
|
|
documentMapping[originalDocId] = jsonDocId
|
|
|
|
# Create temporary ChatDocument for original document
|
|
originalDoc = ChatDocument(
|
|
id=originalDocId,
|
|
fileName=preExtracted["originalDocument"]["fileName"],
|
|
mimeType=preExtracted["originalDocument"]["mimeType"],
|
|
# ... other fields from preExtracted["originalDocument"]
|
|
)
|
|
resolvedDocuments.append(originalDoc) # Use original doc for intent analysis
|
|
else:
|
|
resolvedDocuments.append(doc) # Regular document, use as-is
|
|
```
|
|
|
|
**Result**:
|
|
- `documentMapping = {"original_pdf_123": "pre_extracted_456"}`
|
|
- `resolvedDocuments = [ChatDocument(id="original_pdf_123"), ChatDocument(id="other_doc")]`
|
|
|
|
**Step 2: AI Analyzes Intents** (line 86)
|
|
```python
|
|
# AI analyzes intents for resolvedDocuments (original documents, not JSONs)
|
|
intentPrompt = self._buildIntentAnalysisPrompt(userPrompt, resolvedDocuments, actionParameters)
|
|
aiResponse = await self.aiService.callAiPlanning(prompt=intentPrompt, ...)
|
|
```
|
|
|
|
**AI Response**:
|
|
```json
|
|
{
|
|
"intents": [
|
|
{
|
|
"documentId": "original_pdf_123", // ← Original document ID
|
|
"intents": ["extract"],
|
|
"extractionPrompt": "Extract all text",
|
|
"reasoning": "..."
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
**Step 3: Map Intents Back to JSON Doc IDs** (lines 96-104)
|
|
```python
|
|
intentsData = json.loads(self.services.utils.jsonExtractString(aiResponse))
|
|
documentIntents = []
|
|
|
|
for intent in intentsData.get("intents", []):
|
|
docId = intent.get("documentId") # "original_pdf_123"
|
|
|
|
# If intent is for an original document covered by a pre-extracted JSON
|
|
if docId in documentMapping:
|
|
# Map back to JSON document ID
|
|
intent["documentId"] = documentMapping[docId] # "pre_extracted_456"
|
|
|
|
documentIntents.append(DocumentIntent(**intent))
|
|
```
|
|
|
|
**Result**:
|
|
- `DocumentIntent(documentId="pre_extracted_456", intents=["extract"], ...)`
|
|
- Intent is now mapped to the JSON document ID, so it can be applied to ContentParts extracted from the JSON
|
|
|
|
### Why This Works
|
|
|
|
1. **AI analyzes original documents**: More meaningful context (file name, MIME type, etc.)
|
|
2. **Intents mapped to JSON IDs**: ContentParts extracted from JSON can be tagged with correct intents
|
|
3. **Consistent with filtering**: Original documents are filtered out, but their intents are preserved via mapping
|
|
|
|
### Example Flow
|
|
|
|
```
|
|
Input:
|
|
- documentList: [original_pdf_123.pdf, pre_extracted_456.json]
|
|
|
|
Step 1: Filtering (Phase 1)
|
|
- Identify: pre_extracted_456.json covers original_pdf_123.pdf
|
|
- Filter: Remove original_pdf_123.pdf
|
|
- Result: documents = [pre_extracted_456.json]
|
|
|
|
Step 2: Intent Mapping (Phase 1)
|
|
- Build mapping: {"original_pdf_123": "pre_extracted_456"}
|
|
- Resolve: resolvedDocuments = [ChatDocument(id="original_pdf_123")]
|
|
- AI analyzes: intents for "original_pdf_123"
|
|
- Map back: intents for "pre_extracted_456"
|
|
|
|
Step 3: Content Extraction (Phase 2)
|
|
- Extract ContentParts from pre_extracted_456.json
|
|
- Apply intents (from Step 2) to ContentParts
|
|
- Result: ContentParts with correct intents
|
|
```
|
|
|
|
---
|
|
|
|
## Implementation Notes
|
|
|
|
### Infrastructure Available
|
|
|
|
The following infrastructure already exists and should be reused:
|
|
|
|
- **Language Validation**: `currentUserLanguage` is validated at `workflowManager.py:695-727` - always valid 2-character ISO code. Access via `self.services.currentUserLanguage` or `_getUserLanguage()` method.
|
|
|
|
- **Format Validation**: Renderer registry exists at `mainServiceGeneration.py:529` (`_getFormatRenderer()` uses `getRenderer()`). Import: `from modules.services.serviceGeneration.renderers.registry import getRenderer`. Returns None if format invalid, falls back to text renderer.
|
|
|
|
- **Language Extraction**: `_getDocumentLanguage()` works correctly at `subStructureFilling.py:349` - extracts per-document language from structure. Used properly during section generation.
|
|
|
|
### Key Implementation Points
|
|
|
|
1. **Per-Document Format/Language**: Multiple documents can have different formats and languages. AI determines these from user prompt. Parameters are only validation fallbacks.
|
|
|
|
2. **Filtering**: Must filter pre-extracted JSONs before content extraction to prevent duplicate ContentParts. Filtering logic exists in `documentPath.py:62-87` and should be copied to data extraction path.
|
|
|
|
3. **State 3 Validation**: Use existing infrastructure (`getRenderer()`, `_getUserLanguage()`) for validation. Infrastructure exists, just needs to be called.
|
|
|
|
4. **Rendering**: Extract per-document `outputFormat` and `language` from structure (validated in State 3). Check `outputFormat` field first, then `format` field (legacy), then global fallback.
|
|
|
|
---
|
|
|
|
## Appendix: Validation Failure Handling Decisions
|
|
|
|
This appendix documents the decision-making process for how to handle each validation failure. The actual implementation code is integrated into Section 3 above.
|
|
|
|
### Approach
|
|
- **Try to fix automatically** (use defaults) when validation fails
|
|
- **All validations are critical** (must not fail - fix or error)
|
|
- **Validation happens inline** in each phase method
|
|
|
|
### State 1: After Intent Clarification
|
|
|
|
#### Validation 1.1: Intent count mismatch
|
|
**Check**: `len(documentIntents) != len(documents)`
|
|
**Decision**: Documents without intents are OK. Intents for non-existing documents should be skipped.
|
|
**Rationale**: Not all documents need intents (some may be reference-only). Intents referencing unknown documents are invalid and should be removed.
|
|
|
|
#### Validation 1.2: Intent references unknown document
|
|
**Check**: `intent.documentId not in documentIds`
|
|
**Decision**: Skip this intent (remove it)
|
|
**Rationale**: Cannot map intent to non-existent document. Better to skip than fail.
|
|
|
|
---
|
|
|
|
### State 2: After Content Extraction
|
|
|
|
#### Validation 2.1: ContentPart missing documentId
|
|
**Check**: `not part.metadata.get("documentId")`
|
|
**Decision**: Skip this ContentPart (remove it) with warning in logger
|
|
**Rationale**: ContentPart without documentId cannot be properly assigned. Skip with warning for debugging.
|
|
|
|
#### Validation 2.2: ContentPart has invalid contentFormat
|
|
**Check**: `contentFormat not in ["extracted", "object", "reference"]`
|
|
**Decision**: Skip this ContentPart (remove it) with warning in logger
|
|
**Rationale**: Invalid contentFormat indicates corrupted data. Skip with warning for debugging.
|
|
|
|
---
|
|
|
|
### State 3: After Structure Generation
|
|
|
|
#### Validation 3.1: Structure missing 'documents' field
|
|
**Check**: `"documents" not in chapterStructure`
|
|
**Decision**: Stop with error (cannot auto-fix - structure is invalid)
|
|
**Rationale**: Structure without documents field is fundamentally broken. Cannot proceed.
|
|
|
|
#### Validation 3.2: Structure has no documents
|
|
**Check**: `len(documents) == 0`
|
|
**Decision**: Stop with error (cannot generate without documents)
|
|
**Rationale**: Cannot generate output without documents. Must have at least one document.
|
|
|
|
#### Validation 3.3: Document missing 'outputFormat' field
|
|
**Check**: `"outputFormat" not in doc`
|
|
**Decision**: Use global fallback format (from parameters), if not available use default "txt"
|
|
**Rationale**: Format is required for rendering. Use fallback chain: per-document → global → default.
|
|
|
|
#### Validation 3.4: Document has invalid outputFormat
|
|
**Check**: `outputFormat not in valid formats`
|
|
**Decision**: Use renderer registry to check if format has a renderer. If no renderer exists, try global fallback, then default "txt"
|
|
**Rationale**: Use dynamic renderer registry (not hardcoded list) to check format validity. Fallback chain ensures we always have a valid format.
|
|
|
|
#### Validation 3.5: Document missing 'language' field
|
|
**Check**: `"language" not in doc`
|
|
**Decision**: Use user prompt language (from `self.services.currentUserLanguage` via `_getUserLanguage()`), not "en" fallback
|
|
**Rationale**: Language is required for content generation. Use user prompt language (detected from user intention analysis) as fallback, not hardcoded "en".
|
|
|
|
#### Validation 3.6: Document has invalid language
|
|
**Check**: `len(doc["language"]) != 2`
|
|
**Decision**: Use validated `currentUserLanguage` (always valid, validated during user intention analysis)
|
|
**Rationale**: `currentUserLanguage` is validated during user intention analysis and is always a valid 2-character ISO 639-1 code. Safe to use directly.
|
|
|
|
#### Validation 3.7: Document missing 'chapters' field
|
|
**Check**: `"chapters" not in doc`
|
|
**Decision**: Stop with error (cannot auto-fix - document structure invalid)
|
|
**Rationale**: Document without chapters is structurally invalid. Cannot proceed.
|
|
|
|
#### Validation 3.8: Chapter missing 'contentParts' field
|
|
**Check**: `"contentParts" not in chapter`
|
|
**Decision**: Stop with error (cannot auto-fix - chapter structure invalid)
|
|
**Rationale**: Chapter without contentParts field is structurally invalid. Cannot proceed.
|
|
|
|
---
|
|
|
|
### State 4: After Structure Filling
|
|
|
|
#### Validation 4.1: Filled structure missing 'documents' field
|
|
**Check**: `"documents" not in filledStructure`
|
|
**Decision**: Stop with error (cannot auto-fix - structure is invalid)
|
|
**Rationale**: Structure without documents field is fundamentally broken. Cannot proceed.
|
|
|
|
#### Validation 4.2: Section missing 'elements' field
|
|
**Check**: `"elements" not in section`
|
|
**Decision**: Create empty elements list: `section["elements"] = []`
|
|
**Rationale**: Section can be intentionally empty. Create empty list to maintain structure.
|
|
|
|
#### Validation 4.3: Section has empty elements list
|
|
**Check**: `not section["elements"]` (empty list)
|
|
**Decision**: Allow empty elements (section might be intentionally empty)
|
|
**Rationale**: Empty sections are valid (e.g., placeholder sections). No action needed.
|
|
|
|
#### Validation 4.4: Document missing 'language' field in filled structure
|
|
**Check**: `"language" not in doc` (in filledStructure)
|
|
**Decision**: Stop with error (language MUST be preserved from Phase 3)
|
|
**Rationale**: Language is validated and set in Phase 3 (State 3). If missing in filled structure, it's a critical error - language must be preserved.
|
|
|
|
#### Validation 4.5: Document has invalid language format in filled structure
|
|
**Check**: `not isinstance(doc["language"], str) or len(doc["language"]) != 2`
|
|
**Decision**: Stop with error (language format MUST be valid)
|
|
**Rationale**: Language format is validated in Phase 3 (State 3). If invalid in filled structure, it's a critical error.
|
|
|
|
---
|
|
|
|
### State 5: After Document Rendering
|
|
|
|
#### Validation 5.1: No documents rendered
|
|
**Check**: `len(renderedDocuments) == 0`
|
|
**Decision**: Stop with error (already implemented in documentPath.py line 176)
|
|
**Rationale**: Cannot return empty result. Error already implemented.
|
|
|
|
#### Validation 5.2: Rendered document has empty documentData
|
|
**Check**: `not doc.documentData`
|
|
**Decision**: Skip this document (remove from list)
|
|
**Rationale**: Empty document is not useful. Skip it rather than fail entire operation.
|
|
|
|
#### Validation 5.3: Rendered document missing mimeType
|
|
**Check**: `not doc.mimeType`
|
|
**Decision**: Infer mimeType from filename extension
|
|
**Rationale**: mimeType can be inferred from filename. Use utility function to detect.
|