diff --git a/modules/services/serviceAi/CONTENT_EXTRACTION_ANALYSIS.md b/modules/services/serviceAi/CONTENT_EXTRACTION_ANALYSIS.md new file mode 100644 index 00000000..b83d328f --- /dev/null +++ b/modules/services/serviceAi/CONTENT_EXTRACTION_ANALYSIS.md @@ -0,0 +1,2564 @@ +# Content Extraction Logic Analysis - ai.process Action + +## Overview +This document provides a stepwise structured analysis of the content extraction logic in the main AI call (`ai.process` action). It covers input formats, document processing, AI service communication, and content handling. + +--- + +## 1. Input Content Formats + +### 1.1 Document Input Formats +The `ai.process` action accepts documents in the following formats: + +#### Supported Document Types (via Extraction Service) +- **PDF** (`application/pdf`) - Extracted via `PdfExtractor` +- **Word Documents** (`application/vnd.openxmlformats-officedocument.wordprocessingml.document`) - Extracted via `DocxExtractor` +- **Excel** (`application/vnd.openxmlformats-officedocument.spreadsheetml.sheet`) - Extracted via `XlsxExtractor` +- **PowerPoint** (`application/vnd.openxmlformats-officedocument.presentationml.presentation`) - Extracted via `PptxExtractor` +- **CSV** (`text/csv`) - Extracted via `CsvExtractor` +- **HTML** (`text/html`) - Extracted via `HtmlExtractor` +- **XML** (`application/xml`, `text/xml`) - Extracted via `XmlExtractor` +- **JSON** (`application/json`) - Extracted via `JsonExtractor` +- **Images** (`image/jpeg`, `image/png`, `image/gif`, `image/webp`) - Extracted via `ImageExtractor` +- **Text** (`text/plain`) - Extracted via `TextExtractor` +- **SQL** (`application/sql`) - Extracted via `SqlExtractor` +- **Binary** (other formats) - Extracted via `BinaryExtractor` + +#### Document Reference Formats +Documents are provided via the `documentList` parameter which accepts: +- `DocumentReferenceList` object (preferred) +- List of strings (document references) +- Single string (single document reference) +- `None` (no documents) + +### 1.2 Content Parts Input Format +Alternatively, pre-extracted content can be provided via `contentParts` parameter: +- **Type**: `List[ContentPart]` +- **ContentPart Structure**: + ```python + ContentPart( + id: str, # Unique identifier + parentId: Optional[str], # Parent part ID (for hierarchical content) + label: str, # Human-readable label + typeGroup: str, # "text", "table", "image", "structure", "container", "binary" + mimeType: str, # MIME type of the content + data: Union[str, bytes], # Actual content data + metadata: Dict[str, Any] # Metadata including: + # - documentId + # - documentMimeType + # - originalFileName + # - contentFormat ("extracted", "object", "reference") + # - intent ("extract", "display", "analyze") + # - usageHint + # - extractionPrompt + # - sourceAction + ) + ``` + +### 1.3 Prompt Input Format +- **Type**: `str` +- **Required**: Yes +- **Description**: Instruction for the AI describing what processing to perform + +### 1.4 Result Type Format +- **Type**: `str` +- **Default**: `"txt"` +- **Supported Formats**: `txt`, `json`, `md`, `csv`, `xml`, `html`, `pdf`, `docx`, `xlsx`, `pptx`, `png`, `jpg`, `jpeg`, `gif`, `webp` +- **Purpose**: Determines output file extension and generation intent + +--- + +## 2. Document Processing Flow + +### 2.1 Entry Point: `ai.process` Action +**Location**: `gateway/modules/workflows/methods/methodAi/actions/process.py` + +**Flow**: +1. **Parameter Extraction** (lines 35-55) + - Extract `aiPrompt` from parameters + - Extract `documentList` and convert to `DocumentReferenceList` + - Extract `resultType` (default: "txt") + - Extract `contentParts` if already provided + +2. **Content Extraction Decision** (lines 72-119) + - **Path A**: If `contentParts` already provided → Skip extraction, use provided parts + - **Path B**: If `documentList` provided but no `contentParts` → Extract content from documents + - **Path C**: If BOTH `contentParts` AND `documentList` provided: + - **In `ai.process` action** (lines 85-86, 167-174): + - Condition: `if not contentParts and documentList.references:` (line 86) + - **Behavior**: Only extracts from `documentList` if `contentParts` is NOT provided + - **Result**: If both provided, `contentParts` takes precedence + - **Important**: `documentList` is **NOT passed** to `callAiContent()` (line 167) + - Only `contentParts` is passed to the AI service + - **Conclusion**: `documentList` is **ignored** when `contentParts` is provided + - **Note**: Merging logic exists in document generation path (`DocumentGenerationPath.generateDocument`, lines 109-119), but this only applies when `documentList` is passed separately to `callAiContent()` (not from `ai.process` action) + - **Note**: Similar merging exists in data extraction path (`_handleDataExtraction`, lines 727-733), but also requires `documentList` to be passed to `callAiContent()` + +### 2.2 Content Extraction Process (Path B) + +**Location**: `gateway/modules/services/serviceExtraction/mainServiceExtraction.py` + +#### Step 1: Document Resolution (lines 86-94 in process.py) +```python +chatDocuments = self.services.chat.getChatDocumentsFromDocumentList(documentList) +``` +- Converts `DocumentReferenceList` to `List[ChatDocument]` +- Each `ChatDocument` contains: + - `id`: Document ID + - `fileId`: File ID for database lookup + - `fileName`: Original filename + - `mimeType`: MIME type + +#### Step 2: Extraction Options Preparation (lines 96-108 in process.py) +```python +extractionOptions = ExtractionOptions( + prompt="Extract all content from the document", + mergeStrategy=MergeStrategy( + mergeType="concatenate", + groupBy="typeGroup", + orderBy="id" + ), + processDocumentsIndividually=True +) +``` + +#### Step 3: Content Extraction (line 111 in process.py) +```python +extractedResults = self.services.extraction.extractContent(chatDocuments, extractionOptions) +``` + +**Extraction Service Flow** (`mainServiceExtraction.py:extractContent`): + +1. **For each document** (lines 69-288): + - **Load document bytes** (line 96): + ```python + documentBytes = dbInterface.getFileData(doc.fileId) + ``` + + - **Run extraction pipeline** (lines 113-120): + ```python + ec = runExtraction( + extractorRegistry=self._extractorRegistry, + chunkerRegistry=self._chunkerRegistry, + documentBytes=documentData["bytes"], + fileName=documentData["fileName"], + mimeType=documentData["mimeType"], + options=options + ) + ``` + + - **Extraction Process**: + - **Extractor Selection**: Based on MIME type, select appropriate extractor (PDF, DOCX, XLSX, etc.) + - **Content Parsing**: Extractor parses document and extracts structured content + - **Chunking** (if needed): Large content is chunked based on size limits + - **ContentPart Creation**: Each extracted piece becomes a `ContentPart` with: + - `typeGroup`: "text", "table", "image", "structure", "container", "binary" + - `data`: Extracted content (text, table data, base64 image, etc.) + - `mimeType`: Original MIME type + - `label`: Descriptive label + + - **Metadata Attachment** (lines 132-166): + ```python + # Required metadata fields + p.metadata["documentId"] = documentData["id"] + p.metadata["documentMimeType"] = documentData["mimeType"] + p.metadata["originalFileName"] = documentData["fileName"] + p.metadata["contentFormat"] = "extracted" # Default + p.metadata["intent"] = "extract" # Default + p.metadata["extractionPrompt"] = options.prompt + p.metadata["usageHint"] = f"Use extracted content from {documentData['fileName']}" + p.metadata["sourceAction"] = "extraction.extractContent" + ``` + +2. **Return Results**: + - Returns `List[ContentExtracted]` (one per input document) + - Each `ContentExtracted` contains: + - `id`: Document ID + - `parts`: `List[ContentPart]` - All extracted content parts + +#### Step 4: Combine ContentParts (lines 113-119 in process.py) +```python +contentParts = [] +for extracted in extractedResults: + if extracted.parts: + contentParts.extend(extracted.parts) +``` + +**Result**: Single `List[ContentPart]` containing all extracted content from all documents. + +--- + +## 3. What is Sent to the AI Service + +### 3.1 AI Service Call +**Location**: `gateway/modules/workflows/methods/methodAi/actions/process.py` (line 167) + +```python +aiResponse = await self.services.ai.callAiContent( + prompt=aiPrompt, + options=options, + contentParts=contentParts, # Already extracted (or None if no documents) + outputFormat=output_format, + parentOperationId=operationId, + generationIntent=generationIntent # REQUIRED for DATA_GENERATE +) +``` + +### 3.2 Parameters Sent to AI Service + +#### 3.2.1 Prompt +- **Type**: `str` +- **Content**: User-provided instruction describing what processing to perform +- **Example**: "Extract all content from the document" + +#### 3.2.2 Options (`AiCallOptions`) +```python +options = AiCallOptions( + resultFormat=output_format, # e.g., "txt", "json", "docx" + operationType=OperationTypeEnum.DATA_GENERATE # or IMAGE_GENERATE +) +``` + +**Operation Types**: +- `DATA_GENERATE`: Generate structured content (documents, code) +- `IMAGE_GENERATE`: Generate images +- `DATA_EXTRACT`: Extract and process content +- `DATA_ANALYSE`: Analyze content +- `IMAGE_ANALYSE`: Analyze images + +#### 3.2.3 ContentParts (`List[ContentPart]`) +**Structure per ContentPart**: +```python +ContentPart( + id="part_123", + parentId=None, + label="Chapter 1 Text", + typeGroup="text", # or "table", "image", "structure", "container", "binary" + mimeType="text/plain", + data="Actual content text here...", # or base64 for images + metadata={ + "documentId": "doc_456", + "documentMimeType": "application/pdf", + "originalFileName": "document.pdf", + "contentFormat": "extracted", + "intent": "extract", + "usageHint": "Use extracted content from document.pdf", + "extractionPrompt": "Extract all content from the document", + "sourceAction": "extraction.extractContent" + } +) +``` + +#### 3.2.4 Output Format +- **Type**: `str` +- **Examples**: `"txt"`, `"json"`, `"docx"`, `"pdf"`, `"xlsx"`, `"png"` + +#### 3.2.5 Generation Intent +- **Type**: `str` +- **Values**: `"document"`, `"code"`, `"image"` +- **Default Logic** (lines 142-160 in process.py): + - Document formats (xlsx, docx, pdf, txt, md, html, csv, xml, pptx) → `"document"` + - Code formats (py, js, ts, java, cpp, c, go, rs, rb, php, swift, kt) → `"code"` + - Image formats (png, jpg, jpeg, gif, webp) → `"image"` (handled separately) + +--- + +## 4. What the AI Service Does with Documents and Contents + +### 4.1 AI Service Entry Point +**Location**: `gateway/modules/services/serviceAi/mainServiceAi.py:callAiContent` (line 540) + +### 4.2 Operation Type Routing + +#### 4.2.1 IMAGE_GENERATE (lines 599-601) +- Routes to `_handleImageGeneration()` +- Generates images from prompt (no document processing) + +#### 4.2.2 DATA_GENERATE (lines 607-640) +- **Requires**: `generationIntent` parameter +- **Routes based on intent**: + - `generationIntent == "code"` → `_handleCodeGeneration()` + - `generationIntent == "document"` → `_handleDocumentGeneration()` + +#### 4.2.3 DATA_EXTRACT (lines 643-653) +- Routes to `_handleDataExtraction()` +- Extracts content from documents, then processes with AI + +### 4.3 Document Generation Flow (`_handleDocumentGeneration`) + +**Location**: `mainServiceAi.py:_handleDocumentGeneration` (referenced at line 631) + +**CRITICAL**: When called from `ai.process` action: +- **Only `contentParts` is passed** to `callAiContent()` (line 167 in `process.py`) +- **`documentList` is NOT passed** (it's `None`) +- Therefore, **extraction does NOT happen again** in the document generation path +- The `contentParts` already extracted in `ai.process` are used directly +- **Steps 1-2 below are SKIPPED** for `ai.process` flow (no `documentList` to process) + +**Note**: `DocumentGenerationPath.generateDocument()` can also be called directly from other code paths with `documentList`, so it handles both cases. The following steps describe the general flow when `documentList` IS provided (not from `ai.process`). + +#### Step 1: Document Intent Clarification +- **Condition**: `if documentList:` AND `documentIntents` not provided +- If documents exist: + - Calls `clarifyDocumentIntents()` to analyze document purposes + - Determines how each document should be used (extract, display, analyze) +- **For `ai.process` flow**: This step is **skipped** (no `documentList` passed) + +#### Step 2: Content Extraction and Preparation +- **Condition**: `if documents:` (i.e., if `documentList` was provided and converted to documents) +- If documents exist: + - Calls `extractAndPrepareContent()`: + - **RAW Extraction (NO AI)**: Uses `extractContent()` service for pure document parsing + - **What it does**: Parses PDF, DOCX, XLSX, etc. to extract structured content + - **What it creates**: ContentParts with raw extracted data + - **AI involved**: NONE - this is pure parsing/parsing, no AI calls + - **Prompt Used**: `intent.extractionPrompt` or default `"Extract all content from the document"` + - **Important**: This prompt is stored in metadata but NOT used for AI extraction here + - It's only used later during section generation (Step 4) for Vision AI extraction + - **Purpose**: Just metadata storage, not actual AI prompt execution + - **ContentPart Preparation**: + - **For Images**: + - Creates image ContentPart with base64 image data + - Marks with `needsVisionExtraction: True` + - Stores `extractionPrompt` in metadata for later use + - **Reason**: Vision AI extraction is expensive, so it's deferred to section generation + - **No AI extraction happens here** - image is just parsed and stored + - **For Text**: + - Creates text ContentPart with extracted text (from PDF text layer, DOCX text, etc.) + - Marks with `skipExtraction: True` (already extracted from parsing, no AI needed) + - **No AI extraction happens here** - text is already extracted from document parsing + - **For Objects**: Creates object ContentParts for rendering (images, videos, etc.) + - Then merges with provided `contentParts` (if any) +- **For `ai.process` flow**: This step is **skipped** (no `documentList` passed, `contentParts` already extracted) +- **Why Extract (Parse) Before Structure Generation?** + - **ContentParts are needed BEFORE structure generation** so AI can assign them to chapters + - Structure generation needs to know: + - What documents exist (documentId) + - What content types are available (typeGroup: text, image, table, etc.) + - What content formats exist (contentFormat: extracted, object, reference) + - **Structure generation doesn't need AI-extracted text from images** - it just needs to know images exist + - Vision AI extraction (converting images to text) is deferred to section generation (Step 4) for efficiency + - **Key Point**: Only RAW parsing happens here - NO AI calls, NO Vision AI, NO text extraction from images + +#### Step 3: Structure Generation (for document formats) +- Calls `structureGenerator.generateStructure()`: + - Generates document structure (chapters, sections) + - Creates JSON structure with: + - `metadata`: Title, language + - `documents`: Array of document structures + - `chapters`: Array of chapter structures with: + - `id`, `level`, `title` + - `contentParts`: Assignment of ContentParts to chapters + - `generationHint`: Description of chapter content + +#### Step 4: Structure Filling +- Calls `structureFiller.fillStructure()`: + - For each chapter: + - Extracts relevant ContentParts assigned to chapter + - **Vision AI Extraction (if needed)**: + - Checks for ContentParts with `needsVisionExtraction == True` (images) + - Calls Vision AI with `extractionPrompt` from metadata (line 651 in `subStructureFilling.py`) + - Converts image ContentPart to text ContentPart with extracted text + - **Prompt Used**: `part.metadata.get("extractionPrompt")` or default `"Extract all text content from this image..."` + - **Section Generation**: + - Generates section content using AI with processed ContentParts + - Processes ContentParts with model-aware chunking if needed + - Merges results intelligently +- **Two-Phase Extraction Explained**: + - **Phase 1 (Step 2)**: RAW extraction (parsing) - creates ContentParts for structure generation + - **Phase 2 (Step 4)**: Vision AI extraction (for images only) - happens during section generation + - **Why Two Phases?** + - Structure generation needs ContentParts early (to assign to chapters) + - Vision AI extraction is expensive and only needed when generating content + - Text content doesn't need AI extraction (already extracted in Phase 1) + +#### Step 5: Document Rendering +- Converts filled structure to final document format (PDF, DOCX, XLSX, etc.) +- Returns `AiResponse` with rendered documents + +### 4.4 Content Parts Processing (`processContentPartsWithAi`) + +**Location**: `gateway/modules/services/serviceExtraction/mainServiceExtraction.py:processContentPartsWithAi` (line 1499) + +#### Step 1: Model Selection +```python +availableModels = modelRegistry.getAvailableModels() +failoverModelList = modelSelector.getFailoverModelList(prompt, "", options, availableModels) +``` +- Selects appropriate AI models based on: + - Operation type + - Content type (text, images, etc.) + - Model capabilities + +#### Step 2: Parallel Processing +- Processes all ContentParts in parallel (max 5 concurrent by default) +- For each ContentPart: + - Calls `processContentPartWithFallback()` + +#### Step 3: ContentPart Processing (`processContentPartWithFallback`) + +**Location**: `mainServiceExtraction.py:processContentPartWithFallback` (line 1232) + +**Flow**: + +1. **Size Check** (lines 1328-1379): + ```python + # Calculate if content fits in model context + partSize = len(contentPart.data.encode('utf-8')) + modelContextTokens = model.contextLength + availableContentTokens = int((modelContextTokens - totalReservedTokens) * 0.8) + ``` + +2. **Chunking Decision**: + - If content exceeds model limits → **Chunk content** + - If content fits → **Process directly** + +3. **Chunking Process** (`chunkContentPartForAi`, line 1146): + - Calculates model-specific chunk sizes: + ```python + # Reserve tokens for: + # - Prompt + # - System message wrapper + # - Max output tokens + # - Message overhead + availableContentTokens = int((modelContextTokens - totalReservedTokens) * 0.60) + ``` + - Uses appropriate chunker based on `typeGroup`: + - `TextChunker` for text + - `StructureChunker` for JSON/structured content + - `TableChunker` for tables + - `ImageChunker` for images + +4. **AI Call**: + - **For chunks**: Process each chunk separately, then merge results + - **For single part**: Call AI directly + - **For images**: Special handling with vision models (base64 encoding) + +5. **Model Fallback**: + - If model fails → Try next model in failover list + - Continues until success or all models exhausted + +#### Step 4: Result Merging (`mergePartResults`) + +**Location**: `mainServiceExtraction.py:mergePartResults` (line 615) + +**Merging Strategies**: + +1. **Elements Response Format** (detected at line 657): + - Merges JSON responses with `"elements"` array + - Specifically merges tables by headers + - Combines rows from tables with same headers + +2. **JSON Extraction Response Format** (detected at line 669): + - Merges `{"extracted_content": {...}}` structures + - Combines: + - Text blocks + - Tables (by headers) + - Headings + - Lists + - Images + +3. **Regular Merging** (line 680): + - Uses `MergeStrategy`: + - `groupBy`: "typeGroup" or "documentId" + - `orderBy`: "id" or "originalIndex" + - `mergeType`: "concatenate" + - Applies intelligent token-aware merging if enabled + - Preserves ContentPart metadata + +#### Step 5: Return Merged Content +- Returns single `AiCallResponse` with: + - `content`: Merged content string + - `modelName`: "multiple" (if multiple models used) + - `priceUsd`: Sum of all model costs + - `processingTime`: Sum of all processing times + - `bytesSent`: Sum of all bytes sent + - `bytesReceived`: Sum of all bytes received + +--- + +## 5. Summary Flow Diagram + +``` +ai.process Action + │ + ├─→ Extract Parameters (aiPrompt, documentList, resultType) + │ + ├─→ Check contentParts + │ ├─→ If provided → Use directly + │ └─→ If not provided → Extract from documents + │ │ + │ ├─→ Convert documentList → ChatDocuments + │ │ + │ ├─→ For each document: + │ │ ├─→ Load document bytes from database + │ │ ├─→ Select extractor (PDF, DOCX, XLSX, etc.) + │ │ ├─→ Extract content → ContentParts + │ │ ├─→ Chunk if needed (size-based) + │ │ └─→ Attach metadata + │ │ + │ └─→ Combine all ContentParts + │ + ├─→ Determine operationType (DATA_GENERATE, IMAGE_GENERATE, etc.) + │ + ├─→ Determine generationIntent (document, code, image) + │ + └─→ Call AI Service (callAiContent) + │ + ├─→ Route by operationType + │ │ + │ ├─→ DATA_GENERATE + document → Document Generation + │ │ ├─→ Clarify document intents + │ │ ├─→ Extract/prepare content + │ │ ├─→ Generate structure (chapters, sections) + │ │ ├─→ Fill structure (generate content per section) + │ │ └─→ Render document (PDF, DOCX, etc.) + │ │ + │ ├─→ DATA_GENERATE + code → Code Generation + │ │ └─→ Generate code directly + │ │ + │ └─→ DATA_EXTRACT → Data Extraction + │ ├─→ Extract content from documents + │ └─→ Process with AI (simple text processing) + │ + └─→ Process ContentParts (if provided) + │ + ├─→ For each ContentPart: + │ ├─→ Check size vs model limits + │ ├─→ If too large → Chunk (model-aware) + │ ├─→ Call AI with chunk/part + │ ├─→ Handle model fallback if needed + │ └─→ Collect results + │ + └─→ Merge results + ├─→ Detect response format (elements, extraction, regular) + ├─→ Apply merging strategy + └─→ Return merged content +``` + +--- + +## 6. Key Data Structures + +### 6.1 ContentPart +```python +ContentPart( + id: str, # Unique identifier + parentId: Optional[str], # Parent part ID + label: str, # Human-readable label + typeGroup: str, # "text", "table", "image", "structure", "container", "binary" + mimeType: str, # MIME type + data: Union[str, bytes], # Content data + metadata: Dict[str, Any] # Metadata dictionary +) +``` + +### 6.2 ContentExtracted +```python +ContentExtracted( + id: str, # Document ID + parts: List[ContentPart] # Extracted content parts +) +``` + +### 6.3 AiCallOptions +```python +AiCallOptions( + resultFormat: str, # Output format ("txt", "json", "docx", etc.) + operationType: OperationTypeEnum, # Operation type + priority: PriorityEnum, # Quality vs speed + processingMode: ProcessingModeEnum, # Detailed vs fast + compressPrompt: bool, # Compress prompt + compressContext: bool # Compress context +) +``` + +### 6.4 AiCallResponse +```python +AiCallResponse( + content: str, # Generated/processed content + modelName: str, # Model used + priceUsd: float, # Cost in USD + processingTime: float, # Processing time in seconds + bytesSent: int, # Bytes sent to model + bytesReceived: int, # Bytes received from model + errorCount: int # Number of errors +) +``` + +--- + +## 7. Important Notes + +### 7.1 Content Extraction Separation +- **Extraction** (no AI): Pure document parsing and content extraction +- **AI Processing**: Content analysis, generation, transformation + +### 7.2 Model-Aware Chunking +- Chunking considers: + - Model context length + - Model max output tokens + - Prompt size + - System message overhead + - Conservative safety margins (60% of available tokens) + +### 7.3 Parallel Processing +- ContentParts are processed in parallel (max 5 concurrent) +- Improves performance for multiple documents/parts + +### 7.4 Intelligent Merging +- Merges content intelligently: + - Tables by headers + - Text blocks with separators + - Preserves document structure + - Token-aware optimization + +### 7.5 Metadata Preservation +- ContentPart metadata is preserved throughout the pipeline +- Includes document source, extraction prompt, usage hints +- Enables traceability and proper content assignment + +--- + +## 8. Debug Files Generated + +During processing, the following debug files may be generated: + +1. **Extraction Results**: `extraction_result_{filename}.txt` + - Contains extraction summary per document + - Includes part metadata and data previews + +2. **Text Parts**: `extraction_text_part_{N}_{filename}.txt` + - Contains full extracted text for each text part + +3. **Per-Part Extracted Data**: `content_extraction_per_part.txt` + - Contains per-part extracted content summary + +4. **Original Parts Extracted Data**: `content_extraction_original_parts.txt` + - Contains original parts with extracted content + +5. **Generation Prompts/Responses**: `generation_contentPart_{id}_{label}_{prompt|response}.txt` + - Contains prompts and responses for generation phase + +6. **Structure Generation**: `chapter_structure_generation_{prompt|response}.txt` + - Contains structure generation prompts and responses + +--- + +## 9. Recommendations and Next Steps + +This section documents architectural findings, recommendations, and planned improvements. Topics will be added step by step as analysis progresses. + +### 9.1 Architectural Inconsistency: contentParts + documentList Merging Behavior + +#### Problem Statement + +The `ai.process` action exhibits **inconsistent behavior** when both `contentParts` and `documentList` parameters are provided: + +**Current Behavior Across Code Paths:** + +1. **`ai.process` Action** (`process.py` lines 85-86): + - **Logic**: `if not contentParts and documentList.references:` + - **Behavior**: If both provided → Only `contentParts` used, `documentList` ignored + - **Issue**: `documentList` is not passed to `callAiContent()`, so it's completely ignored + +2. **Document Generation Path** (`documentPath.py` lines 109-119): + - **Logic**: Extracts from `documentList`, then merges with `contentParts` + - **Behavior**: If both provided → **MERGES** both + - **Code**: `preparedContentParts.extend(contentParts)` + +3. **Data Extraction Path** (`mainServiceAi.py` lines 727-733): + - **Logic**: Extracts from `documentList`, then merges with `contentParts` + - **Behavior**: If both provided → **MERGES** both + - **Code**: `preparedContentParts.extend(contentParts)` + +#### Analysis + +**Arguments FOR Current Behavior (Skip documentList):** +- Performance: Avoids redundant extraction if contentParts already provided +- Explicit Intent: If user provides contentParts, they may want only those +- Pre-extracted Content: contentParts might be pre-processed/filtered content +- Simplicity: Simpler logic, fewer edge cases + +**Arguments AGAINST Current Behavior (Should Merge):** +- **Inconsistency**: Other paths merge, creating confusion +- **User Intent**: If user provides both, they likely want both used +- **Flexibility**: Allows combining pre-extracted content with additional documents +- **Architectural Pattern**: Document generation path already handles this correctly +- **No Performance Issue**: Extraction is fast, merging is trivial + +#### Recommendation + +**The current behavior in `ai.process` does NOT make architectural sense** because: + +1. **Inconsistency**: The action routes to paths that DO merge, but the action itself doesn't +2. **Lost Functionality**: User cannot combine pre-extracted contentParts with additional documents +3. **Unexpected Behavior**: Users might expect both to be used (like in other paths) + +#### Proposed Fix + +Change `ai.process` to merge both with intelligent deduplication: + +**Logic Requirements:** +- Extract content parts from documents (without AI) **only if** that document is not already represented in the `contentParts` list +- Merge all contentParts +- Result: Complete list of contentParts for all provided documents (no duplicates) + +**Current Implementation** (lines 85-119): +```python +# If contentParts not provided but documentList is, extract content first +if not contentParts and documentList.references: + # Extract from documentList + extractedResults = self.services.extraction.extractContent(...) + contentParts = [] + for extracted in extractedResults: + if extracted.parts: + contentParts.extend(extracted.parts) +``` + +**Proposed Implementation**: +```python +# Step 1: Identify documents already represented in contentParts +documentsAlreadyExtracted = set() +if contentParts: + for part in contentParts: + documentId = part.metadata.get("documentId") + if documentId: + documentsAlreadyExtracted.add(documentId) + logger.info(f"Found {len(documentsAlreadyExtracted)} documents already represented in contentParts: {documentsAlreadyExtracted}") + +# Step 2: Extract from documentList only for documents NOT already in contentParts +extractedParts = [] +if documentList and documentList.references: + self.services.chat.progressLogUpdate(operationId, 0.3, "Extracting content from documents") + chatDocuments = self.services.chat.getChatDocumentsFromDocumentList(documentList) + + if chatDocuments: + # Filter: Only extract documents not already represented + documentsToExtract = [ + doc for doc in chatDocuments + if doc.id not in documentsAlreadyExtracted + ] + + if documentsToExtract: + logger.info(f"Extracting content from {len(documentsToExtract)} new documents (skipping {len(chatDocuments) - len(documentsToExtract)} already represented)") + + # Prepare extraction options + extractionOptions = parameters.get("extractionOptions") + if not extractionOptions: + extractionOptions = ExtractionOptions( + prompt="Extract all content from the document", + mergeStrategy=MergeStrategy( + mergeType="concatenate", + groupBy="typeGroup", + orderBy="id" + ), + processDocumentsIndividually=True + ) + + # Extract content (without AI - pure extraction) + extractedResults = self.services.extraction.extractContent(documentsToExtract, extractionOptions) + + # Combine all ContentParts from extracted results + for extracted in extractedResults: + if extracted.parts: + extractedParts.extend(extracted.parts) + + logger.info(f"Extracted {len(extractedParts)} content parts from {len(extractedResults)} documents") + else: + logger.info(f"All documents from documentList are already represented in contentParts, skipping extraction") + +# Step 3: Merge all contentParts +if contentParts: + # Preserve pre-extracted content metadata + for part in contentParts: + if part.metadata.get("skipExtraction", False): + part.metadata.setdefault("contentFormat", "extracted") + part.metadata.setdefault("isPreExtracted", True) + + # Merge: extracted parts first, then provided contentParts + # This ensures extracted content comes before pre-extracted content + finalContentParts = extractedParts + contentParts + contentParts = finalContentParts + logger.info(f"Merged contentParts: {len(extractedParts)} extracted + {len(contentParts) - len(extractedParts)} provided = {len(contentParts)} total") +elif extractedParts: + contentParts = extractedParts +``` + +**Benefits:** +- Makes behavior consistent across all paths +- Allows users to combine pre-extracted content with documents +- Matches user expectations +- Follows the architectural pattern already established in document generation path + +#### Edge Cases Handled + +1. **Duplicate Documents**: Same document in both `contentParts` and `documentList` + - **Solution**: Check `documentId` in `contentParts` metadata before extracting + - **Implementation**: Build set of `documentsAlreadyExtracted` from `part.metadata.get("documentId")` + - **Result**: Only extract documents NOT already represented in `contentParts` + - **Benefit**: Avoids redundant extraction, prevents duplicate content + +2. **Different Extraction Options**: contentParts might have different extraction settings + - **Solution**: Preserve metadata, let AI handle differences + - **Note**: Each ContentPart retains its own metadata (extractionPrompt, etc.) + - **Behavior**: Documents extracted with current options, pre-extracted parts keep their original metadata + +3. **Ordering**: Which comes first - extracted or provided? + - **Solution**: Extracted parts first, then provided contentParts + - **Rationale**: Newly extracted content comes first, pre-extracted content follows + - **Implementation**: `finalContentParts = extractedParts + contentParts` + +4. **Performance**: Avoids unnecessary extraction + - **Solution**: Only extracts documents not already in `contentParts` + - **Benefit**: Skips extraction for documents already represented + - **Logging**: Logs which documents are skipped and why + +5. **Missing documentId in Metadata**: What if contentPart doesn't have documentId? + - **Solution**: Only documents with `documentId` in metadata are considered "already extracted" + - **Behavior**: If `documentId` missing, document will be extracted (safe default) + - **Note**: Extraction service always sets `documentId` in metadata, so this is rare + +#### Implementation Steps + +1. **Update `ai.process` action** (`process.py` lines 85-119): + - **Step 1**: Build set of `documentsAlreadyExtracted` from `contentParts` metadata + - **Step 2**: Filter `chatDocuments` to only include documents NOT in `documentsAlreadyExtracted` + - **Step 3**: Extract content only from filtered documents (pure extraction, no AI) + - **Step 4**: Merge extracted parts with provided `contentParts` (extracted first, then provided) + - **Step 5**: Preserve metadata for pre-extracted contentParts + - **Step 6**: Add logging for transparency (which documents skipped, counts, etc.) + +2. **Update Documentation**: + - Update action parameter documentation to clarify deduplication behavior + - Document that extraction only happens for documents not already in `contentParts` + - Add examples showing both parameters used together + - Explain how `documentId` metadata is used for deduplication + +3. **Testing**: + - **Test Case 1**: Both parameters provided, no overlap → Both extracted and merged + - **Test Case 2**: Both parameters provided, full overlap → Only contentParts used, no extraction + - **Test Case 3**: Both parameters provided, partial overlap → Extract only new documents, merge all + - **Test Case 4**: Only contentParts → Use as-is + - **Test Case 5**: Only documentList → Extract all documents + - **Test Case 6**: contentParts without documentId metadata → Extract all documents (safe default) + +4. **Migration**: + - No breaking changes expected (only adds functionality) + - Existing code using only one parameter continues to work + - New behavior: When both provided, intelligently deduplicates before merging + +### 9.2 Architectural Redundancy: Duplicate Extraction Logic + +#### Problem Statement + +**Current Architecture:** +- `ai.process` action extracts documents and creates `contentParts` (lines 86-119) +- Then passes only `contentParts` to `callAiContent()` (line 167) +- `callAiContent()` accepts both `contentParts` AND `documentList` (line 545) +- Document generation path has `extractAndPrepareContent()` logic (line 103 in `documentPath.py`) +- But this extraction logic is **never used** when called from `ai.process` (because `documentList` is not passed) + +**Question**: Why does `ai.process` extract documents when the AI service already has extraction logic? + +#### Analysis + +**Current Flow:** +``` +ai.process + ├─→ Extract documents → contentParts (lines 86-119) + ├─→ Pass contentParts to callAiContent() (line 167) + └─→ callAiContent() routes to document generation path + └─→ extractAndPrepareContent() exists but is SKIPPED (no documentList) +``` + +**Alternative Flow (More Logical):** +``` +ai.process + ├─→ Pass documentList to callAiContent() (line 167) + └─→ callAiContent() routes to document generation path + └─→ extractAndPrepareContent() handles extraction +``` + +#### Issues with Current Architecture + +1. **Code Duplication**: Extraction logic exists in both `ai.process` and document generation path +2. **Inconsistency**: Different extraction paths use different extraction options/logic +3. **Maintenance Burden**: Changes to extraction logic must be made in multiple places +4. **Unused Code**: `extractAndPrepareContent()` in document generation path is unused when called from `ai.process` +5. **Loss of Flexibility**: `ai.process` can't leverage document intent clarification and other features in `extractAndPrepareContent()` + +#### Why Current Architecture Exists (Possible Reasons) + +1. **Historical**: Extraction may have been added to `ai.process` before AI service had extraction +2. **Separation of Concerns**: `ai.process` might be intended as a simpler entry point +3. **Progress Tracking**: Early extraction allows better progress tracking at action level +4. **Performance**: Early extraction might allow parallel processing + +However, these don't justify the duplication and inconsistency. + +#### Recommendation + +**Option A: Remove Extraction from `ai.process` (Preferred)** +- `ai.process` should pass `documentList` to `callAiContent()` instead of extracting +- Let the AI service handle all extraction through `extractAndPrepareContent()` +- Benefits: + - Single source of truth for extraction logic + - Consistent extraction options and behavior + - Leverages document intent clarification + - Simpler `ai.process` action + - Better separation: action layer vs service layer + +**Option B: Keep Extraction in `ai.process` but Make it Optional** +- Add parameter to control whether extraction happens in `ai.process` or AI service +- Still creates complexity and potential inconsistency + +**Option C: Keep Current Architecture (Not Recommended)** +- Document the duplication and accept it +- Maintain extraction logic in both places +- Risk of divergence over time + +#### Proposed Refactoring (Option A) + +**Current Implementation** (`process.py` lines 85-119): +```python +# Extract in ai.process +if not contentParts and documentList.references: + extractedResults = self.services.extraction.extractContent(...) + contentParts = combineExtractedResults(extractedResults) + +# Pass only contentParts +aiResponse = await self.services.ai.callAiContent( + contentParts=contentParts, # documentList NOT passed + ... +) +``` + +**Proposed Implementation**: +```python +# Don't extract in ai.process - let AI service handle it +# Pass documentList to AI service +aiResponse = await self.services.ai.callAiContent( + prompt=aiPrompt, + options=options, + documentList=documentList, # Pass documentList instead + contentParts=contentParts, # Still support pre-extracted contentParts + outputFormat=output_format, + parentOperationId=operationId, + generationIntent=generationIntent +) +``` + +**Benefits:** +- Single extraction path in AI service +- Consistent extraction behavior +- Leverages document intent clarification +- Simpler `ai.process` action +- Better architecture: action layer delegates to service layer + +**Migration Path:** +1. Update `ai.process` to pass `documentList` to `callAiContent()` +2. Remove extraction logic from `ai.process` (or make it optional) +3. Ensure `extractAndPrepareContent()` handles all extraction cases +4. Test that all existing workflows continue to work +5. Update documentation + +**Edge Cases:** +- Pre-extracted `contentParts` should still be supported (merge with extracted) +- Extraction options should be configurable via parameters +- Progress tracking should work at both levels + +### 9.3 Target State: Ideal Architecture and Flow + +#### Target Architecture Overview + +The target state addresses all architectural issues identified: +1. **Single extraction path** in AI service (no duplication in `ai.process`) +2. **Intelligent merging** of `contentParts` and `documentList` with deduplication +3. **Clear separation** of concerns: action layer delegates to service layer +4. **Consistent behavior** across all code paths + +#### Target Flow Diagram + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ ai.process Action │ +│ │ +│ 1. Extract Parameters │ +│ ├─→ aiPrompt │ +│ ├─→ documentList (optional) │ +│ ├─→ contentParts (optional) │ +│ ├─→ resultType │ +│ └─→ generationIntent │ +│ │ +│ 2. Determine Operation Type │ +│ ├─→ IMAGE_GENERATE → Route to image generation │ +│ ├─→ DATA_GENERATE → Route to document/code generation │ +│ └─→ DATA_EXTRACT → Route to data extraction │ +│ │ +│ 3. Pass Parameters to AI Service │ +│ └─→ callAiContent( │ +│ prompt=aiPrompt, │ +│ documentList=documentList, ← PASS documentList │ +│ contentParts=contentParts, ← PASS contentParts │ +│ options=options, │ +│ generationIntent=generationIntent │ +│ ) │ +└─────────────────────────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ AI Service: callAiContent() │ +│ │ +│ 1. Route by Operation Type │ +│ └─→ DATA_GENERATE → _handleDocumentGeneration() │ +└─────────────────────────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ Document Generation Path: generateDocument() │ +│ │ +│ Phase 1: Document Intent Clarification │ +│ ┌─────────────────────────────────────────────────────────┐ │ +│ │ if documentList: │ │ +│ │ documents = getChatDocumentsFromDocumentList() │ │ +│ │ │ │ +│ │ # Step 1: Map pre-extracted JSONs to original docs │ │ +│ │ # (for intent analysis, analyze original docs, not JSON)│ │ +│ │ documentMapping = {} │ │ +│ │ resolvedDocuments = [] │ │ +│ │ for doc in documents: │ │ +│ │ preExtracted = resolvePreExtractedDocument(doc) │ │ +│ │ if preExtracted: │ │ +│ │ originalDocId = preExtracted["originalDocument"]["id"]│ +│ │ documentMapping[originalDocId] = doc.id │ │ +│ │ resolvedDocuments.append(originalDoc) │ │ +│ │ else: │ │ +│ │ resolvedDocuments.append(doc) │ │ +│ │ │ │ +│ │ # Step 2: AI analyzes document purposes │ │ +│ │ documentIntents = clarifyDocumentIntents( │ │ +│ │ resolvedDocuments, │ │ +│ │ userPrompt, │ │ +│ │ actionParameters │ │ +│ │ ) │ │ +│ │ │ │ +│ │ # Step 3: Map intents back to JSON doc IDs │ │ +│ │ # (if intent was for original doc, map to JSON doc) │ │ +│ │ for intent in documentIntents: │ │ +│ │ if intent.documentId in documentMapping: │ │ +│ │ intent.documentId = documentMapping[intent.documentId]│ +│ │ │ │ +│ │ # Result: List[DocumentIntent] with: │ │ +│ │ # - documentId: Document ID │ │ +│ │ # - intents: ["extract", "render", "reference"] │ │ +│ │ # - extractionPrompt: Prompt for extraction │ │ +│ │ # - reasoning: Why these intents were chosen │ │ +│ └─────────────────────────────────────────────────────────┘ │ +│ │ +│ Phase 2: Content Extraction and Preparation │ +│ ┌─────────────────────────────────────────────────────────┐ │ +│ │ Step 1: Identify Pre-Extracted JSON Documents │ │ +│ │ preExtractedDocs = [] │ │ +│ │ originalDocIdsCovered = set() │ │ +│ │ for doc in documents: │ │ +│ │ preExtracted = resolvePreExtractedDocument(doc) │ │ +│ │ if preExtracted: │ │ +│ │ preExtractedDocs.append(doc) │ │ +│ │ originalDocId = preExtracted["originalDocument"]["id"]│ +│ │ originalDocIdsCovered.add(originalDocId) │ │ +│ │ │ │ +│ │ Step 2: Filter Out Original Documents │ │ +│ │ # Remove original documents covered by pre-extracted │ │ +│ │ filteredDocuments = [ │ │ +│ │ doc for doc in documents │ │ +│ │ if doc.id not in originalDocIdsCovered │ │ +│ │ ] │ │ +│ │ │ │ +│ │ Step 3: Identify Already Extracted Documents │ │ +│ │ documentsAlreadyExtracted = set() │ │ +│ │ for part in contentParts: │ │ +│ │ if part.metadata.get("documentId"): │ │ +│ │ documentsAlreadyExtracted.add(documentId) │ │ +│ │ │ │ +│ │ Step 4: Filter Documents to Extract │ │ +│ │ documentsToExtract = [ │ │ +│ │ doc for doc in filteredDocuments │ │ +│ │ if doc.id not in documentsAlreadyExtracted │ │ +│ │ ] │ │ +│ │ │ │ +│ │ Step 5: Process Pre-Extracted JSON Documents │ │ +│ │ preExtractedParts = [] │ │ +│ │ for doc in preExtractedDocs: │ │ +│ │ preExtracted = resolvePreExtractedDocument(doc) │ │ +│ │ contentExtracted = preExtracted["contentExtracted"] │ │ +│ │ # Extract ContentParts from JSON (not regular JSON) │ │ +│ │ for part in contentExtracted.parts: │ │ +│ │ # Process nested parts if structure part │ │ +│ │ # Apply intents (extract, render, reference) │ │ +│ │ # Mark as pre-extracted │ │ +│ │ part.metadata["isPreExtracted"] = True │ │ +│ │ part.metadata["fromPreExtractedJson"] = True │ │ +│ │ preExtractedParts.append(part) │ │ +│ │ │ │ +│ │ Step 6: RAW Extraction (NO AI) for Regular Documents │ │ +│ │ if documentsToExtract: │ │ +│ │ extractedResults = extractContent( │ │ +│ │ documentsToExtract, │ │ +│ │ extractionOptions │ │ +│ │ ) │ │ +│ │ extractedParts = combineResults(extractedResults) │ │ +│ │ else: │ │ +│ │ extractedParts = [] │ │ +│ │ │ │ +│ │ Step 7: Merge All ContentParts │ │ +│ │ allParts = [] │ │ +│ │ allParts.extend(preExtractedParts) # Pre-extracted first│ +│ │ allParts.extend(extractedParts) # Then extracted │ │ +│ │ if contentParts: │ │ +│ │ # Preserve metadata │ │ +│ │ for part in contentParts: │ │ +│ │ part.metadata.setdefault("isPreExtracted", True) │ │ +│ │ allParts.extend(contentParts) # Then provided │ │ +│ │ │ │ +│ │ finalContentParts = allParts │ │ +│ └─────────────────────────────────────────────────────────┘ │ +│ │ +│ Phase 3: Structure Generation │ +│ ┌─────────────────────────────────────────────────────────┐ │ +│ │ structure = generateStructure( │ │ +│ │ userPrompt, │ │ +│ │ finalContentParts, ← Uses ContentParts metadata │ │ +│ │ outputFormat │ │ +│ │ ) │ │ +│ │ │ │ +│ │ Result: JSON structure with chapters │ │ +│ │ - Each chapter has contentParts assignments │ │ +│ │ - Based on ContentPart metadata (documentId, etc.) │ │ +│ └─────────────────────────────────────────────────────────┘ │ +│ │ +│ Phase 4: Structure Filling │ +│ ┌─────────────────────────────────────────────────────────┐ │ +│ │ filledStructure = fillStructure( │ │ +│ │ structure, │ │ +│ │ finalContentParts, │ │ +│ │ userPrompt │ │ +│ │ ) │ │ +│ │ │ │ +│ │ For each section: │ │ +│ │ 1. Check if ContentPart needsVisionExtraction │ │ +│ │ 2. If yes: Call Vision AI (Phase 2 extraction) │ │ +│ │ 3. Generate section content with AI │ │ +│ └─────────────────────────────────────────────────────────┘ │ +│ │ +│ Phase 5: Document Rendering │ +│ ┌─────────────────────────────────────────────────────────┐ │ +│ │ renderedDocuments = renderDocuments( │ │ +│ │ filledStructure, │ │ +│ │ outputFormat │ │ +│ │ ) │ │ +│ └─────────────────────────────────────────────────────────┘ │ +└─────────────────────────────────────────────────────────────────┘ +``` + +#### Key Differences from Current State + +**Current State Issues:** +1. ❌ `ai.process` extracts documents (duplication) +2. ❌ `ai.process` doesn't pass `documentList` to AI service +3. ❌ No deduplication when both `contentParts` and `documentList` provided +4. ❌ Inconsistent behavior across code paths +5. ❌ Pre-extracted JSON documents in `documentList` may not be properly identified + +**Target State Benefits:** +1. ✅ Single extraction path in AI service +2. ✅ `ai.process` passes both `documentList` and `contentParts` +3. ✅ Intelligent deduplication (extract only new documents) +4. ✅ Pre-extracted JSON documents identified and processed as ContentParts (not regular JSON) +5. ✅ Original documents filtered out if covered by pre-extracted JSON +6. ✅ Consistent behavior across all code paths +7. ✅ Better separation of concerns + +#### Document Intent Clarification Details + +**What Happens in Phase 1:** + +1. **Document Resolution**: + - Maps pre-extracted JSON documents to their original documents + - Creates `documentMapping` to track original → JSON document ID mapping + - Resolves documents for intent analysis (analyze original docs, not JSON) + +2. **AI Analysis** (`clarifyDocumentIntents`): + - **Input**: User prompt, resolved documents, action parameters (outputFormat, etc.) + - **Process**: Uses AI (`callAiPlanning()`) to analyze how each document should be used + - **Output**: List of `DocumentIntent` objects, one per document + - **AI Call**: Structured JSON response with intents and reasoning + +3. **Intent Determination**: + - **"extract"**: Content extraction needed (text, structure, OCR, etc.) + - Used for: PDFs, DOCX, images with text, tables, etc. + - Generates `extractionPrompt` for specific extraction needs + - Example: `"Extract all text content, preserving structure"` + - **"render"**: Image/binary should be rendered as-is (visual element) + - Used for: Images that should appear in final document + - No extraction prompt needed + - Example: Image that should be displayed in PDF/DOCX + - **"reference"**: Document reference/attachment (no extraction) + - Used for: Documents mentioned but not extracted + - No extraction prompt needed + - Example: Template document referenced but not included + +4. **Multiple Intents**: + - A document can have multiple intents (e.g., `["extract", "render"]`) + - Example: Image that needs text extraction AND visual rendering + - Each intent creates a separate ContentPart later in extraction phase + +5. **Extraction Prompt Generation**: + - AI generates specific extraction prompt for each document + - Based on user prompt, document type, and output format + - Examples: + - `"Extract all text content, preserving structure"` + - `"Extract text content from image using vision AI"` + - `"Extract tables and data, preserving formatting"` + - Stored in `DocumentIntent.extractionPrompt` for later use + +6. **Mapping Back**: + - If intent was for original document, map back to JSON document ID + - Ensures intents are associated with correct documents + - Pre-extracted JSON documents get intents mapped correctly + +**Example Flow**: +``` +Input: + documents = [ + ChatDocument(id="doc_1", fileName="report.pdf"), + ChatDocument(id="doc_2", fileName="image.jpg"), + ChatDocument(id="json_3", fileName="pre_extracted.json") # Pre-extracted + ] + userPrompt = "Create a report with the PDF content and show the image" + +Step 1: Map pre-extracted JSON + → json_3 maps to original_doc_3 + → resolvedDocuments = [doc_1, doc_2, original_doc_3] + +Step 2: AI Analysis + → Analyzes: "Create report with PDF content and show image" + → Determines: + - doc_1: ["extract"] (needs text extraction) + extractionPrompt: "Extract all text content, preserving structure" + - doc_2: ["render"] (needs visual rendering) + extractionPrompt: null + - original_doc_3: ["extract"] (needs extraction) + extractionPrompt: "Extract all text content, preserving structure" + +Step 3: Map back + → original_doc_3 intent mapped to json_3 + → Final intents: + - doc_1: ["extract"] + - doc_2: ["render"] + - json_3: ["extract"] +``` + +**Why This Matters**: +- Determines HOW each document should be processed (extract vs. render vs. reference) +- Generates appropriate extraction prompts for each document +- Handles pre-extracted JSON documents correctly (maps to original for analysis) +- Enables multiple intents per document (extract + render for images) +- Guides content extraction phase (Phase 2) on what to extract and how + +**Output Structure**: +```python +DocumentIntent( + documentId: str, # Document ID + intents: List[str], # ["extract", "render", "reference"] + extractionPrompt: Optional[str], # Prompt for extraction (if extract intent) + reasoning: str # Why these intents were chosen +) +``` + +#### Pre-Extracted JSON Documents Handling + +**Scenario**: ContentParts are already extracted and handed over as JSON documents in `documentList` + +**Target State Behavior**: + +1. **Identification** (Step 1 in Phase 2): + - Use `resolvePreExtractedDocument()` to identify JSON documents containing `ContentExtracted` structure + - These are NOT regular JSON documents - they contain pre-processed ContentParts + - Map back to original document ID to identify which original documents are covered + +2. **Filtering** (Step 2 in Phase 2): + - Keep pre-extracted JSON documents (will be processed as ContentParts) + - Remove original documents if covered by pre-extracted JSON (prevents duplicate extraction) + - Keep regular documents (not pre-extracted, not covered) + +3. **Processing** (Step 5 in Phase 2): + - Extract ContentParts from pre-extracted JSON (not treat as regular JSON) + - Process nested parts if structure parts contain nested ContentParts + - Apply intents (extract, render, reference) to each ContentPart + - Mark with metadata: + - `isPreExtracted: True` + - `fromPreExtractedJson: True` + - `originalFileName`: Original document filename + - `documentId`: Pre-extracted JSON document ID + +4. **Merging** (Step 7 in Phase 2): + - Merge order: pre-extracted parts → extracted parts → provided contentParts + - All ContentParts treated equally regardless of source + +**Example Flow**: +``` +documentList = [ + "doc:original_pdf_123", # Original PDF document + "doc:pre_extracted_json_456" # Pre-extracted JSON (contains ContentParts from original_pdf_123) +] + +Step 1: Identify pre-extracted JSON + → pre_extracted_json_456 is identified as pre-extracted + → Maps to original_pdf_123 + +Step 2: Filter documents + → Keep pre_extracted_json_456 (will extract ContentParts from JSON) + → Remove original_pdf_123 (covered by pre-extracted JSON) + +Step 5: Process pre-extracted JSON + → Extract ContentParts from pre_extracted_json_456 + → Mark as isPreExtracted=True, fromPreExtractedJson=True + +Step 6: Extract regular documents + → No documents to extract (all filtered out or pre-extracted) + +Step 7: Merge + → finalContentParts = [ContentParts from pre_extracted_json_456] +``` + +**Key Point**: Pre-extracted JSON documents are identified BEFORE deduplication and processed as ContentParts, NOT as regular JSON documents. This prevents treating them as regular JSON and ensures ContentParts are properly extracted and used. + +#### Migration Steps + +**Phase 1: Update `ai.process` Action** + +**Step 1.1: Remove Extraction Logic from `ai.process`** +- **File**: `gateway/modules/workflows/methods/methodAi/actions/process.py` +- **Lines**: 85-119 +- **Action**: Remove or comment out extraction logic +- **Code Change**: + ```python + # REMOVE THIS: + # if not contentParts and documentList.references: + # extractedResults = self.services.extraction.extractContent(...) + # contentParts = combineExtractedResults(extractedResults) + ``` + +**Step 1.2: Pass `documentList` to `callAiContent()`** +- **File**: `gateway/modules/workflows/methods/methodAi/actions/process.py` +- **Line**: 167 +- **Action**: Add `documentList` parameter +- **Code Change**: + ```python + # CURRENT: + aiResponse = await self.services.ai.callAiContent( + prompt=aiPrompt, + options=options, + contentParts=contentParts, # Only contentParts + outputFormat=output_format, + parentOperationId=operationId, + generationIntent=generationIntent + ) + + # TARGET: + aiResponse = await self.services.ai.callAiContent( + prompt=aiPrompt, + options=options, + documentList=documentList, # ADD documentList + contentParts=contentParts, # Keep contentParts + outputFormat=output_format, + parentOperationId=operationId, + generationIntent=generationIntent + ) + ``` + +**Step 1.3: Update Progress Tracking** +- **File**: `gateway/modules/workflows/methods/methodAi/actions/process.py` +- **Action**: Remove extraction progress tracking (moved to AI service) +- **Note**: Progress tracking will happen in `extractAndPrepareContent()` + +**Phase 2: Update Document Generation Path** + +**Step 2.1: Document Intent Clarification (Already Exists)** +- **File**: `gateway/modules/services/serviceAi/subDocumentIntents.py` +- **Lines**: 30-120 +- **Action**: Verify intent clarification works correctly with new flow +- **What it does**: + - **AI Analysis**: Uses AI to analyze user prompt and documents + - **Determines Intents**: For each document, determines how it should be used: + - `"extract"`: Content extraction needed (text, structure, OCR, etc.) + - `"render"`: Image/binary should be rendered as-is (visual element) + - `"reference"`: Document reference/attachment (no extraction, just reference) + - **Multiple Intents**: A document can have multiple intents (e.g., `["extract", "render"]` for images) + - **Extraction Prompt**: Generates specific extraction prompt for each document + - **Pre-Extracted JSON Handling**: Maps pre-extracted JSONs to original documents for analysis, then maps back +- **Example Output**: + ```python + [ + DocumentIntent( + documentId="doc_1", + intents=["extract"], + extractionPrompt="Extract all text content, preserving structure", + reasoning="User needs text content for document generation" + ), + DocumentIntent( + documentId="doc_2", + intents=["extract", "render"], # Both! + extractionPrompt="Extract text content from image using vision AI", + reasoning="Image contains text that needs extraction, but also should be rendered visually" + ) + ] + ``` +- **Note**: This step already exists and works correctly, just needs to be verified with new flow + +**Step 2.2: Identify Pre-Extracted JSON Documents** +- **File**: `gateway/modules/services/serviceGeneration/paths/documentPath.py` +- **Lines**: 62-87 (already exists, but needs to be integrated with deduplication) +- **Action**: Ensure pre-extracted JSON documents are identified BEFORE deduplication +- **Code Change**: + ```python + # Step 1: Identify pre-extracted JSON documents + preExtractedDocs = [] + originalDocIdsCoveredByPreExtracted = set() + for doc in documents: + preExtracted = self.services.ai.intentAnalyzer.resolvePreExtractedDocument(doc) + if preExtracted: + preExtractedDocs.append(doc) + originalDocId = preExtracted["originalDocument"]["id"] + originalDocIdsCoveredByPreExtracted.add(originalDocId) + logger.info(f"Found pre-extracted JSON {doc.id} covering original document {originalDocId}") + + # Step 2: Filter out original documents covered by pre-extracted JSONs + filteredDocuments = [] + for doc in documents: + preExtracted = self.services.ai.intentAnalyzer.resolvePreExtractedDocument(doc) + if preExtracted: + # Pre-extracted JSON - keep it (will be processed as ContentParts, not regular JSON) + filteredDocuments.append(doc) + elif doc.id in originalDocIdsCoveredByPreExtracted: + # Original document covered by pre-extracted JSON - skip it + logger.info(f"Skipping original document {doc.id} - already covered by pre-extracted JSON") + else: + # Regular document - keep it + filteredDocuments.append(doc) + + documents = filteredDocuments + ``` + +**Step 2.2: Add Deduplication Logic for Regular Documents** +- **File**: `gateway/modules/services/serviceGeneration/paths/documentPath.py` +- **Lines**: 101-119 +- **Action**: Add deduplication before extraction (after pre-extracted JSON handling) +- **Code Change**: + ```python + # Step 3: Identify already extracted documents (from contentParts) + documentsAlreadyExtracted = set() + if contentParts: + for part in contentParts: + documentId = part.metadata.get("documentId") + if documentId: + documentsAlreadyExtracted.add(documentId) + + # Step 4: Filter documents to extract (exclude pre-extracted JSONs and already extracted) + documentsToExtract = [ + doc for doc in documents + if doc.id not in documentsAlreadyExtracted + and not self.services.ai.intentAnalyzer.resolvePreExtractedDocument(doc) # Not pre-extracted JSON + ] + + # Step 5: Process pre-extracted JSON documents (handled in extractAndPrepareContent) + # Step 6: Extract regular documents + if documentsToExtract: + preparedContentParts = await extractAndPrepareContent( + documentsToExtract, # Only new documents (not pre-extracted, not already extracted) + documentIntents or [], + docOperationId + ) + + # Merge: pre-extracted parts + extracted parts + provided contentParts + if contentParts: + # Preserve metadata + for part in contentParts: + part.metadata.setdefault("isPreExtracted", True) + preparedContentParts.extend(contentParts) + + contentParts = preparedContentParts + elif contentParts: + # All documents already extracted or pre-extracted, use contentParts as-is + contentParts = contentParts + ``` + +**Step 2.4: Ensure Pre-Extracted JSON Processing** +- **File**: `gateway/modules/services/serviceAi/subContentExtraction.py` +- **Lines**: 75-253 +- **Action**: Ensure `extractAndPrepareContent()` properly handles pre-extracted JSON documents +- **Note**: This logic already exists (lines 75-253) but needs to be verified: + - Pre-extracted JSON documents are identified via `resolvePreExtractedDocument()` + - ContentParts are extracted from JSON (not treated as regular JSON) + - Original documents are skipped if covered by pre-extracted JSON + - Metadata is preserved (`isPreExtracted`, `fromPreExtractedJson`) + +**Step 2.5: Verify Pre-Extracted JSON Identification** +- **File**: `gateway/modules/services/serviceAi/subDocumentIntents.py` +- **Action**: Ensure `resolvePreExtractedDocument()` correctly identifies pre-extracted JSON documents +- **Requirements**: + - Must identify JSON documents containing `ContentExtracted` structure + - Must map back to original document ID + - Must extract ContentParts from JSON (not treat as regular JSON) + - Must preserve metadata (`isPreExtracted`, `fromPreExtractedJson`) + +**Step 2.6: Update Extraction Logic** +- **File**: `gateway/modules/services/serviceAi/subContentExtraction.py` +- **Action**: Ensure extraction handles deduplication gracefully +- **Note**: Extraction service already supports this, just need to pass filtered documents +- **Important**: Pre-extracted JSON documents should be processed BEFORE regular extraction + +**Phase 3: Testing and Validation** + +**Step 3.1: Unit Tests** +- Test `ai.process` with only `documentList` +- Test `ai.process` with only `contentParts` +- Test `ai.process` with both `documentList` and `contentParts` (no overlap) +- Test `ai.process` with both `documentList` and `contentParts` (full overlap) +- Test `ai.process` with both `documentList` and `contentParts` (partial overlap) + +**Step 3.2: Integration Tests** +- Test full document generation flow +- Test progress tracking at all levels +- Test error handling (missing documents, extraction failures) +- Test performance (no duplicate extraction) + +**Step 3.3: Regression Tests** +- Ensure existing workflows continue to work +- Test backward compatibility +- Test edge cases (empty lists, missing metadata, etc.) + +**Phase 4: Documentation Updates** + +**Step 4.1: Update Action Documentation** +- **File**: `gateway/modules/workflows/methods/methodAi/methodAi.py` +- **Action**: Update parameter descriptions to clarify merging behavior +- **Content**: Document that both parameters can be provided and will be merged intelligently + +**Step 4.2: Update API Documentation** +- Document new behavior in API docs +- Add examples showing both parameters used together +- Explain deduplication logic + +**Step 4.3: Update This Analysis Document** +- Mark current state sections as "Current State (Pre-Migration)" +- Add "Target State" sections (this chapter) +- Document migration progress + +**Phase 5: Rollout Strategy** + +**Step 5.1: Feature Flag (Optional)** +- Add feature flag to control new vs. old behavior +- Allows gradual rollout +- Easy rollback if issues found + +**Step 5.2: Gradual Migration** +- Migrate one workflow at a time +- Monitor for issues +- Collect feedback + +**Step 5.3: Full Migration** +- Remove old extraction logic from `ai.process` +- Remove feature flag +- Update all documentation + +#### Migration Checklist + +- [ ] **Phase 1: Update `ai.process` Action** + - [ ] Remove extraction logic from `ai.process` + - [ ] Pass `documentList` to `callAiContent()` + - [ ] Update progress tracking + - [ ] Test `ai.process` with new parameters + +- [ ] **Phase 2: Update Document Generation Path** + - [ ] Identify pre-extracted JSON documents (before deduplication) + - [ ] Filter out original documents covered by pre-extracted JSONs + - [ ] Add deduplication logic for regular documents + - [ ] Ensure pre-extracted JSON processing (extract ContentParts, not treat as JSON) + - [ ] Update extraction to handle filtered documents + - [ ] Test merging behavior (pre-extracted + extracted + provided) + - [ ] Test pre-extracted JSON identification + +- [ ] **Phase 3: Testing and Validation** + - [ ] Unit tests for all scenarios + - [ ] Integration tests for full flow + - [ ] Regression tests for existing workflows + - [ ] Performance tests (no duplicate extraction) + +- [ ] **Phase 4: Documentation Updates** + - [ ] Update action parameter documentation + - [ ] Update API documentation + - [ ] Update analysis document + +- [ ] **Phase 5: Rollout** + - [ ] Feature flag (if needed) + - [ ] Gradual migration + - [ ] Full migration + - [ ] Remove old code + +- [ ] **Phase 6: Security and Design Improvements** + - [ ] **CRITICAL: Fix unfenced user input** (Finding 1) + - [ ] Add fencing around `userPrompt` in intent analysis prompt + - [ ] Test with various user inputs (special chars, JSON, newlines) + - [ ] Verify AI still correctly parses user request + - [ ] **IMPROVEMENT: Per-document output format** (Finding 2) + - [ ] Add `outputFormat` field to `DocumentIntent` model (optional) + - [ ] Update intent analysis prompt to determine format per document + - [ ] Update structure generation to use per-document format + - [ ] Fallback to global format if not specified + +#### Expected Benefits After Migration + +1. **Architectural Improvements**: + - Single source of truth for extraction logic + - Consistent behavior across all code paths + - Better separation of concerns + +2. **Functional Improvements**: + - Users can combine pre-extracted content with documents + - Intelligent deduplication prevents redundant extraction + - More flexible and powerful API + +3. **Maintenance Improvements**: + - Less code duplication + - Easier to maintain and extend + - Clearer code organization + +4. **Performance Improvements**: + - No duplicate extraction + - Better resource utilization + - Faster processing for common cases + +### 9.4 Two-Phase Extraction: Why Extract Before Structure Generation? + +#### Problem Statement + +**Question**: Why do we extract content (Step 2) BEFORE structure generation (Step 3), when we need AI to fill sections (Step 4) anyway? Are we extracting twice? + +**Answer**: Yes, but it's intentional and necessary. There are TWO different types of extraction happening at different phases: + +1. **Phase 1 (Step 2)**: RAW extraction (parsing) - NO AI +2. **Phase 2 (Step 4)**: Vision AI extraction (for images only) - WITH AI + +#### Analysis + +**Phase 1: RAW Extraction (Step 2 - `extractAndPrepareContent`)** + +**What happens:** +- Uses `extractContent()` service for pure document parsing +- Parses PDF, DOCX, XLSX, etc. to extract structured content +- Creates ContentParts with raw extracted data +- **No AI involved** - just parsing/parsing + +**Prompt used:** +- `intent.extractionPrompt` or default `"Extract all content from the document"` +- **Important**: This prompt is stored in metadata but NOT used for AI extraction here +- It's only used later during section generation (Step 4) for Vision AI + +**ContentPart preparation:** +- **For Images**: + - Marks with `needsVisionExtraction: True` + - Stores `extractionPrompt` in metadata + - **Reason**: Vision AI extraction is expensive, so it's deferred to section generation +- **For Text**: + - Marks with `skipExtraction: True` (already extracted, no AI needed) + - Text is already extracted from document parsing +- **For Objects**: + - Creates object ContentParts for rendering (images, videos, etc.) + +**Why extract before structure generation?** +- ContentParts are needed BEFORE structure generation so AI can assign them to chapters +- Structure generation needs to know what content is available to assign to chapters +- The AI needs ContentPart metadata (documentId, typeGroup, etc.) to make intelligent assignments + +**Phase 2: Vision AI Extraction (Step 4 - `fillStructure`)** + +**What happens:** +- During section generation, checks for ContentParts with `needsVisionExtraction == True` +- Calls Vision AI with `extractionPrompt` from metadata (line 651 in `subStructureFilling.py`) +- Converts image ContentPart to text ContentPart with extracted text +- Then uses the text part for section generation + +**Prompt used:** +- `part.metadata.get("extractionPrompt")` or default `"Extract all text content from this image. Return only the extracted text, no additional formatting."` +- This is the actual AI extraction prompt + +**Why extract during section generation?** +- Vision AI extraction is expensive (costs tokens, takes time) +- Only needed when actually generating content for a section +- Not needed for structure generation (just needs to know images exist) +- Deferred extraction saves costs and improves performance + +#### Current Flow + +``` +Step 2: extractAndPrepareContent() + ├─→ RAW extraction (parsing PDF/DOCX/etc.) - NO AI + ├─→ Creates ContentParts with raw data + ├─→ For images: marks needsVisionExtraction=True, stores extractionPrompt + └─→ For text: marks skipExtraction=True (already extracted) + +Step 3: generateStructure() + ├─→ Uses ContentParts metadata to assign to chapters + └─→ Creates structure with contentPart assignments + +Step 4: fillStructure() + ├─→ For each section: + │ ├─→ Check if ContentPart needsVisionExtraction==True + │ ├─→ If yes: Call Vision AI with extractionPrompt (Phase 2 extraction) + │ ├─→ Convert image → text ContentPart + │ └─→ Generate section content with processed ContentParts + └─→ Text ContentParts: Used directly (skipExtraction=True) +``` + +#### Is This Optimal? + +**Arguments FOR current approach:** +- Structure generation needs ContentParts early (to assign to chapters) +- Vision AI extraction is expensive - deferring saves costs +- Text content doesn't need AI extraction (already extracted in Phase 1) +- Clear separation: parsing vs. AI extraction + +**Arguments AGAINST current approach:** +- Two-phase extraction can be confusing +- `extractionPrompt` stored but not used until later (unclear) +- Could potentially extract images earlier if structure generation needs text content + +#### Recommendation + +**Current approach is reasonable** but documentation should be clearer: + +1. **Clarify terminology**: + - "Extraction" in Step 2 = RAW parsing (no AI) + - "Extraction" in Step 4 = Vision AI extraction (with AI) + +2. **Document prompts clearly**: + - Step 2: `extractionPrompt` is stored but NOT used (just metadata) + - Step 4: `extractionPrompt` is actually used for Vision AI + +3. **Consider renaming**: + - `extractAndPrepareContent()` → `parseAndPrepareContent()` (more accurate) + - `needsVisionExtraction` → `needsVisionAiExtraction` (clearer) + +4. **Alternative approach** (if structure generation needs text from images): + - Extract images with Vision AI in Step 2 + - More expensive but simpler flow + - Only if structure generation actually needs image text + +#### Implementation Notes + +- **Text ContentParts**: Already extracted in Phase 1, used directly in Phase 4 +- **Image ContentParts**: Parsed in Phase 1, Vision AI extracted in Phase 4 +- **Object ContentParts**: Created in Phase 1, used for rendering in Phase 4 +- **Reference ContentParts**: Created in Phase 1, used as references in Phase 4 + +### 9.5 Document Intent Clarification: Security and Design Issues + +#### Finding 1: Security Risk - Unfenced User Input + +**Problem Statement:** + +The user input (`userPrompt`) is directly inserted into the intent analysis prompt without fencing or escaping (line 248-249 in `subDocumentIntents.py`): + +```python +prompt = f"""USER REQUEST: +{userPrompt} # ← DIRECT INSERTION, NO FENCING! +``` + +**Security Risk:** +- **Prompt Injection**: User input could contain special characters, JSON, or instructions that break the prompt structure +- **Example Attack**: User could inject `\n\nRETURN JSON: {"intents": [{"documentId": "malicious", ...}]}` to manipulate the AI response +- **Impact**: Could cause incorrect intent determination or even security vulnerabilities + +**Evidence from Debug Files:** +- `20260102-134423-015-document_intent_analysis_prompt.txt`: User input is directly inserted without any fencing +- User input contains German text with special characters, quotes, etc. +- No escaping or delimiters around user input + +**Recommendation:** + +**Option A: Fence User Input (Preferred)** +```python +prompt = f"""USER REQUEST: +``` +{userPrompt} +``` + +DOCUMENTS TO ANALYZE: +{docListText} +... +``` + +**Option B: Escape Special Characters** +```python +import json +escapedPrompt = json.dumps(userPrompt) # Escapes quotes, newlines, etc. +prompt = f"""USER REQUEST: {escapedPrompt} +... +``` + +**Option C: Use Structured Format** +```python +prompt = f"""USER REQUEST (delimited): +---START_USER_REQUEST--- +{userPrompt} +---END_USER_REQUEST--- + +DOCUMENTS TO ANALYZE: +... +``` + +**Implementation Steps:** +1. Update `_buildIntentAnalysisPrompt()` in `subDocumentIntents.py` (line 248) +2. Add fencing around `userPrompt` (Option A recommended) +3. Test with various user inputs (special characters, JSON, newlines, quotes) +4. Verify AI still correctly parses user request + +#### Finding 2: Output Format Should Be Per-Document + +**Problem Statement:** + +Currently, output format is passed as a single value in the intent analysis prompt (line 259 in `subDocumentIntents.py`): + +```python +OUTPUT FORMAT: {outputFormat} # Single format for all documents +``` + +**Issue:** +- Output format is global, but different documents might need different formats +- Similar to language handling: each document can have its own language +- Should be determined per document based on intention + +**Current Behavior:** +- Single `outputFormat` parameter (e.g., "docx") +- All documents analyzed with same output format in mind +- AI considers output format when determining intents (e.g., DOCX → images need "render") + +**Proposed Behavior:** +- Each `DocumentIntent` should have optional `outputFormat` field +- AI determines output format per document based on user intention +- If not specified, use global output format as fallback +- Similar to language: per-document with fallback to global + +**Example:** +```python +DocumentIntent( + documentId: str, + intents: List[str], + extractionPrompt: Optional[str], + reasoning: str, + outputFormat: Optional[str] = None # NEW: Per-document format +) +``` + +**Benefits:** +- More flexible: Different documents can have different output formats +- Better intention analysis: AI can determine format based on document purpose +- Consistent with language handling (per-document with fallback) + +**Migration Steps:** +1. Add `outputFormat` field to `DocumentIntent` model (optional) +2. Update intent analysis prompt to ask AI to determine format per document +3. Update prompt to show: "OUTPUT FORMAT (default: {outputFormat})" instead of "OUTPUT FORMAT: {outputFormat}" +4. Update structure generation to use per-document format if available +5. Fallback to global format if not specified per document + +**Updated Prompt Structure:** +```python +OUTPUT FORMAT (default: {outputFormat}): +- If not specified per document, use default format above +- Determine format per document based on user intention +- Examples: "docx", "pdf", "html", "json", etc. + +RETURN JSON: +{{ + "intents": [ + {{ + "documentId": "doc_1", + "intents": ["extract"], + "extractionPrompt": "...", + "outputFormat": "docx", # NEW: Per-document format + "reasoning": "..." + }} + ] +}} +``` + +#### Implementation Priority + +**High Priority:** +- Finding 1 (Security Risk): **CRITICAL** - Fix immediately + - Security vulnerability that could be exploited + - Easy to fix (add fencing) + - Low risk change + +**Medium Priority:** +- Finding 2 (Output Format): **IMPROVEMENT** - Plan for next iteration + - Architectural improvement + - Requires model changes + - More complex migration + +--- + +## 10. Implementation Plan: Target State Migration + +This section provides a detailed implementation plan for migrating to the target architecture described in Section 9.3. The plan focuses on documents/content handling, output formats, languages, and clear handover states between phases. + +### 10.1 Overview: Major Phases and Handover States + +#### Phase Flow Diagram + +``` +┌─────────────────────────────────────────────────────────────────────┐ +│ PHASE 1: Document Intent Clarification │ +│ ────────────────────────────────────────────────────────────────── │ +│ INPUT: │ +│ - userPrompt: str (fenced) │ +│ - documentList: DocumentReferenceList (optional) │ +│ - contentParts: List[ContentPart] (optional) │ +│ - actionParameters: Dict (outputFormat, language, etc.) │ +│ │ +│ THROUGHPUT: │ +│ 1. Resolve documents from documentList │ +│ 2. Map pre-extracted JSONs to original documents │ +│ 3. AI analyzes document purposes │ +│ 4. Map intents back to JSON doc IDs (if applicable) │ +│ │ +│ OUTPUT: │ +│ - documentIntents: List[DocumentIntent] │ +│ * documentId: str │ +│ * intents: List[str] (["extract", "render", "reference"]) │ +│ * extractionPrompt: str (optional) │ +│ * outputFormat: str (optional, per-document) ← NEW │ +│ * language: str (optional, per-document) ← NEW │ +│ * reasoning: str │ +│ │ +│ HANDOVER STATE: │ +│ - documentIntents: Complete intent analysis │ +│ - documents: Resolved ChatDocuments │ +│ - preExtractedMapping: Map[originalDocId, jsonDocId] │ +└─────────────────────────────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────────┐ +│ PHASE 2: Content Extraction and Preparation │ +│ ────────────────────────────────────────────────────────────────── │ +│ INPUT: │ +│ - documents: List[ChatDocument] │ +│ - documentIntents: List[DocumentIntent] │ +│ - contentParts: List[ContentPart] (optional, pre-extracted) │ +│ - preExtractedMapping: Map[originalDocId, jsonDocId] │ +│ │ +│ THROUGHPUT: │ +│ 1. Identify pre-extracted JSON documents │ +│ 2. Filter out original documents covered by pre-extracted │ +│ 3. Identify already extracted documents (from contentParts) │ +│ 4. Filter documents to extract (exclude duplicates) │ +│ 5. Process pre-extracted JSON documents → ContentParts │ +│ 6. RAW extraction (NO AI) for regular documents │ +│ 7. Merge: pre-extracted + extracted + provided contentParts │ +│ 8. Apply intents to ContentParts (extract, render, reference) │ +│ 9. Mark images for Vision AI extraction (deferred) │ +│ │ +│ OUTPUT: │ +│ - finalContentParts: List[ContentPart] │ +│ * id: str │ +│ * typeGroup: str │ +│ * mimeType: str │ +│ * data: Union[str, bytes] │ +│ * metadata: Dict │ +│ - documentId: str │ +│ - contentFormat: str ("extracted", "object", "reference") │ +│ - intent: str │ +│ - needsVisionExtraction: bool (for images) │ +│ - extractionPrompt: str (for Vision AI) │ +│ - originalFileName: str │ +│ - isPreExtracted: bool │ +│ - outputFormat: str (from DocumentIntent) ← NEW │ +│ - language: str (from DocumentIntent) ← NEW │ +│ │ +│ HANDOVER STATE: │ +│ - finalContentParts: Complete, ready for structure generation │ +│ - All documents processed (extracted or pre-extracted) │ +│ - Vision AI extraction deferred to Phase 4 │ +└─────────────────────────────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────────┐ +│ PHASE 3: Structure Generation │ +│ ────────────────────────────────────────────────────────────────── │ +│ INPUT: │ +│ - userPrompt: str │ +│ - finalContentParts: List[ContentPart] │ +│ - globalOutputFormat: str (fallback) │ +│ - globalLanguage: str (fallback) │ +│ │ +│ THROUGHPUT: │ +│ 1. Group ContentParts by documentId │ +│ 2. Determine per-document outputFormat (from ContentPart.metadata│ +│ or global fallback) │ +│ 3. Determine per-document language (from ContentPart.metadata │ +│ or global fallback) │ +│ 4. AI generates structure with chapters │ +│ 5. Assign ContentParts to chapters │ +│ │ +│ OUTPUT: │ +│ - chapterStructure: Dict │ +│ * documents: List[Dict] │ +│ - id: str │ +│ - title: str │ +│ - outputFormat: str (per-document) ← NEW │ +│ - language: str (per-document) ← NEW │ +│ - chapters: List[Dict] │ +│ * id: str │ +│ * level: int │ +│ * title: str │ +│ * generationHint: str │ +│ * contentParts: List[str] (ContentPart IDs) │ +│ │ +│ HANDOVER STATE: │ +│ - chapterStructure: Complete structure with ContentPart │ +│ assignments │ +│ - Per-document format/language determined │ +└─────────────────────────────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────────┐ +│ PHASE 4: Structure Filling │ +│ ────────────────────────────────────────────────────────────────── │ +│ INPUT: │ +│ - chapterStructure: Dict │ +│ - finalContentParts: List[ContentPart] │ +│ - userPrompt: str │ +│ │ +│ THROUGHPUT: │ +│ For each chapter: │ +│ 1. Generate sections structure (parallel) │ +│ 2. For each section: │ +│ a. Check if ContentParts need Vision AI extraction │ +│ b. If yes: Call Vision AI (Phase 2 deferred extraction) │ +│ c. Determine prompt type: │ +│ - WITH CONTENT: If contentParts assigned │ +│ → Use aggregation prompt (isAggregation=True) │ +│ → ContentParts passed as parameters │ +│ - WITHOUT CONTENT: If no contentParts │ +│ → Use generation prompt (isAggregation=False) │ +│ → Only generationHint in prompt │ +│ d. Generate section content with AI │ +│ │ +│ OUTPUT: │ +│ - filledStructure: Dict │ +│ * documents: List[Dict] │ +│ - chapters: List[Dict] │ +│ * sections: List[Dict] │ +│ - id: str │ +│ - content_type: str │ +│ - elements: List[Dict] │ +│ * type: str │ +│ * content: str (or base64 for images) │ +│ │ +│ HANDOVER STATE: │ +│ - filledStructure: Complete content, ready for rendering │ +│ - All Vision AI extractions completed │ +└─────────────────────────────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────────┐ +│ PHASE 5: Document Rendering │ +│ ────────────────────────────────────────────────────────────────── │ +│ INPUT: │ +│ - filledStructure: Dict │ +│ - per-document outputFormat (from Phase 3) │ +│ - per-document language (from Phase 3) │ +│ │ +│ THROUGHPUT: │ +│ 1. Group sections by document (from structure) │ +│ 2. For each document: │ +│ a. Use per-document outputFormat │ +│ b. Use per-document language │ +│ c. Render document in specified format │ +│ │ +│ OUTPUT: │ +│ - renderedDocuments: List[DocumentData] │ +│ * documentName: str │ +│ * documentData: bytes │ +│ * mimeType: str │ +│ │ +│ HANDOVER STATE: │ +│ - renderedDocuments: Final output ready for user │ +└─────────────────────────────────────────────────────────────────────┘ +``` + +### 10.2 Detailed Implementation Steps + +#### Step 1: Update DocumentIntent Model + +**File**: `gateway/modules/datamodels/datamodelExtraction.py` + +**Changes**: +```python +class DocumentIntent(BaseModel): + documentId: str + intents: List[str] # ["extract", "render", "reference"] + extractionPrompt: Optional[str] = None + outputFormat: Optional[str] = None # ← NEW: Per-document format + language: Optional[str] = None # ← NEW: Per-document language + reasoning: str +``` + +**Rationale**: +- Enables per-document output format and language determination +- Aligns with existing language handling pattern +- Allows AI to determine format/language based on document purpose + +#### Step 2: Update Intent Analysis Prompt + +**File**: `gateway/modules/services/serviceAi/subDocumentIntents.py` + +**Changes**: + +1. **Add fencing around userPrompt** (Security Fix): +```python +def _buildIntentAnalysisPrompt( + self, + userPrompt: str, + documents: List[ChatDocument], + actionParameters: Dict[str, Any] +) -> str: + # FENCE user input to prevent prompt injection + fencedUserPrompt = f"""```user_request +{userPrompt} +```""" + + prompt = f"""USER REQUEST: +{fencedUserPrompt} + +DOCUMENTS TO ANALYZE: +{docListText} + +TASK: For each document, determine: +1. Intents (can be multiple): "extract", "render", "reference" +2. Output format (optional): If document should be rendered in specific format +3. Language (optional): If document content should be in specific language + +OUTPUT FORMAT: {outputFormat} (global fallback) + +RETURN JSON: +{{ + "intents": [ + {{ + "documentId": "doc_1", + "intents": ["extract"], + "extractionPrompt": "Extract all text content", + "outputFormat": "pdf", // ← NEW: Optional, per-document + "language": "de", // ← NEW: Optional, per-document + "reasoning": "..." + }} + ] +}} +""" +``` + +2. **Remove global outputFormat from prompt** (or keep as fallback only): + - Output format should be determined per document based on intent + - Global format remains as fallback if not specified per document + +#### Step 3: Update ContentPart Metadata Propagation + +**File**: `gateway/modules/services/serviceAi/subContentExtraction.py` + +**Changes**: +```python +async def extractAndPrepareContent( + self, + documents: List[ChatDocument], + documentIntents: List[DocumentIntent], + parentOperationId: str, + getIntentForDocument: callable +) -> List[ContentPart]: + # ... existing extraction logic ... + + # When creating ContentParts, propagate outputFormat and language from DocumentIntent + for part in allContentParts: + intent = getIntentForDocument(part.metadata.get("documentId"), documentIntents) + if intent: + # Propagate per-document format and language to ContentPart + if intent.outputFormat: + part.metadata["outputFormat"] = intent.outputFormat + if intent.language: + part.metadata["language"] = intent.language +``` + +**Rationale**: +- ContentParts carry format/language information through pipeline +- Enables per-document rendering in Phase 5 + +#### Step 4: Update Structure Generation + +**File**: `gateway/modules/services/serviceAi/subStructureGeneration.py` + +**Changes**: + +1. **Determine per-document format/language from ContentParts**: +```python +def generateStructure( + self, + userPrompt: str, + contentParts: List[ContentPart], + outputFormat: str, # Global fallback + language: str, # Global fallback + parentOperationId: str +) -> Dict[str, Any]: + # Group ContentParts by documentId + partsByDocument = {} + for part in contentParts: + docId = part.metadata.get("documentId", "default") + if docId not in partsByDocument: + partsByDocument[docId] = [] + partsByDocument[docId].append(part) + + # Determine per-document format and language + documentFormats = {} + documentLanguages = {} + for docId, parts in partsByDocument.items(): + # Get format from first ContentPart (all parts from same doc should have same format) + docFormat = parts[0].metadata.get("outputFormat") or outputFormat + docLanguage = parts[0].metadata.get("language") or language + documentFormats[docId] = docFormat + documentLanguages[docId] = docLanguage + + # Update prompt to include per-document format/language + prompt = self._buildStructureGenerationPrompt( + userPrompt=userPrompt, + contentParts=contentParts, + documentFormats=documentFormats, # ← NEW + documentLanguages=documentLanguages, # ← NEW + globalOutputFormat=outputFormat, # Fallback + globalLanguage=language # Fallback + ) +``` + +2. **Update prompt to include per-document format/language**: +```python +def _buildStructureGenerationPrompt( + self, + userPrompt: str, + contentParts: List[ContentPart], + documentFormats: Dict[str, str], # ← NEW + documentLanguages: Dict[str, str], # ← NEW + globalOutputFormat: str, + globalLanguage: str +) -> str: + # ... existing prompt building ... + + # Add per-document format/language information + formatLanguageInfo = "\n## PER-DOCUMENT OUTPUT FORMATS AND LANGUAGES\n" + for docId, docFormat in documentFormats.items(): + docLanguage = documentLanguages.get(docId, globalLanguage) + formatLanguageInfo += f"- Document {docId}: Format={docFormat}, Language={docLanguage}\n" + + prompt += formatLanguageInfo + + prompt += """ +## DOCUMENT LANGUAGE +- Each document can have its own language (ISO 639-1 code: "de", "en", "fr", etc.) +- Per-document languages are listed above +- If not specified, use global language: "{globalLanguage}" + +## OUTPUT FORMAT +- Each document can have its own output format +- Per-document formats are listed above +- If not specified, use global format: "{globalOutputFormat}" +""" +``` + +#### Step 5: Update Structure Filling - Two Prompt Types + +**File**: `gateway/modules/services/serviceAi/subStructureFilling.py` + +**Changes**: + +1. **Ensure two prompt types are used** (already implemented, verify): +```python +async def _fillSingleSection( + self, + section: Dict[str, Any], + contentParts: List[ContentPart], + userPrompt: str, + generationHint: str, + # ... other params ... +) -> List[Dict[str, Any]]: + contentPartIds = section.get("contentPartIds", []) + hasContentParts = len(contentPartIds) > 0 + + if hasContentParts: + # PROMPT TYPE 1: WITH CONTENT (Aggregation) + # ContentParts passed as parameters, not in prompt text + isAggregation = True + relevantParts = [p for p in contentParts if p.id in contentPartIds] + + generationPrompt = self._buildSectionGenerationPrompt( + section=section, + contentParts=relevantParts, # Passed as parameters + userPrompt=userPrompt, + generationHint=generationHint, + isAggregation=True, # ← Key flag + language=language + ) + else: + # PROMPT TYPE 2: WITHOUT CONTENT (Generation) + # Only generationHint in prompt, no ContentParts + isAggregation = False + + generationPrompt = self._buildSectionGenerationPrompt( + section=section, + contentParts=[], # Empty + userPrompt=userPrompt, + generationHint=generationHint, + isAggregation=False, # ← Key flag + language=language + ) +``` + +2. **Verify `_buildSectionGenerationPrompt` handles both cases**: +```python +def _buildSectionGenerationPrompt( + self, + section: Dict[str, Any], + contentParts: List[ContentPart], + userPrompt: str, + generationHint: str, + isAggregation: bool, # ← Determines prompt type + language: str +) -> str: + if isAggregation: + # TYPE 1: WITH CONTENT + # ContentParts are passed as parameters to AI call + # Don't include full content in prompt text (token efficiency) + prompt = f"""Generate content for section based on provided ContentParts. + +Section: {sectionTitle} +Generation Hint: {generationHint} +Language: {language} + +ContentParts are provided as parameters (not shown in prompt for efficiency). +Use the ContentParts data to generate the section content. +""" + else: + # TYPE 2: WITHOUT CONTENT + # Only generationHint, no ContentParts + prompt = f"""Generate content for section based on generation hint. + +Section: {sectionTitle} +Generation Hint: {generationHint} +Language: {language} + +Generate content based on the generation hint without referencing external content. +""" +``` + +**Rationale**: +- **Type 1 (with content)**: Efficient for large content (ContentParts as parameters) +- **Type 2 (without content)**: Simple generation based on hint only +- Already implemented via `isAggregation` flag, verify it's used correctly + +#### Step 6: Update Document Rendering + +**File**: `gateway/modules/services/serviceGeneration/paths/documentPath.py` + +**Changes**: +```python +async def renderDocuments( + self, + filledStructure: Dict[str, Any], + outputFormat: str, # Global fallback + language: str # Global fallback +) -> List[DocumentData]: + renderedDocuments = [] + + for doc in filledStructure.get("documents", []): + docId = doc.get("id") + docFormat = doc.get("outputFormat") or outputFormat # ← Use per-document format + docLanguage = doc.get("language") or language # ← Use per-document language + + # Render document with per-document format and language + renderedDoc = await self._renderSingleDocument( + doc=doc, + outputFormat=docFormat, + language=docLanguage + ) + renderedDocuments.append(renderedDoc) + + return renderedDocuments +``` + +#### Step 7: Update ai.process to Pass documentList + +**File**: `gateway/modules/workflows/methods/methodAi/actions/process.py` + +**Changes**: +```python +# Phase 7.3: Pass both documentList and contentParts to AI service +# (Remove extraction logic from here - handled by AI service) + +# Use unified callAiContent method with BOTH parameters +aiResponse = await self.services.ai.callAiContent( + prompt=aiPrompt, + options=options, + documentList=documentList, # ← PASS documentList (was missing) + contentParts=contentParts, # ← PASS contentParts + outputFormat=output_format, + parentOperationId=operationId, + generationIntent=generationIntent +) +``` + +**Rationale**: +- Centralizes extraction logic in AI service +- Enables intelligent merging with deduplication +- Consistent behavior across all code paths + +### 10.3 Handover State Definitions + +#### State 1: After Intent Clarification +```python +class IntentClarificationState: + documentIntents: List[DocumentIntent] # Complete intent analysis + documents: List[ChatDocument] # Resolved documents + preExtractedMapping: Dict[str, str] # Map[originalDocId, jsonDocId] + + # Validation + assert len(documentIntents) == len(documents) # One intent per document + assert all(intent.documentId in [d.id for d in documents] for intent in documentIntents) +``` + +#### State 2: After Content Extraction +```python +class ContentExtractionState: + finalContentParts: List[ContentPart] # All content parts ready + + # Validation + assert all(part.metadata.get("documentId") for part in finalContentParts) + assert all(part.metadata.get("contentFormat") in ["extracted", "object", "reference"] + for part in finalContentParts) + # All documents either extracted or pre-extracted + assert len(set(p.metadata.get("documentId") for p in finalContentParts)) == len(documents) +``` + +#### State 3: After Structure Generation +```python +class StructureGenerationState: + chapterStructure: Dict[str, Any] # Complete structure + + # Validation + assert "documents" in chapterStructure + for doc in chapterStructure["documents"]: + assert "outputFormat" in doc # Per-document format + assert "language" in doc # Per-document language + assert "chapters" in doc + for chapter in doc["chapters"]: + assert "contentParts" in chapter # ContentPart assignments +``` + +#### State 4: After Structure Filling +```python +class StructureFillingState: + filledStructure: Dict[str, Any] # Complete content + + # Validation + assert "documents" in filledStructure + for doc in filledStructure["documents"]: + for chapter in doc.get("chapters", []): + for section in chapter.get("sections", []): + assert "elements" in section # Generated elements + # All Vision AI extractions completed + assert not any(p.metadata.get("needsVisionExtraction") + for p in contentParts) +``` + +#### State 5: After Document Rendering +```python +class DocumentRenderingState: + renderedDocuments: List[DocumentData] # Final output + + # Validation + assert len(renderedDocuments) > 0 + for doc in renderedDocuments: + assert doc.documentData # Non-empty + assert doc.mimeType # Valid MIME type +``` + +### 10.4 Migration Checklist + +#### Phase 1: Model Updates +- [ ] Add `outputFormat` and `language` to `DocumentIntent` model +- [ ] Update intent analysis prompt parser to handle new fields +- [ ] Add validation for new fields + +#### Phase 2: Intent Analysis Updates +- [ ] **CRITICAL**: Add fencing around `userPrompt` in intent analysis prompt +- [ ] Update prompt to ask for per-document format/language +- [ ] Update prompt to remove global outputFormat dependency (or keep as fallback) +- [ ] Test with various user inputs (special chars, JSON, newlines) + +#### Phase 3: Content Extraction Updates +- [ ] Propagate `outputFormat` and `language` from `DocumentIntent` to `ContentPart.metadata` +- [ ] Verify pre-extracted JSON handling preserves format/language +- [ ] Test merging logic with format/language propagation + +#### Phase 4: Structure Generation Updates +- [ ] Group ContentParts by documentId +- [ ] Determine per-document format/language from ContentPart metadata +- [ ] Update structure generation prompt to include per-document info +- [ ] Update structure output to include per-document format/language + +#### Phase 5: Structure Filling Verification +- [ ] Verify two prompt types are correctly used: + - [ ] `isAggregation=True`: ContentParts as parameters + - [ ] `isAggregation=False`: Only generationHint +- [ ] Test both prompt types with various scenarios +- [ ] Verify Vision AI extraction happens during filling phase + +#### Phase 6: Document Rendering Updates +- [ ] Use per-document format from structure +- [ ] Use per-document language from structure +- [ ] Fallback to global format/language if not specified +- [ ] Test multi-document rendering with different formats/languages + +#### Phase 7: ai.process Refactoring +- [ ] Remove extraction logic from `ai.process` +- [ ] Pass `documentList` to `callAiContent()` +- [ ] Pass `contentParts` to `callAiContent()` +- [ ] Verify intelligent merging in AI service works correctly + +#### Phase 8: Testing +- [ ] Test with pre-extracted JSON documents +- [ ] Test with mixed `documentList` + `contentParts` +- [ ] Test per-document format/language determination +- [ ] Test two prompt types in structure filling +- [ ] Test multi-document output with different formats/languages +- [ ] Test security: prompt injection attempts with fenced input + +#### Phase 9: Documentation +- [ ] Update API documentation +- [ ] Update developer documentation +- [ ] Update user documentation (if applicable) + +--- + +## End of Analysis + +This document provides a comprehensive overview of the content extraction and processing logic in the `ai.process` action. For implementation details, refer to the source files referenced throughout this document. + +**Note**: The "Recommendations and Next Steps" section (Section 9) will be expanded with additional findings and improvements as analysis continues. diff --git a/modules/services/serviceAi/subAiCallLooping.py b/modules/services/serviceAi/subAiCallLooping.py index 62c91ce6..3d4a0866 100644 --- a/modules/services/serviceAi/subAiCallLooping.py +++ b/modules/services/serviceAi/subAiCallLooping.py @@ -17,6 +17,7 @@ from modules.datamodels.datamodelExtraction import ContentPart from modules.shared.jsonUtils import buildContinuationContext, extractJsonString, tryParseJson from modules.services.serviceAi.subJsonResponseHandling import JsonResponseHandler from modules.services.serviceAi.subLoopingUseCases import LoopingUseCaseRegistry +from modules.workflows.processing.shared.stateTools import checkWorkflowStopped logger = logging.getLogger(__name__) @@ -134,6 +135,7 @@ class AiCallLooper: # Make AI call try: + checkWorkflowStopped(self.services) if iterationOperationId: self.services.chat.progressLogUpdate(iterationOperationId, 0.3, "Calling AI model") # ARCHITECTURE: Pass ContentParts directly to AiCallRequest @@ -621,6 +623,7 @@ If no trackable items can be identified, return: {{"kpis": []}} # Write KPI definition prompt to debug file self.services.utils.writeDebugFile(kpiDefinitionPrompt, f"{debugPrefix}_kpi_definition_prompt") + checkWorkflowStopped(self.services) response = await self.aiService.callAi(request) # Write KPI definition response to debug file diff --git a/modules/services/serviceAi/subContentExtraction.py b/modules/services/serviceAi/subContentExtraction.py index 3eff0855..40bf5bf5 100644 --- a/modules/services/serviceAi/subContentExtraction.py +++ b/modules/services/serviceAi/subContentExtraction.py @@ -16,6 +16,7 @@ from typing import Dict, Any, List, Optional from modules.datamodels.datamodelChat import ChatDocument from modules.datamodels.datamodelExtraction import ContentPart, DocumentIntent +from modules.workflows.processing.shared.stateTools import checkWorkflowStopped logger = logging.getLogger(__name__) @@ -70,6 +71,7 @@ class ContentExtractor: allContentParts = [] for document in documents: + checkWorkflowStopped(self.services) # Check if document is already a ContentExtracted document (pre-extracted JSON) logger.debug(f"Checking document {document.id} ({document.fileName}, mimeType={document.mimeType}) for pre-extracted content") preExtracted = self.intentAnalyzer.resolvePreExtractedDocument(document) @@ -92,12 +94,28 @@ class ContentExtractor: logger.warning(f" ⚠️ No intent found for pre-extracted document {document.id}! Available intent documentIds: {[i.documentId for i in documentIntents]}") if contentExtracted.parts: + # CRITICAL: Process pre-extracted parts - analyze structure parts for nested content + processedParts = [] for part in contentExtracted.parts: # Überspringe leere Parts (Container ohne Daten) if not part.data or (isinstance(part.data, str) and len(part.data.strip()) == 0): if part.typeGroup == "container": continue # Überspringe leere Container + # CRITICAL: Check if structure part contains nested parts (e.g., JSON with documentData.parts) + if part.typeGroup == "structure" and part.mimeType == "application/json" and part.data: + nestedParts = self._extractNestedPartsFromStructure(part, document, preExtracted, intent) + if nestedParts: + # Replace structure part with extracted nested parts + processedParts.extend(nestedParts) + logger.info(f"✅ Extracted {len(nestedParts)} nested parts from structure part {part.id}") + continue # Skip original structure part + + # Keep original part if no nested parts found + processedParts.append(part) + + # Use processed parts (with nested parts extracted) + for part in processedParts: if not part.metadata: part.metadata = {} @@ -352,6 +370,7 @@ class ContentExtractor: ) # extractContent ist nicht async - keine await nötig + checkWorkflowStopped(self.services) extractedResults = self.services.extraction.extractContent( [document], extractionOptions, @@ -431,6 +450,7 @@ class ContentExtractor: ) # Verwende AI-Service für Vision AI-Verarbeitung + checkWorkflowStopped(self.services) response = await self.aiService.callAi(request) # Debug-Log für Response (harmonisiert) @@ -504,6 +524,7 @@ class ContentExtractor: ) # Verwende AI-Service für Text-Verarbeitung + checkWorkflowStopped(self.services) response = await self.aiService.callAi(request) # Debug-Log für Response (harmonisiert) @@ -537,4 +558,84 @@ class ContentExtractor: "application/x-zip-compressed" ] return mimeType in binaryTypes or mimeType.startswith("image/") or mimeType.startswith("video/") or mimeType.startswith("audio/") + + def _extractNestedPartsFromStructure( + self, + structurePart: ContentPart, + document: ChatDocument, + preExtracted: Dict[str, Any], + intent: Optional[Any] + ) -> List[ContentPart]: + """ + Extract nested parts from a structure ContentPart (e.g., JSON with documentData.parts). + + This is a generic function that analyzes pre-processed ContentParts and extracts + any nested parts that are embedded in structure data (typically JSON). + + Works with standard ContentExtracted format: documentData.parts array. + Each nested part is extracted as a separate ContentPart with proper metadata. + + Args: + structurePart: ContentPart with typeGroup="structure" containing nested parts + document: The document this part belongs to + preExtracted: Pre-extracted document metadata + intent: Document intent for nested parts + + Returns: + List of extracted ContentParts, empty if no nested parts found + """ + nestedParts = [] + + try: + # Parse JSON structure + jsonData = json.loads(structurePart.data) + + # Check for standard ContentExtracted format: documentData.parts + if isinstance(jsonData, dict): + documentData = jsonData.get("documentData") + if isinstance(documentData, dict): + parts = documentData.get("parts", []) + if isinstance(parts, list) and len(parts) > 0: + # Extract each nested part + for nestedPartData in parts: + if not isinstance(nestedPartData, dict): + continue + + nestedPartId = nestedPartData.get("id") or f"nested_{len(nestedParts)}" + nestedTypeGroup = nestedPartData.get("typeGroup", "text") + nestedMimeType = nestedPartData.get("mimeType", "text/plain") + nestedLabel = nestedPartData.get("label", structurePart.label) + nestedData = nestedPartData.get("data", "") + nestedMetadata = nestedPartData.get("metadata", {}) + + # Create ContentPart for nested part + nestedPart = ContentPart( + id=f"{structurePart.id}_{nestedPartId}", + parentId=structurePart.id, + label=nestedLabel, + typeGroup=nestedTypeGroup, + mimeType=nestedMimeType, + data=nestedData, + metadata={ + **nestedMetadata, + "documentId": document.id, + "fromNestedStructure": True, + "parentStructurePartId": structurePart.id, + "originalFileName": preExtracted["originalDocument"]["fileName"] + } + ) + + nestedParts.append(nestedPart) + logger.debug(f"✅ Extracted nested part: {nestedPart.id} (typeGroup={nestedTypeGroup}, mimeType={nestedMimeType})") + + # If no nested parts found, return empty list (original part will be kept) + if not nestedParts: + logger.debug(f"No nested parts found in structure part {structurePart.id}") + + except json.JSONDecodeError as e: + logger.warning(f"Could not parse structure part {structurePart.id} as JSON: {str(e)}") + except Exception as e: + logger.error(f"Error extracting nested parts from structure part {structurePart.id}: {str(e)}") + + return nestedParts diff --git a/modules/services/serviceAi/subDocumentIntents.py b/modules/services/serviceAi/subDocumentIntents.py index c1faba39..d81f6e4e 100644 --- a/modules/services/serviceAi/subDocumentIntents.py +++ b/modules/services/serviceAi/subDocumentIntents.py @@ -14,6 +14,7 @@ from typing import Dict, Any, List, Optional from modules.datamodels.datamodelChat import ChatDocument from modules.datamodels.datamodelExtraction import DocumentIntent +from modules.workflows.processing.shared.stateTools import checkWorkflowStopped logger = logging.getLogger(__name__) @@ -86,6 +87,7 @@ class DocumentIntentAnalyzer: # AI-Call (verwende callAiPlanning für einfache JSON-Responses) # Debug-Logs werden bereits von callAiPlanning geschrieben + checkWorkflowStopped(self.services) aiResponse = await self.aiService.callAiPlanning( prompt=intentPrompt, debugType="document_intent_analysis" diff --git a/modules/services/serviceAi/subStructureFilling.py b/modules/services/serviceAi/subStructureFilling.py index fae9a5bf..86bcf04d 100644 --- a/modules/services/serviceAi/subStructureFilling.py +++ b/modules/services/serviceAi/subStructureFilling.py @@ -16,6 +16,7 @@ from typing import Dict, Any, List, Optional, Tuple from modules.datamodels.datamodelExtraction import ContentPart from modules.datamodels.datamodelAi import AiCallRequest, AiCallOptions, OperationTypeEnum, PriorityEnum, ProcessingModeEnum +from modules.workflows.processing.shared.stateTools import checkWorkflowStopped logger = logging.getLogger(__name__) @@ -51,6 +52,33 @@ class StructureFiller: pass return 'en' # Default fallback + def _getDocumentLanguage(self, structure: Dict[str, Any], documentId: str) -> str: + """ + Get language for a specific document from structure. + Falls back to user language if not specified. + + Args: + structure: The document structure with documents array + documentId: The ID of the document to get language for + + Returns: + ISO 639-1 language code (e.g., "de", "en", "fr") + """ + # Try to find document in structure + for doc in structure.get("documents", []): + if doc.get("id") == documentId: + docLanguage = doc.get("language") + if docLanguage: + return docLanguage + + # Fallback to metadata language + metadataLanguage = structure.get("metadata", {}).get("language") + if metadataLanguage: + return metadataLanguage + + # Fallback to user language + return self._getUserLanguage() + def _extractContentPartInfo(self, chapter: Dict[str, Any]) -> Tuple[List[str], Dict[str, Any]]: """ Extract contentPartIds and contentPartInstructions from chapter's contentParts structure. @@ -60,11 +88,15 @@ class StructureFiller: """ contentParts = chapter.get("contentParts", {}) contentPartIds = list(contentParts.keys()) - # Extract instructions (only entries with "instruction" field) + # Extract instructions (entries with "instruction" field) and captions (entries with "caption" field) contentPartInstructions = {} for partId, partInfo in contentParts.items(): - if isinstance(partInfo, dict) and "instruction" in partInfo: - contentPartInstructions[partId] = {"instruction": partInfo["instruction"]} + if isinstance(partInfo, dict): + if "instruction" in partInfo: + contentPartInstructions[partId] = {"instruction": partInfo["instruction"]} + elif "caption" in partInfo: + # For entries with only caption (no instruction), still add to dict so it's available + contentPartInstructions[partId] = {"caption": partInfo["caption"]} return contentPartIds, contentPartInstructions def _getContentPartCaption(self, chapter: Dict[str, Any], partId: str) -> Optional[str]: @@ -219,6 +251,7 @@ class StructureFiller: # AI-Call für Chapter-Struktur-Generierung # Note: Debug logging is handled by callAiPlanning + checkWorkflowStopped(self.services) aiResponse = await self.aiService.callAiPlanning( prompt=chapterPrompt, debugType=f"chapter_structure_{chapterId}" @@ -311,6 +344,10 @@ class StructureFiller: chapterIndex = 0 for doc in chapterStructure.get("documents", []): + docId = doc.get("id", "unknown") + # Get language for this specific document + docLanguage = self._getDocumentLanguage(chapterStructure, docId) + for chapter in doc.get("chapters", []): chapterIndex += 1 chapterId = chapter.get("id", "unknown") @@ -320,7 +357,8 @@ class StructureFiller: contentPartIds, contentPartInstructions = self._extractContentPartInfo(chapter) # Create task for parallel processing with semaphore - async def processChapterWithSemaphore(chapter, chapterIndex, chapterId, chapterLevel, chapterTitle, generationHint, contentPartIds, contentPartInstructions): + async def processChapterWithSemaphore(chapter, chapterIndex, chapterId, chapterLevel, chapterTitle, generationHint, contentPartIds, contentPartInstructions, docLanguage): + checkWorkflowStopped(self.services) async with semaphore: return await self._generateSingleChapterSectionsStructure( chapter=chapter, @@ -333,13 +371,13 @@ class StructureFiller: contentPartInstructions=contentPartInstructions, contentParts=contentParts, userPrompt=userPrompt, - language=language, + language=docLanguage, # Use document-specific language parentOperationId=parentOperationId, totalChapters=totalChapters ) task = processChapterWithSemaphore( - chapter, chapterIndex, chapterId, chapterLevel, chapterTitle, generationHint, contentPartIds, contentPartInstructions + chapter, chapterIndex, chapterId, chapterLevel, chapterTitle, generationHint, contentPartIds, contentPartInstructions, docLanguage ) chapterTasks.append((chapterIndex, chapter, task)) @@ -367,7 +405,8 @@ class StructureFiller: operationType: OperationTypeEnum, sectionId: str, generationHint: str, - generatedElements: List[Dict[str, Any]] + generatedElements: List[Dict[str, Any]], + section: Dict[str, Any] ) -> List[Dict[str, Any]]: """ Helper method to process AI response and extract elements. @@ -424,13 +463,16 @@ class StructureFiller: # Image already processed as JSON, skip pass elif base64Data: + # Get caption from section if available + caption = section.get("caption") or section.get("metadata", {}).get("caption") or "" elements.append({ "type": "image", "content": { "base64Data": base64Data, "altText": generationHint or "Generated image", - "caption": "" - } + "caption": caption # Use caption from section if available + }, + "caption": caption # Also at element level for compatibility }) logger.debug(f"Created proper JSON image structure with base64Data length: {len(base64Data)}") else: @@ -566,14 +608,26 @@ class StructureFiller: }) elif contentFormat == "object": if part.typeGroup == "image": - elements.append({ - "type": "image", - "content": { - "base64Data": part.data, - "altText": part.metadata.get("usageHint", part.label), - "caption": part.metadata.get("caption", "") - } - }) + # Validate that image data exists + if not part.data: + logger.warning(f"Section {sectionId}: Image ContentPart {part.id} has no data (object format). Skipping image element.") + elements.append({ + "type": "error", + "message": f"Image ContentPart {part.id} has no data", + "sectionId": sectionId + }) + else: + # Get caption from section (priority: section.caption > part.metadata.caption) + caption = section.get("caption") or section.get("metadata", {}).get("caption") or part.metadata.get("caption", "") + elements.append({ + "type": "image", + "content": { + "base64Data": part.data, + "altText": part.metadata.get("usageHint", part.label), + "caption": caption # Use caption from section + }, + "caption": caption # Also at element level for compatibility + }) else: elements.append({ "type": part.typeGroup, @@ -615,6 +669,7 @@ class StructureFiller: contentParts=[part] ) + checkWorkflowStopped(self.services) visionResponse = await self.aiService.callAi(visionRequest) # Write debug file for image extraction response @@ -715,6 +770,7 @@ class StructureFiller: processingMode=ProcessingModeEnum.DETAILED ) ) + checkWorkflowStopped(self.services) aiResponse = await self.aiService.callAi(request) generatedElements = [] @@ -773,6 +829,7 @@ The JSON should be a fragment that can be merged with the previous response.""" processingMode=ProcessingModeEnum.DETAILED ) + checkWorkflowStopped(self.services) aiResponseJson = await self.aiService.callAiWithLooping( prompt=generationPrompt, options=options, @@ -858,7 +915,8 @@ The JSON should be a fragment that can be merged with the previous response.""" operationType=operationType, sectionId=sectionId, generationHint=generationHint, - generatedElements=generatedElements + generatedElements=generatedElements, + section=section ) elements.extend(responseElements) @@ -1061,7 +1119,8 @@ The JSON should be a fragment that can be merged with the previous response.""" operationType=operationType, sectionId=sectionId, generationHint=generationHint, - generatedElements=generatedElements + generatedElements=generatedElements, + section=section ) elements.extend(responseElements) @@ -1106,14 +1165,26 @@ The JSON should be a fragment that can be merged with the previous response.""" elif contentFormat == "object": if part.typeGroup == "image": - elements.append({ - "type": "image", - "content": { - "base64Data": part.data, - "altText": part.metadata.get("usageHint", part.label), - "caption": part.metadata.get("caption", "") - } - }) + # Validate that image data exists + if not part.data: + logger.warning(f"Section {sectionId}: Image ContentPart {part.id} has no data (object format). Skipping image element.") + elements.append({ + "type": "error", + "message": f"Image ContentPart {part.id} has no data", + "sectionId": sectionId + }) + else: + # Get caption from section (priority: section.caption > part.metadata.caption) + caption = section.get("caption") or section.get("metadata", {}).get("caption") or part.metadata.get("caption", "") + elements.append({ + "type": "image", + "content": { + "base64Data": part.data, + "altText": part.metadata.get("usageHint", part.label), + "caption": caption # Use caption from section + }, + "caption": caption # Also at element level for compatibility + }) else: elements.append({ "type": part.typeGroup, @@ -1125,6 +1196,12 @@ The JSON should be a fragment that can be merged with the previous response.""" }) elif contentFormat == "extracted": + # CRITICAL: If useAiCall is true, extracted parts are used as input for AI generation + # and should NOT be added as elements. Only add extracted text as element if useAiCall is false. + if useAiCall: + # Extracted part will be used as input for AI call - skip adding as element + logger.debug(f"Section {sectionId}: Skipping extracted part {part.id} as element (useAiCall=true, will be used as AI input)") + # Continue to process this part for AI call, but don't add as element yet # Check if this is an image that needs Vision AI extraction originalPartId = part.id if (part.typeGroup == "image" and @@ -1143,6 +1220,7 @@ The JSON should be a fragment that can be merged with the previous response.""" contentParts=[part] ) + checkWorkflowStopped(self.services) visionResponse = await self.aiService.callAi(visionRequest) if visionResponse and visionResponse.content: @@ -1344,7 +1422,8 @@ The JSON should be a fragment that can be merged with the previous response.""" operationType=operationType, sectionId=sectionId, generationHint=generationHint, - generatedElements=generatedElements + generatedElements=generatedElements, + section=section ) elements.extend(responseElements) @@ -1373,24 +1452,114 @@ The JSON should be a fragment that can be merged with the previous response.""" ) else: # Füge extrahierten Content direkt hinzu (kein AI-Call) - if part.typeGroup == "image": - logger.debug(f"Processing section {sectionId}: Single extracted IMAGE part WITHOUT AI call") - elements.append({ - "type": "image", - "content": { - "base64Data": part.data, - "altText": part.metadata.get("usageHint", part.label), - "caption": part.metadata.get("caption", "") - } - }) + # CRITICAL: If content_type is "image", we must render an image, not extracted text + if contentType == "image": + # Section wants to display an image - find the image part + if part.typeGroup == "image": + # Direct image part - use it + logger.debug(f"Processing section {sectionId}: Single extracted IMAGE part WITHOUT AI call") + # Validate that image data exists + if not part.data: + logger.warning(f"Section {sectionId}: Image ContentPart {part.id} has no data (extracted format without AI call). Skipping image element.") + elements.append({ + "type": "error", + "message": f"Image ContentPart {part.id} has no data", + "sectionId": sectionId + }) + else: + # Get caption from section (priority: section.caption > part.metadata.caption) + caption = section.get("caption") or section.get("metadata", {}).get("caption") or part.metadata.get("caption", "") + elements.append({ + "type": "image", + "content": { + "base64Data": part.data, + "altText": part.metadata.get("usageHint", part.label), + "caption": caption # Use caption from section + }, + "caption": caption # Also at element level for compatibility + }) + elif part.typeGroup == "text" and part.metadata.get("sourceImagePartId"): + # This is a vision-extracted text part - find the original image object part + sourceImagePartId = part.metadata.get("sourceImagePartId") + logger.debug(f"Processing section {sectionId}: Found vision-extracted text part, looking for original image object part: {sourceImagePartId}") + + # Try to find the object part (format: "obj_...") + objectPartId = part.metadata.get("relatedObjectPartId") + objectPart = None + + if objectPartId: + objectPart = self._findContentPartById(objectPartId, contentParts) + + # If not found via metadata, search through all contentParts for object part + if not objectPart: + # Search for object part that references the source image part ID + for candidatePart in contentParts: + if (candidatePart.metadata.get("contentFormat") == "object" and + candidatePart.typeGroup == "image" and + sourceImagePartId in candidatePart.id): + objectPart = candidatePart + objectPartId = candidatePart.id + logger.debug(f"Section {sectionId}: Found object part {objectPartId} by searching all contentParts") + break + + if objectPart and objectPart.typeGroup == "image" and objectPart.data: + logger.info(f"Section {sectionId}: Found object part {objectPartId} for image rendering") + caption = section.get("caption") or section.get("metadata", {}).get("caption") or objectPart.metadata.get("caption", "") + elements.append({ + "type": "image", + "content": { + "base64Data": objectPart.data, + "altText": objectPart.metadata.get("usageHint", objectPart.label), + "caption": caption + }, + "caption": caption + }) + else: + logger.warning(f"Section {sectionId}: No object part found for vision-extracted text part {part.id} (sourceImagePartId={sourceImagePartId}), cannot render image") + elements.append({ + "type": "error", + "message": f"Cannot render image: no object part found for extracted text part (sourceImagePartId={sourceImagePartId})", + "sectionId": sectionId + }) + else: + logger.warning(f"Section {sectionId}: ContentPart {part.id} is not an image (typeGroup={part.typeGroup}), but section content_type is 'image'. Cannot render image.") + elements.append({ + "type": "error", + "message": f"Cannot render image: ContentPart is not an image type", + "sectionId": sectionId + }) else: - logger.debug(f"Processing section {sectionId}: Single extracted TEXT part WITHOUT AI call") - elements.append({ - "type": "extracted_text", - "content": part.data, - "source": part.metadata.get("documentId"), - "extractionPrompt": part.metadata.get("extractionPrompt") - }) + # content_type is not "image" - add extracted text as normal + if part.typeGroup == "image": + logger.debug(f"Processing section {sectionId}: Single extracted IMAGE part WITHOUT AI call") + # Validate that image data exists + if not part.data: + logger.warning(f"Section {sectionId}: Image ContentPart {part.id} has no data (extracted format without AI call). Skipping image element.") + elements.append({ + "type": "error", + "message": f"Image ContentPart {part.id} has no data", + "sectionId": sectionId + }) + else: + # Get caption from section (priority: section.caption > part.metadata.caption) + caption = section.get("caption") or section.get("metadata", {}).get("caption") or part.metadata.get("caption", "") + elements.append({ + "type": "image", + "content": { + "base64Data": part.data, + "altText": part.metadata.get("usageHint", part.label), + "caption": caption # Use caption from section + }, + "caption": caption # Also at element level for compatibility + }) + else: + logger.debug(f"Processing section {sectionId}: Single extracted TEXT part WITHOUT AI call") + elements.append({ + "type": "extracted_text", + "content": part.data, + "source": part.metadata.get("documentId"), + "extractionPrompt": part.metadata.get("extractionPrompt") + }) # Update progress after section completion chapterProgress = (sectionIndex + 1) / totalSections if totalSections > 0 else 1.0 @@ -1462,6 +1631,10 @@ The JSON should be a fragment that can be merged with the previous response.""" # Process chapters sequentially with chapter-level progress chapterIndex = 0 for doc in chapterStructure.get("documents", []): + docId = doc.get("id", "unknown") + # Get language for this specific document + docLanguage = self._getDocumentLanguage(chapterStructure, docId) + for chapter in doc.get("chapters", []): chapterIndex += 1 chapterId = chapter.get("id", "unknown") @@ -1483,7 +1656,8 @@ The JSON should be a fragment that can be merged with the previous response.""" sectionTasks = [] for sectionIndex, section in enumerate(sections): # Create task wrapper with semaphore for parallel processing - async def processSectionWithSemaphore(section, sectionIndex, totalSections, chapterIndex, totalChapters, chapterId, chapterOperationId, fillOperationId, contentParts, userPrompt, all_sections_list, language, calculateOverallProgress): + async def processSectionWithSemaphore(section, sectionIndex, totalSections, chapterIndex, totalChapters, chapterId, chapterOperationId, fillOperationId, contentParts, userPrompt, all_sections_list, docLanguage, calculateOverallProgress): + checkWorkflowStopped(self.services) async with sectionSemaphore: return await self._processSingleSection( section=section, @@ -1497,12 +1671,12 @@ The JSON should be a fragment that can be merged with the previous response.""" contentParts=contentParts, userPrompt=userPrompt, all_sections_list=all_sections_list, - language=language, + language=docLanguage, # Use document-specific language calculateOverallProgress=calculateOverallProgress ) task = processSectionWithSemaphore( - section, sectionIndex, totalSections, chapterIndex, totalChapters, chapterId, chapterOperationId, fillOperationId, contentParts, userPrompt, all_sections_list, language, calculateOverallProgress + section, sectionIndex, totalSections, chapterIndex, totalChapters, chapterId, chapterOperationId, fillOperationId, contentParts, userPrompt, all_sections_list, docLanguage, calculateOverallProgress ) sectionTasks.append((sectionIndex, section, task)) @@ -1675,15 +1849,30 @@ The JSON should be a fragment that can be merged with the previous response.""" for partId in contentPartIds: part = self._findContentPartById(partId, contentParts) if not part: + # Part not found - try to show info from chapter structure + partInfo = contentPartInstructions.get(partId, {}) + if partInfo: + logger.warning(f"Chapter {chapterId}: ContentPart {partId} not found in contentParts list, but has chapter structure info.") + contentPartsIndex += f"\n- ContentPart ID: {partId}\n" + if "instruction" in partInfo: + contentPartsIndex += f" Instruction: {partInfo['instruction']}\n" + if "caption" in partInfo: + contentPartsIndex += f" Caption: {partInfo['caption']}\n" + contentPartsIndex += f" Note: ContentPart not found in contentParts list (ID may be from nested structure)\n" continue contentFormat = part.metadata.get("contentFormat", "unknown") - instruction = contentPartInstructions.get(partId, {}).get("instruction", "Use content as needed") + partInfo = contentPartInstructions.get(partId, {}) + instruction = partInfo.get("instruction", "Use content as needed") + caption = partInfo.get("caption") contentPartsIndex += f"\n- ContentPart ID: {partId}\n" contentPartsIndex += f" Format: {contentFormat}\n" contentPartsIndex += f" Type: {part.typeGroup}\n" - contentPartsIndex += f" Instruction: {instruction}\n" + if instruction and instruction != "Use content as needed": + contentPartsIndex += f" Instruction: {instruction}\n" + if caption: + contentPartsIndex += f" Caption: {caption}\n" if not contentPartsIndex: contentPartsIndex = "\n(No content parts specified for this chapter)" @@ -1695,6 +1884,8 @@ LANGUAGE: Generate all content in {language.upper()} language. All text, titles, CHAPTER: {chapterTitle} (Level {chapterLevel}, ID: {chapterId}) GENERATION HINT: {generationHint} +**CRITICAL**: The chapter's generationHint above describes what content this chapter should generate. If the generationHint references documents/images/data, then EACH section that generates content for this chapter MUST assign the relevant ContentParts from AVAILABLE CONTENT PARTS below. + NOTE: Chapter already has a heading section. Do NOT generate a heading for the chapter title. ## SECTION INDEPENDENCE @@ -1705,7 +1896,18 @@ NOTE: Chapter already has a heading section. Do NOT generate a heading for the c AVAILABLE CONTENT PARTS: {contentPartsIndex} -CONTENT TYPES: table, bullet_list, heading, paragraph, code_block, image +## CONTENT ASSIGNMENT RULE - CRITICAL +If AVAILABLE CONTENT PARTS are listed above, then EVERY section that generates content related to those ContentParts MUST assign them explicitly. + +**Assignment logic:** +- If section generates text content ABOUT a ContentPart → assign "extracted" format ContentPart with appropriate instruction +- If section DISPLAYS a ContentPart → assign "object" format ContentPart +- If section's generationHint or purpose relates to a ContentPart listed above → it MUST have contentPartIds assigned +- If chapter's generationHint references documents/images/data AND section generates content for that chapter → section MUST assign relevant ContentParts +- Empty contentPartIds [] are only allowed if section generates content WITHOUT referencing any available ContentParts AND WITHOUT relating to chapter's generationHint + +## CONTENT TYPES +Available content types for sections: table, bullet_list, heading, paragraph, code_block, image useAiCall RULES: - useAiCall: true ONLY if ContentPart Format is "extracted" AND transformation needed @@ -1728,15 +1930,12 @@ RETURN JSON: ] }} -EXAMPLES (all content types): -- paragraph: {{"id": "s1", "content_type": "paragraph", "contentPartIds": ["extracted_1"], "generationHint": "Include full text", "useAiCall": false, "elements": []}} -- bullet_list: {{"id": "s2", "content_type": "bullet_list", "contentPartIds": ["extracted_1"], "generationHint": "Create bullet list", "useAiCall": true, "elements": []}} -- table: {{"id": "s3", "content_type": "table", "contentPartIds": ["extracted_1", "extracted_2"], "generationHint": "Create table", "useAiCall": true, "elements": []}} -- heading: {{"id": "s4", "content_type": "heading", "contentPartIds": ["extracted_1"], "generationHint": "Extract heading", "useAiCall": true, "elements": []}} -- code_block: {{"id": "s5", "content_type": "code_block", "contentPartIds": ["extracted_1"], "generationHint": "Format code", "useAiCall": true, "elements": []}} -- image: {{"id": "s6", "content_type": "image", "contentPartIds": ["obj_1"], "generationHint": "Display image", "caption": "Figure 1: Description of the image", "useAiCall": false, "elements": []}} -- reference: {{"id": "s7", "content_type": "paragraph", "contentPartIds": ["ref_1"], "generationHint": "Reference", "useAiCall": false, "elements": []}} -- NO CONTENT PARTS (generate from scratch): {{"id": "s8", "content_type": "paragraph", "contentPartIds": [], "generationHint": "Write a detailed professional paragraph explaining [specific topic or purpose]. Include [key points to cover]. Address [important aspects]. Conclude with [summary or recommendations].", "useAiCall": true, "elements": []}} +**MANDATORY CONTENT ASSIGNMENT CHECK:** +For each section, verify: +1. Are ContentParts listed in AVAILABLE CONTENT PARTS above? +2. Does this section's generationHint or purpose relate to those ContentParts? +3. If YES to both → section MUST have contentPartIds assigned (cannot be empty []) +4. Assign ContentPart IDs exactly as listed in AVAILABLE CONTENT PARTS above IMAGE SECTIONS: - For image sections, always provide a "caption" field with a descriptive caption for the image. @@ -1793,13 +1992,40 @@ Return only valid JSON. Do not include any explanatory text outside the JSON. contentPartsText += f" Source file: {part.metadata.get('originalFileName')}\n" if contentFormat == "extracted": - # Zeige Preview von extrahiertem Text (länger für besseren Kontext) - previewLength = 1000 - if part.data: - preview = part.data[:previewLength] + "..." if len(part.data) > previewLength else part.data - contentPartsText += f" Content preview:\n```\n{preview}\n```\n" + # CRITICAL: Check if this is binary/image data - NEVER include in text prompt! + isBinaryOrImage = ( + part.typeGroup == "image" or + part.typeGroup == "binary" or + (part.mimeType and ( + part.mimeType.startswith("image/") or + part.mimeType.startswith("video/") or + part.mimeType.startswith("audio/") or + self._isBinaryMimeType(part.mimeType) + )) or + # Heuristic check: if data looks like base64 (long string with base64 chars) + (part.data and isinstance(part.data, str) and + len(part.data) > 100 and + self._looksLikeBase64(part.data)) + ) + + if isBinaryOrImage: + # NEVER include binary/base64 data in text prompt - security risk and token explosion! + dataLength = len(part.data) if part.data else 0 + contentPartsText += f" Type: {part.typeGroup}\n" + contentPartsText += f" MIME type: {part.mimeType or 'unknown'}\n" + contentPartsText += f" Data size: {dataLength} chars (binary/base64 - not shown in prompt)\n" + if part.metadata.get("needsVisionExtraction"): + contentPartsText += f" Note: Will be processed with Vision AI\n" + if part.metadata.get("usageHint"): + contentPartsText += f" Usage hint: {part.metadata.get('usageHint')}\n" else: - contentPartsText += f" Content: (empty)\n" + # Only for text data: Show preview + previewLength = 1000 + if part.data: + preview = part.data[:previewLength] + "..." if len(part.data) > previewLength else part.data + contentPartsText += f" Content preview:\n```\n{preview}\n```\n" + else: + contentPartsText += f" Content: (empty)\n" elif contentFormat == "reference": contentPartsText += f" Reference: {part.metadata.get('documentReference')}\n" if part.metadata.get("usageHint"): @@ -1901,7 +2127,12 @@ Output requirements: {contextText if contextText else ""} """ else: - prompt = f"""# TASK: Generate Section Content + # Determine if we have ContentParts or need to generate from scratch + hasContentParts = len(validParts) > 0 + + if hasContentParts: + # EXTRACT MODE: Extract data from provided ContentParts + prompt = f"""# TASK: Extract Section Content from Provided Data LANGUAGE: Generate all content in {language.upper()} language. All text, titles, headings, paragraphs, and content must be written in {language.upper()}. @@ -1911,7 +2142,7 @@ LANGUAGE: Generate all content in {language.upper()} language. All text, titles, - Generation Hint: {generationHint} ## AVAILABLE CONTENT FOR THIS SECTION -{contentPartsText if contentPartsText else "(No content parts specified for this section)"} +{contentPartsText} ## INSTRUCTIONS 1. Extract data only from provided ContentParts. Never invent or generate data. @@ -1942,6 +2173,49 @@ Output requirements: {userPrompt} ``` +## CONTEXT +{contextText if contextText else ""} +""" + else: + # GENERATE MODE: Generate content from scratch based on generationHint + prompt = f"""# TASK: Generate Section Content + +LANGUAGE: Generate all content in {language.upper()} language. All text, titles, headings, paragraphs, and content must be written in {language.upper()}. + +## SECTION METADATA +- Section ID: {sectionId} +- Content Type: {contentType} +- Generation Hint: {generationHint} + +## INSTRUCTIONS +1. Generate content based on the Generation Hint above. +2. Create appropriate content that matches the content_type ({contentType}). +3. The content should be relevant to the USER REQUEST and fit the context of surrounding sections. +4. Return only valid JSON with "elements" array. +5. No HTML/styling: Plain text only, no markup. + +## OUTPUT FORMAT +Return a JSON object with this structure: + +{{ + "elements": [ + {{ + "type": "{contentType}", + "content": {contentStructureExample} + }} + ] +}} + +Output requirements: +- "content" must be an object (never a string) +- Return only valid JSON, no explanatory text +- Generate meaningful content based on the Generation Hint + +## USER REQUEST +``` +{userPrompt} +``` + ## CONTEXT {contextText if contextText else ""} """ @@ -2174,6 +2448,41 @@ Output requirements: } } + def _isBinaryMimeType(self, mimeType: str) -> bool: + """Check if MIME type is binary.""" + binaryTypes = [ + "application/octet-stream", + "application/pdf", + "application/zip", + "application/x-zip-compressed" + ] + return mimeType in binaryTypes + + def _looksLikeBase64(self, data: str) -> bool: + """ + Heuristic check if string looks like base64-encoded data. + + Base64 contains only: A-Z, a-z, 0-9, +, /, =, and whitespace. + If >95% of characters are base64 chars and no normal text patterns, likely base64. + """ + if not data or len(data) < 100: + return False + + base64Chars = set("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/=\n\r\t ") + sample = data[:500] # Check first 500 chars + if not sample: + return False + + base64Ratio = sum(1 for c in sample if c in base64Chars) / len(sample) + + # If >95% base64 chars and no normal text patterns (like spaces between words) → likely base64 + # Base64 typically has very long strings without spaces or punctuation + hasNormalTextPatterns = any( + c in sample[:200] for c in ".,!?;:()[]{}\"'" + ) or " " in sample[:200] # Double spaces suggest text + + return base64Ratio > 0.95 and not hasNormalTextPatterns + def _findContentPartById(self, partId: str, contentParts: List[ContentPart]) -> Optional[ContentPart]: """Finde ContentPart nach ID.""" for part in contentParts: diff --git a/modules/services/serviceAi/subStructureGeneration.py b/modules/services/serviceAi/subStructureGeneration.py index a3db2072..f16bacd6 100644 --- a/modules/services/serviceAi/subStructureGeneration.py +++ b/modules/services/serviceAi/subStructureGeneration.py @@ -13,6 +13,7 @@ from typing import Dict, Any, List, Optional from modules.datamodels.datamodelExtraction import ContentPart from modules.datamodels.datamodelAi import AiCallOptions, OperationTypeEnum, PriorityEnum, ProcessingModeEnum +from modules.workflows.processing.shared.stateTools import checkWorkflowStopped logger = logging.getLogger(__name__) @@ -139,6 +140,7 @@ Continue generating the remaining chapters now. # NOTE: Do NOT pass contentParts here - we only need metadata for structure generation # The contentParts metadata is already included in the prompt (contentPartsIndex) # Actual content extraction happens later during section generation + checkWorkflowStopped(self.services) aiResponseJson = await self.aiService.callAiWithLooping( prompt=structurePrompt, options=options, @@ -259,36 +261,50 @@ This is a PLANNING task. Return EXACTLY ONE complete JSON object. Do not generat {userPrompt} ``` -LANGUAGE: Generate all content in {language.upper()} language. All text, titles, headings, paragraphs, and content must be written in {language.upper()}. +DEFAULT LANGUAGE: If no language is specified for a document, use "{language}" (from user prompt). Each document can have its own language specified in the "language" field. Use ISO 639-1 language codes in lowercase (e.g., "de", "en", "fr", "it"). ## AVAILABLE CONTENT PARTS {contentPartsIndex} -## CHAPTER INDEPENDENCE -- Each chapter is independent and self-contained -- One chapter does NOT have information about another chapter -- Each chapter must provide its own context and be understandable alone +## CONTENT ASSIGNMENT RULE - CRITICAL +If the user request mentions documents/images/data, then EVERY chapter that generates content related to those references MUST assign the relevant ContentParts explicitly. -## CONTENT ASSIGNMENT -- Assign ContentParts to chapters via contentParts object -- For data extraction, the type of a contentPart (image, text, etc.) is not relevant - only what is specified in the instruction matters -- Include all relevant parts from same source when needed for structured data extraction -- Each contentPart can have either: - - "instruction": For AI extraction prompts (how to process/extract from this part) - - "caption": For user-facing presentation (how to display/reference this part in the document) - - Both can be present if needed -- Chapters without contentParts can only generate generic content (not document-specific) +**Assignment logic:** +- If chapter DISPLAYS a document/image → assign "object" format ContentPart with "caption" +- If chapter generates text content ABOUT a document/image/data → assign ContentPart with "instruction": + - Prefer "extracted" format if available (contains analyzed/extracted content) + - If only "object" format is available, use "object" format with "instruction" (to write ABOUT the image/document) +- If chapter's generationHint or purpose relates to a document/image/data mentioned in user request → it MUST have ContentParts assigned +- Multiple chapters might assign the same ContentPart (e.g., one chapter displays image, another writes about it) +- Use ContentPart IDs exactly as listed in AVAILABLE CONTENT PARTS above +- Empty contentParts are only allowed if chapter generates content WITHOUT referencing any documents/images/data from the user request + +**CRITICAL RULE**: If the user request mentions BOTH: + a) Documents/images/data (listed in AVAILABLE CONTENT PARTS above), AND + b) Generic content types (article text, main content, body text, etc.) +Then chapters that generate those generic content types MUST assign the relevant ContentParts, because the content should relate to or be based on the provided documents/images/data. ## FORMATTING - Formatting is handled automatically - focus on content and structure only -## CHAPTER STRUCTURE -- chapter id, level (1, 2, 3, etc.), title -- contentParts: {{"partId": {{"instruction": "..."}} or {{"caption": "..."}} or both}} - Compact mapping of part IDs to their extraction instructions and/or presentation captions -- generationHint: Self-contained description that reflects the user's intent for the specific data. If contentParts is empty, must be detailed. If contentParts are present, the hint should guide how to extract and structure the data according to the user's requirements (e.g., specific columns, format, structure) +## CHAPTER STRUCTURE REQUIREMENTS +- Generate chapters based on USER REQUEST - analyze what structure the user wants +- Each chapter needs: id, level (1, 2, 3, etc.), title +- contentParts: {{"partId": {{"instruction": "..."}} or {{"caption": "..."}} or both}} - Assign ContentParts as required by CONTENT ASSIGNMENT RULE above +- generationHint: Description of what content to generate for this chapter +- The number of chapters depends on the user request - create only what is requested + +## DOCUMENT LANGUAGE +- Each document can have its own language (ISO 639-1 code in lowercase: "de", "en", "fr", "it", etc.) +- If no language is specified for a document, use the user prompt language: "{language}" +- The language determines in which language the content of that document will be generated +- Multiple documents can have different languages if needed +- Always use lowercase ISO 639-1 codes in the JSON output (e.g., "de", not "DE") ## OUTPUT FORMAT -Return JSON: +Generate the chapter structure based on the USER REQUEST above. The number and types of chapters depend entirely on what the user requested - do NOT copy the example structure below. + +EXAMPLE STRUCTURE (for reference only - adapt to user request): {{ "metadata": {{ "title": "Document Title", @@ -298,38 +314,39 @@ Return JSON: "id": "doc_1", "title": "Document Title", "filename": "document.{outputFormat}", + "language": "{language}", "chapters": [ {{ "id": "chapter_1", "level": 1, - "title": "Introduction", + "title": "Chapter Title", "contentParts": {{ - "part_ext_1": {{ - "instruction": "Use full extracted text" - }}, - "part_img_1": {{ - "instruction": "Analyze image for additional details" - }}, - "part_img_2": {{ - "instruction": "Analyze image for additional details", - "caption": "Figure 1: Overview diagram" + "extracted_part_id": {{ + "instruction": "Use extracted content..." }} }}, - "generationHint": "Create introduction section", - "sections": [] - }}, - {{ - "id": "chapter_2", - "level": 1, - "title": "Main Title", - "contentParts": {{}}, - "generationHint": "Create [specific content description] with [formatting details]. Include [required information]. Purpose: [explanation of what this chapter provides].", + "generationHint": "Description of chapter content", "sections": [] }} ] }}] }} +CRITICAL INSTRUCTIONS: +- Generate chapters based on USER REQUEST, NOT based on the example above +- The example shows the JSON structure format, NOT the required chapters +- Create only the chapters that match the user's request +- Adapt chapter titles and structure to match the user's specific request + +**MANDATORY CONTENT ASSIGNMENT CHECK:** +For each chapter, verify: +1. Does the user request mention documents/images/data? (e.g., "photo", "image", "document", "data", "based on", "about") +2. Does this chapter's generationHint, title, or purpose relate to those documents/images/data mentioned in step 1? + - Examples: "article about the photo", "text describing the image", "analysis of the document", "content based on the data" + - Even if chapter doesn't explicitly say "about the image", if user request mentions both the image AND this chapter's content type → relate them +3. If YES to both → chapter MUST have contentParts assigned (cannot be empty {{}}) +4. If ContentPart is "object" format and chapter needs to write ABOUT it → assign with "instruction" field, not just "caption" + OUTPUT FORMAT: Start with {{ and end with }}. Do NOT use markdown code fences (```json). Do NOT add explanatory text before or after the JSON. Return ONLY the JSON object itself. """ return prompt diff --git a/modules/services/serviceGeneration/paths/documentPath.py b/modules/services/serviceGeneration/paths/documentPath.py index cb2caae3..94cbda54 100644 --- a/modules/services/serviceGeneration/paths/documentPath.py +++ b/modules/services/serviceGeneration/paths/documentPath.py @@ -14,6 +14,7 @@ from modules.datamodels.datamodelWorkflow import AiResponse, AiResponseMetadata, from modules.datamodels.datamodelExtraction import ContentPart, DocumentIntent from modules.datamodels.datamodelAi import AiCallOptions, OperationTypeEnum from modules.datamodels.datamodelDocument import RenderedDocument +from modules.workflows.processing.shared.stateTools import checkWorkflowStopped logger = logging.getLogger(__name__) @@ -58,6 +59,35 @@ class DocumentGenerationPath: if documentList: documents = self.services.chat.getChatDocumentsFromDocumentList(documentList) + # Filter: Entferne Original-Dokumente, wenn bereits Pre-Extracted JSONs existieren + # (um Duplikate zu vermeiden - Pre-Extracted JSONs enthalten bereits die ContentParts) + # Schritt 1: Identifiziere alle Original-Dokument-IDs, die durch Pre-Extracted JSONs abgedeckt werden + originalDocIdsCoveredByPreExtracted = set() + for doc in documents: + preExtracted = self.services.ai.intentAnalyzer.resolvePreExtractedDocument(doc) + if preExtracted: + originalDocId = preExtracted["originalDocument"]["id"] + originalDocIdsCoveredByPreExtracted.add(originalDocId) + logger.debug(f"Found pre-extracted JSON {doc.id} covering original document {originalDocId}") + + # Schritt 2: Filtere Dokumente - entferne Original-Dokumente, die bereits durch Pre-Extracted JSONs abgedeckt werden + filteredDocuments = [] + for doc in documents: + preExtracted = self.services.ai.intentAnalyzer.resolvePreExtractedDocument(doc) + if preExtracted: + # Pre-Extracted JSON behalten + filteredDocuments.append(doc) + elif doc.id in originalDocIdsCoveredByPreExtracted: + # Original-Dokument, das bereits durch Pre-Extracted JSON abgedeckt wird - entfernen + logger.info(f"Skipping original document {doc.id} ({doc.fileName}) - already covered by pre-extracted JSON") + else: + # Normales Dokument ohne Pre-Extracted JSON - behalten + filteredDocuments.append(doc) + + documents = filteredDocuments + + checkWorkflowStopped(self.services) + if not documentIntents and documents: documentIntents = await self.services.ai.clarifyDocumentIntents( documents, @@ -66,6 +96,8 @@ class DocumentGenerationPath: docOperationId ) + checkWorkflowStopped(self.services) + # Schritt 5B: Extrahiere und bereite Content vor if documents: preparedContentParts = await self.services.ai.extractAndPrepareContent( @@ -91,6 +123,8 @@ class DocumentGenerationPath: if contentParts: logger.info(f"Using {len(contentParts)} content parts for generation (no AI extraction at this stage)") + checkWorkflowStopped(self.services) + # Schritt 5C: Generiere Struktur structure = await self.services.ai.generateStructure( userPrompt, @@ -99,6 +133,8 @@ class DocumentGenerationPath: docOperationId ) + checkWorkflowStopped(self.services) + # Schritt 5D: Fülle Struktur # Language will be extracted from services (user intention analysis) in fillStructure filledStructure = await self.services.ai.fillStructure( @@ -108,6 +144,8 @@ class DocumentGenerationPath: docOperationId ) + checkWorkflowStopped(self.services) + # Schritt 5E: Rendere Resultat # Jedes Dokument wird einzeln gerendert, kann 1..n Dateien zurückgeben (z.B. HTML + Bilder) renderedDocuments = await self.services.ai.renderResult( diff --git a/modules/services/serviceGeneration/renderers/rendererCsv.py b/modules/services/serviceGeneration/renderers/rendererCsv.py index 83ca41c1..15be4d96 100644 --- a/modules/services/serviceGeneration/renderers/rendererCsv.py +++ b/modules/services/serviceGeneration/renderers/rendererCsv.py @@ -71,8 +71,9 @@ class RendererCsv(BaseRenderer): sections = self._extractSections(jsonContent) metadata = self._extractMetadata(jsonContent) - # Use title from JSON metadata if available, otherwise use provided title - documentTitle = metadata.get("title", title) + # Use provided title (which comes from documents[].title) as primary source + # Fallback to metadata.title only if title parameter is empty + documentTitle = title if title else metadata.get("title", "Generated Document") # Generate CSV content csvRows = [] diff --git a/modules/services/serviceGeneration/renderers/rendererDocx.py b/modules/services/serviceGeneration/renderers/rendererDocx.py index 9d7eaaeb..b0f62394 100644 --- a/modules/services/serviceGeneration/renderers/rendererDocx.py +++ b/modules/services/serviceGeneration/renderers/rendererDocx.py @@ -121,8 +121,9 @@ class RendererDocx(BaseRenderer): sections = self._extractSections(json_content) metadata = self._extractMetadata(json_content) - # Use title from JSON metadata if available, otherwise use provided title - document_title = metadata.get("title", title) + # Use provided title (which comes from documents[].title) as primary source + # Fallback to metadata.title only if title parameter is empty + document_title = title if title else metadata.get("title", "Generated Document") # Add document title using Title style if document_title: diff --git a/modules/services/serviceGeneration/renderers/rendererHtml.py b/modules/services/serviceGeneration/renderers/rendererHtml.py index 1f013e50..1797af6d 100644 --- a/modules/services/serviceGeneration/renderers/rendererHtml.py +++ b/modules/services/serviceGeneration/renderers/rendererHtml.py @@ -107,8 +107,9 @@ class RendererHtml(BaseRenderer): sections = self._extractSections(jsonContent) metadata = self._extractMetadata(jsonContent) - # Use title from JSON metadata if available, otherwise use provided title - documentTitle = metadata.get("title", title) + # Use provided title (which comes from documents[].title) as primary source + # Fallback to metadata.title only if title parameter is empty + documentTitle = title if title else metadata.get("title", "Generated Document") # Build HTML document htmlParts = [] diff --git a/modules/services/serviceGeneration/renderers/rendererImage.py b/modules/services/serviceGeneration/renderers/rendererImage.py index 479881df..197560d1 100644 --- a/modules/services/serviceGeneration/renderers/rendererImage.py +++ b/modules/services/serviceGeneration/renderers/rendererImage.py @@ -86,8 +86,9 @@ class RendererImage(BaseRenderer): # Extract metadata from standardized schema metadata = self._extractMetadata(extractedContent) - # Use title from JSON metadata if available, otherwise use provided title - documentTitle = metadata.get("title", title) + # Use provided title (which comes from documents[].title) as primary source + # Fallback to metadata.title only if title parameter is empty + documentTitle = title if title else metadata.get("title", "Generated Document") # Create AI prompt for image generation imagePrompt = await self._createImageGeneratePrompt(extractedContent, documentTitle, userPrompt, aiService) diff --git a/modules/services/serviceGeneration/renderers/rendererMarkdown.py b/modules/services/serviceGeneration/renderers/rendererMarkdown.py index 84644485..2bdbf114 100644 --- a/modules/services/serviceGeneration/renderers/rendererMarkdown.py +++ b/modules/services/serviceGeneration/renderers/rendererMarkdown.py @@ -82,8 +82,9 @@ class RendererMarkdown(BaseRenderer): sections = self._extractSections(jsonContent) metadata = self._extractMetadata(jsonContent) - # Use title from JSON metadata if available, otherwise use provided title - documentTitle = metadata.get("title", title) + # Use provided title (which comes from documents[].title) as primary source + # Fallback to metadata.title only if title parameter is empty + documentTitle = title if title else metadata.get("title", "Generated Document") # Build markdown content markdownParts = [] diff --git a/modules/services/serviceGeneration/renderers/rendererPdf.py b/modules/services/serviceGeneration/renderers/rendererPdf.py index e27abce7..32aca32c 100644 --- a/modules/services/serviceGeneration/renderers/rendererPdf.py +++ b/modules/services/serviceGeneration/renderers/rendererPdf.py @@ -110,8 +110,9 @@ class RendererPdf(BaseRenderer): sections = self._extractSections(json_content) metadata = self._extractMetadata(json_content) - # Use title from JSON metadata if available, otherwise use provided title - document_title = metadata.get("title", title) + # Use provided title (which comes from documents[].title) as primary source + # Fallback to metadata.title only if title parameter is empty + document_title = title if title else metadata.get("title", "Generated Document") # Make title shorter to prevent wrapping/overlapping if len(document_title) > 40: diff --git a/modules/services/serviceGeneration/renderers/rendererPptx.py b/modules/services/serviceGeneration/renderers/rendererPptx.py index 5525ae89..ff3d005d 100644 --- a/modules/services/serviceGeneration/renderers/rendererPptx.py +++ b/modules/services/serviceGeneration/renderers/rendererPptx.py @@ -601,8 +601,9 @@ JSON ONLY. NO OTHER TEXT.""" sections = self._extractSections(json_content) metadata = self._extractMetadata(json_content) - # Use title from JSON metadata if available, otherwise use provided title - document_title = metadata.get("title", title) + # Use provided title (which comes from documents[].title) as primary source + # Fallback to metadata.title only if title parameter is empty + document_title = title if title else metadata.get("title", "Generated Document") # Create title slide slides.append({ diff --git a/modules/services/serviceGeneration/renderers/rendererText.py b/modules/services/serviceGeneration/renderers/rendererText.py index 116d73f4..52035014 100644 --- a/modules/services/serviceGeneration/renderers/rendererText.py +++ b/modules/services/serviceGeneration/renderers/rendererText.py @@ -104,8 +104,9 @@ class RendererText(BaseRenderer): sections = self._extractSections(jsonContent) metadata = self._extractMetadata(jsonContent) - # Use title from JSON metadata if available, otherwise use provided title - documentTitle = metadata.get("title", title) + # Use provided title (which comes from documents[].title) as primary source + # Fallback to metadata.title only if title parameter is empty + documentTitle = title if title else metadata.get("title", "Generated Document") # Build text content textParts = [] diff --git a/modules/services/serviceGeneration/renderers/rendererXlsx.py b/modules/services/serviceGeneration/renderers/rendererXlsx.py index d0074394..404abf31 100644 --- a/modules/services/serviceGeneration/renderers/rendererXlsx.py +++ b/modules/services/serviceGeneration/renderers/rendererXlsx.py @@ -290,8 +290,9 @@ class RendererXlsx(BaseRenderer): # Extract metadata from standardized schema metadata = self._extractMetadata(jsonContent) - # Use title from JSON metadata if available, otherwise use provided title - document_title = metadata.get("title", title) + # Use provided title (which comes from documents[].title) as primary source + # Fallback to metadata.title only if title parameter is empty + document_title = title if title else metadata.get("title", "Generated Document") # Create workbook wb = Workbook() @@ -689,7 +690,12 @@ class RendererXlsx(BaseRenderer): # If no level 1 headings found, use document title if not sheetNames: - documentTitle = jsonContent.get("metadata", {}).get("title", "Document") + # Use documents[].title as primary source, fallback to metadata.title + documents = jsonContent.get("documents", []) + if documents and isinstance(documents[0], dict) and documents[0].get("title"): + documentTitle = documents[0].get("title") + else: + documentTitle = jsonContent.get("metadata", {}).get("title", "Document") sheetNames.append(self._sanitizeSheetName(documentTitle)) return sheetNames @@ -825,8 +831,12 @@ class RendererXlsx(BaseRenderer): def _populateMainSheet(self, sheet, jsonContent: Dict[str, Any], styles: Dict[str, Any]): """Populate the main sheet with document overview and all content.""" try: - # Document title - documentTitle = jsonContent.get("metadata", {}).get("title", "Generated Report") + # Document title - use documents[].title as primary source, fallback to metadata.title + documents = jsonContent.get("documents", []) + if documents and isinstance(documents[0], dict) and documents[0].get("title"): + documentTitle = documents[0].get("title") + else: + documentTitle = jsonContent.get("metadata", {}).get("title", "Generated Report") sheet['A1'] = documentTitle # Safety check for title style diff --git a/modules/services/serviceGeneration/subContentGenerator.py b/modules/services/serviceGeneration/subContentGenerator.py index 2f90a09a..86464ef6 100644 --- a/modules/services/serviceGeneration/subContentGenerator.py +++ b/modules/services/serviceGeneration/subContentGenerator.py @@ -13,6 +13,7 @@ import re import traceback from typing import Dict, Any, Optional, List, Callable from modules.services.serviceGeneration.subContentIntegrator import ContentIntegrator +from modules.workflows.processing.shared.stateTools import checkWorkflowStopped logger = logging.getLogger(__name__) @@ -167,6 +168,7 @@ class ContentGenerator: contentPartsMap[partId] = part for idx, section in enumerate(sections): + checkWorkflowStopped(self.services) try: contentType = section.get("content_type", "content") sectionId = section.get("id", f"section_{idx}") @@ -229,7 +231,8 @@ class ContentGenerator: sections: List[Dict[str, Any]], cachedContent: Optional[Dict[str, Any]], userPrompt: str, - documentMetadata: Dict[str, Any], + contentParts: Optional[List[Any]] = None, + documentMetadata: Dict[str, Any] = {}, progressCallback: Optional[Callable] = None, batchSize: int = 10 ) -> List[Dict[str, Any]]: @@ -240,6 +243,7 @@ class ContentGenerator: sections: List of sections to generate cachedContent: Extracted content cache userPrompt: Original user prompt + contentParts: List of all available ContentParts (for mapping by contentPartIds) documentMetadata: Document metadata progressCallback: Progress callback function batchSize: Number of sections to process in parallel per batch @@ -253,6 +257,14 @@ class ContentGenerator: if totalSections == 0: return [] + # Create ContentParts lookup map by ID + contentPartsMap = {} + if contentParts: + for part in contentParts: + partId = part.id if hasattr(part, 'id') else part.get('id', '') + if partId: + contentPartsMap[partId] = part + # Adjust batch size based on section types (images take longer) imageCount = sum(1 for s in sections if s.get("content_type") == "image") if imageCount > 0: @@ -277,6 +289,7 @@ class ContentGenerator: ) async def generateWithProgress(section: Dict[str, Any], globalIndex: int, localIndex: int, batchPreviousSections: List[Dict[str, Any]]): + checkWorkflowStopped(self.services) try: contentType = section.get("content_type", "content") sectionId = section.get("id", f"section_{globalIndex}") @@ -422,6 +435,7 @@ class ContentGenerator: resultFormat="json" ) + checkWorkflowStopped(self.services) aiResponse = await self.services.ai.callAiContent( prompt=sectionPrompt, options=options, @@ -603,6 +617,59 @@ class ContentGenerator: ) -> Dict[str, Any]: """Generate image for image section or include existing image""" try: + # First, check if section has image ContentParts to integrate directly + sectionContentParts = context.get("sectionContentParts", []) + if sectionContentParts: + # Look for image ContentParts + for part in sectionContentParts: + partTypeGroup = part.typeGroup if hasattr(part, 'typeGroup') else part.get('typeGroup', '') + partMimeType = part.mimeType if hasattr(part, 'mimeType') else part.get('mimeType', '') + isImage = partTypeGroup == "image" or (partMimeType and partMimeType.startswith("image/")) + + if isImage: + # Extract image data from ContentPart + partData = part.data if hasattr(part, 'data') else part.get('data', '') + partId = part.id if hasattr(part, 'id') else part.get('id', '') + + # Get base64 data + base64Data = None + if isinstance(partData, str): + # Check if it's already base64 or needs extraction + if partData.startswith("data:image"): + # Extract base64 from data URL + base64Data = partData.split(",", 1)[1] if "," in partData else partData + elif len(partData) > 100: # Likely base64 string + base64Data = partData + elif isinstance(partData, bytes): + import base64 + base64Data = base64.b64encode(partData).decode('utf-8') + + if base64Data: + # Get caption from section (priority: section.caption > metadata.caption) + caption = section.get("caption") or section.get("metadata", {}).get("caption") + + # Get alt text from ContentPart metadata or section + altText = part.metadata.get("altText") if hasattr(part, 'metadata') else part.get('metadata', {}).get('altText') + if not altText: + altText = section.get("generation_hint", "Image") + + # Get mime type + mimeType = partMimeType or "image/png" + + # Create image element with caption + section["elements"] = [{ + "type": "image", + "content": { + "base64Data": base64Data, + "altText": altText, + "caption": caption # Include caption from section + }, + "caption": caption # Also at element level for compatibility + }] + + logger.info(f"Successfully integrated image from ContentPart {partId} for section {section.get('id')} with caption: {caption}") + return section + # Check if this is an existing image to include or render imageSource = section.get("image_source", "generate") @@ -623,12 +690,17 @@ class ContentGenerator: # Create image element from existing/render image altText = imageDoc.get("altText", section.get("generation_hint", "Image")) mimeType = imageDoc.get("mimeType", "image/png") + caption = section.get("caption") or section.get("metadata", {}).get("caption") + # Use nested content structure for consistency with renderers section["elements"] = [{ - "base64Data": imageDoc.get("base64Data"), - "altText": altText, - "mimeType": mimeType, - "caption": section.get("caption") or section.get("metadata", {}).get("caption") + "type": "image", + "content": { + "base64Data": imageDoc.get("base64Data"), + "altText": altText, + "caption": caption # Include caption in content structure + }, + "caption": caption # Also at element level for compatibility }] logger.info(f"Successfully integrated image {imageRefId} for section {section.get('id')} (source={imageSource})") @@ -666,6 +738,7 @@ class ContentGenerator: logger.info(f"Starting image generation for section {section.get('id')}: {imagePrompt[:100]}...") # Call AI for image generation + checkWorkflowStopped(self.services) aiResponse = await self.services.ai.callAiContent( prompt=promptJson, options=options, @@ -704,11 +777,15 @@ class ContentGenerator: caption = section.get("caption") or section.get("metadata", {}).get("caption") + # Use nested content structure for consistency with renderers section["elements"] = [{ - "url": f"data:image/png;base64,{base64Data}", - "base64Data": base64Data, - "altText": altText, - "caption": caption + "type": "image", + "content": { + "base64Data": base64Data, + "altText": altText, + "caption": caption # Include caption in content structure + }, + "caption": caption # Also at element level for compatibility }] logger.info(f"Successfully generated image for section {section.get('id')}") diff --git a/modules/workflows/methods/methodAi/actions/generateCode.py b/modules/workflows/methods/methodAi/actions/generateCode.py index 52e36316..4f9bbd21 100644 --- a/modules/workflows/methods/methodAi/actions/generateCode.py +++ b/modules/workflows/methods/methodAi/actions/generateCode.py @@ -17,25 +17,11 @@ async def generateCode(self, parameters: Dict[str, Any]) -> ActionResult: return ActionResult.isFailure(error="prompt is required") documentList = parameters.get("documentList", []) + # Optional: if omitted, formats determined from prompt by AI resultType = parameters.get("resultType") - # Auto-detect format from prompt if not provided if not resultType: - promptLower = prompt.lower() - if ".html" in promptLower or "html file" in promptLower: - resultType = "html" - elif ".js" in promptLower or "javascript" in promptLower: - resultType = "js" - elif ".py" in promptLower or "python" in promptLower: - resultType = "py" - elif ".ts" in promptLower or "typescript" in promptLower: - resultType = "ts" - elif ".java" in promptLower: - resultType = "java" - elif ".cpp" in promptLower or ".c++" in promptLower: - resultType = "cpp" - else: - resultType = "txt" # Default + logger.debug("resultType not provided - formats will be determined from prompt by AI") # Create operation ID for progress tracking workflowId = self.services.workflow.id if self.services.workflow else f"no-workflow-{int(time.time())}" @@ -67,11 +53,12 @@ async def generateCode(self, parameters: Dict[str, Any]) -> ActionResult: processingMode=ProcessingModeEnum.DETAILED ) + # outputFormat: Optional - if None, formats determined from prompt by AI aiResponse: AiResponse = await self.services.ai.callAiContent( prompt=prompt, options=options, documentList=docRefList, - outputFormat=resultType, + outputFormat=resultType, # Can be None - AI determines from prompt title=title, parentOperationId=parentOperationId, generationIntent="code" # Explicit intent, skips detection @@ -93,7 +80,8 @@ async def generateCode(self, parameters: Dict[str, Any]) -> ActionResult: # If no documents but content exists, create a document from content if not documents and aiResponse.content: # Determine document name from metadata - docName = f"code.{resultType}" + resultTypeFallback = resultType or "txt" # Fallback for file naming + docName = f"code.{resultTypeFallback}" if aiResponse.metadata and aiResponse.metadata.filename: docName = aiResponse.metadata.filename elif aiResponse.metadata and aiResponse.metadata.title: @@ -101,8 +89,8 @@ async def generateCode(self, parameters: Dict[str, Any]) -> ActionResult: sanitized = re.sub(r"[^a-zA-Z0-9._-]", "_", aiResponse.metadata.title) sanitized = re.sub(r"_+", "_", sanitized).strip("_") if sanitized: - if not sanitized.lower().endswith(f".{resultType}"): - docName = f"{sanitized}.{resultType}" + if not sanitized.lower().endswith(f".{resultTypeFallback}"): + docName = f"{sanitized}.{resultTypeFallback}" else: docName = sanitized diff --git a/modules/workflows/methods/methodAi/actions/generateDocument.py b/modules/workflows/methods/methodAi/actions/generateDocument.py index 4e67251b..65e95a32 100644 --- a/modules/workflows/methods/methodAi/actions/generateDocument.py +++ b/modules/workflows/methods/methodAi/actions/generateDocument.py @@ -18,23 +18,11 @@ async def generateDocument(self, parameters: Dict[str, Any]) -> ActionResult: documentList = parameters.get("documentList", []) documentType = parameters.get("documentType") - resultType = parameters.get("resultType", "txt") + # Optional: if omitted, formats determined from prompt by AI + resultType = parameters.get("resultType") - # Auto-detect format from prompt if not explicitly provided - if resultType == "txt" and prompt: - promptLower = prompt.lower() - if "html" in promptLower or "html5" in promptLower: - resultType = "html" - logger.info(f"Auto-detected HTML format from prompt") - elif "pdf" in promptLower: - resultType = "pdf" - logger.info(f"Auto-detected PDF format from prompt") - elif "markdown" in promptLower or " md " in promptLower or promptLower.endswith(" md"): - resultType = "md" - logger.info(f"Auto-detected Markdown format from prompt") - elif ("text" in promptLower or "txt" in promptLower) and "html" not in promptLower: - resultType = "txt" - logger.info(f"Auto-detected Text format from prompt") + if not resultType: + logger.debug("resultType not provided - formats will be determined from prompt by AI") # Create operation ID for progress tracking workflowId = self.services.workflow.id if self.services.workflow else f"no-workflow-{int(time.time())}" @@ -69,11 +57,12 @@ async def generateDocument(self, parameters: Dict[str, Any]) -> ActionResult: compressContext=False ) + # outputFormat: Optional - if None, formats determined from prompt by AI aiResponse: AiResponse = await self.services.ai.callAiContent( prompt=prompt, options=options, documentList=docRefList, # Übergebe documentList direkt - callAiContent macht Phasen 5A-5E - outputFormat=resultType, + outputFormat=resultType, # Can be None - AI determines from prompt title=title, parentOperationId=parentOperationId, generationIntent="document" # NEW: Explicit intent, skips detection @@ -95,7 +84,8 @@ async def generateDocument(self, parameters: Dict[str, Any]) -> ActionResult: # If no documents but content exists, create a document from content if not documents and aiResponse.content: # Determine document name from metadata - docName = f"document.{resultType}" + resultTypeFallback = resultType or "txt" # Fallback for file naming + docName = f"document.{resultTypeFallback}" if aiResponse.metadata and aiResponse.metadata.filename: docName = aiResponse.metadata.filename elif aiResponse.metadata and aiResponse.metadata.title: @@ -103,8 +93,8 @@ async def generateDocument(self, parameters: Dict[str, Any]) -> ActionResult: sanitized = re.sub(r"[^a-zA-Z0-9._-]", "_", aiResponse.metadata.title) sanitized = re.sub(r"_+", "_", sanitized).strip("_") if sanitized: - if not sanitized.lower().endswith(f".{resultType}"): - docName = f"{sanitized}.{resultType}" + if not sanitized.lower().endswith(f".{resultTypeFallback}"): + docName = f"{sanitized}.{resultTypeFallback}" else: docName = sanitized diff --git a/modules/workflows/methods/methodAi/actions/process.py b/modules/workflows/methods/methodAi/actions/process.py index 5f05afed..16cc3307 100644 --- a/modules/workflows/methods/methodAi/actions/process.py +++ b/modules/workflows/methods/methodAi/actions/process.py @@ -54,8 +54,8 @@ async def process(self, parameters: Dict[str, Any]) -> ActionResult: logger.error(f"Invalid documentList type: {type(documentListParam)}") documentList = DocumentReferenceList(references=[]) - resultType = parameters.get("resultType", "txt") - + # Optional: if omitted, formats determined from prompt. Default "txt" is validation fallback only. + resultType = parameters.get("resultType") if not aiPrompt: logger.error(f"aiPrompt is missing or empty. Parameters: {parameters}") @@ -63,11 +63,20 @@ async def process(self, parameters: Dict[str, Any]) -> ActionResult: error="AI prompt is required" ) - # Determine output extension and default MIME type without duplicating service logic - normalized_result_type = (str(resultType).strip().lstrip('.').lower() or "txt") - output_extension = f".{normalized_result_type}" + # Handle optional resultType: if None, formats determined from prompt by AI + if resultType: + normalized_result_type = (str(resultType).strip().lstrip('.').lower() or "txt") + output_extension = f".{normalized_result_type}" + output_format = output_extension.replace('.', '') or 'txt' + logger.info(f"Using result type: {resultType} -> {output_extension}") + else: + # No format specified - AI will determine formats from prompt + normalized_result_type = None + output_extension = None + output_format = None + logger.debug("resultType not provided - formats will be determined from prompt by AI") + output_mime_type = "application/octet-stream" # Prefer service-provided mimeType when available - logger.info(f"Using result type: {resultType} -> {output_extension}") # Phase 7.3: Extract content first if documents provided, then use contentParts # Check if contentParts are already provided (preferred path) @@ -121,54 +130,33 @@ async def process(self, parameters: Dict[str, Any]) -> ActionResult: # Update progress - preparing AI call self.services.chat.progressLogUpdate(operationId, 0.4, "Preparing AI call") - # Detect image generation from resultType + # Detect image generation from resultType (if provided) imageFormats = ["png", "jpg", "jpeg", "gif", "webp"] - isImageGeneration = normalized_result_type in imageFormats + isImageGeneration = normalized_result_type in imageFormats if normalized_result_type else False # Build options with correct operationType - output_format = output_extension.replace('.', '') or 'txt' from modules.datamodels.datamodelAi import OperationTypeEnum options = AiCallOptions( - resultFormat=output_format, + resultFormat=output_format or "txt", # Fallback for options, but outputFormat can be None for callAiContent operationType=OperationTypeEnum.IMAGE_GENERATE if isImageGeneration else OperationTypeEnum.DATA_GENERATE ) - # Get generationIntent from parameters - generationIntent = parameters.get("generationIntent") - - # For DATA_GENERATE, generationIntent is REQUIRED - # If not provided, default to "document" for document formats (xlsx, docx, pdf, txt, html, etc.) - # This is format-based defaulting, not prompt-based auto-detection - if options.operationType == OperationTypeEnum.DATA_GENERATE and not generationIntent: - # Document formats (default to document generation) - documentFormats = ["xlsx", "docx", "pdf", "txt", "md", "html", "csv", "xml", "json", "pptx"] - # Code formats (should use ai.generateCode instead, but default to code if ai.process is used) - codeFormats = ["py", "js", "ts", "java", "cpp", "c", "go", "rs", "rb", "php", "swift", "kt"] - - if normalized_result_type in documentFormats: - generationIntent = "document" - logger.info(f"Defaulting generationIntent to 'document' for resultType '{normalized_result_type}'") - elif normalized_result_type in codeFormats: - generationIntent = "code" - logger.info(f"Defaulting generationIntent to 'code' for resultType '{normalized_result_type}'") - else: - # Unknown format - default to document (most common use case) - generationIntent = "document" - logger.warning( - f"Unknown resultType '{normalized_result_type}', defaulting generationIntent to 'document'. " - f"For code generation, use ai.generateCode action or explicitly pass generationIntent='code'." - ) + # Get generationIntent from parameters (required for DATA_GENERATE) + # Default to "document" if not provided (most common use case) + # For code generation, use ai.generateCode action or explicitly pass generationIntent="code" + generationIntent = parameters.get("generationIntent", "document") # Update progress - calling AI self.services.chat.progressLogUpdate(operationId, 0.6, "Calling AI") # Use unified callAiContent method with contentParts (extraction is now separate) # ContentParts are already extracted above (or None if no documents) + # outputFormat: Optional - if None, formats determined from prompt by AI aiResponse = await self.services.ai.callAiContent( prompt=aiPrompt, options=options, contentParts=contentParts, # Already extracted (or None if no documents) - outputFormat=output_format, + outputFormat=output_format, # Can be None - AI determines from prompt parentOperationId=operationId, generationIntent=generationIntent # REQUIRED for DATA_GENERATE ) @@ -198,7 +186,7 @@ async def process(self, parameters: Dict[str, Any]) -> ActionResult: final_documents = action_documents else: # Text response - create document from content - extension = output_extension.lstrip('.') + extension = output_extension.lstrip('.') if output_extension else "txt" meaningful_name = self._generateMeaningfulFileName( base_name="ai", extension=extension, @@ -206,8 +194,8 @@ async def process(self, parameters: Dict[str, Any]) -> ActionResult: ) validationMetadata = { "actionType": "ai.process", - "resultType": normalized_result_type, - "outputFormat": output_format, + "resultType": normalized_result_type or "auto", + "outputFormat": output_format or "auto", "hasDocuments": False, "contentType": "text" } diff --git a/modules/workflows/methods/methodAi/methodAi.py b/modules/workflows/methods/methodAi/methodAi.py index 4cd98f14..6aff6047 100644 --- a/modules/workflows/methods/methodAi/methodAi.py +++ b/modules/workflows/methods/methodAi/methodAi.py @@ -60,7 +60,7 @@ class MethodAi(MethodBase): frontendOptions=["txt", "json", "md", "csv", "xml", "html", "pdf", "docx", "xlsx", "pptx", "png", "jpg"], required=False, default="txt", - description="Output file extension. All output documents will use this format" + description="Output file extension. Optional: if omitted, formats are determined from prompt by AI. Default \"txt\" is validation fallback only. With per-document format determination, AI can determine different formats for different documents based on prompt." ), "generationIntent": WorkflowActionParameter( name="generationIntent", @@ -68,7 +68,8 @@ class MethodAi(MethodBase): frontendType=FrontendType.SELECT, frontendOptions=["document", "code", "image"], required=False, - description="Explicit generation intent (\"document\" | \"code\" | \"image\"). For DATA_GENERATE operations, if not provided, defaults based on resultType: document formats (xlsx, docx, pdf, etc.) → \"document\", code formats (py, js, ts, etc.) → \"code\". For IMAGE_GENERATE operations, this parameter is ignored. Best practice: Use qualified actions (ai.generateDocument, ai.generateCode) instead of ai.process." + default="document", + description="Explicit generation intent (\"document\" | \"code\" | \"image\"). Required for DATA_GENERATE operations. Defaults to \"document\" if not provided. For code generation, use ai.generateCode action or explicitly pass generationIntent=\"code\". For IMAGE_GENERATE operations, this parameter is ignored." ) }, execute=process.__get__(self, self.__class__) @@ -267,7 +268,7 @@ class MethodAi(MethodBase): frontendType=FrontendType.TEXT, required=False, default="txt", - description="Output format (e.g., txt, html, pdf, docx, md, json, csv, xlsx, pptx, png, jpg). Any format supported by renderers is acceptable. Default: txt" + description="Output format (e.g., txt, html, pdf, docx, md, json, csv, xlsx, pptx, png, jpg). Optional: if omitted, formats are determined from prompt by AI. Default \"txt\" is validation fallback only. With per-document format determination, AI can determine different formats for different documents based on prompt." ) }, execute=generateDocument.__get__(self, self.__class__) @@ -297,7 +298,7 @@ class MethodAi(MethodBase): frontendType=FrontendType.SELECT, frontendOptions=["py", "js", "ts", "html", "java", "cpp", "txt"], required=False, - description="Output format (html, js, py, etc.). Default: based on prompt" + description="Output format (html, js, py, etc.). Optional: if omitted, formats are determined from prompt by AI. With per-document format determination, AI can determine different formats for different documents based on prompt." ) }, execute=generateCode.__get__(self, self.__class__) diff --git a/modules/workflows/workflowManager.py b/modules/workflows/workflowManager.py index 4d1abd0c..01db9438 100644 --- a/modules/workflows/workflowManager.py +++ b/modules/workflows/workflowManager.py @@ -693,12 +693,38 @@ The following is the user's original input message. Analyze intent, normalize th setattr(self.services, '_needsWorkflowHistory', False) # Update services state + # CRITICAL: Validate language from AI response + # If AI didn't return language or invalid → use user language + # If user language not set → use "en" + validatedLanguage = None + + # Validate AI-detected language if detectedLanguage and isinstance(detectedLanguage, str): - self._setUserLanguage(detectedLanguage) - try: - setattr(self.services, 'currentUserLanguage', detectedLanguage) - except Exception: - pass + detectedLanguage = detectedLanguage.strip().lower() + # Check if it's a valid 2-character ISO code + if len(detectedLanguage) == 2 and detectedLanguage.isalpha(): + validatedLanguage = detectedLanguage + + # If AI didn't return valid language, use user language + if not validatedLanguage: + userLanguage = getattr(self.services.user, 'language', None) if hasattr(self.services, 'user') and self.services.user else None + if userLanguage and isinstance(userLanguage, str): + userLanguage = userLanguage.strip().lower() + if len(userLanguage) == 2 and userLanguage.isalpha(): + validatedLanguage = userLanguage + + # Final fallback to "en" + if not validatedLanguage: + validatedLanguage = "en" + logger.warning("Language not detected from AI and user language not set - using default 'en'") + + # Set validated language + self._setUserLanguage(validatedLanguage) + try: + setattr(self.services, 'currentUserLanguage', validatedLanguage) + logger.debug(f"Set currentUserLanguage to validated value: {validatedLanguage}") + except Exception: + pass self.services.currentUserPrompt = intentText or userInput.prompt # Always set currentUserPromptNormalized - use normalizedRequest if available, otherwise fallback to currentUserPrompt normalizedValue = normalizedRequest or intentText or userInput.prompt