PowerOn/gateway

Fork 0

ValueOn AG 64b44473aa fixed data extraction and generation handling with parts

2026-01-02 00:05:54 +01:00

4.2 KiB

Raw Blame History

Document Generation Architecture Analysis

Current Flow

1. Document Input → ContentParts (`extractAndPrepareContent`)

Location: gateway/modules/services/serviceAi/subContentExtraction.py

Flow:

Regular documents → Calls extractContent() (NON-AI extraction) → Creates contentParts with raw extracted text
BUT THEN:
- Images with "extract" intent → Calls Vision AI (line 190) → AI extraction
- Text with "extract" intent + extractionPrompt → Calls AI processing (line 265) → AI extraction
Pre-extracted JSON → Uses contentParts directly (no AI)

Result: ContentParts may already be AI-processed before structure generation

2. Structure Generation

Location: gateway/modules/services/serviceAi/subStructureGeneration.py

Flow:

Uses contentParts (may already be AI-processed)
Generates document structure (chapters, sections)

3. Section Generation (`_processSingleSection`)

Location: gateway/modules/services/serviceAi/subStructureFilling.py

Flow:

Uses contentParts (which may already be AI-processed)
Aggregates "extracted" contentParts with AI (line 554-682)
Generates section content using callAiWithLooping with useCaseId="section_content"

Issues Identified

Issue 1: Duplicate AI Processing

AI extraction happens in extractAndPrepareContent (for images/text)
AI generation happens again in section generation
This is redundant and inefficient

Issue 2: Architecture Inconsistency

Pre-extracted JSON files → contentParts directly (no AI)
Regular documents → contentParts + AI extraction (inconsistent)
User wants: Documents → contentParts (like pre-extracted JSON) → AI only in section generation

Issue 3: Image Processing

Images need Vision AI to extract text
Currently happens in extractAndPrepareContent
Question: Should this happen during section generation instead?

Proposed Architecture

Option A: Remove All AI from `extractAndPrepareContent`

Documents → extractContent() → Raw contentParts (text, tables, etc.)
Images → Keep as image contentParts (no Vision AI extraction)
Section generation → Handle images with Vision AI when needed

Pros:

Consistent with pre-extracted JSON flow
Single point of AI processing (section generation)
Clear separation of concerns

Cons:

Images won't have extracted text until section generation
May need to handle images differently in section generation

Option B: Keep Vision AI for Images Only

Documents → extractContent() → Raw contentParts
Images → Vision AI extraction → Text contentParts
Section generation → Uses text contentParts (no additional AI extraction)

Pros:

Images get text extracted early
Section generation can use text directly

Cons:

Still has AI extraction before structure generation
Inconsistent with user's request

Recommendation

Follow Option A - Remove all AI extraction from extractAndPrepareContent:

Documents → ContentParts (like pre-extracted JSON):
- Call extractContent() (NON-AI)
- Create contentParts with raw extracted content
- Images remain as image contentParts (no Vision AI)
Section Generation:
- Handle images with Vision AI when needed
- Aggregate all contentParts with AI
- Single point of AI processing

Benefits:

Clear architecture: Documents = raw contentParts
Consistent with pre-extracted JSON flow
AI processing only where needed (section generation)
Easier to understand and maintain

Questions to Resolve

Image handling: How should images be processed during section generation?
- Option 1: Vision AI extraction happens automatically when image contentParts are used
- Option 2: Images are passed to AI with Vision models during section generation
- Option 3: Images remain as binary and are rendered directly (no text extraction)
Text with extractionPrompt: Should text contentParts with extractionPrompt be processed differently?
- Currently: AI processing in extractAndPrepareContent
- Proposed: Raw text → AI processing during section generation
Performance: Will deferring image extraction to section generation cause performance issues?
- Need to test with multiple images

4.2 KiB Raw Blame History

Document Generation Architecture Analysis

Current Flow

1. Document Input → ContentParts (extractAndPrepareContent)

2. Structure Generation

3. Section Generation (_processSingleSection)

Issues Identified

Issue 1: Duplicate AI Processing

Issue 2: Architecture Inconsistency

Issue 3: Image Processing

Proposed Architecture

Option A: Remove All AI from extractAndPrepareContent

Option B: Keep Vision AI for Images Only

Recommendation

Questions to Resolve

4.2 KiB

Raw Blame History

1. Document Input → ContentParts (`extractAndPrepareContent`)

3. Section Generation (`_processSingleSection`)

Option A: Remove All AI from `extractAndPrepareContent`