wiki/implementation/implementation_extraction.md at 1a7cea983501323c56c537ceec30f28618c92b5e

PowerOn/wiki

Fork 0

ValueOn AG d9d13dfc12 fix: refactored structure and removed mandates data

2025-11-11 19:16:25 +01:00

11 KiB

Raw Blame History

PowerON Extraction Service – Concept and Architecture

Goals

Normalize any document into a small set of processing‑ready typeGroups: text, table, structure, image, binary, metadata, container.
Decouple extraction (split/normalize) from chunking and merging.
Scale to multi‑part/container formats (pdf, office) using recursive splitting.
Control cost/latency by honoring maxSize and chunkAllowed with AI only when needed.
Integrate with AI Prompt Builder entrypoint and support operationType behavior.

New Service Location

Base: gateway/modules/services/serviceExtraction/mainServiceExtraction.py
Sub‑modules:
- gateway/modules/services/serviceExtraction/subRegistry.py (extractor and chunker registries)
- gateway/modules/services/serviceExtraction/subPipeline.py (3‑pass orchestration)
- gateway/modules/services/serviceExtraction/formats/ (per‑format extractors)
- gateway/modules/services/serviceExtraction/chunking/ (per‑typeGroup chunkers)
- gateway/modules/services/serviceExtraction/merging/ (per‑typeGroup mergers)
- gateway/modules/services/serviceExtraction/utils/ (encoding, mime, helpers)

No backwards compatibility is required; this is a clean introduction.

Core Data Model (standardized outputs)

ContentPart
- id: str
- parentId: Optional[str] (preserve hierarchy; root has None)
- label: str (e.g., "page_2", "sheet_Jan", "table_1")
- typeGroup: Literal["text","table","structure","image","binary","metadata","container"]
- mimeType: str
- data: str (utf‑8 text for text|table|structure; base64 for image|binary; empty for container)
- metadata: Dict[str, Any] (size, pages, width/height, pageIndex, sheetName, sourceRanges, checksum, confidence, warnings)
ExtractedContent
- id: str (document id)
- parts: List[ContentPart] (flat list; hierarchy via parentId)
- summary: Optional[Dict[str, Any]]

Notes:

metadata.sourceRanges or page/sheet indices allow provenance for merges/summaries.
metadata.confidence and metadata.warnings guide downstream AI/UX decisions.

MIME → typeGroup mapping (deterministic first)

text/plain, text/markdown → text
text/csv → table
application/json, application/xml, text/html, image/svg+xml → structure
image/* → image
application/pdf, application/vnd.openxmlformats-officedocument.* → container
otherwise → binary

Container extractors are responsible for disaggregating into basic typeGroups.

3‑Pass Pipeline

Identify and normalize (Split/Extract)
- Start with a root container part representing the raw file.
- Resolve extractor by mimeType/extension via registry.
- Recursively split container formats into child parts until only basic typeGroups remain (text|table|structure|image|binary|metadata).
- Output a single ExtractedContent per input document.
Chunk
- Route each basic typeGroup to its chunker:
  - text → size‑bounded line/paragraph aware
  - table → row‑bounded (CSV lines), schema aware optional
  - structure → JSON object/XML subtree/HTML block aware
  - image, binary, metadata, container → no chunking by default
- Chunkers return chunks: List[Dict] with back‑references (partId, order).
Merge
- Strategy driven by call options and workflow:
  - text → concatenate by logical order (page/section) or keep per part
  - table → keep separate per table/sheet; optional schema merge
  - structure → preserve keys/paths; avoid lossy merges
  - image|binary → usually pass‑through
  - metadata|container → excluded by default

Registries

ExtractorRegistry (in subRegistry.py)
- Maps mimeType/extension to an Extractor instance.
- Fallbacks: content sniffing, default binary extractor.
ChunkerRegistry (in subRegistry.py)
- Maps typeGroup to a Chunker.

Base Interfaces

Use camelCase and prefix internal methods with _.

class Extractor:
    def detect(self, fileName: str, mimeType: str, headBytes: bytes) -> bool: ...
    def extract(self, fileBytes: bytes, context: Dict[str, Any]) -> List[ContentPart]: ...

class Chunker:
    def chunk(self, part: ContentPart, options: Dict[str, Any]) -> List[Dict[str, Any]]: ...

class Merger:
    def merge(self, parts: List[ContentPart], strategy: Dict[str, Any]) -> List[ContentPart]: ...

Format Extractors (under `formats/`)

text_extractor.py → emits one text part
csv_extractor.py → emits one table part (CSV payload)
json_extractor.py, xml_extractor.py, html_extractor.py, svg_extractor.py → emit structure parts
image_extractor.py → emits one image part; optional OCR is handled by AI post‑processing
pdf_extractor.py → emits container root with children:
- per page: text part if text found
- per page: extracted images as image parts
- per page/section metadata as metadata
docx_extractor.py → container + children: headings structure, paragraphs text, tables table, comments metadata
xlsx_extractor.py → container + children: each sheet as table CSV; properties metadata; charts as image or structure
pptx_extractor.py → container + slides: text boxes text, tables table, images image, notes metadata
legacy_*_extractor.py → metadata + binary with clear limitations
binary_extractor.py → single binary part

Chunkers (under `chunking/`)

text_chunker.py → size/paragraph aware; configurable sizes
table_chunker.py → split by row count/bytes, keep header propagation
structure_chunker.py → JSON object buckets, XML subtree buckets, HTML block buckets
binary_chunker.py → byte slicing when explicitly requested
noop_chunker.py → for image/metadata/container

Mergers (under `merging/`)

text_merger.py → page/section aware concatenation
table_merger.py → per sheet/table; optional schema merge
structure_merger.py → key/path preserving grouping
default_merger.py → pass‑through

Orchestration (in `subPipeline.py`)

High‑level flow for one document:

def runExtraction(document: bytes, fileName: str, mimeType: str, options: Dict[str, Any]) -> ExtractedContent:
    # Pass 1: extract/normalize
    parts = _extractAll(document, fileName, mimeType, options)

    # Pass 2: chunk if allowed
    if options.get("chunkAllowed", False):
        chunks = _chunkParts(parts, options)
    else:
        chunks = []

    # Pass 3: merge per strategy
    merged = _merge(parts, chunks, options.get("mergeStrategy", {}))

    return ExtractedContent(id=_makeId(), parts=merged, summary=_buildSummary(parts))

Entry Point and Options (in `mainServiceExtraction.py`)

The service is invoked by AI Prompt Builder with (documentList, options).

Supported options and effects:

prompt: str
- If present, enables optional AI augmentation on extracted content/chunks based on operationType.
operationType: Literal["general","generate_plan","analyse_content","generate_content","web_research"]
- general/analyse_content: prefer deterministic extraction; AI can summarize or answer over chunks.
- generate_plan: produce structured structure outputs (bullet points, tasks) from text chunks.
- generate_content: allow AI synthesis over merged text parts within maxSize.
- web_research: treat extracted structure and text as context; AI orchestrator may fetch more docs upstream.
processDocumentsIndividually: bool
- True: run the 3‑pass pipeline per document; apply maxSize per document; return list of results.
- False: extract all docs → pool parts → global chunk/merge → apply maxSize across the pool; keep provenance by parentId and documentId.
maxSize: int and chunkAllowed: bool
- Hard cap on total size of content passed to AI.
- If chunkAllowed=True → prefer chunking to stay under maxSize; process chunks iteratively in priority order (e.g., text before images, or by page order).
- If chunkAllowed=False → do not chunk; instead summarize down (per part, then hierarchical) until under maxSize.

Size governance policy:

Compute sizes for candidate parts/chunks.
If total ≤ maxSize → pass through.
If total > maxSize and chunkAllowed → progressively include highest‑value chunks until the cap; optionally add a final global summary.
If total > maxSize and not chunkAllowed → summarize per part, then merge summaries; ensure final text ≤ cap.

AI Integration

AI is optional and strictly after extraction.
Recommended placements:
- OCR/VLM for image parts when requested.
- LLM summarization for large text|structure|table parts to respect maxSize when chunkAllowed=False.
- LLM question answering (analyse_content) over selected chunks.
All AI calls must respect budget/time guards and the size cap.

Error Handling

Every extractor must return either valid parts or a metadata part with warnings/error plus a binary fallback when applicable.
Include enough context in metadata to diagnose issues (library missing, parse error details) without leaking sensitive content.

Ordering and Provenance

Preserve logical order within a document (page index, slide index, sheet index).
Maintain parentId links to reconstruct hierarchy during merge and summarization.

Testing Strategy

Unit tests per extractor on small fixtures for each format.
Contract tests for the 3‑pass pipeline (end‑to‑end) with mixed multi‑part documents.
Size‑cap tests validating chunking vs summarization paths.

Migration Notes

Existing monolithic logic can be moved into formats/* and utils/* preserving robust decoding and Office/PDF heuristics, while removing AI calls from extractors.
ContentItem usage should shift to ContentPart (no backward compatibility required).

Minimal Pseudocode – processDocumentsIndividually

def extractDocuments(documentList: List[Dict], options: Dict[str, Any]):
    if options.get("processDocumentsIndividually", True):
        results = []
        for doc in documentList:
            ec = runExtraction(doc.bytes, doc.fileName, doc.mimeType, options)
            ec = _applyAiIfRequested(ec, options)  # respects maxSize + chunkAllowed
            results.append(ec)
        return results
    else:
        # global pool
        parts = []
        for doc in documentList:
            ec = runExtraction(doc.bytes, doc.fileName, doc.mimeType, options)
            parts.extend(_tagWithDocumentId(ec.parts, doc.id))
        pooled = _poolAndLimit(parts, options)  # chunk/summarize to cap
        pooled = _applyAiIfRequestedOverPool(pooled, options)
        return pooled

Defaults and Configuration

Chunk sizes per typeGroup are centralized and configurable.
Merge strategies (text concat policy, table schema inference) are pluggable.
Registries support runtime extension (new formats) without touching the pipeline.

Summary

This design introduces a small, stable contract (ContentPart with typeGroup) and a 3‑pass pipeline that:

normalizes diverse documents into uniform parts,
chunks only what benefits from chunking,
merges predictably for downstream AI and workflow steps, while strictly enforcing maxSize and honoring chunkAllowed and processDocumentsIndividually.

11 KiB Raw Blame History Unescape Escape