## PowerON Extraction Service – Concept and Architecture ### Goals - **Normalize** any document into a small set of processing‑ready typeGroups: `text`, `table`, `structure`, `image`, `binary`, `metadata`, `container`. - **Decouple** extraction (split/normalize) from chunking and merging. - **Scale** to multi‑part/container formats (pdf, office) using recursive splitting. - **Control** cost/latency by honoring `maxSize` and `chunkAllowed` with AI only when needed. - **Integrate** with AI Prompt Builder entrypoint and support `operationType` behavior. ### New Service Location - Base: `gateway/modules/services/serviceExtraction/mainServiceExtraction.py` - Sub‑modules: - `gateway/modules/services/serviceExtraction/subRegistry.py` (extractor and chunker registries) - `gateway/modules/services/serviceExtraction/subPipeline.py` (3‑pass orchestration) - `gateway/modules/services/serviceExtraction/formats/` (per‑format extractors) - `gateway/modules/services/serviceExtraction/chunking/` (per‑typeGroup chunkers) - `gateway/modules/services/serviceExtraction/merging/` (per‑typeGroup mergers) - `gateway/modules/services/serviceExtraction/utils/` (encoding, mime, helpers) No backwards compatibility is required; this is a clean introduction. ### Core Data Model (standardized outputs) - ContentPart - `id: str` - `parentId: Optional[str]` (preserve hierarchy; root has `None`) - `label: str` (e.g., "page_2", "sheet_Jan", "table_1") - `typeGroup: Literal["text","table","structure","image","binary","metadata","container"]` - `mimeType: str` - `data: str` (utf‑8 text for `text|table|structure`; base64 for `image|binary`; empty for `container`) - `metadata: Dict[str, Any]` (size, pages, width/height, pageIndex, sheetName, sourceRanges, checksum, confidence, warnings) - ExtractedContent - `id: str` (document id) - `parts: List[ContentPart]` (flat list; hierarchy via `parentId`) - `summary: Optional[Dict[str, Any]]` Notes: - `metadata.sourceRanges` or page/sheet indices allow provenance for merges/summaries. - `metadata.confidence` and `metadata.warnings` guide downstream AI/UX decisions. ### MIME → typeGroup mapping (deterministic first) - `text/plain`, `text/markdown` → `text` - `text/csv` → `table` - `application/json`, `application/xml`, `text/html`, `image/svg+xml` → `structure` - `image/*` → `image` - `application/pdf`, `application/vnd.openxmlformats-officedocument.*` → `container` - otherwise → `binary` Container extractors are responsible for disaggregating into basic typeGroups. ### 3‑Pass Pipeline 1) Identify and normalize (Split/Extract) - Start with a root `container` part representing the raw file. - Resolve extractor by `mimeType`/extension via registry. - Recursively split container formats into child parts until only basic typeGroups remain (`text|table|structure|image|binary|metadata`). - Output a single `ExtractedContent` per input document. 2) Chunk - Route each basic typeGroup to its chunker: - `text` → size‑bounded line/paragraph aware - `table` → row‑bounded (CSV lines), schema aware optional - `structure` → JSON object/XML subtree/HTML block aware - `image`, `binary`, `metadata`, `container` → no chunking by default - Chunkers return `chunks: List[Dict]` with back‑references (`partId`, `order`). 3) Merge - Strategy driven by call options and workflow: - `text` → concatenate by logical order (page/section) or keep per part - `table` → keep separate per table/sheet; optional schema merge - `structure` → preserve keys/paths; avoid lossy merges - `image|binary` → usually pass‑through - `metadata|container` → excluded by default ### Registries - ExtractorRegistry (in `subRegistry.py`) - Maps `mimeType`/extension to an `Extractor` instance. - Fallbacks: content sniffing, default binary extractor. - ChunkerRegistry (in `subRegistry.py`) - Maps `typeGroup` to a `Chunker`. ### Base Interfaces Use camelCase and prefix internal methods with `_`. ```python class Extractor: def detect(self, fileName: str, mimeType: str, headBytes: bytes) -> bool: ... def extract(self, fileBytes: bytes, context: Dict[str, Any]) -> List[ContentPart]: ... class Chunker: def chunk(self, part: ContentPart, options: Dict[str, Any]) -> List[Dict[str, Any]]: ... class Merger: def merge(self, parts: List[ContentPart], strategy: Dict[str, Any]) -> List[ContentPart]: ... ``` ### Format Extractors (under `formats/`) - `text_extractor.py` → emits one `text` part - `csv_extractor.py` → emits one `table` part (CSV payload) - `json_extractor.py`, `xml_extractor.py`, `html_extractor.py`, `svg_extractor.py` → emit `structure` parts - `image_extractor.py` → emits one `image` part; optional OCR is handled by AI post‑processing - `pdf_extractor.py` → emits `container` root with children: - per page: `text` part if text found - per page: extracted images as `image` parts - per page/section metadata as `metadata` - `docx_extractor.py` → `container` + children: headings `structure`, paragraphs `text`, tables `table`, comments `metadata` - `xlsx_extractor.py` → `container` + children: each sheet as `table` CSV; properties `metadata`; charts as `image` or `structure` - `pptx_extractor.py` → `container` + slides: text boxes `text`, tables `table`, images `image`, notes `metadata` - `legacy_*_extractor.py` → `metadata` + `binary` with clear limitations - `binary_extractor.py` → single `binary` part ### Chunkers (under `chunking/`) - `text_chunker.py` → size/paragraph aware; configurable sizes - `table_chunker.py` → split by row count/bytes, keep header propagation - `structure_chunker.py` → JSON object buckets, XML subtree buckets, HTML block buckets - `binary_chunker.py` → byte slicing when explicitly requested - `noop_chunker.py` → for image/metadata/container ### Mergers (under `merging/`) - `text_merger.py` → page/section aware concatenation - `table_merger.py` → per sheet/table; optional schema merge - `structure_merger.py` → key/path preserving grouping - `default_merger.py` → pass‑through ### Orchestration (in `subPipeline.py`) High‑level flow for one document: ```python def runExtraction(document: bytes, fileName: str, mimeType: str, options: Dict[str, Any]) -> ExtractedContent: # Pass 1: extract/normalize parts = _extractAll(document, fileName, mimeType, options) # Pass 2: chunk if allowed if options.get("chunkAllowed", False): chunks = _chunkParts(parts, options) else: chunks = [] # Pass 3: merge per strategy merged = _merge(parts, chunks, options.get("mergeStrategy", {})) return ExtractedContent(id=_makeId(), parts=merged, summary=_buildSummary(parts)) ``` ### Entry Point and Options (in `mainServiceExtraction.py`) The service is invoked by AI Prompt Builder with `(documentList, options)`. Supported options and effects: - `prompt: str` - If present, enables optional AI augmentation on extracted content/chunks based on `operationType`. - `operationType: Literal["general","generate_plan","analyse_content","generate_content","web_research"]` - `general`/`analyse_content`: prefer deterministic extraction; AI can summarize or answer over chunks. - `generate_plan`: produce structured `structure` outputs (bullet points, tasks) from `text` chunks. - `generate_content`: allow AI synthesis over merged `text` parts within `maxSize`. - `web_research`: treat extracted `structure` and `text` as context; AI orchestrator may fetch more docs upstream. - `processDocumentsIndividually: bool` - `True`: run the 3‑pass pipeline per document; apply `maxSize` per document; return list of results. - `False`: extract all docs → pool parts → global chunk/merge → apply `maxSize` across the pool; keep provenance by `parentId` and `documentId`. - `maxSize: int` and `chunkAllowed: bool` - Hard cap on total size of content passed to AI. - If `chunkAllowed=True` → prefer chunking to stay under `maxSize`; process chunks iteratively in priority order (e.g., text before images, or by page order). - If `chunkAllowed=False` → do not chunk; instead summarize down (per part, then hierarchical) until under `maxSize`. Size governance policy: 1) Compute sizes for candidate parts/chunks. 2) If total ≤ `maxSize` → pass through. 3) If total > `maxSize` and `chunkAllowed` → progressively include highest‑value chunks until the cap; optionally add a final global summary. 4) If total > `maxSize` and not chunkAllowed → summarize per part, then merge summaries; ensure final text ≤ cap. ### AI Integration - AI is optional and strictly after extraction. - Recommended placements: - OCR/VLM for `image` parts when requested. - LLM summarization for large `text|structure|table` parts to respect `maxSize` when `chunkAllowed=False`. - LLM question answering (`analyse_content`) over selected chunks. - All AI calls must respect budget/time guards and the size cap. ### Error Handling - Every extractor must return either valid parts or a `metadata` part with `warnings/error` plus a `binary` fallback when applicable. - Include enough context in `metadata` to diagnose issues (library missing, parse error details) without leaking sensitive content. ### Ordering and Provenance - Preserve logical order within a document (page index, slide index, sheet index). - Maintain `parentId` links to reconstruct hierarchy during merge and summarization. ### Testing Strategy - Unit tests per extractor on small fixtures for each format. - Contract tests for the 3‑pass pipeline (end‑to‑end) with mixed multi‑part documents. - Size‑cap tests validating chunking vs summarization paths. ### Migration Notes - Existing monolithic logic can be moved into `formats/*` and `utils/*` preserving robust decoding and Office/PDF heuristics, while removing AI calls from extractors. - `ContentItem` usage should shift to `ContentPart` (no backward compatibility required). ### Minimal Pseudocode – processDocumentsIndividually ```python def extractDocuments(documentList: List[Dict], options: Dict[str, Any]): if options.get("processDocumentsIndividually", True): results = [] for doc in documentList: ec = runExtraction(doc.bytes, doc.fileName, doc.mimeType, options) ec = _applyAiIfRequested(ec, options) # respects maxSize + chunkAllowed results.append(ec) return results else: # global pool parts = [] for doc in documentList: ec = runExtraction(doc.bytes, doc.fileName, doc.mimeType, options) parts.extend(_tagWithDocumentId(ec.parts, doc.id)) pooled = _poolAndLimit(parts, options) # chunk/summarize to cap pooled = _applyAiIfRequestedOverPool(pooled, options) return pooled ``` ### Defaults and Configuration - Chunk sizes per typeGroup are centralized and configurable. - Merge strategies (text concat policy, table schema inference) are pluggable. - Registries support runtime extension (new formats) without touching the pipeline. ### Summary This design introduces a small, stable contract (`ContentPart` with `typeGroup`) and a 3‑pass pipeline that: - normalizes diverse documents into uniform parts, - chunks only what benefits from chunking, - merges predictably for downstream AI and workflow steps, while strictly enforcing `maxSize` and honoring `chunkAllowed` and `processDocumentsIndividually`.