11 KiB
11 KiB
PowerON Extraction Service – Concept and Architecture
Goals
- Normalize any document into a small set of processing‑ready typeGroups:
text,table,structure,image,binary,metadata,container. - Decouple extraction (split/normalize) from chunking and merging.
- Scale to multi‑part/container formats (pdf, office) using recursive splitting.
- Control cost/latency by honoring
maxSizeandchunkAllowedwith AI only when needed. - Integrate with AI Prompt Builder entrypoint and support
operationTypebehavior.
New Service Location
- Base:
gateway/modules/services/serviceExtraction/mainServiceExtraction.py - Sub‑modules:
gateway/modules/services/serviceExtraction/subRegistry.py(extractor and chunker registries)gateway/modules/services/serviceExtraction/subPipeline.py(3‑pass orchestration)gateway/modules/services/serviceExtraction/formats/(per‑format extractors)gateway/modules/services/serviceExtraction/chunking/(per‑typeGroup chunkers)gateway/modules/services/serviceExtraction/merging/(per‑typeGroup mergers)gateway/modules/services/serviceExtraction/utils/(encoding, mime, helpers)
No backwards compatibility is required; this is a clean introduction.
Core Data Model (standardized outputs)
-
ContentPart
id: strparentId: Optional[str](preserve hierarchy; root hasNone)label: str(e.g., "page_2", "sheet_Jan", "table_1")typeGroup: Literal["text","table","structure","image","binary","metadata","container"]mimeType: strdata: str(utf‑8 text fortext|table|structure; base64 forimage|binary; empty forcontainer)metadata: Dict[str, Any](size, pages, width/height, pageIndex, sheetName, sourceRanges, checksum, confidence, warnings)
-
ExtractedContent
id: str(document id)parts: List[ContentPart](flat list; hierarchy viaparentId)summary: Optional[Dict[str, Any]]
Notes:
metadata.sourceRangesor page/sheet indices allow provenance for merges/summaries.metadata.confidenceandmetadata.warningsguide downstream AI/UX decisions.
MIME → typeGroup mapping (deterministic first)
text/plain,text/markdown→texttext/csv→tableapplication/json,application/xml,text/html,image/svg+xml→structureimage/*→imageapplication/pdf,application/vnd.openxmlformats-officedocument.*→container- otherwise →
binary
Container extractors are responsible for disaggregating into basic typeGroups.
3‑Pass Pipeline
-
Identify and normalize (Split/Extract)
- Start with a root
containerpart representing the raw file. - Resolve extractor by
mimeType/extension via registry. - Recursively split container formats into child parts until only basic typeGroups remain (
text|table|structure|image|binary|metadata). - Output a single
ExtractedContentper input document.
- Start with a root
-
Chunk
- Route each basic typeGroup to its chunker:
text→ size‑bounded line/paragraph awaretable→ row‑bounded (CSV lines), schema aware optionalstructure→ JSON object/XML subtree/HTML block awareimage,binary,metadata,container→ no chunking by default
- Chunkers return
chunks: List[Dict]with back‑references (partId,order).
- Route each basic typeGroup to its chunker:
-
Merge
- Strategy driven by call options and workflow:
text→ concatenate by logical order (page/section) or keep per parttable→ keep separate per table/sheet; optional schema mergestructure→ preserve keys/paths; avoid lossy mergesimage|binary→ usually pass‑throughmetadata|container→ excluded by default
- Strategy driven by call options and workflow:
Registries
-
ExtractorRegistry (in
subRegistry.py)- Maps
mimeType/extension to anExtractorinstance. - Fallbacks: content sniffing, default binary extractor.
- Maps
-
ChunkerRegistry (in
subRegistry.py)- Maps
typeGroupto aChunker.
- Maps
Base Interfaces
Use camelCase and prefix internal methods with _.
class Extractor:
def detect(self, fileName: str, mimeType: str, headBytes: bytes) -> bool: ...
def extract(self, fileBytes: bytes, context: Dict[str, Any]) -> List[ContentPart]: ...
class Chunker:
def chunk(self, part: ContentPart, options: Dict[str, Any]) -> List[Dict[str, Any]]: ...
class Merger:
def merge(self, parts: List[ContentPart], strategy: Dict[str, Any]) -> List[ContentPart]: ...
Format Extractors (under formats/)
text_extractor.py→ emits onetextpartcsv_extractor.py→ emits onetablepart (CSV payload)json_extractor.py,xml_extractor.py,html_extractor.py,svg_extractor.py→ emitstructurepartsimage_extractor.py→ emits oneimagepart; optional OCR is handled by AI post‑processingpdf_extractor.py→ emitscontainerroot with children:- per page:
textpart if text found - per page: extracted images as
imageparts - per page/section metadata as
metadata
- per page:
docx_extractor.py→container+ children: headingsstructure, paragraphstext, tablestable, commentsmetadataxlsx_extractor.py→container+ children: each sheet astableCSV; propertiesmetadata; charts asimageorstructurepptx_extractor.py→container+ slides: text boxestext, tablestable, imagesimage, notesmetadatalegacy_*_extractor.py→metadata+binarywith clear limitationsbinary_extractor.py→ singlebinarypart
Chunkers (under chunking/)
text_chunker.py→ size/paragraph aware; configurable sizestable_chunker.py→ split by row count/bytes, keep header propagationstructure_chunker.py→ JSON object buckets, XML subtree buckets, HTML block bucketsbinary_chunker.py→ byte slicing when explicitly requestednoop_chunker.py→ for image/metadata/container
Mergers (under merging/)
text_merger.py→ page/section aware concatenationtable_merger.py→ per sheet/table; optional schema mergestructure_merger.py→ key/path preserving groupingdefault_merger.py→ pass‑through
Orchestration (in subPipeline.py)
High‑level flow for one document:
def runExtraction(document: bytes, fileName: str, mimeType: str, options: Dict[str, Any]) -> ExtractedContent:
# Pass 1: extract/normalize
parts = _extractAll(document, fileName, mimeType, options)
# Pass 2: chunk if allowed
if options.get("chunkAllowed", False):
chunks = _chunkParts(parts, options)
else:
chunks = []
# Pass 3: merge per strategy
merged = _merge(parts, chunks, options.get("mergeStrategy", {}))
return ExtractedContent(id=_makeId(), parts=merged, summary=_buildSummary(parts))
Entry Point and Options (in mainServiceExtraction.py)
The service is invoked by AI Prompt Builder with (documentList, options).
Supported options and effects:
prompt: str- If present, enables optional AI augmentation on extracted content/chunks based on
operationType.
- If present, enables optional AI augmentation on extracted content/chunks based on
operationType: Literal["general","generate_plan","analyse_content","generate_content","web_research"]general/analyse_content: prefer deterministic extraction; AI can summarize or answer over chunks.generate_plan: produce structuredstructureoutputs (bullet points, tasks) fromtextchunks.generate_content: allow AI synthesis over mergedtextparts withinmaxSize.web_research: treat extractedstructureandtextas context; AI orchestrator may fetch more docs upstream.
processDocumentsIndividually: boolTrue: run the 3‑pass pipeline per document; applymaxSizeper document; return list of results.False: extract all docs → pool parts → global chunk/merge → applymaxSizeacross the pool; keep provenance byparentIdanddocumentId.
maxSize: intandchunkAllowed: bool- Hard cap on total size of content passed to AI.
- If
chunkAllowed=True→ prefer chunking to stay undermaxSize; process chunks iteratively in priority order (e.g., text before images, or by page order). - If
chunkAllowed=False→ do not chunk; instead summarize down (per part, then hierarchical) until undermaxSize.
Size governance policy:
- Compute sizes for candidate parts/chunks.
- If total ≤
maxSize→ pass through. - If total >
maxSizeandchunkAllowed→ progressively include highest‑value chunks until the cap; optionally add a final global summary. - If total >
maxSizeand not chunkAllowed → summarize per part, then merge summaries; ensure final text ≤ cap.
AI Integration
- AI is optional and strictly after extraction.
- Recommended placements:
- OCR/VLM for
imageparts when requested. - LLM summarization for large
text|structure|tableparts to respectmaxSizewhenchunkAllowed=False. - LLM question answering (
analyse_content) over selected chunks.
- OCR/VLM for
- All AI calls must respect budget/time guards and the size cap.
Error Handling
- Every extractor must return either valid parts or a
metadatapart withwarnings/errorplus abinaryfallback when applicable. - Include enough context in
metadatato diagnose issues (library missing, parse error details) without leaking sensitive content.
Ordering and Provenance
- Preserve logical order within a document (page index, slide index, sheet index).
- Maintain
parentIdlinks to reconstruct hierarchy during merge and summarization.
Testing Strategy
- Unit tests per extractor on small fixtures for each format.
- Contract tests for the 3‑pass pipeline (end‑to‑end) with mixed multi‑part documents.
- Size‑cap tests validating chunking vs summarization paths.
Migration Notes
- Existing monolithic logic can be moved into
formats/*andutils/*preserving robust decoding and Office/PDF heuristics, while removing AI calls from extractors. ContentItemusage should shift toContentPart(no backward compatibility required).
Minimal Pseudocode – processDocumentsIndividually
def extractDocuments(documentList: List[Dict], options: Dict[str, Any]):
if options.get("processDocumentsIndividually", True):
results = []
for doc in documentList:
ec = runExtraction(doc.bytes, doc.fileName, doc.mimeType, options)
ec = _applyAiIfRequested(ec, options) # respects maxSize + chunkAllowed
results.append(ec)
return results
else:
# global pool
parts = []
for doc in documentList:
ec = runExtraction(doc.bytes, doc.fileName, doc.mimeType, options)
parts.extend(_tagWithDocumentId(ec.parts, doc.id))
pooled = _poolAndLimit(parts, options) # chunk/summarize to cap
pooled = _applyAiIfRequestedOverPool(pooled, options)
return pooled
Defaults and Configuration
- Chunk sizes per typeGroup are centralized and configurable.
- Merge strategies (text concat policy, table schema inference) are pluggable.
- Registries support runtime extension (new formats) without touching the pipeline.
Summary
This design introduces a small, stable contract (ContentPart with typeGroup) and a 3‑pass pipeline that:
- normalizes diverse documents into uniform parts,
- chunks only what benefits from chunking,
- merges predictably for downstream AI and workflow steps,
while strictly enforcing
maxSizeand honoringchunkAllowedandprocessDocumentsIndividually.