5.5 KiB
5.5 KiB
Normalization Service Integration (Design + Refactor)
Goal
Introduce a deterministic Normalization stage between extraction merge and format-specific rendering to guarantee a single canonical, aligned table for downstream outputs (CSV/HTML/Excel/XML/JSON objects) without language-specific code.
Scope
- Applies to all workflows that use per-chunk extraction and merged JSON.
- Lives after merged JSON assembly and before Generation/Rendering.
- Single-path flow (no fallbacks), testable end-to-end.
- Cache mapping for the current workflow run; prompts remain generic.
Canonical Output Contract
- One consolidated table section in merged JSON:
- headers: ["Date","Merchant","CreditCardNumber","TotalAmount","Currency","VATRate"]
- rows: normalized and aligned values
- This canonical table is the only input the format engines require.
High-Level Flow
- Extraction per chunk → merged JSON (unchanged)
- Normalization Service:
- Analyze structures: discover candidate headers/paths and value samples (no AI)
- Get mapping from AI (bounded, deterministic mapping JSON only)
- Apply mapping deterministically in code (flatten + rename + value normalize)
- Validate canonical table; fail if zero rows
- Emit debug artifacts for traceability
- Generation/Rendering consumes canonical table
Module and Integration Points
-
Add
modules/services/serviceNormalization/mainServiceNormalization.pywith public API:discoverStructures(mergedJson) -> StructureInventoryrequestHeaderMapping(inventory, canonicalSpec, cache) -> MappingSpecapplyMapping(mergedJson, mappingSpec) -> CanonicalMergedJsonvalidateCanonical(canonicalJson) -> ValidationReportpersistArtifacts(artifacts, whenDebug=True)
-
Call site:
SubDocumentProcessing.processDocumentsPerChunkJson(...)- After
_mergeChunkResultsJson(...) - Before
SubDocumentGeneration/GenerationServicerendering
- After
Detailed Steps
1) Discover Structures (No AI)
- Input: merged JSON with sections (tables, paragraphs, headings). For JSON/XML sources, also consider objects.
- Output:
StructureInventorycontaining:tableHeaders: set of distinct header labels (deduped across sections)headerSamples: small value samples per header (e.g., up to 5 distinct examples)objectPaths: candidate field paths for JSON/XML object sources with value samples (optional)
Notes:
- Keep RAM processing; streaming not required for this step.
- No language rules in code.
2) AI Mapping (Headers/Paths → Canonical)
- Prompt inputs (generic):
- Canonical schema (names + short definitions + constraints)
- Discovered header inventory (and object paths, if any) + minimal value samples
- Response (strict JSON):
mappings: { "Datum": "Date", "Händler": "Merchant", ... }- Optional
objectPathMappingsfor JSON/XML flattening normalizationPolicywith specific rules per field (e.g., decimalSeparator, dateFormatCandidates, currencyPlacement)
Constraints:
- One-to-one or null mappings only (no invented fields)
- No free text; reject if schema invalid
- Cache mapping for the current workflow run (in-memory + test-chat/ai/mapping.json when debug)
3) Apply Mapping (Deterministic Code)
- Flatten JSON/XML objects per
objectPathMappings(when provided) into rows. - Transform table sections to the canonical headers per
mappings. - Normalize values per
normalizationPolicy:- Decimals: comma → dot
- Currency: split symbol/code into
Currency, keep numeric inTotalAmount - Date: parse with candidate formats → ISO or target format
- CreditCardNumber: extract last 4 if masked
- Merge all rows into a single table with canonical headers.
4) Validate
- Field-level validators (parseable date, numeric amounts, currency in allowed set).
- If zero rows after normalization: fail with a clear error and persist artifacts.
Artifacts (Debug Mode)
*_header_inventory.json/*_path_inventory.json*_mapping.json(AI output)*_normalization_report.json(counts, per-field conversions)*_canonical_merged.json(final single table)
Error Handling
- Single-path only (no fallback). If mapping invalid or yields zero rows:
- Raise an error with pointers to artifacts for diagnosis
- The workflow surface shows a concise failure reason
Testing Strategy
- Unit: mapping schema validation; deterministic value normalization; row merge.
- Integration: feed synthetic merged JSON with heterogeneous headers; assert canonical table shape and non-empty rows.
- Golden files: store sample inventories, mappings, canonical outputs under test resources.
Non-Goals
- Do not encode language heuristics in code.
- Do not normalize via AI over full data rows (AI only returns mapping + policies; code applies them).
Performance Considerations
- RAM-based processing acceptable for current document sizes.
- Mapping prompt uses only inventories and samples (small payload), not full data.
Security & Compliance
- Debug artifacts gated by existing debug flag.
- Mapping cache limited to current workflow run unless explicitly persisted.
Rollout Plan
- Implement NormalizationService with feature flag (default on in dev/test).
- Wire into
SubDocumentProcessing.processDocumentsPerChunkJson. - Add debug artifact writing.
- Create unit/integration tests and golden fixtures.
- Validate CSV/HTML outputs are populated for heterogeneous sources.
Open Extensions
- Multi-table merging strategies (per VAT rate, per document) before final consolidation.
- Domain-specific canonical schemas selectable by task type.