## Normalization Service Integration (Design + Refactor) ### Goal Introduce a deterministic Normalization stage between extraction merge and format-specific rendering to guarantee a single canonical, aligned table for downstream outputs (CSV/HTML/Excel/XML/JSON objects) without language-specific code. ### Scope - Applies to all workflows that use per-chunk extraction and merged JSON. - Lives after merged JSON assembly and before Generation/Rendering. - Single-path flow (no fallbacks), testable end-to-end. - Cache mapping for the current workflow run; prompts remain generic. ### Canonical Output Contract - One consolidated table section in merged JSON: - headers: ["Date","Merchant","CreditCardNumber","TotalAmount","Currency","VATRate"] - rows: normalized and aligned values - This canonical table is the only input the format engines require. ### High-Level Flow 1) Extraction per chunk → merged JSON (unchanged) 2) Normalization Service: - Analyze structures: discover candidate headers/paths and value samples (no AI) - Get mapping from AI (bounded, deterministic mapping JSON only) - Apply mapping deterministically in code (flatten + rename + value normalize) - Validate canonical table; fail if zero rows - Emit debug artifacts for traceability 3) Generation/Rendering consumes canonical table ### Module and Integration Points - Add `modules/services/serviceNormalization/mainServiceNormalization.py` with public API: - `discoverStructures(mergedJson) -> StructureInventory` - `requestHeaderMapping(inventory, canonicalSpec, cache) -> MappingSpec` - `applyMapping(mergedJson, mappingSpec) -> CanonicalMergedJson` - `validateCanonical(canonicalJson) -> ValidationReport` - `persistArtifacts(artifacts, whenDebug=True)` - Call site: `SubDocumentProcessing.processDocumentsPerChunkJson(...)` - After `_mergeChunkResultsJson(...)` - Before `SubDocumentGeneration`/`GenerationService` rendering ### Detailed Steps #### 1) Discover Structures (No AI) - Input: merged JSON with sections (tables, paragraphs, headings). For JSON/XML sources, also consider objects. - Output: `StructureInventory` containing: - `tableHeaders`: set of distinct header labels (deduped across sections) - `headerSamples`: small value samples per header (e.g., up to 5 distinct examples) - `objectPaths`: candidate field paths for JSON/XML object sources with value samples (optional) Notes: - Keep RAM processing; streaming not required for this step. - No language rules in code. #### 2) AI Mapping (Headers/Paths → Canonical) - Prompt inputs (generic): - Canonical schema (names + short definitions + constraints) - Discovered header inventory (and object paths, if any) + minimal value samples - Response (strict JSON): - `mappings`: { "Datum": "Date", "Händler": "Merchant", ... } - Optional `objectPathMappings` for JSON/XML flattening - `normalizationPolicy` with specific rules per field (e.g., decimalSeparator, dateFormatCandidates, currencyPlacement) Constraints: - One-to-one or null mappings only (no invented fields) - No free text; reject if schema invalid - Cache mapping for the current workflow run (in-memory + test-chat/ai/mapping.json when debug) #### 3) Apply Mapping (Deterministic Code) - Flatten JSON/XML objects per `objectPathMappings` (when provided) into rows. - Transform table sections to the canonical headers per `mappings`. - Normalize values per `normalizationPolicy`: - Decimals: comma → dot - Currency: split symbol/code into `Currency`, keep numeric in `TotalAmount` - Date: parse with candidate formats → ISO or target format - CreditCardNumber: extract last 4 if masked - Merge all rows into a single table with canonical headers. #### 4) Validate - Field-level validators (parseable date, numeric amounts, currency in allowed set). - If zero rows after normalization: fail with a clear error and persist artifacts. ### Artifacts (Debug Mode) - `*_header_inventory.json` / `*_path_inventory.json` - `*_mapping.json` (AI output) - `*_normalization_report.json` (counts, per-field conversions) - `*_canonical_merged.json` (final single table) ### Error Handling - Single-path only (no fallback). If mapping invalid or yields zero rows: - Raise an error with pointers to artifacts for diagnosis - The workflow surface shows a concise failure reason ### Testing Strategy - Unit: mapping schema validation; deterministic value normalization; row merge. - Integration: feed synthetic merged JSON with heterogeneous headers; assert canonical table shape and non-empty rows. - Golden files: store sample inventories, mappings, canonical outputs under test resources. ### Non-Goals - Do not encode language heuristics in code. - Do not normalize via AI over full data rows (AI only returns mapping + policies; code applies them). ### Performance Considerations - RAM-based processing acceptable for current document sizes. - Mapping prompt uses only inventories and samples (small payload), not full data. ### Security & Compliance - Debug artifacts gated by existing debug flag. - Mapping cache limited to current workflow run unless explicitly persisted. ### Rollout Plan 1) Implement NormalizationService with feature flag (default on in dev/test). 2) Wire into `SubDocumentProcessing.processDocumentsPerChunkJson`. 3) Add debug artifact writing. 4) Create unit/integration tests and golden fixtures. 5) Validate CSV/HTML outputs are populated for heterogeneous sources. ### Open Extensions - Multi-table merging strategies (per VAT rate, per document) before final consolidation. - Domain-specific canonical schemas selectable by task type.