From 23e752fa8de42a763aaa60b08c1c140024d71dc7 Mon Sep 17 00:00:00 2001 From: ValueOn AG Date: Tue, 14 Oct 2025 17:54:00 +0200 Subject: [PATCH] doc added --- README.md | Bin 13 -> 0 bytes .../implement_normalization_service.md | 119 ++++++++++++++++++ 2 files changed, 119 insertions(+) delete mode 100644 README.md create mode 100644 poweron/implementation/implement_normalization_service.md diff --git a/README.md b/README.md deleted file mode 100644 index ac181358fe6238104d5d4550779fbf5df66b2d61..0000000000000000000000000000000000000000 GIT binary patch literal 0 HcmV?d00001 literal 13 ScmY#pP+%x$$YjU{VlDs;o&o{@ diff --git a/poweron/implementation/implement_normalization_service.md b/poweron/implementation/implement_normalization_service.md new file mode 100644 index 0000000..8ea91ad --- /dev/null +++ b/poweron/implementation/implement_normalization_service.md @@ -0,0 +1,119 @@ +## Normalization Service Integration (Design + Refactor) + +### Goal +Introduce a deterministic Normalization stage between extraction merge and format-specific rendering to guarantee a single canonical, aligned table for downstream outputs (CSV/HTML/Excel/XML/JSON objects) without language-specific code. + +### Scope +- Applies to all workflows that use per-chunk extraction and merged JSON. +- Lives after merged JSON assembly and before Generation/Rendering. +- Single-path flow (no fallbacks), testable end-to-end. +- Cache mapping for the current workflow run; prompts remain generic. + +### Canonical Output Contract +- One consolidated table section in merged JSON: + - headers: ["Date","Merchant","CreditCardNumber","TotalAmount","Currency","VATRate"] + - rows: normalized and aligned values +- This canonical table is the only input the format engines require. + +### High-Level Flow +1) Extraction per chunk → merged JSON (unchanged) +2) Normalization Service: + - Analyze structures: discover candidate headers/paths and value samples (no AI) + - Get mapping from AI (bounded, deterministic mapping JSON only) + - Apply mapping deterministically in code (flatten + rename + value normalize) + - Validate canonical table; fail if zero rows + - Emit debug artifacts for traceability +3) Generation/Rendering consumes canonical table + +### Module and Integration Points +- Add `modules/services/serviceNormalization/mainServiceNormalization.py` with public API: + - `discoverStructures(mergedJson) -> StructureInventory` + - `requestHeaderMapping(inventory, canonicalSpec, cache) -> MappingSpec` + - `applyMapping(mergedJson, mappingSpec) -> CanonicalMergedJson` + - `validateCanonical(canonicalJson) -> ValidationReport` + - `persistArtifacts(artifacts, whenDebug=True)` + +- Call site: `SubDocumentProcessing.processDocumentsPerChunkJson(...)` + - After `_mergeChunkResultsJson(...)` + - Before `SubDocumentGeneration`/`GenerationService` rendering + +### Detailed Steps +#### 1) Discover Structures (No AI) +- Input: merged JSON with sections (tables, paragraphs, headings). For JSON/XML sources, also consider objects. +- Output: `StructureInventory` containing: + - `tableHeaders`: set of distinct header labels (deduped across sections) + - `headerSamples`: small value samples per header (e.g., up to 5 distinct examples) + - `objectPaths`: candidate field paths for JSON/XML object sources with value samples (optional) + +Notes: +- Keep RAM processing; streaming not required for this step. +- No language rules in code. + +#### 2) AI Mapping (Headers/Paths → Canonical) +- Prompt inputs (generic): + - Canonical schema (names + short definitions + constraints) + - Discovered header inventory (and object paths, if any) + minimal value samples +- Response (strict JSON): + - `mappings`: { "Datum": "Date", "Händler": "Merchant", ... } + - Optional `objectPathMappings` for JSON/XML flattening + - `normalizationPolicy` with specific rules per field (e.g., decimalSeparator, dateFormatCandidates, currencyPlacement) + +Constraints: +- One-to-one or null mappings only (no invented fields) +- No free text; reject if schema invalid +- Cache mapping for the current workflow run (in-memory + test-chat/ai/mapping.json when debug) + +#### 3) Apply Mapping (Deterministic Code) +- Flatten JSON/XML objects per `objectPathMappings` (when provided) into rows. +- Transform table sections to the canonical headers per `mappings`. +- Normalize values per `normalizationPolicy`: + - Decimals: comma → dot + - Currency: split symbol/code into `Currency`, keep numeric in `TotalAmount` + - Date: parse with candidate formats → ISO or target format + - CreditCardNumber: extract last 4 if masked +- Merge all rows into a single table with canonical headers. + +#### 4) Validate +- Field-level validators (parseable date, numeric amounts, currency in allowed set). +- If zero rows after normalization: fail with a clear error and persist artifacts. + +### Artifacts (Debug Mode) +- `*_header_inventory.json` / `*_path_inventory.json` +- `*_mapping.json` (AI output) +- `*_normalization_report.json` (counts, per-field conversions) +- `*_canonical_merged.json` (final single table) + +### Error Handling +- Single-path only (no fallback). If mapping invalid or yields zero rows: + - Raise an error with pointers to artifacts for diagnosis + - The workflow surface shows a concise failure reason + +### Testing Strategy +- Unit: mapping schema validation; deterministic value normalization; row merge. +- Integration: feed synthetic merged JSON with heterogeneous headers; assert canonical table shape and non-empty rows. +- Golden files: store sample inventories, mappings, canonical outputs under test resources. + +### Non-Goals +- Do not encode language heuristics in code. +- Do not normalize via AI over full data rows (AI only returns mapping + policies; code applies them). + +### Performance Considerations +- RAM-based processing acceptable for current document sizes. +- Mapping prompt uses only inventories and samples (small payload), not full data. + +### Security & Compliance +- Debug artifacts gated by existing debug flag. +- Mapping cache limited to current workflow run unless explicitly persisted. + +### Rollout Plan +1) Implement NormalizationService with feature flag (default on in dev/test). +2) Wire into `SubDocumentProcessing.processDocumentsPerChunkJson`. +3) Add debug artifact writing. +4) Create unit/integration tests and golden fixtures. +5) Validate CSV/HTML outputs are populated for heterogeneous sources. + +### Open Extensions +- Multi-table merging strategies (per VAT rate, per document) before final consolidation. +- Domain-specific canonical schemas selectable by task type. + +