From 23e752fa8de42a763aaa60b08c1c140024d71dc7 Mon Sep 17 00:00:00 2001
From: ValueOn AG
Date: Tue, 14 Oct 2025 17:54:00 +0200
Subject: [PATCH] doc added
---
README.md | Bin 13 -> 0 bytes
.../implement_normalization_service.md | 119 ++++++++++++++++++
2 files changed, 119 insertions(+)
delete mode 100644 README.md
create mode 100644 poweron/implementation/implement_normalization_service.md
diff --git a/README.md b/README.md
deleted file mode 100644
index ac181358fe6238104d5d4550779fbf5df66b2d61..0000000000000000000000000000000000000000
GIT binary patch
literal 0
HcmV?d00001
literal 13
ScmY#pP+%x$$YjU{VlDs;o&o{@
diff --git a/poweron/implementation/implement_normalization_service.md b/poweron/implementation/implement_normalization_service.md
new file mode 100644
index 0000000..8ea91ad
--- /dev/null
+++ b/poweron/implementation/implement_normalization_service.md
@@ -0,0 +1,119 @@
+## Normalization Service Integration (Design + Refactor)
+
+### Goal
+Introduce a deterministic Normalization stage between extraction merge and format-specific rendering to guarantee a single canonical, aligned table for downstream outputs (CSV/HTML/Excel/XML/JSON objects) without language-specific code.
+
+### Scope
+- Applies to all workflows that use per-chunk extraction and merged JSON.
+- Lives after merged JSON assembly and before Generation/Rendering.
+- Single-path flow (no fallbacks), testable end-to-end.
+- Cache mapping for the current workflow run; prompts remain generic.
+
+### Canonical Output Contract
+- One consolidated table section in merged JSON:
+ - headers: ["Date","Merchant","CreditCardNumber","TotalAmount","Currency","VATRate"]
+ - rows: normalized and aligned values
+- This canonical table is the only input the format engines require.
+
+### High-Level Flow
+1) Extraction per chunk → merged JSON (unchanged)
+2) Normalization Service:
+ - Analyze structures: discover candidate headers/paths and value samples (no AI)
+ - Get mapping from AI (bounded, deterministic mapping JSON only)
+ - Apply mapping deterministically in code (flatten + rename + value normalize)
+ - Validate canonical table; fail if zero rows
+ - Emit debug artifacts for traceability
+3) Generation/Rendering consumes canonical table
+
+### Module and Integration Points
+- Add `modules/services/serviceNormalization/mainServiceNormalization.py` with public API:
+ - `discoverStructures(mergedJson) -> StructureInventory`
+ - `requestHeaderMapping(inventory, canonicalSpec, cache) -> MappingSpec`
+ - `applyMapping(mergedJson, mappingSpec) -> CanonicalMergedJson`
+ - `validateCanonical(canonicalJson) -> ValidationReport`
+ - `persistArtifacts(artifacts, whenDebug=True)`
+
+- Call site: `SubDocumentProcessing.processDocumentsPerChunkJson(...)`
+ - After `_mergeChunkResultsJson(...)`
+ - Before `SubDocumentGeneration`/`GenerationService` rendering
+
+### Detailed Steps
+#### 1) Discover Structures (No AI)
+- Input: merged JSON with sections (tables, paragraphs, headings). For JSON/XML sources, also consider objects.
+- Output: `StructureInventory` containing:
+ - `tableHeaders`: set of distinct header labels (deduped across sections)
+ - `headerSamples`: small value samples per header (e.g., up to 5 distinct examples)
+ - `objectPaths`: candidate field paths for JSON/XML object sources with value samples (optional)
+
+Notes:
+- Keep RAM processing; streaming not required for this step.
+- No language rules in code.
+
+#### 2) AI Mapping (Headers/Paths → Canonical)
+- Prompt inputs (generic):
+ - Canonical schema (names + short definitions + constraints)
+ - Discovered header inventory (and object paths, if any) + minimal value samples
+- Response (strict JSON):
+ - `mappings`: { "Datum": "Date", "Händler": "Merchant", ... }
+ - Optional `objectPathMappings` for JSON/XML flattening
+ - `normalizationPolicy` with specific rules per field (e.g., decimalSeparator, dateFormatCandidates, currencyPlacement)
+
+Constraints:
+- One-to-one or null mappings only (no invented fields)
+- No free text; reject if schema invalid
+- Cache mapping for the current workflow run (in-memory + test-chat/ai/mapping.json when debug)
+
+#### 3) Apply Mapping (Deterministic Code)
+- Flatten JSON/XML objects per `objectPathMappings` (when provided) into rows.
+- Transform table sections to the canonical headers per `mappings`.
+- Normalize values per `normalizationPolicy`:
+ - Decimals: comma → dot
+ - Currency: split symbol/code into `Currency`, keep numeric in `TotalAmount`
+ - Date: parse with candidate formats → ISO or target format
+ - CreditCardNumber: extract last 4 if masked
+- Merge all rows into a single table with canonical headers.
+
+#### 4) Validate
+- Field-level validators (parseable date, numeric amounts, currency in allowed set).
+- If zero rows after normalization: fail with a clear error and persist artifacts.
+
+### Artifacts (Debug Mode)
+- `*_header_inventory.json` / `*_path_inventory.json`
+- `*_mapping.json` (AI output)
+- `*_normalization_report.json` (counts, per-field conversions)
+- `*_canonical_merged.json` (final single table)
+
+### Error Handling
+- Single-path only (no fallback). If mapping invalid or yields zero rows:
+ - Raise an error with pointers to artifacts for diagnosis
+ - The workflow surface shows a concise failure reason
+
+### Testing Strategy
+- Unit: mapping schema validation; deterministic value normalization; row merge.
+- Integration: feed synthetic merged JSON with heterogeneous headers; assert canonical table shape and non-empty rows.
+- Golden files: store sample inventories, mappings, canonical outputs under test resources.
+
+### Non-Goals
+- Do not encode language heuristics in code.
+- Do not normalize via AI over full data rows (AI only returns mapping + policies; code applies them).
+
+### Performance Considerations
+- RAM-based processing acceptable for current document sizes.
+- Mapping prompt uses only inventories and samples (small payload), not full data.
+
+### Security & Compliance
+- Debug artifacts gated by existing debug flag.
+- Mapping cache limited to current workflow run unless explicitly persisted.
+
+### Rollout Plan
+1) Implement NormalizationService with feature flag (default on in dev/test).
+2) Wire into `SubDocumentProcessing.processDocumentsPerChunkJson`.
+3) Add debug artifact writing.
+4) Create unit/integration tests and golden fixtures.
+5) Validate CSV/HTML outputs are populated for heterogeneous sources.
+
+### Open Extensions
+- Multi-table merging strategies (per VAT rate, per document) before final consolidation.
+- Domain-specific canonical schemas selectable by task type.
+
+