doc added

2025-10-14 17:54:00 +02:00 · 2025-10-14 17:54:00 +02:00 · 23e752fa8d
commit 23e752fa8d
parent 5eb5bc0154
2 changed files with 119 additions and 0 deletions
--- a/README.md
+++ b/README.md
--- a/poweron/implementation/implement_normalization_service.md
+++ b/poweron/implementation/implement_normalization_service.md
@ -0,0 +1,119 @@
+## Normalization Service Integration (Design + Refactor)
+
+### Goal
+Introduce a deterministic Normalization stage between extraction merge and format-specific rendering to guarantee a single canonical, aligned table for downstream outputs (CSV/HTML/Excel/XML/JSON objects) without language-specific code.
+
+### Scope
+- Applies to all workflows that use per-chunk extraction and merged JSON.
+- Lives after merged JSON assembly and before Generation/Rendering.
+- Single-path flow (no fallbacks), testable end-to-end.
+- Cache mapping for the current workflow run; prompts remain generic.
+
+### Canonical Output Contract
+- One consolidated table section in merged JSON:
+  - headers: ["Date","Merchant","CreditCardNumber","TotalAmount","Currency","VATRate"]
+  - rows: normalized and aligned values
+- This canonical table is the only input the format engines require.
+
+### High-Level Flow
+1) Extraction per chunk → merged JSON (unchanged)
+2) Normalization Service:
+   - Analyze structures: discover candidate headers/paths and value samples (no AI)
+   - Get mapping from AI (bounded, deterministic mapping JSON only)
+   - Apply mapping deterministically in code (flatten + rename + value normalize)
+   - Validate canonical table; fail if zero rows
+   - Emit debug artifacts for traceability
+3) Generation/Rendering consumes canonical table
+
+### Module and Integration Points
+- Add `modules/services/serviceNormalization/mainServiceNormalization.py` with public API:
+  - `discoverStructures(mergedJson) -> StructureInventory`
+  - `requestHeaderMapping(inventory, canonicalSpec, cache) -> MappingSpec`
+  - `applyMapping(mergedJson, mappingSpec) -> CanonicalMergedJson`
+  - `validateCanonical(canonicalJson) -> ValidationReport`
+  - `persistArtifacts(artifacts, whenDebug=True)`
+
+- Call site: `SubDocumentProcessing.processDocumentsPerChunkJson(...)`
+  - After `_mergeChunkResultsJson(...)`
+  - Before `SubDocumentGeneration`/`GenerationService` rendering
+
+### Detailed Steps
+#### 1) Discover Structures (No AI)
+- Input: merged JSON with sections (tables, paragraphs, headings). For JSON/XML sources, also consider objects.
+- Output: `StructureInventory` containing:
+  - `tableHeaders`: set of distinct header labels (deduped across sections)
+  - `headerSamples`: small value samples per header (e.g., up to 5 distinct examples)
+  - `objectPaths`: candidate field paths for JSON/XML object sources with value samples (optional)
+
+Notes:
+- Keep RAM processing; streaming not required for this step.
+- No language rules in code.
+
+#### 2) AI Mapping (Headers/Paths → Canonical)
+- Prompt inputs (generic):
+  - Canonical schema (names + short definitions + constraints)
+  - Discovered header inventory (and object paths, if any) + minimal value samples
+- Response (strict JSON):
+  - `mappings`: { "Datum": "Date", "Händler": "Merchant", ... }
+  - Optional `objectPathMappings` for JSON/XML flattening
+  - `normalizationPolicy` with specific rules per field (e.g., decimalSeparator, dateFormatCandidates, currencyPlacement)
+
+Constraints:
+- One-to-one or null mappings only (no invented fields)
+- No free text; reject if schema invalid
+- Cache mapping for the current workflow run (in-memory + test-chat/ai/mapping.json when debug)
+
+#### 3) Apply Mapping (Deterministic Code)
+- Flatten JSON/XML objects per `objectPathMappings` (when provided) into rows.
+- Transform table sections to the canonical headers per `mappings`.
+- Normalize values per `normalizationPolicy`:
+  - Decimals: comma → dot
+  - Currency: split symbol/code into `Currency`, keep numeric in `TotalAmount`
+  - Date: parse with candidate formats → ISO or target format
+  - CreditCardNumber: extract last 4 if masked
+- Merge all rows into a single table with canonical headers.
+
+#### 4) Validate
+- Field-level validators (parseable date, numeric amounts, currency in allowed set).
+- If zero rows after normalization: fail with a clear error and persist artifacts.
+
+### Artifacts (Debug Mode)
+- `*_header_inventory.json` / `*_path_inventory.json`
+- `*_mapping.json` (AI output)
+- `*_normalization_report.json` (counts, per-field conversions)
+- `*_canonical_merged.json` (final single table)
+
+### Error Handling
+- Single-path only (no fallback). If mapping invalid or yields zero rows:
+  - Raise an error with pointers to artifacts for diagnosis
+  - The workflow surface shows a concise failure reason
+
+### Testing Strategy
+- Unit: mapping schema validation; deterministic value normalization; row merge.
+- Integration: feed synthetic merged JSON with heterogeneous headers; assert canonical table shape and non-empty rows.
+- Golden files: store sample inventories, mappings, canonical outputs under test resources.
+
+### Non-Goals
+- Do not encode language heuristics in code.
+- Do not normalize via AI over full data rows (AI only returns mapping + policies; code applies them).
+
+### Performance Considerations
+- RAM-based processing acceptable for current document sizes.
+- Mapping prompt uses only inventories and samples (small payload), not full data.
+
+### Security & Compliance
+- Debug artifacts gated by existing debug flag.
+- Mapping cache limited to current workflow run unless explicitly persisted.
+
+### Rollout Plan
+1) Implement NormalizationService with feature flag (default on in dev/test).
+2) Wire into `SubDocumentProcessing.processDocumentsPerChunkJson`.
+3) Add debug artifact writing.
+4) Create unit/integration tests and golden fixtures.
+5) Validate CSV/HTML outputs are populated for heterogeneous sources.
+
+### Open Extensions
+- Multi-table merging strategies (per VAT rate, per document) before final consolidation.
+- Domain-specific canonical schemas selectable by task type.
+
+