doc added
This commit is contained in:
parent
5eb5bc0154
commit
23e752fa8d
2 changed files with 119 additions and 0 deletions
BIN
README.md
BIN
README.md
Binary file not shown.
119
poweron/implementation/implement_normalization_service.md
Normal file
119
poweron/implementation/implement_normalization_service.md
Normal file
|
|
@ -0,0 +1,119 @@
|
|||
## Normalization Service Integration (Design + Refactor)
|
||||
|
||||
### Goal
|
||||
Introduce a deterministic Normalization stage between extraction merge and format-specific rendering to guarantee a single canonical, aligned table for downstream outputs (CSV/HTML/Excel/XML/JSON objects) without language-specific code.
|
||||
|
||||
### Scope
|
||||
- Applies to all workflows that use per-chunk extraction and merged JSON.
|
||||
- Lives after merged JSON assembly and before Generation/Rendering.
|
||||
- Single-path flow (no fallbacks), testable end-to-end.
|
||||
- Cache mapping for the current workflow run; prompts remain generic.
|
||||
|
||||
### Canonical Output Contract
|
||||
- One consolidated table section in merged JSON:
|
||||
- headers: ["Date","Merchant","CreditCardNumber","TotalAmount","Currency","VATRate"]
|
||||
- rows: normalized and aligned values
|
||||
- This canonical table is the only input the format engines require.
|
||||
|
||||
### High-Level Flow
|
||||
1) Extraction per chunk → merged JSON (unchanged)
|
||||
2) Normalization Service:
|
||||
- Analyze structures: discover candidate headers/paths and value samples (no AI)
|
||||
- Get mapping from AI (bounded, deterministic mapping JSON only)
|
||||
- Apply mapping deterministically in code (flatten + rename + value normalize)
|
||||
- Validate canonical table; fail if zero rows
|
||||
- Emit debug artifacts for traceability
|
||||
3) Generation/Rendering consumes canonical table
|
||||
|
||||
### Module and Integration Points
|
||||
- Add `modules/services/serviceNormalization/mainServiceNormalization.py` with public API:
|
||||
- `discoverStructures(mergedJson) -> StructureInventory`
|
||||
- `requestHeaderMapping(inventory, canonicalSpec, cache) -> MappingSpec`
|
||||
- `applyMapping(mergedJson, mappingSpec) -> CanonicalMergedJson`
|
||||
- `validateCanonical(canonicalJson) -> ValidationReport`
|
||||
- `persistArtifacts(artifacts, whenDebug=True)`
|
||||
|
||||
- Call site: `SubDocumentProcessing.processDocumentsPerChunkJson(...)`
|
||||
- After `_mergeChunkResultsJson(...)`
|
||||
- Before `SubDocumentGeneration`/`GenerationService` rendering
|
||||
|
||||
### Detailed Steps
|
||||
#### 1) Discover Structures (No AI)
|
||||
- Input: merged JSON with sections (tables, paragraphs, headings). For JSON/XML sources, also consider objects.
|
||||
- Output: `StructureInventory` containing:
|
||||
- `tableHeaders`: set of distinct header labels (deduped across sections)
|
||||
- `headerSamples`: small value samples per header (e.g., up to 5 distinct examples)
|
||||
- `objectPaths`: candidate field paths for JSON/XML object sources with value samples (optional)
|
||||
|
||||
Notes:
|
||||
- Keep RAM processing; streaming not required for this step.
|
||||
- No language rules in code.
|
||||
|
||||
#### 2) AI Mapping (Headers/Paths → Canonical)
|
||||
- Prompt inputs (generic):
|
||||
- Canonical schema (names + short definitions + constraints)
|
||||
- Discovered header inventory (and object paths, if any) + minimal value samples
|
||||
- Response (strict JSON):
|
||||
- `mappings`: { "Datum": "Date", "Händler": "Merchant", ... }
|
||||
- Optional `objectPathMappings` for JSON/XML flattening
|
||||
- `normalizationPolicy` with specific rules per field (e.g., decimalSeparator, dateFormatCandidates, currencyPlacement)
|
||||
|
||||
Constraints:
|
||||
- One-to-one or null mappings only (no invented fields)
|
||||
- No free text; reject if schema invalid
|
||||
- Cache mapping for the current workflow run (in-memory + test-chat/ai/mapping.json when debug)
|
||||
|
||||
#### 3) Apply Mapping (Deterministic Code)
|
||||
- Flatten JSON/XML objects per `objectPathMappings` (when provided) into rows.
|
||||
- Transform table sections to the canonical headers per `mappings`.
|
||||
- Normalize values per `normalizationPolicy`:
|
||||
- Decimals: comma → dot
|
||||
- Currency: split symbol/code into `Currency`, keep numeric in `TotalAmount`
|
||||
- Date: parse with candidate formats → ISO or target format
|
||||
- CreditCardNumber: extract last 4 if masked
|
||||
- Merge all rows into a single table with canonical headers.
|
||||
|
||||
#### 4) Validate
|
||||
- Field-level validators (parseable date, numeric amounts, currency in allowed set).
|
||||
- If zero rows after normalization: fail with a clear error and persist artifacts.
|
||||
|
||||
### Artifacts (Debug Mode)
|
||||
- `*_header_inventory.json` / `*_path_inventory.json`
|
||||
- `*_mapping.json` (AI output)
|
||||
- `*_normalization_report.json` (counts, per-field conversions)
|
||||
- `*_canonical_merged.json` (final single table)
|
||||
|
||||
### Error Handling
|
||||
- Single-path only (no fallback). If mapping invalid or yields zero rows:
|
||||
- Raise an error with pointers to artifacts for diagnosis
|
||||
- The workflow surface shows a concise failure reason
|
||||
|
||||
### Testing Strategy
|
||||
- Unit: mapping schema validation; deterministic value normalization; row merge.
|
||||
- Integration: feed synthetic merged JSON with heterogeneous headers; assert canonical table shape and non-empty rows.
|
||||
- Golden files: store sample inventories, mappings, canonical outputs under test resources.
|
||||
|
||||
### Non-Goals
|
||||
- Do not encode language heuristics in code.
|
||||
- Do not normalize via AI over full data rows (AI only returns mapping + policies; code applies them).
|
||||
|
||||
### Performance Considerations
|
||||
- RAM-based processing acceptable for current document sizes.
|
||||
- Mapping prompt uses only inventories and samples (small payload), not full data.
|
||||
|
||||
### Security & Compliance
|
||||
- Debug artifacts gated by existing debug flag.
|
||||
- Mapping cache limited to current workflow run unless explicitly persisted.
|
||||
|
||||
### Rollout Plan
|
||||
1) Implement NormalizationService with feature flag (default on in dev/test).
|
||||
2) Wire into `SubDocumentProcessing.processDocumentsPerChunkJson`.
|
||||
3) Add debug artifact writing.
|
||||
4) Create unit/integration tests and golden fixtures.
|
||||
5) Validate CSV/HTML outputs are populated for heterogeneous sources.
|
||||
|
||||
### Open Extensions
|
||||
- Multi-table merging strategies (per VAT rate, per document) before final consolidation.
|
||||
- Domain-specific canonical schemas selectable by task type.
|
||||
|
||||
|
||||
Loading…
Reference in a new issue