wiki/poweron/implementation/implement_normalization_service.md
2025-10-14 17:54:00 +02:00

5.5 KiB

Normalization Service Integration (Design + Refactor)

Goal

Introduce a deterministic Normalization stage between extraction merge and format-specific rendering to guarantee a single canonical, aligned table for downstream outputs (CSV/HTML/Excel/XML/JSON objects) without language-specific code.

Scope

  • Applies to all workflows that use per-chunk extraction and merged JSON.
  • Lives after merged JSON assembly and before Generation/Rendering.
  • Single-path flow (no fallbacks), testable end-to-end.
  • Cache mapping for the current workflow run; prompts remain generic.

Canonical Output Contract

  • One consolidated table section in merged JSON:
    • headers: ["Date","Merchant","CreditCardNumber","TotalAmount","Currency","VATRate"]
    • rows: normalized and aligned values
  • This canonical table is the only input the format engines require.

High-Level Flow

  1. Extraction per chunk → merged JSON (unchanged)
  2. Normalization Service:
    • Analyze structures: discover candidate headers/paths and value samples (no AI)
    • Get mapping from AI (bounded, deterministic mapping JSON only)
    • Apply mapping deterministically in code (flatten + rename + value normalize)
    • Validate canonical table; fail if zero rows
    • Emit debug artifacts for traceability
  3. Generation/Rendering consumes canonical table

Module and Integration Points

  • Add modules/services/serviceNormalization/mainServiceNormalization.py with public API:

    • discoverStructures(mergedJson) -> StructureInventory
    • requestHeaderMapping(inventory, canonicalSpec, cache) -> MappingSpec
    • applyMapping(mergedJson, mappingSpec) -> CanonicalMergedJson
    • validateCanonical(canonicalJson) -> ValidationReport
    • persistArtifacts(artifacts, whenDebug=True)
  • Call site: SubDocumentProcessing.processDocumentsPerChunkJson(...)

    • After _mergeChunkResultsJson(...)
    • Before SubDocumentGeneration/GenerationService rendering

Detailed Steps

1) Discover Structures (No AI)

  • Input: merged JSON with sections (tables, paragraphs, headings). For JSON/XML sources, also consider objects.
  • Output: StructureInventory containing:
    • tableHeaders: set of distinct header labels (deduped across sections)
    • headerSamples: small value samples per header (e.g., up to 5 distinct examples)
    • objectPaths: candidate field paths for JSON/XML object sources with value samples (optional)

Notes:

  • Keep RAM processing; streaming not required for this step.
  • No language rules in code.

2) AI Mapping (Headers/Paths → Canonical)

  • Prompt inputs (generic):
    • Canonical schema (names + short definitions + constraints)
    • Discovered header inventory (and object paths, if any) + minimal value samples
  • Response (strict JSON):
    • mappings: { "Datum": "Date", "Händler": "Merchant", ... }
    • Optional objectPathMappings for JSON/XML flattening
    • normalizationPolicy with specific rules per field (e.g., decimalSeparator, dateFormatCandidates, currencyPlacement)

Constraints:

  • One-to-one or null mappings only (no invented fields)
  • No free text; reject if schema invalid
  • Cache mapping for the current workflow run (in-memory + test-chat/ai/mapping.json when debug)

3) Apply Mapping (Deterministic Code)

  • Flatten JSON/XML objects per objectPathMappings (when provided) into rows.
  • Transform table sections to the canonical headers per mappings.
  • Normalize values per normalizationPolicy:
    • Decimals: comma → dot
    • Currency: split symbol/code into Currency, keep numeric in TotalAmount
    • Date: parse with candidate formats → ISO or target format
    • CreditCardNumber: extract last 4 if masked
  • Merge all rows into a single table with canonical headers.

4) Validate

  • Field-level validators (parseable date, numeric amounts, currency in allowed set).
  • If zero rows after normalization: fail with a clear error and persist artifacts.

Artifacts (Debug Mode)

  • *_header_inventory.json / *_path_inventory.json
  • *_mapping.json (AI output)
  • *_normalization_report.json (counts, per-field conversions)
  • *_canonical_merged.json (final single table)

Error Handling

  • Single-path only (no fallback). If mapping invalid or yields zero rows:
    • Raise an error with pointers to artifacts for diagnosis
    • The workflow surface shows a concise failure reason

Testing Strategy

  • Unit: mapping schema validation; deterministic value normalization; row merge.
  • Integration: feed synthetic merged JSON with heterogeneous headers; assert canonical table shape and non-empty rows.
  • Golden files: store sample inventories, mappings, canonical outputs under test resources.

Non-Goals

  • Do not encode language heuristics in code.
  • Do not normalize via AI over full data rows (AI only returns mapping + policies; code applies them).

Performance Considerations

  • RAM-based processing acceptable for current document sizes.
  • Mapping prompt uses only inventories and samples (small payload), not full data.

Security & Compliance

  • Debug artifacts gated by existing debug flag.
  • Mapping cache limited to current workflow run unless explicitly persisted.

Rollout Plan

  1. Implement NormalizationService with feature flag (default on in dev/test).
  2. Wire into SubDocumentProcessing.processDocumentsPerChunkJson.
  3. Add debug artifact writing.
  4. Create unit/integration tests and golden fixtures.
  5. Validate CSV/HTML outputs are populated for heterogeneous sources.

Open Extensions

  • Multi-table merging strategies (per VAT rate, per document) before final consolidation.
  • Domain-specific canonical schemas selectable by task type.