working on unified knowledge ingestion

2026-04-21 12:46:41 +02:00 · 2026-04-21 12:46:41 +02:00 · a5910121b9
commit a5910121b9
parent 0cc57f5e5e
2 changed files with 27 additions and 4 deletions
--- a/TOPICS.md
+++ b/TOPICS.md
@ -44,6 +44,7 @@ Lade immer zuerst diese Datei. Dann gezielt die passende(n) Referenz-Datei(en).
 | Thema | Datei | Wann laden |
 |-------|-------|------------|
 | Automation Unification | c-work/1-plan/2026-04-automation-unification.md | Refactoring v1/v2/Workspace |
+| Unified Knowledge Indexing (RAG) | c-work/2-build/2026-04-id-unified-knowledge-indexing-rag-concept.md | Ingestion-Fassade `requestIngestion`, Idempotenz, Connector-Lifecycle |
 | Zentrale Workflow-Admin (Meine Sicht) | c-work/1-plan/2026-04-automation-central-admin.md | `/automations` Tabs Dashboard + Workflows, `GET .../workflow-runs/workflows` |
 | Web Image Search | c-work/1-plan/2026-03-web-image-search.md | WEB_SEARCH_MEDIA Feature |
 | UI i18n / Sprachsets (done) | c-work/3-validate/2026-04-ui-i18n-dynamic-language-sets.md | Mehrsprachigkeit, `t()`, Sprachset-API, Admin-UI, AI-Übersetzung |
--- a/c-work/2-build/2026-04-id-unified-knowledge-indexing-rag-concept.md
+++ b/c-work/2-build/2026-04-id-unified-knowledge-indexing-rag-concept.md
@ -1,5 +1,6 @@
-<!-- status: plan -->
+<!-- status: build -->
 <!-- started: 2026-04-16 -->
+<!-- lastReviewed: 2026-04-21 -->
 <!-- component: gateway | platform | frontend-nyla -->

 # Unified Knowledge Indexing — One RAG Corpus for All Platform Information
@ -141,6 +142,19 @@ The implementation can stay in-process at first (asyncio task queue) and move to
 - **Skip** when hash unchanged to control embedding cost.
 - **Tombstone** or **scope-disable** when a source is deleted or access revoked (see Teil 2).

+#### Implementation pitfalls (observed during P0 build, 2026-04-21)
+
+The first end-to-end AC4 test on a 500-page PDF revealed **three** independent bugs that all had to be fixed before `ingestion.skipped.duplicate` could ever fire. Each is a **design rule** that every future ingestion lane must honor:
+
+1. **Hash must derive only from content.** `_computeIngestionHash` initially hashed over `(contentObjectId, contentType, data)`, but `contentObjectId` came from `uuid.uuid4()` inside the extractors and was therefore a fresh value on every run. The hash was effectively a per-run nonce — the duplicate check could never match. **Rule:** hashes MUST be a pure function of payload (`contentType`, `data`, and extractor order); never of caller-supplied per-run identifiers. (Tests: `tests/unit/services/test_ingestion_hash_stability.py`.)
+2. **Pre-upserts must preserve `_ingestion` metadata and the `indexed` status.** `routeDataFiles._autoIndexFile` persisted a fresh `FileContentIndex` from the pre-scan **before** calling `requestIngestion`, overwriting `structure._ingestion.hash` and `status="indexed"` from any prior successful run. The duplicate check saw a row with empty metadata and re-ran the whole embedding stage. **Rule:** any upsert on the idempotency row taken outside `requestIngestion` MUST read the existing row first and merge forward both `_ingestion` and (where applicable) the terminal `indexed` status.
+3. **Extraction-pipeline defaults must preserve granularity for RAG.** `ExtractionOptions.mergeStrategy` defaulted to concatenating every text `ContentPart` into one blob, collapsing a 500-page PDF into a single chunk whose embedding is a blurred average of the whole document — unusable for targeted retrieval. **Rule:** every ingestion lane passes `mergeStrategy=None` explicitly until the default itself can be safely flipped after auditing non-RAG callers. (Tests: `tests/unit/services/test_extraction_merge_strategy.py`.)
+
+**Deferred to P1** (uncovered during P0, not blocking AC1–AC5):
+
+- **In-flight duplicate detection.** The current duplicate check only matches when `status == "indexed"`, so two nearly-simultaneous calls for the same `sourceId` both run full embedding. Fix candidates: accept `status ∈ {"extracted", "embedding", "indexed"}` with matching hash as "already in progress", or a per-`sourceId` `asyncio.Lock` in `KnowledgeService`.
+- **Pre-extraction byte-hash shortcut.** `requestIngestion`'s duplicate check runs **after** extraction, so re-indexing a 1.6 MB PDF still spends ~15 s in `runExtraction` before the content hash is computed. The file-bytes SHA already exists in `interfaceDbManagement` for upload-dedup — a short-circuit in `_autoIndexFile` (and symmetric paths) could skip extraction entirely for an unchanged file.
+
 ---

 ## Teil 2 — Triggers: not only “file write”, but every information delta
@ -286,6 +300,8 @@ These domains **do not** call **`runAgent`** in their **`modules/features/*`** t

 ### 3.3 Summary matrix (per `modules/features/` domain)

+*Matrix verified by audit on 2026-04-21 (P0):* Under `gateway/modules/features/`, only `workspace`, `graphicalEditor`, and `commcoach` resolve `getService("agent")` / `getService("knowledge")` or call `runAgent`; only `commcoach/serviceCommcoachIndexer.py` and `commcoach/serviceCommcoach.py` touch `indexFile` / `buildAgentContext` inside the feature tree. All other domains (`chatbot`, `trustee`, `realEstate`, `teamsbot`, `neutralization`) match the "No" rows below.
+
 | Feature | **`AgentService.runAgent`** | **Retrieval injection** (platform RAG prompt) | **Corpus injection** (typical today) | **Likely gap** (this document) |
 |---------|----------------------------|-----------------------------------------------|-------------------------------------|--------------------------------|
 | **workspace** | Yes | Yes | Upload **`_autoIndexFile`**; optional **`indexFile`** via agent **tools** | **Automatic** corpus for artifacts that never become **`FileItem`** or tool outputs (exports, structured summaries). |
@ -336,7 +352,7 @@ Phases align with **Teil 1** (façade), **Teil 2** (connector + trigger catalog)

 | Phase | Outcome |
 |-------|---------|
-| **P0 — Catalog & façade** | Document all current `indexFile` call sites; wrap them in `requestIngestion`; add metrics and structured logging; complete **Teil 3.3** matrix (retrieval vs corpus vs gap) per feature. |
+| **P0 — Façade + idempotency** *(done, 2026-04-21)* | Single `requestIngestion` / `getIngestionStatus` entry point on `KnowledgeService` with content-hash idempotency, provenance in `structure._ingestion`, and structured logging (`ingestion.queued` / `ingestion.indexed` / `ingestion.skipped.duplicate` / `ingestion.failed`). All prior `indexFile` call sites now route through the façade: `routeDataFiles._autoIndexFile`, `commcoach/serviceCommcoachIndexer.indexSessionData`, `serviceAgent/coreTools/_workspaceTools.readFile`, `serviceAgent/coreTools/_documentTools.describeImage`. Agent tools no longer carry on-demand extraction + ingestion fallbacks — they are pure consumers of the knowledge store. **Teil 3.3** matrix audited. Three implementation bugs fixed during verification: stable content hash, pre-upsert `_ingestion` preservation, `mergeStrategy=None` for per-page granularity (see **§1.4 Implementation pitfalls**). |
 | **P1 — User-connection hooks** | On connection success/failure/revoke, enqueue bootstrap/delta/purge jobs per **Teil 2.2**; SharePoint and one mail provider as pilots. |
 | **P2 — Profile & mandate snapshots** | Allowlisted fields only (**Teil 2.3**); regenerate on events; explicit admin toggle per mandate if needed. |
 | **P3 — Event bus** | Move direct calls to async consumer where load requires it (**Teil 2.4** scalable target). |
@ -377,6 +393,8 @@ Phases align with **Teil 1** (façade), **Teil 2** (connector + trigger catalog)
 | **Cost caps** | Per-mandate embedding budget; defer large backfills |
 | **Neutralization** | Mandatory for certain `sourceKind`s even when not file-upload |
 | **Provenance shape** | First-class DB columns vs **documented `chunkMetadata` keys** for `connectionId`, external id, revision (must support **Teil 2** purge rules). |
+| **In-flight duplicate handling** | Accept `status ∈ {"extracted","embedding","indexed"}` with matching hash as in-progress (cheap, lossy under failure) **vs** per-`sourceId` `asyncio.Lock` in `KnowledgeService` (strict, requires singleton) — see **§1.4 Deferred to P1**. |
+| **Pre-extraction dedup shortcut** | Short-circuit `_autoIndexFile` via the file-bytes SHA in `interfaceDbManagement` before running `runExtraction` (~15 s saved per re-index of a large PDF) — see **§1.4 Deferred to P1**. |

 ---

@ -384,7 +402,8 @@ Phases align with **Teil 1** (façade), **Teil 2** (connector + trigger catalog)

 - **How-to / orientation:** [Unified knowledge & RAG ingestion (guide)](../../d-guides/unified-knowledge-rag.md)  
 - **Gateway reference (retrieval + knowledge):** `wiki/b-reference/gateway/architecture.md`, `wiki/b-reference/gateway/ai-agent.md`  
- **Implementation touchpoints (indicative):** `gateway/modules/serviceCenter/services/serviceKnowledge/mainServiceKnowledge.py`, `gateway/modules/routes/routeDataFiles.py`, `gateway/modules/features/commcoach/serviceCommcoachIndexer.py`, agent `coreTools` `_documentTools` / `_workspaceTools`.
+- **Implementation touchpoints (indicative):** `gateway/modules/serviceCenter/services/serviceKnowledge/mainServiceKnowledge.py`, `gateway/modules/routes/routeDataFiles.py`, `gateway/modules/features/commcoach/serviceCommcoachIndexer.py`, agent `coreTools` `_documentTools` / `_workspaceTools`, `gateway/modules/datamodels/datamodelExtraction.py` (`ExtractionOptions.mergeStrategy: Optional[MergeStrategy]`).
+- **Unit tests (P0 guardrails):** `gateway/tests/unit/services/test_ingestion_hash_stability.py`, `gateway/tests/unit/services/test_extraction_merge_strategy.py`.

 ## Akzeptanzkriterien (Plan-Ebene)

@ -393,7 +412,7 @@ Phases align with **Teil 1** (façade), **Teil 2** (connector + trigger catalog)
 | 1 | Every new **file** that should be searchable triggers ingestion **without** requiring an agent session. | must |
 | 2 | **User connection** connect / disconnect has defined ingestion or purge behavior documented and implementable. | must |
 | 3 | **Profile/mandate** snapshots use an explicit allowlist; secrets never enter the embedding pipeline. | must |
-| 4 | Ingestion is **idempotent** for unchanged content (no duplicate embedding work). | should |
+| 4 | Ingestion is **idempotent** for unchanged content (no duplicate embedding work). Verified 2026-04-21 on a 500-page PDF: second re-index trigger logs `ingestion.skipped.duplicate` with a stable hash, zero embedding API calls. See **§1.4 pitfalls** for the three bug classes that had to be fixed first. | must |
 | 5 | **Teil 3.3** matrix completed: every `modules/features/*` product row has **retrieval** (agent vs none), **corpus** (upload / tools / feature indexer), and **gap** explicitly stated—not “non-injecting” if **`AgentService`** already provides retrieval injection. | should |

 ---
@ -407,3 +426,6 @@ Phases align with **Teil 1** (façade), **Teil 2** (connector + trigger catalog)
 | T3 | Bleibt **plattform-RAG-Retrieval** (`buildAgentContext` über `AgentService`) unveraendert in seiner Rolle fuer Workspace/Grafikeditor/CommCoach? | Review `agentLoop` + `mainServiceAgent._createBuildRagContextFn` |
 | T4 | Ist Revoke/Purge fuer Connector-Chunks ohne **connectionId-Spalte** heute als **Metadata-Konvention** spezifizierbar? | Review **Teil 2.1** + **Offene Entscheidungen** Provenance |
 | T5 | Ist Revoke/Purge pro **User-Connection-Authority / Integrationsoberfläche** (Teil 2.2, nur `msft` / `google` / `clickup` laut `routeDataConnections`) in einem Threat-Model abgedeckt? | Datenfluss Connection → `FileItem` / virtuelles Doc → Chunks |
+| T6 | Ist der Content-Hash stabil zwischen zwei Extraktions-Runs desselben Files (verschiedene `contentObjectId`-UUIDs, identisches Payload)? | Unit: `tests/unit/services/test_ingestion_hash_stability.py` (5 Cases: UUID-Regen, Daten-Delta, Order-Delta, Type-Delta, Empty). Live: zweiter Trigger auf bereits indexiertes File loggt `ingestion.skipped.duplicate` mit identischem Hash (verifiziert 2026-04-21). |
+| T7 | Bleiben bei Multi-Page-PDFs die Per-Page-Chunks erhalten (keine `MergeStrategy`-Konkatenation)? | Unit: `tests/unit/services/test_extraction_merge_strategy.py`. Live: 500-Seiten-PDF → 563 ContentObjects, 567 Embedding-Chunks in 24 Batches (verifiziert 2026-04-21). |
+| T8 | Überleben `_ingestion.hash` und `status="indexed"` einen Pre-Scan-Re-Upsert in `_autoIndexFile`? | Review `routeDataFiles._autoIndexFile` Zeile ~127: existing row wird vor upsert gelesen und `_ingestion` + `indexed` in frischen `contentIndex` gemerged. Live: zweiter Trigger → `ingestion.skipped.duplicate` statt Re-Embedding. |