wiki/c-work/4-done/2026-04-id-unified-knowledge-indexing-rag-concept.md
2026-05-05 14:52:20 +02:00

546 lines
66 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!-- status: build -->
<!-- started: 2026-04-16 -->
<!-- lastReviewed: 2026-04-24 -->
<!-- component: gateway | platform | frontend-nyla -->
# Unified Knowledge Indexing — One RAG Corpus for All Platform Information
## How to read this document
| Section | Content |
|---------|---------|
| **Beschreibung und Kontext** | Scope (**ingestion vs retrieval**), **terminology** (feature / service / connector / interface), **as-is vs target**, business case and risks. |
| **Teil 1** | Ingestion as its **own lifecycle**: façade API, idempotency, orchestration—**not** owned by `agentLoop`. |
| **Teil 2** | **Triggers** (beyond upload): **user connections**, account snapshots, purge; **per-connection-type** indexing guidance; event-driven option. |
| **Teil 3** | **Feature injection** split into **retrieval** (agent + `buildAgentContext`) vs **corpus** (`indexFile`); **matrix** per `modules/features/*` product; real **gaps** vs false “non-injection”. |
| **Implementation phases · Ziele · AC · Testplan** | Rollout, explicit non-goals, acceptance criteria, verification. |
**Single sentence summary:** Keep **retrieval** on **`AgentService`**; unify **when and how** the shared **`interfaceDbKnowledge`** corpus is **filled** (routes, **user connections**, **feature commit points**) behind one **ingestion contract**. **Current roadmap scope:** user-connection lifecycle (**P1a/P1b**), **daily refresh** to close the post-connect delta gap (**P1c**), **explicit user consent + per-connection ingestion preferences** (incl. optional **neutralization**) in **frontend + API**, then **scalable event bus** (**P3**). **Out of current roadmap:** standalone **profile/mandate snapshot** ingestion (former roadmap **P2** — content remains in Teil 2.3 as future option only).
## Beschreibung und Kontext
### Scope of this document
We distinguish **ingestion** (chunking, embedding, persisting into **`interfaceDbKnowledge`**) from **retrieval** (semantic search + `buildAgentContext` for the LLM). **Retrieval** for the **unified knowledge store** is consumed primarily through **`serviceAgent`** / `runAgent` (workspace, graphical editor, CommCoach). Other products (e.g. **chatbot**, **teamsbot**) may use **different** LLM stacks—**Teil 3** maps who gets platform RAG vs who does not. This plan does **not** mandate one global retrieval path for every feature; it **does** mandate a **single ingestion story** into the same corpus where that corpus is used. The gap we address is **how and when** the corpus is **filled**, not how every LLM entry point **reads** it.
### Terminology (Gateway — see `wiki/b-reference/gateway/architecture.md`)
This concept separates **feature modules**, **services**, **connectors**, and **interfaces**. Conflating them produces wrong ownership (e.g. treating “SharePoint” as a `modules/features/` product, or treating “mail” as if it were `serviceKnowledge`).
| Term | Where in Gateway | Role for indexing |
|------|------------------|-------------------|
| **Feature** | `modules/features/*` (e.g. `workspace`, `graphicalEditor`, `commcoach`, `trustee`, `chatbot`) | Product domains: UI, feature routers, orchestration. They **trigger** actions (upload, sync UX, feature-specific indexers) but must not be the **only** place that starts embedding work. |
| **Service** | `modules/serviceCenter/services/*` | Cross-cutting facades: **`serviceKnowledge`** (indexing, search, `buildAgentContext`); **`serviceExtraction`** (content objects); **`serviceChat`** (chat/workflow documents); **`serviceMessaging`** (e-mail, notifications); **`serviceAgent`** (tools that may *indirectly* call indexing). **Unified ingestion** is primarily a **service-layer** responsibility. |
| **Connector** | `modules/connectors/*` (Microsoft, Google, …) | Vendor adapters: OAuth, list, download. **SharePoint** and **mailbox** I/O live here; routes/features **call** connectors—they are not interchangeable with a feature or with `serviceKnowledge`. |
| **Interface** | `modules/interfaces/*` | Persistence contracts: **`interfaceDbKnowledge`** (`FileContentIndex`, `ContentChunk`, …), **`interfaceDbManagement`** (`FileItem`, `DataSource`), **`interfaceDbApp`** (User, Mandate, `UserConnections`, Preferences). Profile, mandate, and connection rows are **interface-backed**, not a single “profile feature”. |
### What we have today (as-is)
**1. A strong technical “write” implementation, but no product-wide ingestion contract**
- **`serviceKnowledge`** (`mainServiceKnowledge.py`) already implements the heavy lifting: **`indexFile`** resolves scope from **`FileItem`** (single source of truth), optional neutralization, sentence-aware chunking, **`serviceAi`** embeddings, **`ContentChunk`** persistence, and status on **`FileContentIndex`**. That is the right **unit of work** once **content objects** exist.
- **Retrieval** is also centralized in the same service: **`buildAgentContext`** composes multi-layer context for agents. So **read** and **write** to the vector store are **service-owned**; what is **not** unified is **who may call writes**, with which **idempotency**, and on which **lifecycle events**.
**2. Multiple invocation lanes, same underlying method**
Indexing is **operationally** reached from a **mix of layers** (not one façade):
- **HTTP routes** — e.g. file pipeline in **`routeDataFiles`**: pre-scan / extraction → **`indexFile`**. This is the “happy path” for uploads.
- **Agent tools** — **`serviceAgent`** core tools (e.g. document/workspace helpers) can call **`indexFile`** when the user interacts through the agent. That ties **embedding** to **an agent session** even when the same file could have been indexed on upload.
- **Feature-specific code** — e.g. **CommCoach** indexer paths that call **`indexFile`** for that products artifacts. Correct for the feature, but it is **another** ad hoc entry point with its own assumptions.
- **Connectors** — Microsoft/Google (and similar) packages can fetch bytes and ultimately produce files or blobs; **OAuth and delta sync** are not yet modeled everywhere as **first-class ingestion lifecycles** (connect → backfill → incremental → revoke) that all funnel through the same API and metadata.
There is **no single** `requestIngestion(...)`, **no standard job identity** for “this external item revision”, and **no one place** that records “this mandate revoked access → tombstone these chunks”.
**3. Extraction vs indexing: clear in code, not enforced at the platform edge**
- **`serviceExtraction`** (and preprocessing helpers) produce **content objects**; **`indexFile`** consumes them. The boundary is clean **inside** the pipeline, but **not every** new binary or external document **must** pass through a single orchestrated “extract then index” step—some paths may skip, duplicate, or call **`indexFile`** with partial metadata.
**4. Truth for scope and identity lives in interfaces—not in “features”**
- **`interfaceDbManagement`** (`FileItem`, …) and **`interfaceDbApp`** (mandate, `UserConnections`, user profile fields) define **who may see what**. **`indexFile`** already mirrors **`FileItem`** for scope; that pattern is **good** but **not generalized** to connector-backed items, virtual documents, or curated “account snapshot” chunks. If a connector writes under a different mental model of `mandateId` / `featureInstanceId`, **`interfaceDbKnowledge`** can drift from app/management truth without a systematic reconcile.
**5. User/mandate/profile deltas are not first-class ingestion events**
- Changes to membership, preferences, or connections update **`interfaceDbApp`** (and related tables). They affect **searchability and personalization** but are **not** consistently reflected as **versioned, allowlisted** chunks in the knowledge store—unless a feature manually adds text somewhere. That leaves agents either **under-informed** or dependent on **non-RAG** code paths for the same facts.
**Summary (as-is):** The **engine** for indexing is **`serviceKnowledge.indexFile`**; the **policy graph** for *when* to run it is **implicit** and **spread across** routes, tools, and features. **Connectors** and **account/mandate** data are **not** uniformly treated as **ingestion sources** with connect/sync/revoke semantics.
### What would make more sense (target)
**1. One ingestion façade at the service boundary (not inside `agentLoop`)**
- A small, stable API (conceptually **`requestIngestion` / `getIngestionStatus`**, implemented atop or beside **`KnowledgeService`**) that **every** lane calls: routes, feature hooks, **connector sync workers**, and (if needed) agent tools as **thin** delegates.
- **Idempotency** (content hash, external revision, `eTag`, …) enforced **here**, so routes and tools cannot accidentally **double-embed** the same logical object.
**2. Lifecycle parity for connectors and “connections”**
- **Establish** → register datasource + optional short **non-secret** summary chunk + enqueue **backfill**.
- **Delta** → incremental jobs with persisted cursors.
- **Revoke / token invalid / GDPR** → **tombstone or purge** by `connectionId` / `sourceKind`, aligned with RBAC—not ad hoc deletes scattered in UI code.
**3. Provenance / `sourceKind` (schema or `chunkMetadata`)**
- Today chunks are **file-anchored**; extended provenance (internal file vs SharePoint item vs mailbox artifact vs `profile_snapshot`, **`connectionId`** for purge, revision keys) should be **consistent**—either **first-class fields** on `ContentChunk` / index rows **or** a **defined convention** inside **`chunkMetadata` / `contextRef`** until a migration is justified. Goal: retrieval, audit, and **connector revoke** cleanup are **data-driven**, not inferred only from call site.
**4. Curated snapshots for interface-backed facts**
- **Allowlisted** projections of mandate membership, locale, entitlements (labels), etc., regenerated on **interface-level** events—**not** dumping full user rows or secrets into embeddings.
**5. Keep retrieval exactly where it is**
- **`buildAgentContext`** remains the agents way to **consume** the corpus; ingestion only ensures that corpus is **complete, scoped, and attributable** when the agent runs.
**6. Observability and cost in one place**
- Queue depth, embedding spend, failures, and “skipped duplicate” counts attach to the **ingestion façade**, not to each feature.
### Business goal
Whenever **meaningful information** appears—files, bytes from **connectors**, configuration that should shape answers, and **bounded** user/mandate context—the platform should **ingest it once** into a **unified, scoped** knowledge layer so agents see **one coherent corpus** with clear **provenance** and **permissions**.
### Why this matters now
Information deltas arrive through **routes**, **features**, **`serviceAgent`** tools, **connectors**, and **`interfaceDbApp`** / **`interfaceDbManagement`** updates. Without **one** ingestion contract and triggers per **source**, you get: **missing** indexes, **duplicate** work, **scope drift** between knowledge rows and app truth, and **repeated** engineering per entry path instead of **once** at the **service** layer.
### Risk if we do not unify
Fragmented memory, inconsistent agent answers, compliance gaps (over-indexing sensitive fields or under-indexing allowed context), and duplicated work **per route/feature/tool** instead of at a **single service boundary**.
---
## Teil 1 — Indexing as its own lifecycle (not owned by the agent)
### 1.1 Current useful core
*(Same technical point as **“What we have today” §1** above; repeated here for readers who start at Teil 1.)* After structured **content objects** exist, **`KnowledgeService.indexFile`** performs chunking, embedding (via **`AiService`**), neutralization when required, and persistence via **`interfaceDbKnowledge`**. The **gap** is not the lack of a service method but the lack of a **single product-wide contract** for *when* and *what* enters that pipeline.
### 1.2 Target responsibility split
| Concern | Owner | Notes |
|--------|--------|--------|
| **Ingestion** (normalize → chunk → embed → store) | **Knowledge ingestion service** (logical module; may remain `KnowledgeService` + new façade) | No dependency on `agentLoop`. |
| **Retrieval** (query → ranked context string) | **Agent** (and similar LLM entry points) | Unchanged by this concept. |
| **Orchestration** (queues, retries, backoff) | **Job runner / worker** (new or existing infra) | Keeps API latency low. |
### 1.3 Public ingestion contract (conceptual)
Introduce a small, stable API surface that **all features** call—never “only if an agent runs”:
- **`requestIngestion(job: IngestionJob) -> IngestionHandle`**
- Idempotent key: `(sourceKind, sourceId, contentVersion | hash)`
- Returns immediately with `queued` / `duplicate` / `skipped` and optional `jobId` for status polling.
- **`getIngestionStatus(handle)`**
- Surfaces the same states already used on `FileContentIndex` (`pending`, `extracted`, `embedding`, `indexed`, `failed`) plus connection- or source-specific substates if needed.
The implementation can stay in-process at first (asyncio task queue) and move to Redis/Celery/ARQ later without changing callers.
### 1.4 Idempotency and versioning
- **Re-index** when content changes: compare **content hash** or **external revision** (SharePoint `eTag`, email `Message-ID` + folder cursor, file `updatedAt`).
- **Skip** when hash unchanged to control embedding cost.
- **Tombstone** or **scope-disable** when a source is deleted or access revoked (see Teil 2).
#### Implementation pitfalls (observed during P0 build, 2026-04-21)
The first end-to-end AC4 test on a 500-page PDF revealed **three** independent bugs that all had to be fixed before `ingestion.skipped.duplicate` could ever fire. Each is a **design rule** that every future ingestion lane must honor:
1. **Hash must derive only from content.** `_computeIngestionHash` initially hashed over `(contentObjectId, contentType, data)`, but `contentObjectId` came from `uuid.uuid4()` inside the extractors and was therefore a fresh value on every run. The hash was effectively a per-run nonce — the duplicate check could never match. **Rule:** hashes MUST be a pure function of payload (`contentType`, `data`, and extractor order); never of caller-supplied per-run identifiers. (Tests: `tests/unit/services/test_ingestion_hash_stability.py`.)
2. **Pre-upserts must preserve `_ingestion` metadata and the `indexed` status.** `routeDataFiles._autoIndexFile` persisted a fresh `FileContentIndex` from the pre-scan **before** calling `requestIngestion`, overwriting `structure._ingestion.hash` and `status="indexed"` from any prior successful run. The duplicate check saw a row with empty metadata and re-ran the whole embedding stage. **Rule:** any upsert on the idempotency row taken outside `requestIngestion` MUST read the existing row first and merge forward both `_ingestion` and (where applicable) the terminal `indexed` status.
3. **Extraction-pipeline defaults must preserve granularity for RAG.** `ExtractionOptions.mergeStrategy` defaulted to concatenating every text `ContentPart` into one blob, collapsing a 500-page PDF into a single chunk whose embedding is a blurred average of the whole document — unusable for targeted retrieval. **Rule:** every ingestion lane passes `mergeStrategy=None` explicitly until the default itself can be safely flipped after auditing non-RAG callers. (Tests: `tests/unit/services/test_extraction_merge_strategy.py`.)
**Deferred (ingestion idempotency hardening)** (uncovered during P0, not blocking AC1AC5; naming here is **not** the same milestone as **P1 user-connection hooks** below):
- **In-flight duplicate detection.** The current duplicate check only matches when `status == "indexed"`, so two nearly-simultaneous calls for the same `sourceId` both run full embedding. Fix candidates: accept `status ∈ {"extracted", "embedding", "indexed"}` with matching hash as "already in progress", or a per-`sourceId` `asyncio.Lock` in `KnowledgeService`.
- **Pre-extraction byte-hash shortcut.** `requestIngestion`'s duplicate check runs **after** extraction, so re-indexing a 1.6 MB PDF still spends ~15 s in `runExtraction` before the content hash is computed. The file-bytes SHA already exists in `interfaceDbManagement` for upload-dedup — a short-circuit in `_autoIndexFile` (and symmetric paths) could skip extraction entirely for an unchanged file.
---
## Teil 2 — Triggers: not only “file write”, but every information delta
“Write path” is too narrow if we read it as “HTTP upload only”. The unified model should treat **any authoritative addition or change of platform-visible information** as a potential ingestion trigger.
### 2.1 Trigger taxonomy
| Trigger category | Examples | Ingestion behavior (conceptual) |
|------------------|----------|----------------------------------|
| **Artifact persisted** | User uploads PDF; paste text saved as file; export from a feature | Existing pipeline: extract → `indexFile` (or equivalent). |
| **User connection added / re-authorized** | SharePoint OAuth success; Microsoft/Google mail connection; new API credential with data scope | **Register datasource** + enqueue **initial sync** (backfill) + index a **short connection summary document** (site name, root path, principal, *no secrets*). |
| **Sync for an existing connection** | Scheduled delta; webhook (if available); manual “refresh” | Incremental fetch → map to content objects or rows → **upsert** chunks keyed by external id. |
| **Connection revoked / token invalid** | User disconnects; admin removes mandate integration | **Tombstone** or **purge** chunks keyed by **connection / external source** (today: enforce via **`chunkMetadata` / `contextRef`** convention or future columns); ensure retrieval never serves stale data from that connection. |
| **Mandate / membership** | User added to mandate; role change; feature instance attached | Regenerate **mandate-safe summary** documents (see Section 2.3) if policy allows; **re-resolve scope** for existing chunks (may be heavy—often better to store immutable `mandateId` on chunks at write time and rely on retrieval filters). |
| **User profile (bounded)** | Display name, locale, timezone, **non-sensitive** preferences | Optional **UserContextDocument** for personalization—not a dump of the whole `User` row. |
| **Feature configuration** | Instance labels, data source labels, automation descriptions | If they should influence answers, emit structured **FeatureConfigSnapshot** chunks (small, text-first). |
| **Artifact deleted / data subject erasure** | User deletes a stored file; mandate/user erase | Purge or tombstone the corresponding **`FileContentIndex` / `ContentChunk`** rows (by `fileId`); erasure jobs cascade by **`userId`** / mandate policy. **Connection-wide** revoke remains the **connection** row above. |
### 2.2 User connections (added by the user) as first-class ingestion sources — lifecycle and **what to index per connection type**
**Conceptual focus:** The trigger is OAuth success, saved credential, or linked account in **`UserConnection`** that grants access to an external system. **Implementation** still flows through provider code under `gateway/modules/connectors/` (e.g. **`providerMsft`**, **`providerGoogle`**, **`providerClickup`**); that mapping is **technical**, not the product wording.
**Scope — what counts as a user connection here:** `gateway/modules/routes/routeDataConnections.py` only allows **creating** connections with `type` **`msft`**, **`google`**, or **`clickup`** (`create_connection` → OAuth via `connect_service`). The **authorities options** endpoint also lists **`local`**, but that path is **not** wired in `create_connection`. **This subsection only covers those user-connection authorities** (plus the surfaces each OAuth integration can reach, e.g. Graph mail for Microsoft). Other Gateway connector packages (FTP, Jira, preprocessor, outbound-only mail, geo APIs, …) are **out of scope** in §2.2 until they are exposed the same way as **`UserConnection`** rows.
**Cross-cutting rules (every user-added connection):**
- **Never index:** OAuth tokens, refresh payloads, raw credentials, webhook signing secrets.
- **Always safe to index (metadata only):** human-readable **connection** label, tenant/site name, root path / mailbox address **as display string**, last sync cursor (store in DB, not necessarily as embedding), **external id** + **revision** for idempotency.
- **Prefer file pipeline for binaries:** download → store as `FileItem` (or equivalent) → reuse existing **extract → `indexFile`** path so neutralization and scope mirror upload behavior.
- **Prefer virtual documents** for small text-native items (mail headers/snippets, issue titles/descriptions) to avoid N binary copies.
- **Quota:** per-mandate max documents, max bytes, and “index only last N days” for mail are **product** knobs, not defaults baked into each adapter.
**Lifecycle pattern (target) — tied to the connection row, not to “a connector class”:**
1. **Connection event** (`ConnectionEstablished`) fires when the user **adds** or **re-authorizes** a connection (OAuth / credential storage, **`UserConnection`**, authority **`msft`**, **`google`**, or **`clickup`** per current API).
2. **Ingestion registry** records: `{ connectionId, featureCode, mandateId, userId, scope, externalRoot, adapterKind }` (adapter kind = which integration backs this connection).
3. **Sync planner** enqueues jobs **for that connection**:
- **Bootstrap:** list roots, respect quotas, prioritize recently modified.
- **Delta:** cursor per drive/site/folder/mailbox/label; persist cursor in DB.
4. **Normalizer** maps each external item to either:
- **File-like** → persist bytes + run extraction + **`indexFile`**, or
- **Virtual document** → build `contentObjects` in memory + **`indexFile`** with a synthetic `fileId` / stable external key.
---
#### When the user connects **Microsoft** (Graph — SharePoint, OneDrive, Outlook, Teams) — `providerMsft`
| **Connection surface** (implementation) | **Should be indexed (typical)** | **Usually skip or optional** | **Notes** |
|--------|----------------------------------|------------------------------|-----------|
| **SharePoint** (`SharepointAdapter`) | Document libraries: **PDF, Office, text, markdown**; list **metadata** (library name, path, item name) as `contextRef`. | Huge video blobs, raw executables, duplicates already indexed via another path. | Use **driveItem id + eTag** for revision. Respect **library/folder allowlist** on this **connection**. |
| **OneDrive** (`OneDriveAdapter`) | Same as SharePoint for **personal files** reachable through the users connection. | System/temp folders if exposed. | Scope = **personal** unless shared into mandate explicitly. |
| **Outlook** (`OutlookAdapter`) | **Mailbox:** subjects, **from/to/cc**, **received date**, **body** (plain or stripped HTML) per policy; **calendar** titles/locations/descriptions if product enables. | Full MIME raw, embedded images as separate media unless needed; **entire mailbox** without date window in v1. | Strong **retention + PII** policy: optional “headers + snippet only”; strip signatures/quoted threads; **attachments** → child **file-like** jobs (virus/size limits). |
| **Teams** (`TeamsAdapter`) | **Channel messages** (text), **meeting chat** exports if API allows; **files shared in channel** as file-like. | Message reactions, per-user read receipts; continuous full channel history without bounds. | Often **high volume** — default to **recent window** or **keyword/subscription** driven sync. |
---
#### When the user connects **Google** (Drive, Gmail) — `providerGoogle`
| **Connection surface** (implementation) | **Should be indexed (typical)** | **Usually skip or optional** | **Notes** |
|--------|----------------------------------|------------------------------|-----------|
| **Drive** (`DriveAdapter`) | Native Google files after **export** to Office/PDF (existing export MIME map); standard uploaded files **download → extract**. | Trashed items; shared drives the user did not authorize. | Use **file id + modifiedTime**; Google Docs need **export** before text extraction. |
| **Gmail** (`GmailAdapter`) | **Threads:** subject, participants, internalDate, **snippet** or **body** per policy; **attachments** as separate ingest jobs. | Entire “All Mail” unbounded; **labels** that are purely system. | Same mail cautions as Outlook; **Message-ID** + **History-ID**/cursor for delta. |
---
#### When the user connects **ClickUp** — `providerClickup` (`AuthAuthority.CLICKUP`)
| **Connection surface** (implementation) | **Should be indexed (typical)** | **Usually skip or optional** | **Notes** |
|--------|----------------------------------|------------------------------|-----------|
| **ClickUp** (`providerClickup`) | Task **name**, **description**, **comments**; **attachment** content if downloaded. | Activity stream noise, every status micro-change unless text changed. | Rate limits → prioritize **recently updated** tasks. |
---
**Email and messaging (Outlook + Gmail via Microsoft / Google user connections) — shared cautions**
- Default tiers: **metadata only****snippet****full body****attachments** (most expensive / sensitive). **Product default** vs **user override** is defined in **§2.6** (per-connection mail depth + attachments).
- Apply **quoted-thread stripping**, **signature removal**, and **max body length** before embed.
- **Legal hold / retention:** ingestion must respect mandate **delete** and **export** rules; **disconnecting** or **revoking** the mail **connection** must **purge** mail-sourced chunks.
### 2.3 “Account and stuff” — what to index vs. what never to index
**Roadmap note:** Standalone **profile/mandate snapshot** ingestion (formerly roadmap **P2**) is **out of current scope**; the table below remains the **target model** when that work is picked up again.
**Goal:** Give agents **useful, permission-safe** context (“who is this user in this mandate”, “which features are on”, “preferred language”) without creating a **second copy of sensitive credentials** in the vector store.
| Data | Typical treatment |
|------|-------------------|
| Passwords, refresh tokens, API secrets | **Never** index; never pass through embedding pipeline. |
| Email, phone, government IDs | **Default deny**; only if product explicitly enables “index PII” with neutralization and mandate policy. |
| Display name, locale, feature entitlements (labels) | **Allow** as a small structured **UserMandateSnapshot** document regenerated on change. |
| Full `User` or `Mandate` DB row | **Avoid**; generate **curated** JSON/text snapshots with field allowlists. |
Snapshots should be stored with the same **scope model** as file chunks (`personal`, `featureInstance`, `mandate`, `global`) so `semanticSearch` filters stay consistent.
### 2.4 Event-driven vs. direct calls
**Minimum viable:** each feature calls `requestIngestion` at the end of its own transaction (direct call).
**Scalable target:** emit **domain events** (`FileCommitted`, `UserConnectionReady` / provider-specific ready event, `ProfileUpdated`) and a single **KnowledgeIngestionConsumer** subscribes. Benefits: one place for metrics, retries, and rate limits; features stay thin.
**Storage (already implemented — not redesigned here):** The platform already uses **one** knowledge persistence stack: **`FileContentIndex`** (incl. `mandateId`, `scope`, status) and **`ContentChunk`** (pgvector embeddings, `fileId`, `userId`, `featureInstanceId`, `contextRef`, optional **`chunkMetadata`**), accessed via **`interfaceDbKnowledge`**. Chunks are **file-anchored** today; **connection- / source-specific** provenance (e.g. `connectionId`, external ids) can ride in **`contextRef` / `chunkMetadata`** until optional schema extensions are justified. **This document targets ingestion triggers and lifecycles**, not a second corpus or a duplicate storage model.
### 2.5 Lifecycle gap and daily refresh (roadmap **P1c**, v1)
**Gap:** After a successful connect, **bootstrap** runs once (initial fill). **New** mail, files, or tasks that arrive **after** that run are **not** indexed automatically until a **delta** path exists (webhook, `historyId` / `changes` cursors, etc. — see Teil **2.1** row *“Sync for an existing connection”*).
**Pragmatic mitigation (deliberately simple):** A **daily scheduler** (e.g. once per night, staggered by tenant/load) re-invokes the same **bootstrap walkers** for every **active** `UserConnection` that has **knowledge ingestion enabled** (see **§2.6**). Idempotency + fast-path skips unchanged items; **new** and **changed** items are picked up.
- **Pros:** No new external dependencies (Pub/Sub, watch renewal) in v1; fits existing BackgroundJob + cron/feature-flag patterns.
- **Con:** Data can lag up to **~24 h** before it appears in RAG — acceptable for v1 product choice.
- **Later (without replacing P1c):** Add per-authority **delta APIs** (Gmail `users.history.list`, Drive `changes.list`, ClickUp tighter polling) to reduce latency and API cost.
### 2.6 User consent, frontend flow, and per-connection preferences (incl. neutralization)
**Goal:** The user **explicitly** chooses whether this connection may feed the **shared knowledge store** used for AI/RAG — and **how much**. Without consent, **no** knowledge bootstrap is started for that connection (OAuth may still unlock other product features; that split must be obvious in the UI).
**Frontend (`frontend_nyla`):** extend the **add connection** flow (and later **connection settings**) with the dialog and controls below; persist choices via Gateway API **before** or **when** triggering knowledge ingestion.
#### UX when adding a connection
1. User starts OAuth as today.
2. **Before** or **immediately after** successful authorization: a **dialog** that clearly separates “establish connection” from “add to knowledge base”.
3. **No:** Connection remains usable for other features; either skip `KnowledgeIngestionConsumer.onConnectionEstablished` for the knowledge lane or persist `knowledgeIngestionEnabled=false` and never schedule walkers.
4. **Yes:** Show **advanced settings** (second step or accordion) per **settings catalog** below; persist **per `connectionId`** (or a dedicated preferences row); only then enqueue **bootstrap** (and later **P1c** refresh) with allowed surfaces and tiers.
**Suggested copy (DE — pick one tone / A-B test):**
- **Formal:** „Möchten Sie Inhalte aus dieser Verbindung in Ihre **Wissensdatenbank** übernehmen? KI-Funktionen können dann passender auf **Ihre** Dokumente und Nachrichten Bezug nehmen — **nur** mit Ihrer ausdrücklichen Zustimmung und in dem Umfang, den Sie festlegen.“
- **Approachable:** „Sollen wir aus dieser Verbindung ausgewählte Inhalte sicher in Ihre **persönliche Wissensdatenbank** legen, damit die KI für Sie **besser helfen** kann? Sie entscheiden **was** und **wie stark anonymisiert** — und können das jederzeit in den Einstellungen ändern oder die Daten entfernen.“
Mirror in EN if the UI is bilingual.
#### Minimum settings catalog (all **per connection** where technically applicable)
| Layer | Setting | Meaning |
|--------|-----------|---------|
| **Master** | **Knowledge ingestion for this connection** | `off` / `on`: gates bootstrap + **§2.5** (P1c) refresh for the knowledge store. |
| **Protection** | **Neutralize / anonymize before embedding** | When `on`: apply the same (or stricter) **neutralization** pipeline as for uploads (`FileItem.neutralize` / platform rules) to connector-sourced text **before** chunking — names, e-mail addresses, phone-like patterns, IBAN-like patterns, per policy. User-facing label **„anonymisiert“** maps to this pipeline (not a cryptographic guarantee). |
| **Mail** (Outlook / Gmail) | **Content depth** | At least: **metadata only** (subject, participants, dates — no body) / **snippet** / **full cleaned body** (after `cleanEmailBody` and caps). |
| **Mail** | **Index attachments** | `off` / `on` (with size/type caps). |
| **Files** (Drive / SharePoint / OneDrive) | **Index binary files** | `off` / `on`; optional **MIME allowlist** (Office/PDF/text only) as a simplified UX preset. |
| **ClickUp** | **Scope** | `titles only` / `title + description` / `+ comments` / optional `attachments`. |
| **Microsoft** | **Parity** | Same dimensions where Graph surfaces mirror Google (mailbox / drive-like). |
| **General** | **Time window** | “Only index items from the last **N** days” (aligns with existing walker caps; slider with a sensible max). |
| **General** | **Help: what RAG is not** | Short explainer: not real-time mail; delay until next scheduled run (**§2.5**). |
**Optional power-user toggles (same screen, collapsed):** per authority **which surfaces** ingest (e.g. **Google:** Gmail on/off, Drive on/off; **Microsoft:** SharePoint on/off, Outlook on/off — when product exposes both). Reduces accidental over-breadth without extra wizard steps.
**Backend consequence:** Walkers read persisted preferences for `connectionId` each run and **filter** surfaces and payload tiers **before** `indexFile`. On preference change, product decision: trigger **re-sync**, or apply only to **new** items — document the chosen rule.
#### Neutralization when the user opts in
- **Ingestion on** + **neutralization on:** After content is obtained (virtual text or extraction output), apply the **neutralization stage** **before** chunking/embedding; **that** text is what gets embedded.
- **Neutralization off:** Still apply baseline **hygiene** where already defined (e.g. `cleanEmailBody` for quotes/signatures) — hygiene **≠** full PII removal.
- **Compliance copy:** If the user chooses **full body**, state clearly that **perfect** anonymization is not guaranteed without neutralization.
---
## Teil 3 — Feature injection: retrieval vs corpus, agent loop, and real gaps
“Injection” is ambiguous. This section uses **two** precise meanings:
| Kind | What happens | Primary implementation today |
|------|----------------|------------------------------|
| **Retrieval injection** | Relevant **existing** chunks and workflow context are **assembled** and **inserted into the LLM prompt** (system message) each agent round. | **`AgentService.runAgent`** → `buildRagContextFn`**`KnowledgeService.buildAgentContext`** → **`ConversationManager.injectRagContext`**. CommCoach wraps the same **`buildAgentContext`** and adds coaching-specific context. |
| **Corpus injection (indexing)** | **New** text/binary is **chunked and embedded** and written to **`interfaceDbKnowledge`** so it can be retrieved later. | **`KnowledgeService.indexFile`**; callers include **`routeDataFiles._autoIndexFile`**, **`serviceAgent`** tools (**`_documentTools`**, **`_workspaceTools`**), and **CommCoach** **`serviceCommcoachIndexer`**. |
A feature can **already participate fully in retrieval injection** by using **`AgentService`** without ever calling **`indexFile`** in its own folder. **Corpus** growth can still happen **indirectly** (upload pipeline, agent tools). Planning must **not** label such features as “non-injecting.”
### 3.1 Features that already use **`AgentService.runAgent`** (retrieval injection is on by default)
These **`modules/features/*`** entry points resolve **`getService("agent", ctx)`** and stream **`agentService.runAgent(...)`** (code audit):
- **`workspace`** (`routeFeatureWorkspace.py`)
- **`graphicalEditor`** (`routeFeatureGraphicalEditor.py`)
- **`commcoach`** (`serviceCommcoach.py` — custom **`buildRagContextFn`**, still uses platform **`buildAgentContext`** inside)
For all three, **every agent round** gets **retrieval injection** unless RAG fails or returns empty. **Corpus** updates for the same sessions still depend on **separate** mechanisms:
| Corpus path | When it runs |
|-------------|----------------|
| **Upload / `FileItem`** | **`routeDataFiles`** **`_autoIndexFile`** after storage (feature-agnostic). |
| **Agent tools** | If the model invokes tools in **`_documentTools`** / **`_workspaceTools`** that call **`indexFile`**, **corpus** changes **during** that agent run—implemented in **`serviceAgent`**, not in the features route file. |
So **workspace** and **graphicalEditor** **do** “inject” in the **retrieval** sense today; they **can** “inject” in the **corpus** sense when users **upload** files or when the **agent** runs indexing-capable tools. What they **often lack** is **feature-owned, automatic corpus** logic (e.g. “on every graph publish, index a snapshot”) without an upload or tool call.
### 3.2 Features that do **not** use **`AgentService`** (no platform RAG prompt injection from this stack)
These domains **do not** call **`runAgent`** in their **`modules/features/*`** trees (audit). They therefore **do not** receive **`buildAgentContext`** through the **workspace agent** loop:
| Feature | Notes |
|---------|--------|
| **chatbot** | Uses an **internal** LangGraph-style flow (SQL / Tavily / answer nodes). **No** `getService("knowledge")` / **`buildAgentContext`** usage under **`modules/features/chatbot/`** in the audited tree—**retrieval injection** and the **unified corpus** are **not** wired the same way as the workspace agent. |
| **trustee** | Domain CRUD and quick actions (e.g. **`agentPrompt`** is a **UI hint** to open the workspace with a prefilled prompt—not **`AgentService` inside trustee**). Corpus: **only** via shared **upload** or if the user later uses **workspace agent** with tools. |
| **realEstate** | No **`AgentService`** hook in feature tree; same **upload** story for files. |
| **teamsbot** | Uses **`serviceAi`** (and related) for the meeting pipeline; **`sessionContext`** is **ephemeral** prompt text. **No** **`AgentService`** / **`buildAgentContext`** in the same pattern as workspace. |
| **neutralization** | **Service/pipeline** used **inside** **`indexFile`** when **`FileItem.neutralize`** applies—not a feature that “injects” either kind by itself. |
### 3.3 Summary matrix (per `modules/features/` domain)
*Matrix verified by audit on 2026-04-21 (P0):* Under `gateway/modules/features/`, only `workspace`, `graphicalEditor`, and `commcoach` resolve `getService("agent")` / `getService("knowledge")` or call `runAgent`; only `commcoach/serviceCommcoachIndexer.py` and `commcoach/serviceCommcoach.py` touch `indexFile` / `buildAgentContext` inside the feature tree. All other domains (`chatbot`, `trustee`, `realEstate`, `teamsbot`, `neutralization`) match the "No" rows below.
| Feature | **`AgentService.runAgent`** | **Retrieval injection** (platform RAG prompt) | **Corpus injection** (typical today) | **Likely gap** (this document) |
|---------|----------------------------|-----------------------------------------------|-------------------------------------|--------------------------------|
| **workspace** | Yes | Yes | Upload **`_autoIndexFile`**; optional **`indexFile`** via agent **tools** | **Automatic** corpus for artifacts that never become **`FileItem`** or tool outputs (exports, structured summaries). |
| **graphicalEditor** | Yes | Yes | Same as workspace | **Published graph / metadata** as searchable corpus without manual upload. |
| **commcoach** | Yes | Yes (+ custom RAG layer) | Session **`indexFile`** (**`serviceCommcoachIndexer`**) + upload/tools | Extend only if new artifact types need the same **feature-local** indexer pattern. |
| **chatbot** | No | **No** (unified store) | No feature-local **`indexFile`** | Decide if chatbot should call **`buildAgentContext`** / **`indexFile`** or stay on SQL/Tavily; **FAQ / grounding** text may need **corpus** hooks. |
| **trustee** | No | Only if user works in **workspace** | Upload path; agent tools only in workspace | **Trustee-native** persist events → ingestion when files are not the only representation. |
| **realEstate** | No | Only via workspace | Upload path | Same as trustee for case/property narratives. |
| **teamsbot** | No | No | None from unified store by default | Persisted **transcripts / notes****`indexFile`** if they should be mandate-searchable. |
| **neutralization** | N/A | N/A | Preconditions for **`indexFile`** | Ensure all **new** ingest paths honor **`FileItem.neutralize`**. |
### 3.4 Shared corpus mechanisms (not feature-local, but serve agent features)
| Mechanism | Role |
|-----------|------|
| **`routeDataFiles` + `_autoIndexFile`** | Indexes **uploaded** `FileItem`s for **any** UI that uses the upload API—including workspace. |
| **`serviceAgent`** **`_documentTools`** / **`_workspaceTools`** | **Corpus** writes when the **model** chooses tools; available to **workspace** and **graphicalEditor** agent sessions (and **CommCoach** when those tools are in the toolset). |
| **CommCoach** **`serviceCommcoachIndexer`** | **Feature-local** corpus: coaching session text → **`indexFile`** without requiring an upload. |
### 3.5 Where **additional feature-native corpus injection** is still needed
Use this checklist **only** after accounting for §3.13.4:
1. **Content is authoritative** in the feature DB or blob store **without** a guaranteed **`FileItem`** + **`_autoIndexFile`** path.
2. **Retrieval injection alone** is insufficient because nothing ever **wrote** chunks (e.g. chatbot never hits **`indexFile`**).
3. **Relying on the agent to call tools** is too fragile for compliance or UX (“user must remember to index”).
Then add **`requestIngestion` / `indexFile`** at the **feature commit point** (or emit a domain event), with **`contextRef` / `chunkMetadata`** for **`feature_code`**, business ids, and **no secrets**.
### 3.6 Implementation pattern (feature-native corpus only)
1. **Commit point** — authoritative write in the feature or shared storage.
2. **Scope** — align with **`FileItem`** / **`ServiceCenterContext`** rules already used in **`indexFile`**.
3. **Unified façade** — one ingestion API; avoid a second embedding pipeline.
4. **Purge** — tie to **`fileId`**, business key, or future connector purge keys on revoke/delete.
### 3.7 Phasing (feature matrix — **not** the same numbering as roadmap **P1c/P1d/P3** above)
- **FM0:** For **each** row in §3.3, confirm **retrieval** vs **corpus** paths; document “satisfied by agent+upload+tools” vs “needs feature hook.”
- **FM1:** Implement **feature-native corpus** for one domain with a clear §3.5 gap (e.g. **trustee** entity text, **teamsbot** persisted transcript).
- **FM2:** **Chatbot** architecture decision: integrate **`serviceKnowledge`** or keep parallel retrieval; if integrate, add explicit **corpus** rules for config/FAQ.
---
## Implementation phases (suggested)
Phases align with **Teil 1** (façade), **Teil 2** (connector + trigger catalog), and **Teil 3.7** (feature matrix and feature-native corpus pilots). **P0** overlaps **Teil 3.7 P0** (complete the per-feature matrix before large builds).
**Authority rollout (2026-04-24):** The **user-connection ingestion lane** (bootstrap + purge tied to **`UserConnection`**) is delivered **per OAuth authority**: **`msft` (P1a)**, **`google`** + **`clickup` (P1b)** — same consumer, dispatcher fan-out, purge-by-`connectionId`, and unit tests for walkers + consumer. **Next product slices:** **P1c** (daily refresh, **§2.5**), **consent + per-connection preferences + frontend** (**§2.6**), then **P3** (event bus at scale).
| Phase | Outcome |
|-------|---------|
| **P0 — Façade + idempotency** *(done, 2026-04-21)* | Single `requestIngestion` / `getIngestionStatus` entry point on `KnowledgeService` with content-hash idempotency, provenance in `structure._ingestion`, and structured logging (`ingestion.queued` / `ingestion.indexed` / `ingestion.skipped.duplicate` / `ingestion.failed`). All prior `indexFile` call sites now route through the façade: `routeDataFiles._autoIndexFile`, `commcoach/serviceCommcoachIndexer.indexSessionData`, `serviceAgent/coreTools/_workspaceTools.readFile`, `serviceAgent/coreTools/_documentTools.describeImage`. Agent tools no longer carry on-demand extraction + ingestion fallbacks — they are pure consumers of the knowledge store. **Teil 3.3** matrix audited. Three implementation bugs fixed during verification: stable content hash, pre-upsert `_ingestion` preservation, `mergeStrategy=None` for per-page granularity (see **§1.4 Implementation pitfalls**). |
| **P1a — User-connection hooks (Microsoft `msft`)** *(done, 2026-04-21)* | **`connection.established`** / **`connection.revoked`** emitted from **Microsoft** data-OAuth success paths and from **disconnect/delete** when the row is **`msft`** (incl. **`ConnectionStatus.REVOKED`** fix where **`INACTIVE`** was invalid). Central **`KnowledgeIngestionConsumer`** (`subConnectorIngestConsumer.py`, **`app.py`** lifespan) maps **`established`** → **`connection.bootstrap`** BackgroundJob and **`revoked`** → synchronous **`KnowledgeService.purgeConnection`** → **`interfaceDbKnowledge.deleteFileContentIndexByConnectionId`**. **`FileContentIndex.connectionId`** + **`sourceKind`** (and **`IngestionJob`** carrying both) make connector-sourced rows purgeable. **Bootstrap modules live for Microsoft:** **`subConnectorSyncSharepoint.py`** (`sourceKind="sharepoint_item"`, **`eTag`** as `contentVersion`, **`SharepointAdapter.browse`** with **`@odata.nextLink`** pagination) and **`subConnectorSyncOutlook.py`** (virtual **`outlook_message`** docs — header / snippet / cleaned body via **`cleanEmailBody`**, **`changeKey`** revisions, optional **`outlook_attachment`** child jobs). Dispatcher **`_bootstrapJobHandler`** runs **SharePoint + Outlook in parallel** for **`msft`**. Structured logs: **§ Structured ingestion logs**. **Retrieval threshold calibration (2026-04-21):** **`buildAgentContext`** **`minScore`** layers lowered to **`0.35`** so **`text-embedding-3-small`** matches real cosine scores; validated on **Outlook/SharePointindexed** content. **Tests (P1a):** purge, consumer **msft** dispatch, **`cleanEmailBody`**, **`bootstrapSharepoint`**, **`bootstrapOutlook`**. |
| **P1b — User-connection hooks (Google + ClickUp)** *(done, 2026-04)* | Parity with **`msft`**: **`routeSecurityGoogle`** / **`routeSecurityClickup`** call **`KnowledgeIngestionConsumer.onConnectionEstablished`** after token save; **`routeDataConnections`** disconnect/delete call **`onConnectionRevoked`** for **all** authorities. **`_bootstrapJobHandler`** fans out **google → `bootstrapGdrive` + `bootstrapGmail`** in parallel and **clickup → `bootstrapClickup`**. Walkers: `subConnectorSyncGdrive.py`, `subConnectorSyncGmail.py`, `subConnectorSyncClickup.py` + `subTextClean.py`. Unit tests: `test_bootstrap_gdrive.py`, `test_bootstrap_gmail.py`, `test_bootstrap_clickup.py`, extended `test_knowledge_ingest_consumer.py`. |
| **P1c — Connection refresh (lifecycle v1)** *(next)* | **Daily** (or nightly) **scheduled** re-run of the same bootstrap walkers for connections with **knowledge ingestion enabled** (**§2.6**). Reuses idempotency + fast-path; closes the **post-connect delta gap** without webhooks in v1. Observability: same log family as bootstrap; optional `event` suffix or `reason=scheduled_refresh` for shippers. |
| **P1d — Consent + preferences + UI** *(next)* | Persist **§2.6** settings **per `connectionId`**; Gate **`onConnectionEstablished`** / P1c jobs on user choice; **`frontend_nyla`** connection wizard + settings screen; walkers honor mail/file/ClickUp depth and **neutralization** flag. |
| **~~P2 — Profile & mandate snapshots~~** | **Removed from active roadmap** (focus: connections + feature corpus + scale). Target content remains documented in **§2.3** for a future re-entry when needed. |
| **P3 — Event bus** | Move direct calls to async consumer where load requires it (**Teil 2.4** scalable target). Remains in scope. |
### P1b checklist *(completed — kept for audit trail)*
1. **`routeSecurityGoogle`:** after successful **data** OAuth, enqueue **same** ingestion consumer path as Microsoft (pass **`connectionId`**, **`AuthAuthority.google`**, mandate/user scope).
2. **`routeSecurityClickup`:** after successful OAuth / token persistence, same.
3. **`routeDataConnections`:** verify **disconnect_service** / **delete_connection** emit **revoke** (or call **`purgeConnection`**) for **google** and **clickup** rows, not only **msft**.
4. **`_bootstrapJobHandler`:** remove any **“unsupported_authority”** skip for **`google`** / **`clickup`** once walkers are registered; keep skip only for **future** authorities.
5. **Quality bar:** T10/T12T15 in the testplan — extend from **Microsoft-only** assumptions to **all three** **`routeDataConnections`** OAuth authorities.
### P1c / P1d checklist *(next engineering slices)*
1. **P1c:** BackgroundJob or cron entry; feature flag; per-tenant stagger; only connections with **knowledge ingestion = on**; metrics on `indexed` vs `skippedDup` per run.
2. **P1d ✅ — implemented:**
- [x] **`UserConnection`** extended with `knowledgeIngestionEnabled: bool` (default `False` = strict opt-in) and `knowledgePreferences: Optional[Dict]` (`schemaVersion=1`); DB auto-migration adds columns on startup.
- [x] **`routeDataConnections` `create_connection`** accepts `knowledgeIngestionEnabled` + `knowledgePreferences` in request body and persists them before returning.
- [x] **OAuth callbacks** (`routeSecurityGoogle`, `routeSecurityMsft`, `routeSecurityClickup`) gate `callbackRegistry.trigger("connection.established", …)` on `connection.knowledgeIngestionEnabled`; emit structured log `ingestion.connection.bootstrap.skipped reason=consent_disabled` when disabled.
- [x] **`_bootstrapJobHandler`** defensive re-check: loads connection via `getUserConnectionById` and no-ops if flag was disabled after OAuth (race protection).
- [x] **`IngestionJob.neutralize: bool`** added; `requestIngestion` + `_indexFileInternal` thread it through; for `sourceKind != "file"` the flag drives `_shouldNeutralize` directly; for `sourceKind == "file"` the `FileItem.neutralize` column remains authoritative.
- [x] **`subConnectorPrefs.py`** — `loadConnectionPrefs(connectionId)` helper + `ConnectionIngestionPrefs` dataclass with safe defaults for all §2.6 keys.
- [x] **All five walkers** (Gmail, GDrive, ClickUp, Outlook, SharePoint) load prefs at bootstrap start; limits structs gain `mailContentDepth` + `neutralize` (mail walkers), `filesIndexBinaries` (Drive), `clickupScope` (ClickUp), and `neutralize` (all).
- [x] **Unit tests** (`test_p1d_consent_prefs.py` — 10 tests): consent gate no-op, prefs defaults + full mapping, Gmail depth modes (metadata/snippet/full), ClickUp scope (titles vs description).
- [x] **Frontend** (`frontend_nyla`): `AddConnectionWizard` 4-step modal (connector → consent → preferences → summary + OAuth); old three-button row replaced with single „Verbindung hinzufügen“ button; `createConnectionAndAuth` hook method; `KnowledgePreferences` type in `connectionApi.ts`.
**Default policy (document for deploy):** `knowledgeIngestionEnabled` defaults to `False` for all new connections. Existing connections (before P1d deploy) have the column `NULL`/`False` — **no bootstrap is triggered retroactively**. Users must explicitly opt in via the wizard or connection settings. If the team decides to migrate existing connections to `True`, a one-time migration script must be run and communicated via release note.
---
## Ziel und Nicht-Ziele
**Ziel:**
- One **ingestion contract** for all features and connector lifecycles.
- Indexing **decoupled** from the agent loop (agents may still *invoke* tools that ultimately call ingestion, but ingestion must not *depend* on an agent run).
- **Explicit** handling of connection establishment, sync, and revocation.
- **Bounded** indexing of user/mandate context with a clear PII policy.
- **Explicit user consent** and **per-connection** ingestion preferences (incl. optional **neutralization**) before connector content enters the knowledge store (**§2.6**).
**Explizit NICHT:**
- Moving **retrieval** (`buildAgentContext`) out of agents.
- Guaranteeing **real-time** indexing for every byte without async jobs (latency targets are product decisions).
- Indexing **everything** in the database “because we can”—only curated, policy-approved surfaces.
---
## Betroffene Module (erwartet)
- **Gateway:** `serviceKnowledge`, file upload routes, connector OAuth handlers, sync workers, possibly new `serviceKnowledgeIngest` or package under `modules/serviceCenter/services/`.
- **Interfaces:** `interfaceDbKnowledge` extensions for source metadata if needed; **`interfaceDbApp`** (or adjacent) for **per-`connectionId`** ingestion preferences from **§2.6**.
- **Frontend:** `frontend_nyla` — connection wizard + connection detail settings (consent, depth toggles, neutralization, time window).
- **Wiki / Reference:** `b-reference/gateway/ai-agent.md` (ingestion vs. retrieval) after implementation.
---
## Offene Entscheidungen
| Thema | Optionen |
|-------|----------|
| **Email bodies** | Default product stance is **user-configurable per connection** (**§2.6** table: metadata / snippet / full cleaned body); mandate policy may still cap max tier. |
| **Multi-tenant isolation audits** | Periodic job to verify chunk `mandateId` matches connection |
| **Cost caps** | Per-mandate embedding budget; defer large backfills |
| **Neutralization** | **User opt-in** per connection (**§2.6**); optional **mandate floor** (“never below snippet+neutralize for mail”) remains a separate governance decision. |
| **Provenance shape** | First-class DB columns vs **documented `chunkMetadata` keys** for `connectionId`, external id, revision (must support **Teil 2** purge rules). |
| **In-flight duplicate handling** | Accept `status ∈ {"extracted","embedding","indexed"}` with matching hash as in-progress (cheap, lossy under failure) **vs** per-`sourceId` `asyncio.Lock` in `KnowledgeService` (strict, requires singleton) — see **§1.4 Deferred (ingestion idempotency hardening)**. |
| **Pre-extraction dedup shortcut** | Short-circuit `_autoIndexFile` via the file-bytes SHA in `interfaceDbManagement` before running `runExtraction` (~15 s saved per re-index of a large PDF) — see **§1.4 Deferred (ingestion idempotency hardening)**. |
---
## Structured ingestion logs (P1 schema)
The connection-lifecycle lane emits the following structured log events. **`part`** values **`sharepoint`**, **`outlook`**, **`gdrive`**, **`gmail`**, and **`clickup`** are all **implemented** for bootstrap; **P1c** may add the same events with a distinguishable `reason` / `jobType` for **scheduled refresh** (exact field TBD in implementation). Each event is a single `logger.info` / `.warning` / `.error` call with a stable `extra={"event": ...}` field so downstream log shippers can route on `event` without parsing the message string.
| `event` | Severity | Emitter | Required `extra` keys | Meaning |
|---------|----------|---------|------------------------|---------|
| `ingestion.connection.bootstrap.queued` | info | `KnowledgeIngestionConsumer._onConnectionEstablished` | `connectionId`, `authority` | A `connection.established` callback was received and a `connection.bootstrap` BackgroundJob is being enqueued. |
| `ingestion.connection.bootstrap.started` | info | `bootstrap{Sharepoint,Outlook,Gdrive,Gmail,Clickup}` | `connectionId`, `part` (`sharepoint` \| `outlook` \| `gdrive` \| `gmail` \| `clickup`) | The per-part bootstrap walker has begun work. |
| `ingestion.connection.bootstrap.progress` | info | bootstrap walkers | `connectionId`, `part`, `processed`, `skippedDup`, `failed` | Heart-beat every ~50 items so long-running runs are observable. |
| `ingestion.connection.bootstrap.done` | info | bootstrap walkers + façade-level totals | `connectionId`, `part`, `indexed`, `skippedDup`, `skippedPolicy`, `failed`, `durationMs` (Outlook/Gmail add `attachmentsIndexed`; SharePoint/Drive add `bytes`; ClickUp adds `workspaces` + `lists`) | Walker finished cleanly. |
| `ingestion.connection.bootstrap.failed` | error | `_bootstrapJobHandler` | `part`, `connectionId`, `error` | One bootstrap part raised — recorded but the other parts still complete. |
| `ingestion.connection.bootstrap.skipped` | info | `_bootstrapJobHandler` + OAuth callbacks + defensive check in `_bootstrapJobHandler` | `connectionId`, `authority`, `reason` (`unsupported_authority` │ `consent_disabled`) | Authority has no bootstrap module registered (e.g. a future provider) — **or** user has not consented (`knowledgeIngestionEnabled=False`). |
| `ingestion.connection.purged` | info | `_onConnectionRevoked` | `connectionId`, `authority`, `reason`, `indexRows`, `chunks` | Knowledge purge for a revoked connection completed; numbers reflect the deleted rows. |
| `ingestion.connection.purged.failed` | error | `_onConnectionRevoked` | `connectionId`, `error` | Purge raised; the revoke event was still acknowledged upstream. |
All events should keep field naming consistent with the existing `ingestion.queued / .indexed / .skipped.duplicate / .failed` family from P0 (camelCase, `connectionId`, `mandateId`, `userId`). Counters are integers, durations are in milliseconds.
## Links
- **How-to / orientation:** [Unified knowledge & RAG ingestion (guide)](../../d-guides/unified-knowledge-rag.md)
- **Gateway reference (retrieval + knowledge):** `wiki/b-reference/gateway/architecture.md`, `wiki/b-reference/gateway/ai-agent.md`
- **Implementation touchpoints (indicative):** `gateway/modules/serviceCenter/services/serviceKnowledge/mainServiceKnowledge.py`, `gateway/modules/routes/routeDataFiles.py`, `gateway/modules/features/commcoach/serviceCommcoachIndexer.py`, agent `coreTools` `_documentTools` / `_workspaceTools`, `gateway/modules/datamodels/datamodelExtraction.py` (`ExtractionOptions.mergeStrategy: Optional[MergeStrategy]`).
- **Unit tests (P0 guardrails):** `gateway/tests/unit/services/test_ingestion_hash_stability.py`, `gateway/tests/unit/services/test_extraction_merge_strategy.py`.
- **Unit tests (P1a — Microsoft, done):** `gateway/tests/unit/services/test_connection_purge.py`, `gateway/tests/unit/services/test_knowledge_ingest_consumer.py` (incl. **msft** fan-out), `gateway/tests/unit/services/test_clean_email_body.py`, `gateway/tests/unit/services/test_bootstrap_sharepoint.py`, `gateway/tests/unit/services/test_bootstrap_outlook.py`.
- **Unit tests (P1b — Google + ClickUp, done):** **`test_knowledge_ingest_consumer`** (google / clickup fan-out), **`test_bootstrap_gmail.py`**, **`test_bootstrap_gdrive.py`**, **`test_bootstrap_clickup.py`**. **P1d (done):** **`test_p1d_consent_prefs.py`** (10 tests: consent gate, prefs parsing, Gmail depth modes, ClickUp scope). **P1c:** add scheduler tests when implemented.
- **P1 implementation touchpoints:** `gateway/modules/serviceCenter/services/serviceKnowledge/subConnectorIngestConsumer.py`, `gateway/modules/serviceCenter/services/serviceKnowledge/subConnectorSyncSharepoint.py`, `gateway/modules/serviceCenter/services/serviceKnowledge/subConnectorSyncOutlook.py`, `gateway/modules/serviceCenter/services/serviceKnowledge/subConnectorSyncGdrive.py`, `gateway/modules/serviceCenter/services/serviceKnowledge/subConnectorSyncGmail.py`, `gateway/modules/serviceCenter/services/serviceKnowledge/subConnectorSyncClickup.py`, `gateway/modules/serviceCenter/services/serviceKnowledge/subTextClean.py`, `gateway/modules/interfaces/interfaceDbKnowledge.py` (`deleteFileContentIndexByConnectionId`), `gateway/modules/datamodels/datamodelKnowledge.py` (`FileContentIndex.connectionId` + `sourceKind`), `gateway/modules/connectors/providerMsft/connectorMsft.py` (`@odata.nextLink`-loop in `SharepointAdapter.browse`, `eTag` in `_graphItemToExternalEntry`), `gateway/modules/connectors/providerGoogle/connectorGoogle.py` (P1b: Drive + Gmail revision keys and download/export paths), `gateway/modules/routes/routeSecurityMsft.py` (P1a callbacks), `gateway/modules/routes/routeSecurityGoogle.py` and `gateway/modules/routes/routeSecurityClickup.py` (P1b: parity callbacks), `gateway/modules/routes/routeDataConnections.py` (revoke for **all** authorities), `gateway/app.py` (consumer registration in lifespan).
## Akzeptanzkriterien (Plan-Ebene)
| # | Kriterium | Prio |
|---|-----------|------|
| 1 | Every new **file** that should be searchable triggers ingestion **without** requiring an agent session. | must |
| 2 | **User connection** connect / disconnect has defined ingestion or purge behavior **for each** OAuth authority **`routeDataConnections`** supports (**P1a** **`msft`**, **P1b** **`google`** / **`clickup`**); **plus** user-controlled **opt-in** and **preference bundle** before ingestion (**P1d**, **§2.6**). | must |
| 3 | **Profile/mandate** snapshot ingestion (**former roadmap P2**) is **deferred**; when re-opened, snapshots must use an explicit allowlist and never embed secrets. Until then, **§2.6** consent + neutralization covers connector-sourced PII risk. | should (reactivated when P2 returns) |
| 4 | Ingestion is **idempotent** for unchanged content (no duplicate embedding work). Verified 2026-04-21 on a 500-page PDF: second re-index trigger logs `ingestion.skipped.duplicate` with a stable hash, zero embedding API calls. See **§1.4 pitfalls** for the three bug classes that had to be fixed first. | must |
| 5 | **Teil 3.3** matrix completed: every `modules/features/*` product row has **retrieval** (agent vs none), **corpus** (upload / tools / feature indexer), and **gap** explicitly stated—not “non-injecting” if **`AgentService`** already provides retrieval injection. | should |
---
## Testplan (Konzept-Verifikation)
| ID | Frage | Methode |
|----|-------|---------|
| T1 | Sind alle bestehenden Index-Entry-Points inventarisiert? | Code-Audit + Tabelle in Build-Phase |
| T2 | Ist klar welche Features **Retrieval** (Agent) vs nur **Corpus** vs beides nutzen? | Review **Teil 3.3** Matrix gegen `runAgent` / `indexFile` Call-Sites |
| T3 | Bleibt **plattform-RAG-Retrieval** (`buildAgentContext` über `AgentService`) unveraendert in seiner Rolle fuer Workspace/Grafikeditor/CommCoach? | Review `agentLoop` + `mainServiceAgent._createBuildRagContextFn` |
| T4 | Ist Revoke/Purge fuer Connector-Chunks ohne **connectionId-Spalte** heute als **Metadata-Konvention** spezifizierbar? | Review **Teil 2.1** + **Offene Entscheidungen** Provenance |
| T5 | Ist Revoke/Purge pro **User-Connection-Authority / Integrationsoberfläche** (Teil 2.2, nur `msft` / `google` / `clickup` laut `routeDataConnections`) in einem Threat-Model abgedeckt? | Datenfluss Connection → `FileItem` / virtuelles Doc → Chunks |
| T6 | Ist der Content-Hash stabil zwischen zwei Extraktions-Runs desselben Files (verschiedene `contentObjectId`-UUIDs, identisches Payload)? | Unit: `tests/unit/services/test_ingestion_hash_stability.py` (5 Cases: UUID-Regen, Daten-Delta, Order-Delta, Type-Delta, Empty). Live: zweiter Trigger auf bereits indexiertes File loggt `ingestion.skipped.duplicate` mit identischem Hash (verifiziert 2026-04-21). |
| T7 | Bleiben bei Multi-Page-PDFs die Per-Page-Chunks erhalten (keine `MergeStrategy`-Konkatenation)? | Unit: `tests/unit/services/test_extraction_merge_strategy.py`. Live: 500-Seiten-PDF → 563 ContentObjects, 567 Embedding-Chunks in 24 Batches (verifiziert 2026-04-21). |
| T8 | Überleben `_ingestion.hash` und `status="indexed"` einen Pre-Scan-Re-Upsert in `_autoIndexFile`? | Review `routeDataFiles._autoIndexFile` Zeile ~127: existing row wird vor upsert gelesen und `_ingestion` + `indexed` in frischen `contentIndex` gemerged. Live: zweiter Trigger → `ingestion.skipped.duplicate` statt Re-Embedding. |
| T9 | Räumt ein `connection.revoked` Event **alle** `FileContentIndex`-Rows + `ContentChunk`s einer Connection und **nichts anderes** auf (Uploads ohne `connectionId`, andere Connections bleiben intakt)? | Unit: `tests/unit/services/test_connection_purge.py` (3 Cases: positive purge, leerer connectionId-Noop, unbekannter connectionId). |
| T10 | Dispatcht der `KnowledgeIngestionConsumer` `connection.established` korrekt als asynchroner `connection.bootstrap` Job (**P1a:** **msft** → SharePoint + Outlook parallel; **P1b:** **google** → Drive + Gmail parallel; **clickup** → Tasks) und `connection.revoked` synchron als Purge — **für jede** der drei **`routeDataConnections`**-Authorities? | **P1a + P1b (done):** `test_knowledge_ingest_consumer.py` — alle drei Authorities + revoke; unbekannte Authorities `skipped.reason="unsupported_authority"`. **P1d:** zusätzlich nur bei **Consent = ja** dispatch. |
| T11 | Reduziert `cleanEmailBody` ein realistisches Outlook-HTML auf den eigenen Body-Anteil (HTML strip, Quote-Strip EN+DE, Signature-Strip, Whitespace-Collapse, `maxChars`-Truncate)? | Unit: `tests/unit/services/test_clean_email_body.py` (8 Cases). Konsequenz: `bootstrapOutlook` schickt nie HTML/Quoted-Replies/Signaturen in den Embedding-Pipeline-Schritt. |
| T12 | Sind die Bootstrap-Walker für SharePoint und Outlook idempotent gegen ein zweites Run mit unveränderten `eTag` / `changeKey`? | Unit: `tests/unit/services/test_bootstrap_sharepoint.py` + `tests/unit/services/test_bootstrap_outlook.py`. Mock-Adapter liefern stable revisions; KnowledgeService-Fake meldet `duplicate` und das Result-Objekt bilanziert `skippedDuplicate`. |
| T13 | Walked `bootstrapGmail` `INBOX + SENT`, parsed MIME-Bodies (preferring `text/plain`, falling back to `text/html`), folgt `nextPageToken`-Pagination und ist idempotent gegen identische `historyId` Revisions? | **P1b (done):** Unit `test_bootstrap_gmail.py`. **P1d:** Walker respektiert **Content depth** aus **§2.6** (Metadaten/Snippet/Body). |
| T14 | Walked `bootstrapGdrive` My Drive rekursiv (Folder-MIME-Erkennung, `maxDepth`), respektiert den `maxAgeDays`-Recency-Filter und ist idempotent gegen identische `modifiedTime` Revisions? | **P1b (done):** Unit `test_bootstrap_gdrive.py`. **P1d:** „Binärdateien“ / MIME-Allowlist aus **§2.6**. |
| T15 | Walked `bootstrapClickup` Workspaces → Spaces → Folder/Folderless Lists → Tasks unter `maxWorkspaces` / `maxListsPerWorkspace` / `maxTasks` Caps, respektiert den `maxAgeDays`-Recency-Filter und ist idempotent gegen identische `date_updated` Revisions? | **P1b (done):** Unit `test_bootstrap_clickup.py`. **P1d:** ClickUp-**Scope** (Titel/Beschreibung/Kommentare) aus **§2.6**. |
| T16 | Führt der **P1c**-Tagesjob nur Verbindungen mit **Wissens-Injektion = ein** aus und bleiben Kosten/API-Limits durch Idempotenz + Fast-Path beherrschbar? | Integration oder Unit mit Fake-Clock: zweiter Lauf → überwiegend `skippedDup`; Logs `ingestion.connection.bootstrap.*` mit erkennbarem Scheduled-`reason` (falls implementiert). |