# Unified Knowledge Indexing — One RAG Corpus for All Platform Information ## How to read this document | Section | Content | |---------|---------| | **Beschreibung und Kontext** | Scope (**ingestion vs retrieval**), **terminology** (feature / service / connector / interface), **as-is vs target**, business case and risks. | | **Teil 1** | Ingestion as its **own lifecycle**: façade API, idempotency, orchestration—**not** owned by `agentLoop`. | | **Teil 2** | **Triggers** (beyond upload): **user connections**, account snapshots, purge; **per-connection-type** indexing guidance; event-driven option. | | **Teil 3** | **Feature injection** split into **retrieval** (agent + `buildAgentContext`) vs **corpus** (`indexFile`); **matrix** per `modules/features/*` product; real **gaps** vs false “non-injection”. | | **Implementation phases · Ziele · AC · Testplan** | Rollout, explicit non-goals, acceptance criteria, verification. | **Single sentence summary:** Keep **retrieval** on **`AgentService`**; unify **when and how** the shared **`interfaceDbKnowledge`** corpus is **filled** (routes, **user connections**, **feature commit points**) behind one **ingestion contract**. **Current roadmap scope:** user-connection lifecycle (**P1a/P1b**), **daily refresh** to close the post-connect delta gap (**P1c**), **explicit user consent + per-connection ingestion preferences** (incl. optional **neutralization**) in **frontend + API**, then **scalable event bus** (**P3**). **Out of current roadmap:** standalone **profile/mandate snapshot** ingestion (former roadmap **P2** — content remains in Teil 2.3 as future option only). ## Beschreibung und Kontext ### Scope of this document We distinguish **ingestion** (chunking, embedding, persisting into **`interfaceDbKnowledge`**) from **retrieval** (semantic search + `buildAgentContext` for the LLM). **Retrieval** for the **unified knowledge store** is consumed primarily through **`serviceAgent`** / `runAgent` (workspace, graphical editor, CommCoach). Other products (e.g. **chatbot**, **teamsbot**) may use **different** LLM stacks—**Teil 3** maps who gets platform RAG vs who does not. This plan does **not** mandate one global retrieval path for every feature; it **does** mandate a **single ingestion story** into the same corpus where that corpus is used. The gap we address is **how and when** the corpus is **filled**, not how every LLM entry point **reads** it. ### Terminology (Gateway — see `wiki/b-reference/gateway/architecture.md`) This concept separates **feature modules**, **services**, **connectors**, and **interfaces**. Conflating them produces wrong ownership (e.g. treating “SharePoint” as a `modules/features/` product, or treating “mail” as if it were `serviceKnowledge`). | Term | Where in Gateway | Role for indexing | |------|------------------|-------------------| | **Feature** | `modules/features/*` (e.g. `workspace`, `graphicalEditor`, `commcoach`, `trustee`, `chatbot`) | Product domains: UI, feature routers, orchestration. They **trigger** actions (upload, sync UX, feature-specific indexers) but must not be the **only** place that starts embedding work. | | **Service** | `modules/serviceCenter/services/*` | Cross-cutting facades: **`serviceKnowledge`** (indexing, search, `buildAgentContext`); **`serviceExtraction`** (content objects); **`serviceChat`** (chat/workflow documents); **`serviceMessaging`** (e-mail, notifications); **`serviceAgent`** (tools that may *indirectly* call indexing). **Unified ingestion** is primarily a **service-layer** responsibility. | | **Connector** | `modules/connectors/*` (Microsoft, Google, …) | Vendor adapters: OAuth, list, download. **SharePoint** and **mailbox** I/O live here; routes/features **call** connectors—they are not interchangeable with a feature or with `serviceKnowledge`. | | **Interface** | `modules/interfaces/*` | Persistence contracts: **`interfaceDbKnowledge`** (`FileContentIndex`, `ContentChunk`, …), **`interfaceDbManagement`** (`FileItem`, `DataSource`), **`interfaceDbApp`** (User, Mandate, `UserConnections`, Preferences). Profile, mandate, and connection rows are **interface-backed**, not a single “profile feature”. | ### What we have today (as-is) **1. A strong technical “write” implementation, but no product-wide ingestion contract** - **`serviceKnowledge`** (`mainServiceKnowledge.py`) already implements the heavy lifting: **`indexFile`** resolves scope from **`FileItem`** (single source of truth), optional neutralization, sentence-aware chunking, **`serviceAi`** embeddings, **`ContentChunk`** persistence, and status on **`FileContentIndex`**. That is the right **unit of work** once **content objects** exist. - **Retrieval** is also centralized in the same service: **`buildAgentContext`** composes multi-layer context for agents. So **read** and **write** to the vector store are **service-owned**; what is **not** unified is **who may call writes**, with which **idempotency**, and on which **lifecycle events**. **2. Multiple invocation lanes, same underlying method** Indexing is **operationally** reached from a **mix of layers** (not one façade): - **HTTP routes** — e.g. file pipeline in **`routeDataFiles`**: pre-scan / extraction → **`indexFile`**. This is the “happy path” for uploads. - **Agent tools** — **`serviceAgent`** core tools (e.g. document/workspace helpers) can call **`indexFile`** when the user interacts through the agent. That ties **embedding** to **an agent session** even when the same file could have been indexed on upload. - **Feature-specific code** — e.g. **CommCoach** indexer paths that call **`indexFile`** for that product’s artifacts. Correct for the feature, but it is **another** ad hoc entry point with its own assumptions. - **Connectors** — Microsoft/Google (and similar) packages can fetch bytes and ultimately produce files or blobs; **OAuth and delta sync** are not yet modeled everywhere as **first-class ingestion lifecycles** (connect → backfill → incremental → revoke) that all funnel through the same API and metadata. There is **no single** `requestIngestion(...)`, **no standard job identity** for “this external item revision”, and **no one place** that records “this mandate revoked access → tombstone these chunks”. **3. Extraction vs indexing: clear in code, not enforced at the platform edge** - **`serviceExtraction`** (and preprocessing helpers) produce **content objects**; **`indexFile`** consumes them. The boundary is clean **inside** the pipeline, but **not every** new binary or external document **must** pass through a single orchestrated “extract then index” step—some paths may skip, duplicate, or call **`indexFile`** with partial metadata. **4. Truth for scope and identity lives in interfaces—not in “features”** - **`interfaceDbManagement`** (`FileItem`, …) and **`interfaceDbApp`** (mandate, `UserConnections`, user profile fields) define **who may see what**. **`indexFile`** already mirrors **`FileItem`** for scope; that pattern is **good** but **not generalized** to connector-backed items, virtual documents, or curated “account snapshot” chunks. If a connector writes under a different mental model of `mandateId` / `featureInstanceId`, **`interfaceDbKnowledge`** can drift from app/management truth without a systematic reconcile. **5. User/mandate/profile deltas are not first-class ingestion events** - Changes to membership, preferences, or connections update **`interfaceDbApp`** (and related tables). They affect **searchability and personalization** but are **not** consistently reflected as **versioned, allowlisted** chunks in the knowledge store—unless a feature manually adds text somewhere. That leaves agents either **under-informed** or dependent on **non-RAG** code paths for the same facts. **Summary (as-is):** The **engine** for indexing is **`serviceKnowledge.indexFile`**; the **policy graph** for *when* to run it is **implicit** and **spread across** routes, tools, and features. **Connectors** and **account/mandate** data are **not** uniformly treated as **ingestion sources** with connect/sync/revoke semantics. ### What would make more sense (target) **1. One ingestion façade at the service boundary (not inside `agentLoop`)** - A small, stable API (conceptually **`requestIngestion` / `getIngestionStatus`**, implemented atop or beside **`KnowledgeService`**) that **every** lane calls: routes, feature hooks, **connector sync workers**, and (if needed) agent tools as **thin** delegates. - **Idempotency** (content hash, external revision, `eTag`, …) enforced **here**, so routes and tools cannot accidentally **double-embed** the same logical object. **2. Lifecycle parity for connectors and “connections”** - **Establish** → register datasource + optional short **non-secret** summary chunk + enqueue **backfill**. - **Delta** → incremental jobs with persisted cursors. - **Revoke / token invalid / GDPR** → **tombstone or purge** by `connectionId` / `sourceKind`, aligned with RBAC—not ad hoc deletes scattered in UI code. **3. Provenance / `sourceKind` (schema or `chunkMetadata`)** - Today chunks are **file-anchored**; extended provenance (internal file vs SharePoint item vs mailbox artifact vs `profile_snapshot`, **`connectionId`** for purge, revision keys) should be **consistent**—either **first-class fields** on `ContentChunk` / index rows **or** a **defined convention** inside **`chunkMetadata` / `contextRef`** until a migration is justified. Goal: retrieval, audit, and **connector revoke** cleanup are **data-driven**, not inferred only from call site. **4. Curated snapshots for interface-backed facts** - **Allowlisted** projections of mandate membership, locale, entitlements (labels), etc., regenerated on **interface-level** events—**not** dumping full user rows or secrets into embeddings. **5. Keep retrieval exactly where it is** - **`buildAgentContext`** remains the agent’s way to **consume** the corpus; ingestion only ensures that corpus is **complete, scoped, and attributable** when the agent runs. **6. Observability and cost in one place** - Queue depth, embedding spend, failures, and “skipped duplicate” counts attach to the **ingestion façade**, not to each feature. ### Business goal Whenever **meaningful information** appears—files, bytes from **connectors**, configuration that should shape answers, and **bounded** user/mandate context—the platform should **ingest it once** into a **unified, scoped** knowledge layer so agents see **one coherent corpus** with clear **provenance** and **permissions**. ### Why this matters now Information deltas arrive through **routes**, **features**, **`serviceAgent`** tools, **connectors**, and **`interfaceDbApp`** / **`interfaceDbManagement`** updates. Without **one** ingestion contract and triggers per **source**, you get: **missing** indexes, **duplicate** work, **scope drift** between knowledge rows and app truth, and **repeated** engineering per entry path instead of **once** at the **service** layer. ### Risk if we do not unify Fragmented memory, inconsistent agent answers, compliance gaps (over-indexing sensitive fields or under-indexing allowed context), and duplicated work **per route/feature/tool** instead of at a **single service boundary**. --- ## Teil 1 — Indexing as its own lifecycle (not owned by the agent) ### 1.1 Current useful core *(Same technical point as **“What we have today” §1** above; repeated here for readers who start at Teil 1.)* After structured **content objects** exist, **`KnowledgeService.indexFile`** performs chunking, embedding (via **`AiService`**), neutralization when required, and persistence via **`interfaceDbKnowledge`**. The **gap** is not the lack of a service method but the lack of a **single product-wide contract** for *when* and *what* enters that pipeline. ### 1.2 Target responsibility split | Concern | Owner | Notes | |--------|--------|--------| | **Ingestion** (normalize → chunk → embed → store) | **Knowledge ingestion service** (logical module; may remain `KnowledgeService` + new façade) | No dependency on `agentLoop`. | | **Retrieval** (query → ranked context string) | **Agent** (and similar LLM entry points) | Unchanged by this concept. | | **Orchestration** (queues, retries, backoff) | **Job runner / worker** (new or existing infra) | Keeps API latency low. | ### 1.3 Public ingestion contract (conceptual) Introduce a small, stable API surface that **all features** call—never “only if an agent runs”: - **`requestIngestion(job: IngestionJob) -> IngestionHandle`** - Idempotent key: `(sourceKind, sourceId, contentVersion | hash)` - Returns immediately with `queued` / `duplicate` / `skipped` and optional `jobId` for status polling. - **`getIngestionStatus(handle)`** - Surfaces the same states already used on `FileContentIndex` (`pending`, `extracted`, `embedding`, `indexed`, `failed`) plus connection- or source-specific substates if needed. The implementation can stay in-process at first (asyncio task queue) and move to Redis/Celery/ARQ later without changing callers. ### 1.4 Idempotency and versioning - **Re-index** when content changes: compare **content hash** or **external revision** (SharePoint `eTag`, email `Message-ID` + folder cursor, file `updatedAt`). - **Skip** when hash unchanged to control embedding cost. - **Tombstone** or **scope-disable** when a source is deleted or access revoked (see Teil 2). #### Implementation pitfalls (observed during P0 build, 2026-04-21) The first end-to-end AC4 test on a 500-page PDF revealed **three** independent bugs that all had to be fixed before `ingestion.skipped.duplicate` could ever fire. Each is a **design rule** that every future ingestion lane must honor: 1. **Hash must derive only from content.** `_computeIngestionHash` initially hashed over `(contentObjectId, contentType, data)`, but `contentObjectId` came from `uuid.uuid4()` inside the extractors and was therefore a fresh value on every run. The hash was effectively a per-run nonce — the duplicate check could never match. **Rule:** hashes MUST be a pure function of payload (`contentType`, `data`, and extractor order); never of caller-supplied per-run identifiers. (Tests: `tests/unit/services/test_ingestion_hash_stability.py`.) 2. **Pre-upserts must preserve `_ingestion` metadata and the `indexed` status.** `routeDataFiles._autoIndexFile` persisted a fresh `FileContentIndex` from the pre-scan **before** calling `requestIngestion`, overwriting `structure._ingestion.hash` and `status="indexed"` from any prior successful run. The duplicate check saw a row with empty metadata and re-ran the whole embedding stage. **Rule:** any upsert on the idempotency row taken outside `requestIngestion` MUST read the existing row first and merge forward both `_ingestion` and (where applicable) the terminal `indexed` status. 3. **Extraction-pipeline defaults must preserve granularity for RAG.** `ExtractionOptions.mergeStrategy` defaulted to concatenating every text `ContentPart` into one blob, collapsing a 500-page PDF into a single chunk whose embedding is a blurred average of the whole document — unusable for targeted retrieval. **Rule:** every ingestion lane passes `mergeStrategy=None` explicitly until the default itself can be safely flipped after auditing non-RAG callers. (Tests: `tests/unit/services/test_extraction_merge_strategy.py`.) **Deferred (ingestion idempotency hardening)** (uncovered during P0, not blocking AC1–AC5; naming here is **not** the same milestone as **P1 user-connection hooks** below): - **In-flight duplicate detection.** The current duplicate check only matches when `status == "indexed"`, so two nearly-simultaneous calls for the same `sourceId` both run full embedding. Fix candidates: accept `status ∈ {"extracted", "embedding", "indexed"}` with matching hash as "already in progress", or a per-`sourceId` `asyncio.Lock` in `KnowledgeService`. - **Pre-extraction byte-hash shortcut.** `requestIngestion`'s duplicate check runs **after** extraction, so re-indexing a 1.6 MB PDF still spends ~15 s in `runExtraction` before the content hash is computed. The file-bytes SHA already exists in `interfaceDbManagement` for upload-dedup — a short-circuit in `_autoIndexFile` (and symmetric paths) could skip extraction entirely for an unchanged file. --- ## Teil 2 — Triggers: not only “file write”, but every information delta “Write path” is too narrow if we read it as “HTTP upload only”. The unified model should treat **any authoritative addition or change of platform-visible information** as a potential ingestion trigger. ### 2.1 Trigger taxonomy | Trigger category | Examples | Ingestion behavior (conceptual) | |------------------|----------|----------------------------------| | **Artifact persisted** | User uploads PDF; paste text saved as file; export from a feature | Existing pipeline: extract → `indexFile` (or equivalent). | | **User connection added / re-authorized** | SharePoint OAuth success; Microsoft/Google mail connection; new API credential with data scope | **Register datasource** + enqueue **initial sync** (backfill) + index a **short connection summary document** (site name, root path, principal, *no secrets*). | | **Sync for an existing connection** | Scheduled delta; webhook (if available); manual “refresh” | Incremental fetch → map to content objects or rows → **upsert** chunks keyed by external id. | | **Connection revoked / token invalid** | User disconnects; admin removes mandate integration | **Tombstone** or **purge** chunks keyed by **connection / external source** (today: enforce via **`chunkMetadata` / `contextRef`** convention or future columns); ensure retrieval never serves stale data from that connection. | | **Mandate / membership** | User added to mandate; role change; feature instance attached | Regenerate **mandate-safe summary** documents (see Section 2.3) if policy allows; **re-resolve scope** for existing chunks (may be heavy—often better to store immutable `mandateId` on chunks at write time and rely on retrieval filters). | | **User profile (bounded)** | Display name, locale, timezone, **non-sensitive** preferences | Optional **UserContextDocument** for personalization—not a dump of the whole `User` row. | | **Feature configuration** | Instance labels, data source labels, automation descriptions | If they should influence answers, emit structured **FeatureConfigSnapshot** chunks (small, text-first). | | **Artifact deleted / data subject erasure** | User deletes a stored file; mandate/user erase | Purge or tombstone the corresponding **`FileContentIndex` / `ContentChunk`** rows (by `fileId`); erasure jobs cascade by **`userId`** / mandate policy. **Connection-wide** revoke remains the **connection** row above. | ### 2.2 User connections (added by the user) as first-class ingestion sources — lifecycle and **what to index per connection type** **Conceptual focus:** The trigger is OAuth success, saved credential, or linked account in **`UserConnection`** that grants access to an external system. **Implementation** still flows through provider code under `gateway/modules/connectors/` (e.g. **`providerMsft`**, **`providerGoogle`**, **`providerClickup`**); that mapping is **technical**, not the product wording. **Scope — what counts as a user connection here:** `gateway/modules/routes/routeDataConnections.py` only allows **creating** connections with `type` **`msft`**, **`google`**, or **`clickup`** (`create_connection` → OAuth via `connect_service`). The **authorities options** endpoint also lists **`local`**, but that path is **not** wired in `create_connection`. **This subsection only covers those user-connection authorities** (plus the surfaces each OAuth integration can reach, e.g. Graph mail for Microsoft). Other Gateway connector packages (FTP, Jira, preprocessor, outbound-only mail, geo APIs, …) are **out of scope** in §2.2 until they are exposed the same way as **`UserConnection`** rows. **Cross-cutting rules (every user-added connection):** - **Never index:** OAuth tokens, refresh payloads, raw credentials, webhook signing secrets. - **Always safe to index (metadata only):** human-readable **connection** label, tenant/site name, root path / mailbox address **as display string**, last sync cursor (store in DB, not necessarily as embedding), **external id** + **revision** for idempotency. - **Prefer file pipeline for binaries:** download → store as `FileItem` (or equivalent) → reuse existing **extract → `indexFile`** path so neutralization and scope mirror upload behavior. - **Prefer virtual documents** for small text-native items (mail headers/snippets, issue titles/descriptions) to avoid N binary copies. - **Quota:** per-mandate max documents, max bytes, and “index only last N days” for mail are **product** knobs, not defaults baked into each adapter. **Lifecycle pattern (target) — tied to the connection row, not to “a connector class”:** 1. **Connection event** (`ConnectionEstablished`) fires when the user **adds** or **re-authorizes** a connection (OAuth / credential storage, **`UserConnection`**, authority **`msft`**, **`google`**, or **`clickup`** per current API). 2. **Ingestion registry** records: `{ connectionId, featureCode, mandateId, userId, scope, externalRoot, adapterKind }` (adapter kind = which integration backs this connection). 3. **Sync planner** enqueues jobs **for that connection**: - **Bootstrap:** list roots, respect quotas, prioritize recently modified. - **Delta:** cursor per drive/site/folder/mailbox/label; persist cursor in DB. 4. **Normalizer** maps each external item to either: - **File-like** → persist bytes + run extraction + **`indexFile`**, or - **Virtual document** → build `contentObjects` in memory + **`indexFile`** with a synthetic `fileId` / stable external key. --- #### When the user connects **Microsoft** (Graph — SharePoint, OneDrive, Outlook, Teams) — `providerMsft` | **Connection surface** (implementation) | **Should be indexed (typical)** | **Usually skip or optional** | **Notes** | |--------|----------------------------------|------------------------------|-----------| | **SharePoint** (`SharepointAdapter`) | Document libraries: **PDF, Office, text, markdown**; list **metadata** (library name, path, item name) as `contextRef`. | Huge video blobs, raw executables, duplicates already indexed via another path. | Use **driveItem id + eTag** for revision. Respect **library/folder allowlist** on this **connection**. | | **OneDrive** (`OneDriveAdapter`) | Same as SharePoint for **personal files** reachable through the user’s connection. | System/temp folders if exposed. | Scope = **personal** unless shared into mandate explicitly. | | **Outlook** (`OutlookAdapter`) | **Mailbox:** subjects, **from/to/cc**, **received date**, **body** (plain or stripped HTML) per policy; **calendar** titles/locations/descriptions if product enables. | Full MIME raw, embedded images as separate media unless needed; **entire mailbox** without date window in v1. | Strong **retention + PII** policy: optional “headers + snippet only”; strip signatures/quoted threads; **attachments** → child **file-like** jobs (virus/size limits). | | **Teams** (`TeamsAdapter`) | **Channel messages** (text), **meeting chat** exports if API allows; **files shared in channel** as file-like. | Message reactions, per-user read receipts; continuous full channel history without bounds. | Often **high volume** — default to **recent window** or **keyword/subscription** driven sync. | --- #### When the user connects **Google** (Drive, Gmail) — `providerGoogle` | **Connection surface** (implementation) | **Should be indexed (typical)** | **Usually skip or optional** | **Notes** | |--------|----------------------------------|------------------------------|-----------| | **Drive** (`DriveAdapter`) | Native Google files after **export** to Office/PDF (existing export MIME map); standard uploaded files **download → extract**. | Trashed items; shared drives the user did not authorize. | Use **file id + modifiedTime**; Google Docs need **export** before text extraction. | | **Gmail** (`GmailAdapter`) | **Threads:** subject, participants, internalDate, **snippet** or **body** per policy; **attachments** as separate ingest jobs. | Entire “All Mail” unbounded; **labels** that are purely system. | Same mail cautions as Outlook; **Message-ID** + **History-ID**/cursor for delta. | --- #### When the user connects **ClickUp** — `providerClickup` (`AuthAuthority.CLICKUP`) | **Connection surface** (implementation) | **Should be indexed (typical)** | **Usually skip or optional** | **Notes** | |--------|----------------------------------|------------------------------|-----------| | **ClickUp** (`providerClickup`) | Task **name**, **description**, **comments**; **attachment** content if downloaded. | Activity stream noise, every status micro-change unless text changed. | Rate limits → prioritize **recently updated** tasks. | --- **Email and messaging (Outlook + Gmail via Microsoft / Google user connections) — shared cautions** - Default tiers: **metadata only** → **snippet** → **full body** → **attachments** (most expensive / sensitive). **Product default** vs **user override** is defined in **§2.6** (per-connection mail depth + attachments). - Apply **quoted-thread stripping**, **signature removal**, and **max body length** before embed. - **Legal hold / retention:** ingestion must respect mandate **delete** and **export** rules; **disconnecting** or **revoking** the mail **connection** must **purge** mail-sourced chunks. ### 2.3 “Account and stuff” — what to index vs. what never to index **Roadmap note:** Standalone **profile/mandate snapshot** ingestion (formerly roadmap **P2**) is **out of current scope**; the table below remains the **target model** when that work is picked up again. **Goal:** Give agents **useful, permission-safe** context (“who is this user in this mandate”, “which features are on”, “preferred language”) without creating a **second copy of sensitive credentials** in the vector store. | Data | Typical treatment | |------|-------------------| | Passwords, refresh tokens, API secrets | **Never** index; never pass through embedding pipeline. | | Email, phone, government IDs | **Default deny**; only if product explicitly enables “index PII” with neutralization and mandate policy. | | Display name, locale, feature entitlements (labels) | **Allow** as a small structured **UserMandateSnapshot** document regenerated on change. | | Full `User` or `Mandate` DB row | **Avoid**; generate **curated** JSON/text snapshots with field allowlists. | Snapshots should be stored with the same **scope model** as file chunks (`personal`, `featureInstance`, `mandate`, `global`) so `semanticSearch` filters stay consistent. ### 2.4 Event-driven vs. direct calls **Minimum viable:** each feature calls `requestIngestion` at the end of its own transaction (direct call). **Scalable target:** emit **domain events** (`FileCommitted`, `UserConnectionReady` / provider-specific ready event, `ProfileUpdated`) and a single **KnowledgeIngestionConsumer** subscribes. Benefits: one place for metrics, retries, and rate limits; features stay thin. **Storage (already implemented — not redesigned here):** The platform already uses **one** knowledge persistence stack: **`FileContentIndex`** (incl. `mandateId`, `scope`, status) and **`ContentChunk`** (pgvector embeddings, `fileId`, `userId`, `featureInstanceId`, `contextRef`, optional **`chunkMetadata`**), accessed via **`interfaceDbKnowledge`**. Chunks are **file-anchored** today; **connection- / source-specific** provenance (e.g. `connectionId`, external ids) can ride in **`contextRef` / `chunkMetadata`** until optional schema extensions are justified. **This document targets ingestion triggers and lifecycles**, not a second corpus or a duplicate storage model. ### 2.5 Lifecycle gap and daily refresh (roadmap **P1c**, v1) **Gap:** After a successful connect, **bootstrap** runs once (initial fill). **New** mail, files, or tasks that arrive **after** that run are **not** indexed automatically until a **delta** path exists (webhook, `historyId` / `changes` cursors, etc. — see Teil **2.1** row *“Sync for an existing connection”*). **Pragmatic mitigation (deliberately simple):** A **daily scheduler** (e.g. once per night, staggered by tenant/load) re-invokes the same **bootstrap walkers** for every **active** `UserConnection` that has **knowledge ingestion enabled** (see **§2.6**). Idempotency + fast-path skips unchanged items; **new** and **changed** items are picked up. - **Pros:** No new external dependencies (Pub/Sub, watch renewal) in v1; fits existing BackgroundJob + cron/feature-flag patterns. - **Con:** Data can lag up to **~24 h** before it appears in RAG — acceptable for v1 product choice. - **Later (without replacing P1c):** Add per-authority **delta APIs** (Gmail `users.history.list`, Drive `changes.list`, ClickUp tighter polling) to reduce latency and API cost. ### 2.6 User consent, frontend flow, and per-connection preferences (incl. neutralization) **Goal:** The user **explicitly** chooses whether this connection may feed the **shared knowledge store** used for AI/RAG — and **how much**. Without consent, **no** knowledge bootstrap is started for that connection (OAuth may still unlock other product features; that split must be obvious in the UI). **Frontend (`frontend_nyla`):** extend the **add connection** flow (and later **connection settings**) with the dialog and controls below; persist choices via Gateway API **before** or **when** triggering knowledge ingestion. #### UX when adding a connection 1. User starts OAuth as today. 2. **Before** or **immediately after** successful authorization: a **dialog** that clearly separates “establish connection” from “add to knowledge base”. 3. **No:** Connection remains usable for other features; either skip `KnowledgeIngestionConsumer.onConnectionEstablished` for the knowledge lane or persist `knowledgeIngestionEnabled=false` and never schedule walkers. 4. **Yes:** Show **advanced settings** (second step or accordion) per **settings catalog** below; persist **per `connectionId`** (or a dedicated preferences row); only then enqueue **bootstrap** (and later **P1c** refresh) with allowed surfaces and tiers. **Suggested copy (DE — pick one tone / A-B test):** - **Formal:** „Möchten Sie Inhalte aus dieser Verbindung in Ihre **Wissensdatenbank** übernehmen? KI-Funktionen können dann passender auf **Ihre** Dokumente und Nachrichten Bezug nehmen — **nur** mit Ihrer ausdrücklichen Zustimmung und in dem Umfang, den Sie festlegen.“ - **Approachable:** „Sollen wir aus dieser Verbindung ausgewählte Inhalte sicher in Ihre **persönliche Wissensdatenbank** legen, damit die KI für Sie **besser helfen** kann? Sie entscheiden **was** und **wie stark anonymisiert** — und können das jederzeit in den Einstellungen ändern oder die Daten entfernen.“ Mirror in EN if the UI is bilingual. #### Minimum settings catalog (all **per connection** where technically applicable) | Layer | Setting | Meaning | |--------|-----------|---------| | **Master** | **Knowledge ingestion for this connection** | `off` / `on`: gates bootstrap + **§2.5** (P1c) refresh for the knowledge store. | | **Protection** | **Neutralize / anonymize before embedding** | When `on`: apply the same (or stricter) **neutralization** pipeline as for uploads (`FileItem.neutralize` / platform rules) to connector-sourced text **before** chunking — names, e-mail addresses, phone-like patterns, IBAN-like patterns, per policy. User-facing label **„anonymisiert“** maps to this pipeline (not a cryptographic guarantee). | | **Mail** (Outlook / Gmail) | **Content depth** | At least: **metadata only** (subject, participants, dates — no body) / **snippet** / **full cleaned body** (after `cleanEmailBody` and caps). | | **Mail** | **Index attachments** | `off` / `on` (with size/type caps). | | **Files** (Drive / SharePoint / OneDrive) | **Index binary files** | `off` / `on`; optional **MIME allowlist** (Office/PDF/text only) as a simplified UX preset. | | **ClickUp** | **Scope** | `titles only` / `title + description` / `+ comments` / optional `attachments`. | | **Microsoft** | **Parity** | Same dimensions where Graph surfaces mirror Google (mailbox / drive-like). | | **General** | **Time window** | “Only index items from the last **N** days” (aligns with existing walker caps; slider with a sensible max). | | **General** | **Help: what RAG is not** | Short explainer: not real-time mail; delay until next scheduled run (**§2.5**). | **Optional power-user toggles (same screen, collapsed):** per authority **which surfaces** ingest (e.g. **Google:** Gmail on/off, Drive on/off; **Microsoft:** SharePoint on/off, Outlook on/off — when product exposes both). Reduces accidental over-breadth without extra wizard steps. **Backend consequence:** Walkers read persisted preferences for `connectionId` each run and **filter** surfaces and payload tiers **before** `indexFile`. On preference change, product decision: trigger **re-sync**, or apply only to **new** items — document the chosen rule. #### Neutralization when the user opts in - **Ingestion on** + **neutralization on:** After content is obtained (virtual text or extraction output), apply the **neutralization stage** **before** chunking/embedding; **that** text is what gets embedded. - **Neutralization off:** Still apply baseline **hygiene** where already defined (e.g. `cleanEmailBody` for quotes/signatures) — hygiene **≠** full PII removal. - **Compliance copy:** If the user chooses **full body**, state clearly that **perfect** anonymization is not guaranteed without neutralization. --- ## Teil 3 — Feature injection: retrieval vs corpus, agent loop, and real gaps “Injection” is ambiguous. This section uses **two** precise meanings: | Kind | What happens | Primary implementation today | |------|----------------|------------------------------| | **Retrieval injection** | Relevant **existing** chunks and workflow context are **assembled** and **inserted into the LLM prompt** (system message) each agent round. | **`AgentService.runAgent`** → `buildRagContextFn` → **`KnowledgeService.buildAgentContext`** → **`ConversationManager.injectRagContext`**. CommCoach wraps the same **`buildAgentContext`** and adds coaching-specific context. | | **Corpus injection (indexing)** | **New** text/binary is **chunked and embedded** and written to **`interfaceDbKnowledge`** so it can be retrieved later. | **`KnowledgeService.indexFile`**; callers include **`routeDataFiles._autoIndexFile`**, **`serviceAgent`** tools (**`_documentTools`**, **`_workspaceTools`**), and **CommCoach** **`serviceCommcoachIndexer`**. | A feature can **already participate fully in retrieval injection** by using **`AgentService`** without ever calling **`indexFile`** in its own folder. **Corpus** growth can still happen **indirectly** (upload pipeline, agent tools). Planning must **not** label such features as “non-injecting.” ### 3.1 Features that already use **`AgentService.runAgent`** (retrieval injection is on by default) These **`modules/features/*`** entry points resolve **`getService("agent", ctx)`** and stream **`agentService.runAgent(...)`** (code audit): - **`workspace`** (`routeFeatureWorkspace.py`) - **`graphicalEditor`** (`routeFeatureGraphicalEditor.py`) - **`commcoach`** (`serviceCommcoach.py` — custom **`buildRagContextFn`**, still uses platform **`buildAgentContext`** inside) For all three, **every agent round** gets **retrieval injection** unless RAG fails or returns empty. **Corpus** updates for the same sessions still depend on **separate** mechanisms: | Corpus path | When it runs | |-------------|----------------| | **Upload / `FileItem`** | **`routeDataFiles`** **`_autoIndexFile`** after storage (feature-agnostic). | | **Agent tools** | If the model invokes tools in **`_documentTools`** / **`_workspaceTools`** that call **`indexFile`**, **corpus** changes **during** that agent run—implemented in **`serviceAgent`**, not in the feature’s route file. | So **workspace** and **graphicalEditor** **do** “inject” in the **retrieval** sense today; they **can** “inject” in the **corpus** sense when users **upload** files or when the **agent** runs indexing-capable tools. What they **often lack** is **feature-owned, automatic corpus** logic (e.g. “on every graph publish, index a snapshot”) without an upload or tool call. ### 3.2 Features that do **not** use **`AgentService`** (no platform RAG prompt injection from this stack) These domains **do not** call **`runAgent`** in their **`modules/features/*`** trees (audit). They therefore **do not** receive **`buildAgentContext`** through the **workspace agent** loop: | Feature | Notes | |---------|--------| | **chatbot** | Uses an **internal** LangGraph-style flow (SQL / Tavily / answer nodes). **No** `getService("knowledge")` / **`buildAgentContext`** usage under **`modules/features/chatbot/`** in the audited tree—**retrieval injection** and the **unified corpus** are **not** wired the same way as the workspace agent. | | **trustee** | Domain CRUD and quick actions (e.g. **`agentPrompt`** is a **UI hint** to open the workspace with a prefilled prompt—not **`AgentService` inside trustee**). Corpus: **only** via shared **upload** or if the user later uses **workspace agent** with tools. | | **realEstate** | No **`AgentService`** hook in feature tree; same **upload** story for files. | | **teamsbot** | Uses **`serviceAi`** (and related) for the meeting pipeline; **`sessionContext`** is **ephemeral** prompt text. **No** **`AgentService`** / **`buildAgentContext`** in the same pattern as workspace. | | **neutralization** | **Service/pipeline** used **inside** **`indexFile`** when **`FileItem.neutralize`** applies—not a feature that “injects” either kind by itself. | ### 3.3 Summary matrix (per `modules/features/` domain) *Matrix verified by audit on 2026-04-21 (P0):* Under `gateway/modules/features/`, only `workspace`, `graphicalEditor`, and `commcoach` resolve `getService("agent")` / `getService("knowledge")` or call `runAgent`; only `commcoach/serviceCommcoachIndexer.py` and `commcoach/serviceCommcoach.py` touch `indexFile` / `buildAgentContext` inside the feature tree. All other domains (`chatbot`, `trustee`, `realEstate`, `teamsbot`, `neutralization`) match the "No" rows below. | Feature | **`AgentService.runAgent`** | **Retrieval injection** (platform RAG prompt) | **Corpus injection** (typical today) | **Likely gap** (this document) | |---------|----------------------------|-----------------------------------------------|-------------------------------------|--------------------------------| | **workspace** | Yes | Yes | Upload **`_autoIndexFile`**; optional **`indexFile`** via agent **tools** | **Automatic** corpus for artifacts that never become **`FileItem`** or tool outputs (exports, structured summaries). | | **graphicalEditor** | Yes | Yes | Same as workspace | **Published graph / metadata** as searchable corpus without manual upload. | | **commcoach** | Yes | Yes (+ custom RAG layer) | Session **`indexFile`** (**`serviceCommcoachIndexer`**) + upload/tools | Extend only if new artifact types need the same **feature-local** indexer pattern. | | **chatbot** | No | **No** (unified store) | No feature-local **`indexFile`** | Decide if chatbot should call **`buildAgentContext`** / **`indexFile`** or stay on SQL/Tavily; **FAQ / grounding** text may need **corpus** hooks. | | **trustee** | No | Only if user works in **workspace** | Upload path; agent tools only in workspace | **Trustee-native** persist events → ingestion when files are not the only representation. | | **realEstate** | No | Only via workspace | Upload path | Same as trustee for case/property narratives. | | **teamsbot** | No | No | None from unified store by default | Persisted **transcripts / notes** → **`indexFile`** if they should be mandate-searchable. | | **neutralization** | N/A | N/A | Preconditions for **`indexFile`** | Ensure all **new** ingest paths honor **`FileItem.neutralize`**. | ### 3.4 Shared corpus mechanisms (not feature-local, but serve agent features) | Mechanism | Role | |-----------|------| | **`routeDataFiles` + `_autoIndexFile`** | Indexes **uploaded** `FileItem`s for **any** UI that uses the upload API—including workspace. | | **`serviceAgent`** **`_documentTools`** / **`_workspaceTools`** | **Corpus** writes when the **model** chooses tools; available to **workspace** and **graphicalEditor** agent sessions (and **CommCoach** when those tools are in the toolset). | | **CommCoach** **`serviceCommcoachIndexer`** | **Feature-local** corpus: coaching session text → **`indexFile`** without requiring an upload. | ### 3.5 Where **additional feature-native corpus injection** is still needed Use this checklist **only** after accounting for §3.1–3.4: 1. **Content is authoritative** in the feature DB or blob store **without** a guaranteed **`FileItem`** + **`_autoIndexFile`** path. 2. **Retrieval injection alone** is insufficient because nothing ever **wrote** chunks (e.g. chatbot never hits **`indexFile`**). 3. **Relying on the agent to call tools** is too fragile for compliance or UX (“user must remember to index”). Then add **`requestIngestion` / `indexFile`** at the **feature commit point** (or emit a domain event), with **`contextRef` / `chunkMetadata`** for **`feature_code`**, business ids, and **no secrets**. ### 3.6 Implementation pattern (feature-native corpus only) 1. **Commit point** — authoritative write in the feature or shared storage. 2. **Scope** — align with **`FileItem`** / **`ServiceCenterContext`** rules already used in **`indexFile`**. 3. **Unified façade** — one ingestion API; avoid a second embedding pipeline. 4. **Purge** — tie to **`fileId`**, business key, or future connector purge keys on revoke/delete. ### 3.7 Phasing (feature matrix — **not** the same numbering as roadmap **P1c/P1d/P3** above) - **FM0:** For **each** row in §3.3, confirm **retrieval** vs **corpus** paths; document “satisfied by agent+upload+tools” vs “needs feature hook.” - **FM1:** Implement **feature-native corpus** for one domain with a clear §3.5 gap (e.g. **trustee** entity text, **teamsbot** persisted transcript). - **FM2:** **Chatbot** architecture decision: integrate **`serviceKnowledge`** or keep parallel retrieval; if integrate, add explicit **corpus** rules for config/FAQ. --- ## Implementation phases (suggested) Phases align with **Teil 1** (façade), **Teil 2** (connector + trigger catalog), and **Teil 3.7** (feature matrix and feature-native corpus pilots). **P0** overlaps **Teil 3.7 P0** (complete the per-feature matrix before large builds). **Authority rollout (2026-04-24):** The **user-connection ingestion lane** (bootstrap + purge tied to **`UserConnection`**) is delivered **per OAuth authority**: **`msft` (P1a)**, **`google`** + **`clickup` (P1b)** — same consumer, dispatcher fan-out, purge-by-`connectionId`, and unit tests for walkers + consumer. **Next product slices:** **P1c** (daily refresh, **§2.5**), **consent + per-connection preferences + frontend** (**§2.6**), then **P3** (event bus at scale). | Phase | Outcome | |-------|---------| | **P0 — Façade + idempotency** *(done, 2026-04-21)* | Single `requestIngestion` / `getIngestionStatus` entry point on `KnowledgeService` with content-hash idempotency, provenance in `structure._ingestion`, and structured logging (`ingestion.queued` / `ingestion.indexed` / `ingestion.skipped.duplicate` / `ingestion.failed`). All prior `indexFile` call sites now route through the façade: `routeDataFiles._autoIndexFile`, `commcoach/serviceCommcoachIndexer.indexSessionData`, `serviceAgent/coreTools/_workspaceTools.readFile`, `serviceAgent/coreTools/_documentTools.describeImage`. Agent tools no longer carry on-demand extraction + ingestion fallbacks — they are pure consumers of the knowledge store. **Teil 3.3** matrix audited. Three implementation bugs fixed during verification: stable content hash, pre-upsert `_ingestion` preservation, `mergeStrategy=None` for per-page granularity (see **§1.4 Implementation pitfalls**). | | **P1a — User-connection hooks (Microsoft `msft`)** *(done, 2026-04-21)* | **`connection.established`** / **`connection.revoked`** emitted from **Microsoft** data-OAuth success paths and from **disconnect/delete** when the row is **`msft`** (incl. **`ConnectionStatus.REVOKED`** fix where **`INACTIVE`** was invalid). Central **`KnowledgeIngestionConsumer`** (`subConnectorIngestConsumer.py`, **`app.py`** lifespan) maps **`established`** → **`connection.bootstrap`** BackgroundJob and **`revoked`** → synchronous **`KnowledgeService.purgeConnection`** → **`interfaceDbKnowledge.deleteFileContentIndexByConnectionId`**. **`FileContentIndex.connectionId`** + **`sourceKind`** (and **`IngestionJob`** carrying both) make connector-sourced rows purgeable. **Bootstrap modules live for Microsoft:** **`subConnectorSyncSharepoint.py`** (`sourceKind="sharepoint_item"`, **`eTag`** as `contentVersion`, **`SharepointAdapter.browse`** with **`@odata.nextLink`** pagination) and **`subConnectorSyncOutlook.py`** (virtual **`outlook_message`** docs — header / snippet / cleaned body via **`cleanEmailBody`**, **`changeKey`** revisions, optional **`outlook_attachment`** child jobs). Dispatcher **`_bootstrapJobHandler`** runs **SharePoint + Outlook in parallel** for **`msft`**. Structured logs: **§ Structured ingestion logs**. **Retrieval threshold calibration (2026-04-21):** **`buildAgentContext`** **`minScore`** layers lowered to **`0.35`** so **`text-embedding-3-small`** matches real cosine scores; validated on **Outlook/SharePoint–indexed** content. **Tests (P1a):** purge, consumer **msft** dispatch, **`cleanEmailBody`**, **`bootstrapSharepoint`**, **`bootstrapOutlook`**. | | **P1b — User-connection hooks (Google + ClickUp)** *(done, 2026-04)* | Parity with **`msft`**: **`routeSecurityGoogle`** / **`routeSecurityClickup`** call **`KnowledgeIngestionConsumer.onConnectionEstablished`** after token save; **`routeDataConnections`** disconnect/delete call **`onConnectionRevoked`** for **all** authorities. **`_bootstrapJobHandler`** fans out **google → `bootstrapGdrive` + `bootstrapGmail`** in parallel and **clickup → `bootstrapClickup`**. Walkers: `subConnectorSyncGdrive.py`, `subConnectorSyncGmail.py`, `subConnectorSyncClickup.py` + `subTextClean.py`. Unit tests: `test_bootstrap_gdrive.py`, `test_bootstrap_gmail.py`, `test_bootstrap_clickup.py`, extended `test_knowledge_ingest_consumer.py`. | | **P1c — Connection refresh (lifecycle v1)** *(next)* | **Daily** (or nightly) **scheduled** re-run of the same bootstrap walkers for connections with **knowledge ingestion enabled** (**§2.6**). Reuses idempotency + fast-path; closes the **post-connect delta gap** without webhooks in v1. Observability: same log family as bootstrap; optional `event` suffix or `reason=scheduled_refresh` for shippers. | | **P1d — Consent + preferences + UI** *(next)* | Persist **§2.6** settings **per `connectionId`**; Gate **`onConnectionEstablished`** / P1c jobs on user choice; **`frontend_nyla`** connection wizard + settings screen; walkers honor mail/file/ClickUp depth and **neutralization** flag. | | **~~P2 — Profile & mandate snapshots~~** | **Removed from active roadmap** (focus: connections + feature corpus + scale). Target content remains documented in **§2.3** for a future re-entry when needed. | | **P3 — Event bus** | Move direct calls to async consumer where load requires it (**Teil 2.4** scalable target). Remains in scope. | ### P1b checklist *(completed — kept for audit trail)* 1. **`routeSecurityGoogle`:** after successful **data** OAuth, enqueue **same** ingestion consumer path as Microsoft (pass **`connectionId`**, **`AuthAuthority.google`**, mandate/user scope). 2. **`routeSecurityClickup`:** after successful OAuth / token persistence, same. 3. **`routeDataConnections`:** verify **disconnect_service** / **delete_connection** emit **revoke** (or call **`purgeConnection`**) for **google** and **clickup** rows, not only **msft**. 4. **`_bootstrapJobHandler`:** remove any **“unsupported_authority”** skip for **`google`** / **`clickup`** once walkers are registered; keep skip only for **future** authorities. 5. **Quality bar:** T10/T12–T15 in the testplan — extend from **Microsoft-only** assumptions to **all three** **`routeDataConnections`** OAuth authorities. ### P1c / P1d checklist *(next engineering slices)* 1. **P1c:** BackgroundJob or cron entry; feature flag; per-tenant stagger; only connections with **knowledge ingestion = on**; metrics on `indexed` vs `skippedDup` per run. 2. **P1d ✅ — implemented:** - [x] **`UserConnection`** extended with `knowledgeIngestionEnabled: bool` (default `False` = strict opt-in) and `knowledgePreferences: Optional[Dict]` (`schemaVersion=1`); DB auto-migration adds columns on startup. - [x] **`routeDataConnections` `create_connection`** accepts `knowledgeIngestionEnabled` + `knowledgePreferences` in request body and persists them before returning. - [x] **OAuth callbacks** (`routeSecurityGoogle`, `routeSecurityMsft`, `routeSecurityClickup`) gate `callbackRegistry.trigger("connection.established", …)` on `connection.knowledgeIngestionEnabled`; emit structured log `ingestion.connection.bootstrap.skipped reason=consent_disabled` when disabled. - [x] **`_bootstrapJobHandler`** defensive re-check: loads connection via `getUserConnectionById` and no-ops if flag was disabled after OAuth (race protection). - [x] **`IngestionJob.neutralize: bool`** added; `requestIngestion` + `_indexFileInternal` thread it through; for `sourceKind != "file"` the flag drives `_shouldNeutralize` directly; for `sourceKind == "file"` the `FileItem.neutralize` column remains authoritative. - [x] **`subConnectorPrefs.py`** — `loadConnectionPrefs(connectionId)` helper + `ConnectionIngestionPrefs` dataclass with safe defaults for all §2.6 keys. - [x] **All five walkers** (Gmail, GDrive, ClickUp, Outlook, SharePoint) load prefs at bootstrap start; limits structs gain `mailContentDepth` + `neutralize` (mail walkers), `filesIndexBinaries` (Drive), `clickupScope` (ClickUp), and `neutralize` (all). - [x] **Unit tests** (`test_p1d_consent_prefs.py` — 10 tests): consent gate no-op, prefs defaults + full mapping, Gmail depth modes (metadata/snippet/full), ClickUp scope (titles vs description). - [x] **Frontend** (`frontend_nyla`): `AddConnectionWizard` 4-step modal (connector → consent → preferences → summary + OAuth); old three-button row replaced with single „Verbindung hinzufügen“ button; `createConnectionAndAuth` hook method; `KnowledgePreferences` type in `connectionApi.ts`. **Default policy (document for deploy):** `knowledgeIngestionEnabled` defaults to `False` for all new connections. Existing connections (before P1d deploy) have the column `NULL`/`False` — **no bootstrap is triggered retroactively**. Users must explicitly opt in via the wizard or connection settings. If the team decides to migrate existing connections to `True`, a one-time migration script must be run and communicated via release note. --- ## Ziel und Nicht-Ziele **Ziel:** - One **ingestion contract** for all features and connector lifecycles. - Indexing **decoupled** from the agent loop (agents may still *invoke* tools that ultimately call ingestion, but ingestion must not *depend* on an agent run). - **Explicit** handling of connection establishment, sync, and revocation. - **Bounded** indexing of user/mandate context with a clear PII policy. - **Explicit user consent** and **per-connection** ingestion preferences (incl. optional **neutralization**) before connector content enters the knowledge store (**§2.6**). **Explizit NICHT:** - Moving **retrieval** (`buildAgentContext`) out of agents. - Guaranteeing **real-time** indexing for every byte without async jobs (latency targets are product decisions). - Indexing **everything** in the database “because we can”—only curated, policy-approved surfaces. --- ## Betroffene Module (erwartet) - **Gateway:** `serviceKnowledge`, file upload routes, connector OAuth handlers, sync workers, possibly new `serviceKnowledgeIngest` or package under `modules/serviceCenter/services/`. - **Interfaces:** `interfaceDbKnowledge` extensions for source metadata if needed; **`interfaceDbApp`** (or adjacent) for **per-`connectionId`** ingestion preferences from **§2.6**. - **Frontend:** `frontend_nyla` — connection wizard + connection detail settings (consent, depth toggles, neutralization, time window). - **Wiki / Reference:** `b-reference/gateway/ai-agent.md` (ingestion vs. retrieval) after implementation. --- ## Offene Entscheidungen | Thema | Optionen | |-------|----------| | **Email bodies** | Default product stance is **user-configurable per connection** (**§2.6** table: metadata / snippet / full cleaned body); mandate policy may still cap max tier. | | **Multi-tenant isolation audits** | Periodic job to verify chunk `mandateId` matches connection | | **Cost caps** | Per-mandate embedding budget; defer large backfills | | **Neutralization** | **User opt-in** per connection (**§2.6**); optional **mandate floor** (“never below snippet+neutralize for mail”) remains a separate governance decision. | | **Provenance shape** | First-class DB columns vs **documented `chunkMetadata` keys** for `connectionId`, external id, revision (must support **Teil 2** purge rules). | | **In-flight duplicate handling** | Accept `status ∈ {"extracted","embedding","indexed"}` with matching hash as in-progress (cheap, lossy under failure) **vs** per-`sourceId` `asyncio.Lock` in `KnowledgeService` (strict, requires singleton) — see **§1.4 Deferred (ingestion idempotency hardening)**. | | **Pre-extraction dedup shortcut** | Short-circuit `_autoIndexFile` via the file-bytes SHA in `interfaceDbManagement` before running `runExtraction` (~15 s saved per re-index of a large PDF) — see **§1.4 Deferred (ingestion idempotency hardening)**. | --- ## Structured ingestion logs (P1 schema) The connection-lifecycle lane emits the following structured log events. **`part`** values **`sharepoint`**, **`outlook`**, **`gdrive`**, **`gmail`**, and **`clickup`** are all **implemented** for bootstrap; **P1c** may add the same events with a distinguishable `reason` / `jobType` for **scheduled refresh** (exact field TBD in implementation). Each event is a single `logger.info` / `.warning` / `.error` call with a stable `extra={"event": ...}` field so downstream log shippers can route on `event` without parsing the message string. | `event` | Severity | Emitter | Required `extra` keys | Meaning | |---------|----------|---------|------------------------|---------| | `ingestion.connection.bootstrap.queued` | info | `KnowledgeIngestionConsumer._onConnectionEstablished` | `connectionId`, `authority` | A `connection.established` callback was received and a `connection.bootstrap` BackgroundJob is being enqueued. | | `ingestion.connection.bootstrap.started` | info | `bootstrap{Sharepoint,Outlook,Gdrive,Gmail,Clickup}` | `connectionId`, `part` (`sharepoint` \| `outlook` \| `gdrive` \| `gmail` \| `clickup`) | The per-part bootstrap walker has begun work. | | `ingestion.connection.bootstrap.progress` | info | bootstrap walkers | `connectionId`, `part`, `processed`, `skippedDup`, `failed` | Heart-beat every ~50 items so long-running runs are observable. | | `ingestion.connection.bootstrap.done` | info | bootstrap walkers + façade-level totals | `connectionId`, `part`, `indexed`, `skippedDup`, `skippedPolicy`, `failed`, `durationMs` (Outlook/Gmail add `attachmentsIndexed`; SharePoint/Drive add `bytes`; ClickUp adds `workspaces` + `lists`) | Walker finished cleanly. | | `ingestion.connection.bootstrap.failed` | error | `_bootstrapJobHandler` | `part`, `connectionId`, `error` | One bootstrap part raised — recorded but the other parts still complete. | | `ingestion.connection.bootstrap.skipped` | info | `_bootstrapJobHandler` + OAuth callbacks + defensive check in `_bootstrapJobHandler` | `connectionId`, `authority`, `reason` (`unsupported_authority` │ `consent_disabled`) | Authority has no bootstrap module registered (e.g. a future provider) — **or** user has not consented (`knowledgeIngestionEnabled=False`). | | `ingestion.connection.purged` | info | `_onConnectionRevoked` | `connectionId`, `authority`, `reason`, `indexRows`, `chunks` | Knowledge purge for a revoked connection completed; numbers reflect the deleted rows. | | `ingestion.connection.purged.failed` | error | `_onConnectionRevoked` | `connectionId`, `error` | Purge raised; the revoke event was still acknowledged upstream. | All events should keep field naming consistent with the existing `ingestion.queued / .indexed / .skipped.duplicate / .failed` family from P0 (camelCase, `connectionId`, `mandateId`, `userId`). Counters are integers, durations are in milliseconds. ## Links - **How-to / orientation:** [Unified knowledge & RAG ingestion (guide)](../../d-guides/unified-knowledge-rag.md) - **Gateway reference (retrieval + knowledge):** `wiki/b-reference/gateway/architecture.md`, `wiki/b-reference/gateway/ai-agent.md` - **Implementation touchpoints (indicative):** `gateway/modules/serviceCenter/services/serviceKnowledge/mainServiceKnowledge.py`, `gateway/modules/routes/routeDataFiles.py`, `gateway/modules/features/commcoach/serviceCommcoachIndexer.py`, agent `coreTools` `_documentTools` / `_workspaceTools`, `gateway/modules/datamodels/datamodelExtraction.py` (`ExtractionOptions.mergeStrategy: Optional[MergeStrategy]`). - **Unit tests (P0 guardrails):** `gateway/tests/unit/services/test_ingestion_hash_stability.py`, `gateway/tests/unit/services/test_extraction_merge_strategy.py`. - **Unit tests (P1a — Microsoft, done):** `gateway/tests/unit/services/test_connection_purge.py`, `gateway/tests/unit/services/test_knowledge_ingest_consumer.py` (incl. **msft** fan-out), `gateway/tests/unit/services/test_clean_email_body.py`, `gateway/tests/unit/services/test_bootstrap_sharepoint.py`, `gateway/tests/unit/services/test_bootstrap_outlook.py`. - **Unit tests (P1b — Google + ClickUp, done):** **`test_knowledge_ingest_consumer`** (google / clickup fan-out), **`test_bootstrap_gmail.py`**, **`test_bootstrap_gdrive.py`**, **`test_bootstrap_clickup.py`**. **P1d (done):** **`test_p1d_consent_prefs.py`** (10 tests: consent gate, prefs parsing, Gmail depth modes, ClickUp scope). **P1c:** add scheduler tests when implemented. - **P1 implementation touchpoints:** `gateway/modules/serviceCenter/services/serviceKnowledge/subConnectorIngestConsumer.py`, `gateway/modules/serviceCenter/services/serviceKnowledge/subConnectorSyncSharepoint.py`, `gateway/modules/serviceCenter/services/serviceKnowledge/subConnectorSyncOutlook.py`, `gateway/modules/serviceCenter/services/serviceKnowledge/subConnectorSyncGdrive.py`, `gateway/modules/serviceCenter/services/serviceKnowledge/subConnectorSyncGmail.py`, `gateway/modules/serviceCenter/services/serviceKnowledge/subConnectorSyncClickup.py`, `gateway/modules/serviceCenter/services/serviceKnowledge/subTextClean.py`, `gateway/modules/interfaces/interfaceDbKnowledge.py` (`deleteFileContentIndexByConnectionId`), `gateway/modules/datamodels/datamodelKnowledge.py` (`FileContentIndex.connectionId` + `sourceKind`), `gateway/modules/connectors/providerMsft/connectorMsft.py` (`@odata.nextLink`-loop in `SharepointAdapter.browse`, `eTag` in `_graphItemToExternalEntry`), `gateway/modules/connectors/providerGoogle/connectorGoogle.py` (P1b: Drive + Gmail revision keys and download/export paths), `gateway/modules/routes/routeSecurityMsft.py` (P1a callbacks), `gateway/modules/routes/routeSecurityGoogle.py` and `gateway/modules/routes/routeSecurityClickup.py` (P1b: parity callbacks), `gateway/modules/routes/routeDataConnections.py` (revoke for **all** authorities), `gateway/app.py` (consumer registration in lifespan). ## Akzeptanzkriterien (Plan-Ebene) | # | Kriterium | Prio | |---|-----------|------| | 1 | Every new **file** that should be searchable triggers ingestion **without** requiring an agent session. | must | | 2 | **User connection** connect / disconnect has defined ingestion or purge behavior **for each** OAuth authority **`routeDataConnections`** supports (**P1a** **`msft`**, **P1b** **`google`** / **`clickup`**); **plus** user-controlled **opt-in** and **preference bundle** before ingestion (**P1d**, **§2.6**). | must | | 3 | **Profile/mandate** snapshot ingestion (**former roadmap P2**) is **deferred**; when re-opened, snapshots must use an explicit allowlist and never embed secrets. Until then, **§2.6** consent + neutralization covers connector-sourced PII risk. | should (reactivated when P2 returns) | | 4 | Ingestion is **idempotent** for unchanged content (no duplicate embedding work). Verified 2026-04-21 on a 500-page PDF: second re-index trigger logs `ingestion.skipped.duplicate` with a stable hash, zero embedding API calls. See **§1.4 pitfalls** for the three bug classes that had to be fixed first. | must | | 5 | **Teil 3.3** matrix completed: every `modules/features/*` product row has **retrieval** (agent vs none), **corpus** (upload / tools / feature indexer), and **gap** explicitly stated—not “non-injecting” if **`AgentService`** already provides retrieval injection. | should | --- ## Testplan (Konzept-Verifikation) | ID | Frage | Methode | |----|-------|---------| | T1 | Sind alle bestehenden Index-Entry-Points inventarisiert? | Code-Audit + Tabelle in Build-Phase | | T2 | Ist klar welche Features **Retrieval** (Agent) vs nur **Corpus** vs beides nutzen? | Review **Teil 3.3** Matrix gegen `runAgent` / `indexFile` Call-Sites | | T3 | Bleibt **plattform-RAG-Retrieval** (`buildAgentContext` über `AgentService`) unveraendert in seiner Rolle fuer Workspace/Grafikeditor/CommCoach? | Review `agentLoop` + `mainServiceAgent._createBuildRagContextFn` | | T4 | Ist Revoke/Purge fuer Connector-Chunks ohne **connectionId-Spalte** heute als **Metadata-Konvention** spezifizierbar? | Review **Teil 2.1** + **Offene Entscheidungen** Provenance | | T5 | Ist Revoke/Purge pro **User-Connection-Authority / Integrationsoberfläche** (Teil 2.2, nur `msft` / `google` / `clickup` laut `routeDataConnections`) in einem Threat-Model abgedeckt? | Datenfluss Connection → `FileItem` / virtuelles Doc → Chunks | | T6 | Ist der Content-Hash stabil zwischen zwei Extraktions-Runs desselben Files (verschiedene `contentObjectId`-UUIDs, identisches Payload)? | Unit: `tests/unit/services/test_ingestion_hash_stability.py` (5 Cases: UUID-Regen, Daten-Delta, Order-Delta, Type-Delta, Empty). Live: zweiter Trigger auf bereits indexiertes File loggt `ingestion.skipped.duplicate` mit identischem Hash (verifiziert 2026-04-21). | | T7 | Bleiben bei Multi-Page-PDFs die Per-Page-Chunks erhalten (keine `MergeStrategy`-Konkatenation)? | Unit: `tests/unit/services/test_extraction_merge_strategy.py`. Live: 500-Seiten-PDF → 563 ContentObjects, 567 Embedding-Chunks in 24 Batches (verifiziert 2026-04-21). | | T8 | Überleben `_ingestion.hash` und `status="indexed"` einen Pre-Scan-Re-Upsert in `_autoIndexFile`? | Review `routeDataFiles._autoIndexFile` Zeile ~127: existing row wird vor upsert gelesen und `_ingestion` + `indexed` in frischen `contentIndex` gemerged. Live: zweiter Trigger → `ingestion.skipped.duplicate` statt Re-Embedding. | | T9 | Räumt ein `connection.revoked` Event **alle** `FileContentIndex`-Rows + `ContentChunk`s einer Connection und **nichts anderes** auf (Uploads ohne `connectionId`, andere Connections bleiben intakt)? | Unit: `tests/unit/services/test_connection_purge.py` (3 Cases: positive purge, leerer connectionId-Noop, unbekannter connectionId). | | T10 | Dispatcht der `KnowledgeIngestionConsumer` `connection.established` korrekt als asynchroner `connection.bootstrap` Job (**P1a:** **msft** → SharePoint + Outlook parallel; **P1b:** **google** → Drive + Gmail parallel; **clickup** → Tasks) und `connection.revoked` synchron als Purge — **für jede** der drei **`routeDataConnections`**-Authorities? | **P1a + P1b (done):** `test_knowledge_ingest_consumer.py` — alle drei Authorities + revoke; unbekannte Authorities `skipped.reason="unsupported_authority"`. **P1d:** zusätzlich nur bei **Consent = ja** dispatch. | | T11 | Reduziert `cleanEmailBody` ein realistisches Outlook-HTML auf den eigenen Body-Anteil (HTML strip, Quote-Strip EN+DE, Signature-Strip, Whitespace-Collapse, `maxChars`-Truncate)? | Unit: `tests/unit/services/test_clean_email_body.py` (8 Cases). Konsequenz: `bootstrapOutlook` schickt nie HTML/Quoted-Replies/Signaturen in den Embedding-Pipeline-Schritt. | | T12 | Sind die Bootstrap-Walker für SharePoint und Outlook idempotent gegen ein zweites Run mit unveränderten `eTag` / `changeKey`? | Unit: `tests/unit/services/test_bootstrap_sharepoint.py` + `tests/unit/services/test_bootstrap_outlook.py`. Mock-Adapter liefern stable revisions; KnowledgeService-Fake meldet `duplicate` und das Result-Objekt bilanziert `skippedDuplicate`. | | T13 | Walked `bootstrapGmail` `INBOX + SENT`, parsed MIME-Bodies (preferring `text/plain`, falling back to `text/html`), folgt `nextPageToken`-Pagination und ist idempotent gegen identische `historyId` Revisions? | **P1b (done):** Unit `test_bootstrap_gmail.py`. **P1d:** Walker respektiert **Content depth** aus **§2.6** (Metadaten/Snippet/Body). | | T14 | Walked `bootstrapGdrive` My Drive rekursiv (Folder-MIME-Erkennung, `maxDepth`), respektiert den `maxAgeDays`-Recency-Filter und ist idempotent gegen identische `modifiedTime` Revisions? | **P1b (done):** Unit `test_bootstrap_gdrive.py`. **P1d:** „Binärdateien“ / MIME-Allowlist aus **§2.6**. | | T15 | Walked `bootstrapClickup` Workspaces → Spaces → Folder/Folderless Lists → Tasks unter `maxWorkspaces` / `maxListsPerWorkspace` / `maxTasks` Caps, respektiert den `maxAgeDays`-Recency-Filter und ist idempotent gegen identische `date_updated` Revisions? | **P1b (done):** Unit `test_bootstrap_clickup.py`. **P1d:** ClickUp-**Scope** (Titel/Beschreibung/Kommentare) aus **§2.6**. | | T16 | Führt der **P1c**-Tagesjob nur Verbindungen mit **Wissens-Injektion = ein** aus und bleiben Kosten/API-Limits durch Idempotenz + Fast-Path beherrschbar? | Integration oder Unit mit Fake-Clock: zweiter Lauf → überwiegend `skippedDup`; Logs `ingestion.connection.bootstrap.*` mit erkennbarem Scheduled-`reason` (falls implementiert). |