From 44e34a4cd7e74da091bbb16c54a7b7c0e484179c Mon Sep 17 00:00:00 2001
From: Ida <i.dittrich@valueon.ch>
Date: Thu, 16 Apr 2026 18:25:10 +0200
Subject: [PATCH] unified RAG plan

---
 ...-unified-knowledge-indexing-rag-concept.md | 409 ++++++++++++++++++
 1 file changed, 409 insertions(+)
 create mode 100644 c-work/1-plan/2026-04-unified-knowledge-indexing-rag-concept.md

diff --git a/c-work/1-plan/2026-04-unified-knowledge-indexing-rag-concept.md b/c-work/1-plan/2026-04-unified-knowledge-indexing-rag-concept.md
new file mode 100644
index 0000000..672ed44
--- /dev/null
+++ b/c-work/1-plan/2026-04-unified-knowledge-indexing-rag-concept.md
@@ -0,0 +1,409 @@
+<!-- status: plan -->
+<!-- started: 2026-04-16 -->
+<!-- component: gateway | platform | frontend-nyla -->
+
+# Unified Knowledge Indexing — One RAG Corpus for All Platform Information
+
+## How to read this document
+
+| Section | Content |
+|---------|---------|
+| **Beschreibung und Kontext** | Scope (**ingestion vs retrieval**), **terminology** (feature / service / connector / interface), **as-is vs target**, business case and risks. |
+| **Teil 1** | Ingestion as its **own lifecycle**: façade API, idempotency, orchestration—**not** owned by `agentLoop`. |
+| **Teil 2** | **Triggers** (beyond upload): **user connections**, account snapshots, purge; **per-connection-type** indexing guidance; event-driven option. |
+| **Teil 3** | **Feature injection** split into **retrieval** (agent + `buildAgentContext`) vs **corpus** (`indexFile`); **matrix** per `modules/features/*` product; real **gaps** vs false “non-injection”. |
+| **Implementation phases · Ziele · AC · Testplan** | Rollout, explicit non-goals, acceptance criteria, verification. |
+
+**Single sentence summary:** Keep **retrieval** on **`AgentService`**; unify **when and how** the shared **`interfaceDbKnowledge`** corpus is **filled** (routes, **user connections** / integrations, features, snapshots) behind one **ingestion contract**, without assuming every product uses the workspace agent.
+
+## Beschreibung und Kontext
+
+### Scope of this document
+
+We distinguish **ingestion** (chunking, embedding, persisting into **`interfaceDbKnowledge`**) from **retrieval** (semantic search + `buildAgentContext` for the LLM). **Retrieval** for the **unified knowledge store** is consumed primarily through **`serviceAgent`** / `runAgent` (workspace, graphical editor, CommCoach). Other products (e.g. **chatbot**, **teamsbot**) may use **different** LLM stacks—**Teil 3** maps who gets platform RAG vs who does not. This plan does **not** mandate one global retrieval path for every feature; it **does** mandate a **single ingestion story** into the same corpus where that corpus is used. The gap we address is **how and when** the corpus is **filled**, not how every LLM entry point **reads** it.
+
+### Terminology (Gateway — see `wiki/b-reference/gateway/architecture.md`)
+
+This concept separates **feature modules**, **services**, **connectors**, and **interfaces**. Conflating them produces wrong ownership (e.g. treating “SharePoint” as a `modules/features/` product, or treating “mail” as if it were `serviceKnowledge`).
+
+| Term | Where in Gateway | Role for indexing |
+|------|------------------|-------------------|
+| **Feature** | `modules/features/*` (e.g. `workspace`, `graphicalEditor`, `commcoach`, `trustee`, `chatbot`) | Product domains: UI, feature routers, orchestration. They **trigger** actions (upload, sync UX, feature-specific indexers) but must not be the **only** place that starts embedding work. |
+| **Service** | `modules/serviceCenter/services/*` | Cross-cutting facades: **`serviceKnowledge`** (indexing, search, `buildAgentContext`); **`serviceExtraction`** (content objects); **`serviceChat`** (chat/workflow documents); **`serviceMessaging`** (e-mail, notifications); **`serviceAgent`** (tools that may *indirectly* call indexing). **Unified ingestion** is primarily a **service-layer** responsibility. |
+| **Connector** | `modules/connectors/*` (Microsoft, Google, …) | Vendor adapters: OAuth, list, download. **SharePoint** and **mailbox** I/O live here; routes/features **call** connectors—they are not interchangeable with a feature or with `serviceKnowledge`. |
+| **Interface** | `modules/interfaces/*` | Persistence contracts: **`interfaceDbKnowledge`** (`FileContentIndex`, `ContentChunk`, …), **`interfaceDbManagement`** (`FileItem`, `DataSource`), **`interfaceDbApp`** (User, Mandate, `UserConnections`, Preferences). Profile, mandate, and connection rows are **interface-backed**, not a single “profile feature”. |
+
+### What we have today (as-is)
+
+**1. A strong technical “write” implementation, but no product-wide ingestion contract**
+
+- **`serviceKnowledge`** (`mainServiceKnowledge.py`) already implements the heavy lifting: **`indexFile`** resolves scope from **`FileItem`** (single source of truth), optional neutralization, sentence-aware chunking, **`serviceAi`** embeddings, **`ContentChunk`** persistence, and status on **`FileContentIndex`**. That is the right **unit of work** once **content objects** exist.
+- **Retrieval** is also centralized in the same service: **`buildAgentContext`** composes multi-layer context for agents. So **read** and **write** to the vector store are **service-owned**; what is **not** unified is **who may call writes**, with which **idempotency**, and on which **lifecycle events**.
+
+**2. Multiple invocation lanes, same underlying method**
+
+Indexing is **operationally** reached from a **mix of layers** (not one façade):
+
+- **HTTP routes** — e.g. file pipeline in **`routeDataFiles`**: pre-scan / extraction → **`indexFile`**. This is the “happy path” for uploads.
+- **Agent tools** — **`serviceAgent`** core tools (e.g. document/workspace helpers) can call **`indexFile`** when the user interacts through the agent. That ties **embedding** to **an agent session** even when the same file could have been indexed on upload.
+- **Feature-specific code** — e.g. **CommCoach** indexer paths that call **`indexFile`** for that product’s artifacts. Correct for the feature, but it is **another** ad hoc entry point with its own assumptions.
+- **Connectors** — Microsoft/Google (and similar) packages can fetch bytes and ultimately produce files or blobs; **OAuth and delta sync** are not yet modeled everywhere as **first-class ingestion lifecycles** (connect → backfill → incremental → revoke) that all funnel through the same API and metadata.
+
+There is **no single** `requestIngestion(...)`, **no standard job identity** for “this external item revision”, and **no one place** that records “this mandate revoked access → tombstone these chunks”.
+
+**3. Extraction vs indexing: clear in code, not enforced at the platform edge**
+
+- **`serviceExtraction`** (and preprocessing helpers) produce **content objects**; **`indexFile`** consumes them. The boundary is clean **inside** the pipeline, but **not every** new binary or external document **must** pass through a single orchestrated “extract then index” step—some paths may skip, duplicate, or call **`indexFile`** with partial metadata.
+
+**4. Truth for scope and identity lives in interfaces—not in “features”**
+
+- **`interfaceDbManagement`** (`FileItem`, …) and **`interfaceDbApp`** (mandate, `UserConnections`, user profile fields) define **who may see what**. **`indexFile`** already mirrors **`FileItem`** for scope; that pattern is **good** but **not generalized** to connector-backed items, virtual documents, or curated “account snapshot” chunks. If a connector writes under a different mental model of `mandateId` / `featureInstanceId`, **`interfaceDbKnowledge`** can drift from app/management truth without a systematic reconcile.
+
+**5. User/mandate/profile deltas are not first-class ingestion events**
+
+- Changes to membership, preferences, or connections update **`interfaceDbApp`** (and related tables). They affect **searchability and personalization** but are **not** consistently reflected as **versioned, allowlisted** chunks in the knowledge store—unless a feature manually adds text somewhere. That leaves agents either **under-informed** or dependent on **non-RAG** code paths for the same facts.
+
+**Summary (as-is):** The **engine** for indexing is **`serviceKnowledge.indexFile`**; the **policy graph** for *when* to run it is **implicit** and **spread across** routes, tools, and features. **Connectors** and **account/mandate** data are **not** uniformly treated as **ingestion sources** with connect/sync/revoke semantics.
+
+### What would make more sense (target)
+
+**1. One ingestion façade at the service boundary (not inside `agentLoop`)**
+
+- A small, stable API (conceptually **`requestIngestion` / `getIngestionStatus`**, implemented atop or beside **`KnowledgeService`**) that **every** lane calls: routes, feature hooks, **connector sync workers**, and (if needed) agent tools as **thin** delegates.
+- **Idempotency** (content hash, external revision, `eTag`, …) enforced **here**, so routes and tools cannot accidentally **double-embed** the same logical object.
+
+**2. Lifecycle parity for connectors and “connections”**
+
+- **Establish** → register datasource + optional short **non-secret** summary chunk + enqueue **backfill**.
+- **Delta** → incremental jobs with persisted cursors.
+- **Revoke / token invalid / GDPR** → **tombstone or purge** by `connectionId` / `sourceKind`, aligned with RBAC—not ad hoc deletes scattered in UI code.
+
+**3. Provenance / `sourceKind` (schema or `chunkMetadata`)**
+
+- Today chunks are **file-anchored**; extended provenance (internal file vs SharePoint item vs mailbox artifact vs `profile_snapshot`, **`connectionId`** for purge, revision keys) should be **consistent**—either **first-class fields** on `ContentChunk` / index rows **or** a **defined convention** inside **`chunkMetadata` / `contextRef`** until a migration is justified. Goal: retrieval, audit, and **connector revoke** cleanup are **data-driven**, not inferred only from call site.
+
+**4. Curated snapshots for interface-backed facts**
+
+- **Allowlisted** projections of mandate membership, locale, entitlements (labels), etc., regenerated on **interface-level** events—**not** dumping full user rows or secrets into embeddings.
+
+**5. Keep retrieval exactly where it is**
+
+- **`buildAgentContext`** remains the agent’s way to **consume** the corpus; ingestion only ensures that corpus is **complete, scoped, and attributable** when the agent runs.
+
+**6. Observability and cost in one place**
+
+- Queue depth, embedding spend, failures, and “skipped duplicate” counts attach to the **ingestion façade**, not to each feature.
+
+### Business goal
+
+Whenever **meaningful information** appears—files, bytes from **connectors**, configuration that should shape answers, and **bounded** user/mandate context—the platform should **ingest it once** into a **unified, scoped** knowledge layer so agents see **one coherent corpus** with clear **provenance** and **permissions**.
+
+### Why this matters now
+
+Information deltas arrive through **routes**, **features**, **`serviceAgent`** tools, **connectors**, and **`interfaceDbApp`** / **`interfaceDbManagement`** updates. Without **one** ingestion contract and triggers per **source**, you get: **missing** indexes, **duplicate** work, **scope drift** between knowledge rows and app truth, and **repeated** engineering per entry path instead of **once** at the **service** layer.
+
+### Risk if we do not unify
+
+Fragmented memory, inconsistent agent answers, compliance gaps (over-indexing sensitive fields or under-indexing allowed context), and duplicated work **per route/feature/tool** instead of at a **single service boundary**.
+
+---
+
+## Teil 1 — Indexing as its own lifecycle (not owned by the agent)
+
+### 1.1 Current useful core
+
+*(Same technical point as **“What we have today” §1** above; repeated here for readers who start at Teil 1.)* After structured **content objects** exist, **`KnowledgeService.indexFile`** performs chunking, embedding (via **`AiService`**), neutralization when required, and persistence via **`interfaceDbKnowledge`**. The **gap** is not the lack of a service method but the lack of a **single product-wide contract** for *when* and *what* enters that pipeline.
+
+### 1.2 Target responsibility split
+
+| Concern | Owner | Notes |
+|--------|--------|--------|
+| **Ingestion** (normalize → chunk → embed → store) | **Knowledge ingestion service** (logical module; may remain `KnowledgeService` + new façade) | No dependency on `agentLoop`. |
+| **Retrieval** (query → ranked context string) | **Agent** (and similar LLM entry points) | Unchanged by this concept. |
+| **Orchestration** (queues, retries, backoff) | **Job runner / worker** (new or existing infra) | Keeps API latency low. |
+
+### 1.3 Public ingestion contract (conceptual)
+
+Introduce a small, stable API surface that **all features** call—never “only if an agent runs”:
+
+- **`requestIngestion(job: IngestionJob) -> IngestionHandle`**  
+  - Idempotent key: `(sourceKind, sourceId, contentVersion | hash)`  
+  - Returns immediately with `queued` / `duplicate` / `skipped` and optional `jobId` for status polling.
+
+- **`getIngestionStatus(handle)`**  
+  - Surfaces the same states already used on `FileContentIndex` (`pending`, `extracted`, `embedding`, `indexed`, `failed`) plus connection- or source-specific substates if needed.
+
+The implementation can stay in-process at first (asyncio task queue) and move to Redis/Celery/ARQ later without changing callers.
+
+### 1.4 Idempotency and versioning
+
+- **Re-index** when content changes: compare **content hash** or **external revision** (SharePoint `eTag`, email `Message-ID` + folder cursor, file `updatedAt`).
+- **Skip** when hash unchanged to control embedding cost.
+- **Tombstone** or **scope-disable** when a source is deleted or access revoked (see Teil 2).
+
+---
+
+## Teil 2 — Triggers: not only “file write”, but every information delta
+
+“Write path” is too narrow if we read it as “HTTP upload only”. The unified model should treat **any authoritative addition or change of platform-visible information** as a potential ingestion trigger.
+
+### 2.1 Trigger taxonomy
+
+| Trigger category | Examples | Ingestion behavior (conceptual) |
+|------------------|----------|----------------------------------|
+| **Artifact persisted** | User uploads PDF; paste text saved as file; export from a feature | Existing pipeline: extract → `indexFile` (or equivalent). |
+| **User connection added / re-authorized** | SharePoint OAuth success; Microsoft/Google mail connection; new API credential with data scope | **Register datasource** + enqueue **initial sync** (backfill) + index a **short connection summary document** (site name, root path, principal, *no secrets*). |
+| **Sync for an existing connection** | Scheduled delta; webhook (if available); manual “refresh” | Incremental fetch → map to content objects or rows → **upsert** chunks keyed by external id. |
+| **Connection revoked / token invalid** | User disconnects; admin removes mandate integration | **Tombstone** or **purge** chunks keyed by **connection / external source** (today: enforce via **`chunkMetadata` / `contextRef`** convention or future columns); ensure retrieval never serves stale data from that connection. |
+| **Mandate / membership** | User added to mandate; role change; feature instance attached | Regenerate **mandate-safe summary** documents (see Section 2.3) if policy allows; **re-resolve scope** for existing chunks (may be heavy—often better to store immutable `mandateId` on chunks at write time and rely on retrieval filters). |
+| **User profile (bounded)** | Display name, locale, timezone, **non-sensitive** preferences | Optional **UserContextDocument** for personalization—not a dump of the whole `User` row. |
+| **Feature configuration** | Instance labels, data source labels, automation descriptions | If they should influence answers, emit structured **FeatureConfigSnapshot** chunks (small, text-first). |
+| **Artifact deleted / data subject erasure** | User deletes a stored file; mandate/user erase | Purge or tombstone the corresponding **`FileContentIndex` / `ContentChunk`** rows (by `fileId`); erasure jobs cascade by **`userId`** / mandate policy. **Connection-wide** revoke remains the **connection** row above. |
+
+### 2.2 User connections (added by the user) as first-class ingestion sources — lifecycle and **what to index per connection type**
+
+**Conceptual focus:** The trigger is OAuth success, saved credential, or linked account in **`UserConnection`** that grants access to an external system. **Implementation** still flows through provider code under `gateway/modules/connectors/` (e.g. **`providerMsft`**, **`providerGoogle`**, **`providerClickup`**); that mapping is **technical**, not the product wording.
+
+**Scope — what counts as a user connection here:** `gateway/modules/routes/routeDataConnections.py` only allows **creating** connections with `type` **`msft`**, **`google`**, or **`clickup`** (`create_connection` → OAuth via `connect_service`). The **authorities options** endpoint also lists **`local`**, but that path is **not** wired in `create_connection`. **This subsection only covers those user-connection authorities** (plus the surfaces each OAuth integration can reach, e.g. Graph mail for Microsoft). Other Gateway connector packages (FTP, Jira, preprocessor, outbound-only mail, geo APIs, …) are **out of scope** in §2.2 until they are exposed the same way as **`UserConnection`** rows.
+
+**Cross-cutting rules (every user-added connection):**
+
+- **Never index:** OAuth tokens, refresh payloads, raw credentials, webhook signing secrets.
+- **Always safe to index (metadata only):** human-readable **connection** label, tenant/site name, root path / mailbox address **as display string**, last sync cursor (store in DB, not necessarily as embedding), **external id** + **revision** for idempotency.
+- **Prefer file pipeline for binaries:** download → store as `FileItem` (or equivalent) → reuse existing **extract → `indexFile`** path so neutralization and scope mirror upload behavior.
+- **Prefer virtual documents** for small text-native items (mail headers/snippets, issue titles/descriptions) to avoid N binary copies.
+- **Quota:** per-mandate max documents, max bytes, and “index only last N days” for mail are **product** knobs, not defaults baked into each adapter.
+
+**Lifecycle pattern (target) — tied to the connection row, not to “a connector class”:**
+
+1. **Connection event** (`ConnectionEstablished`) fires when the user **adds** or **re-authorizes** a connection (OAuth / credential storage, **`UserConnection`**, authority **`msft`**, **`google`**, or **`clickup`** per current API).
+2. **Ingestion registry** records: `{ connectionId, featureCode, mandateId, userId, scope, externalRoot, adapterKind }` (adapter kind = which integration backs this connection).
+3. **Sync planner** enqueues jobs **for that connection**:
+   - **Bootstrap:** list roots, respect quotas, prioritize recently modified.
+   - **Delta:** cursor per drive/site/folder/mailbox/label; persist cursor in DB.
+4. **Normalizer** maps each external item to either:
+   - **File-like** → persist bytes + run extraction + **`indexFile`**, or  
+   - **Virtual document** → build `contentObjects` in memory + **`indexFile`** with a synthetic `fileId` / stable external key.
+
+---
+
+#### When the user connects **Microsoft** (Graph — SharePoint, OneDrive, Outlook, Teams) — `providerMsft`
+
+| **Connection surface** (implementation) | **Should be indexed (typical)** | **Usually skip or optional** | **Notes** |
+|--------|----------------------------------|------------------------------|-----------|
+| **SharePoint** (`SharepointAdapter`) | Document libraries: **PDF, Office, text, markdown**; list **metadata** (library name, path, item name) as `contextRef`. | Huge video blobs, raw executables, duplicates already indexed via another path. | Use **driveItem id + eTag** for revision. Respect **library/folder allowlist** on this **connection**. |
+| **OneDrive** (`OneDriveAdapter`) | Same as SharePoint for **personal files** reachable through the user’s connection. | System/temp folders if exposed. | Scope = **personal** unless shared into mandate explicitly. |
+| **Outlook** (`OutlookAdapter`) | **Mailbox:** subjects, **from/to/cc**, **received date**, **body** (plain or stripped HTML) per policy; **calendar** titles/locations/descriptions if product enables. | Full MIME raw, embedded images as separate media unless needed; **entire mailbox** without date window in v1. | Strong **retention + PII** policy: optional “headers + snippet only”; strip signatures/quoted threads; **attachments** → child **file-like** jobs (virus/size limits). |
+| **Teams** (`TeamsAdapter`) | **Channel messages** (text), **meeting chat** exports if API allows; **files shared in channel** as file-like. | Message reactions, per-user read receipts; continuous full channel history without bounds. | Often **high volume** — default to **recent window** or **keyword/subscription** driven sync. |
+
+---
+
+#### When the user connects **Google** (Drive, Gmail) — `providerGoogle`
+
+| **Connection surface** (implementation) | **Should be indexed (typical)** | **Usually skip or optional** | **Notes** |
+|--------|----------------------------------|------------------------------|-----------|
+| **Drive** (`DriveAdapter`) | Native Google files after **export** to Office/PDF (existing export MIME map); standard uploaded files **download → extract**. | Trashed items; shared drives the user did not authorize. | Use **file id + modifiedTime**; Google Docs need **export** before text extraction. |
+| **Gmail** (`GmailAdapter`) | **Threads:** subject, participants, internalDate, **snippet** or **body** per policy; **attachments** as separate ingest jobs. | Entire “All Mail” unbounded; **labels** that are purely system. | Same mail cautions as Outlook; **Message-ID** + **History-ID**/cursor for delta. |
+
+---
+
+#### When the user connects **ClickUp** — `providerClickup` (`AuthAuthority.CLICKUP`)
+
+| **Connection surface** (implementation) | **Should be indexed (typical)** | **Usually skip or optional** | **Notes** |
+|--------|----------------------------------|------------------------------|-----------|
+| **ClickUp** (`providerClickup`) | Task **name**, **description**, **comments**; **attachment** content if downloaded. | Activity stream noise, every status micro-change unless text changed. | Rate limits → prioritize **recently updated** tasks. |
+
+---
+
+**Email and messaging (Outlook + Gmail via Microsoft / Google user connections) — shared cautions**
+
+- Default tiers: **metadata only** → **snippet** → **full body** → **attachments** (most expensive / sensitive).
+- Apply **quoted-thread stripping**, **signature removal**, and **max body length** before embed.
+- **Legal hold / retention:** ingestion must respect mandate **delete** and **export** rules; **disconnecting** or **revoking** the mail **connection** must **purge** mail-sourced chunks.
+
+### 2.3 “Account and stuff” — what to index vs. what never to index
+
+**Goal:** Give agents **useful, permission-safe** context (“who is this user in this mandate”, “which features are on”, “preferred language”) without creating a **second copy of sensitive credentials** in the vector store.
+
+| Data | Typical treatment |
+|------|-------------------|
+| Passwords, refresh tokens, API secrets | **Never** index; never pass through embedding pipeline. |
+| Email, phone, government IDs | **Default deny**; only if product explicitly enables “index PII” with neutralization and mandate policy. |
+| Display name, locale, feature entitlements (labels) | **Allow** as a small structured **UserMandateSnapshot** document regenerated on change. |
+| Full `User` or `Mandate` DB row | **Avoid**; generate **curated** JSON/text snapshots with field allowlists. |
+
+Snapshots should be stored with the same **scope model** as file chunks (`personal`, `featureInstance`, `mandate`, `global`) so `semanticSearch` filters stay consistent.
+
+### 2.4 Event-driven vs. direct calls
+
+**Minimum viable:** each feature calls `requestIngestion` at the end of its own transaction (direct call).
+
+**Scalable target:** emit **domain events** (`FileCommitted`, `UserConnectionReady` / provider-specific ready event, `ProfileUpdated`) and a single **KnowledgeIngestionConsumer** subscribes. Benefits: one place for metrics, retries, and rate limits; features stay thin.
+
+**Storage (already implemented — not redesigned here):** The platform already uses **one** knowledge persistence stack: **`FileContentIndex`** (incl. `mandateId`, `scope`, status) and **`ContentChunk`** (pgvector embeddings, `fileId`, `userId`, `featureInstanceId`, `contextRef`, optional **`chunkMetadata`**), accessed via **`interfaceDbKnowledge`**. Chunks are **file-anchored** today; **connection- / source-specific** provenance (e.g. `connectionId`, external ids) can ride in **`contextRef` / `chunkMetadata`** until optional schema extensions are justified. **This document targets ingestion triggers and lifecycles**, not a second corpus or a duplicate storage model.
+
+---
+
+## Teil 3 — Feature injection: retrieval vs corpus, agent loop, and real gaps
+
+“Injection” is ambiguous. This section uses **two** precise meanings:
+
+| Kind | What happens | Primary implementation today |
+|------|----------------|------------------------------|
+| **Retrieval injection** | Relevant **existing** chunks and workflow context are **assembled** and **inserted into the LLM prompt** (system message) each agent round. | **`AgentService.runAgent`** → `buildRagContextFn` → **`KnowledgeService.buildAgentContext`** → **`ConversationManager.injectRagContext`**. CommCoach wraps the same **`buildAgentContext`** and adds coaching-specific context. |
+| **Corpus injection (indexing)** | **New** text/binary is **chunked and embedded** and written to **`interfaceDbKnowledge`** so it can be retrieved later. | **`KnowledgeService.indexFile`**; callers include **`routeDataFiles._autoIndexFile`**, **`serviceAgent`** tools (**`_documentTools`**, **`_workspaceTools`**), and **CommCoach** **`serviceCommcoachIndexer`**. |
+
+A feature can **already participate fully in retrieval injection** by using **`AgentService`** without ever calling **`indexFile`** in its own folder. **Corpus** growth can still happen **indirectly** (upload pipeline, agent tools). Planning must **not** label such features as “non-injecting.”
+
+### 3.1 Features that already use **`AgentService.runAgent`** (retrieval injection is on by default)
+
+These **`modules/features/*`** entry points resolve **`getService("agent", ctx)`** and stream **`agentService.runAgent(...)`** (code audit):
+
+- **`workspace`** (`routeFeatureWorkspace.py`)
+- **`graphicalEditor`** (`routeFeatureGraphicalEditor.py`)
+- **`commcoach`** (`serviceCommcoach.py` — custom **`buildRagContextFn`**, still uses platform **`buildAgentContext`** inside)
+
+For all three, **every agent round** gets **retrieval injection** unless RAG fails or returns empty. **Corpus** updates for the same sessions still depend on **separate** mechanisms:
+
+| Corpus path | When it runs |
+|-------------|----------------|
+| **Upload / `FileItem`** | **`routeDataFiles`** **`_autoIndexFile`** after storage (feature-agnostic). |
+| **Agent tools** | If the model invokes tools in **`_documentTools`** / **`_workspaceTools`** that call **`indexFile`**, **corpus** changes **during** that agent run—implemented in **`serviceAgent`**, not in the feature’s route file. |
+
+So **workspace** and **graphicalEditor** **do** “inject” in the **retrieval** sense today; they **can** “inject” in the **corpus** sense when users **upload** files or when the **agent** runs indexing-capable tools. What they **often lack** is **feature-owned, automatic corpus** logic (e.g. “on every graph publish, index a snapshot”) without an upload or tool call.
+
+### 3.2 Features that do **not** use **`AgentService`** (no platform RAG prompt injection from this stack)
+
+These domains **do not** call **`runAgent`** in their **`modules/features/*`** trees (audit). They therefore **do not** receive **`buildAgentContext`** through the **workspace agent** loop:
+
+| Feature | Notes |
+|---------|--------|
+| **chatbot** | Uses an **internal** LangGraph-style flow (SQL / Tavily / answer nodes). **No** `getService("knowledge")` / **`buildAgentContext`** usage under **`modules/features/chatbot/`** in the audited tree—**retrieval injection** and the **unified corpus** are **not** wired the same way as the workspace agent. |
+| **trustee** | Domain CRUD and quick actions (e.g. **`agentPrompt`** is a **UI hint** to open the workspace with a prefilled prompt—not **`AgentService` inside trustee**). Corpus: **only** via shared **upload** or if the user later uses **workspace agent** with tools. |
+| **realEstate** | No **`AgentService`** hook in feature tree; same **upload** story for files. |
+| **teamsbot** | Uses **`serviceAi`** (and related) for the meeting pipeline; **`sessionContext`** is **ephemeral** prompt text. **No** **`AgentService`** / **`buildAgentContext`** in the same pattern as workspace. |
+| **neutralization** | **Service/pipeline** used **inside** **`indexFile`** when **`FileItem.neutralize`** applies—not a feature that “injects” either kind by itself. |
+
+### 3.3 Summary matrix (per `modules/features/` domain)
+
+| Feature | **`AgentService.runAgent`** | **Retrieval injection** (platform RAG prompt) | **Corpus injection** (typical today) | **Likely gap** (this document) |
+|---------|----------------------------|-----------------------------------------------|-------------------------------------|--------------------------------|
+| **workspace** | Yes | Yes | Upload **`_autoIndexFile`**; optional **`indexFile`** via agent **tools** | **Automatic** corpus for artifacts that never become **`FileItem`** or tool outputs (exports, structured summaries). |
+| **graphicalEditor** | Yes | Yes | Same as workspace | **Published graph / metadata** as searchable corpus without manual upload. |
+| **commcoach** | Yes | Yes (+ custom RAG layer) | Session **`indexFile`** (**`serviceCommcoachIndexer`**) + upload/tools | Extend only if new artifact types need the same **feature-local** indexer pattern. |
+| **chatbot** | No | **No** (unified store) | No feature-local **`indexFile`** | Decide if chatbot should call **`buildAgentContext`** / **`indexFile`** or stay on SQL/Tavily; **FAQ / grounding** text may need **corpus** hooks. |
+| **trustee** | No | Only if user works in **workspace** | Upload path; agent tools only in workspace | **Trustee-native** persist events → ingestion when files are not the only representation. |
+| **realEstate** | No | Only via workspace | Upload path | Same as trustee for case/property narratives. |
+| **teamsbot** | No | No | None from unified store by default | Persisted **transcripts / notes** → **`indexFile`** if they should be mandate-searchable. |
+| **neutralization** | N/A | N/A | Preconditions for **`indexFile`** | Ensure all **new** ingest paths honor **`FileItem.neutralize`**. |
+
+### 3.4 Shared corpus mechanisms (not feature-local, but serve agent features)
+
+| Mechanism | Role |
+|-----------|------|
+| **`routeDataFiles` + `_autoIndexFile`** | Indexes **uploaded** `FileItem`s for **any** UI that uses the upload API—including workspace. |
+| **`serviceAgent`** **`_documentTools`** / **`_workspaceTools`** | **Corpus** writes when the **model** chooses tools; available to **workspace** and **graphicalEditor** agent sessions (and **CommCoach** when those tools are in the toolset). |
+| **CommCoach** **`serviceCommcoachIndexer`** | **Feature-local** corpus: coaching session text → **`indexFile`** without requiring an upload. |
+
+### 3.5 Where **additional feature-native corpus injection** is still needed
+
+Use this checklist **only** after accounting for §3.1–3.4:
+
+1. **Content is authoritative** in the feature DB or blob store **without** a guaranteed **`FileItem`** + **`_autoIndexFile`** path.  
+2. **Retrieval injection alone** is insufficient because nothing ever **wrote** chunks (e.g. chatbot never hits **`indexFile`**).  
+3. **Relying on the agent to call tools** is too fragile for compliance or UX (“user must remember to index”).
+
+Then add **`requestIngestion` / `indexFile`** at the **feature commit point** (or emit a domain event), with **`contextRef` / `chunkMetadata`** for **`feature_code`**, business ids, and **no secrets**.
+
+### 3.6 Implementation pattern (feature-native corpus only)
+
+1. **Commit point** — authoritative write in the feature or shared storage.  
+2. **Scope** — align with **`FileItem`** / **`ServiceCenterContext`** rules already used in **`indexFile`**.  
+3. **Unified façade** — one ingestion API; avoid a second embedding pipeline.  
+4. **Purge** — tie to **`fileId`**, business key, or future connector purge keys on revoke/delete.
+
+### 3.7 Phasing
+
+- **P0:** For **each** row in §3.3, confirm **retrieval** vs **corpus** paths; document “satisfied by agent+upload+tools” vs “needs feature hook.”  
+- **P1:** Implement **feature-native corpus** for one domain with a clear §3.5 gap (e.g. **trustee** entity text, **teamsbot** persisted transcript).  
+- **P2:** **Chatbot** architecture decision: integrate **`serviceKnowledge`** or keep parallel retrieval; if integrate, add explicit **corpus** rules for config/FAQ.
+
+---
+
+## Implementation phases (suggested)
+
+Phases align with **Teil 1** (façade), **Teil 2** (connector + trigger catalog), and **Teil 3.7** (feature matrix and feature-native corpus pilots). **P0** overlaps **Teil 3.7 P0** (complete the per-feature matrix before large builds).
+
+| Phase | Outcome |
+|-------|---------|
+| **P0 — Catalog & façade** | Document all current `indexFile` call sites; wrap them in `requestIngestion`; add metrics and structured logging; complete **Teil 3.3** matrix (retrieval vs corpus vs gap) per feature. |
+| **P1 — User-connection hooks** | On connection success/failure/revoke, enqueue bootstrap/delta/purge jobs per **Teil 2.2**; SharePoint and one mail provider as pilots. |
+| **P2 — Profile & mandate snapshots** | Allowlisted fields only (**Teil 2.3**); regenerate on events; explicit admin toggle per mandate if needed. |
+| **P3 — Event bus** | Move direct calls to async consumer where load requires it (**Teil 2.4** scalable target). |
+
+---
+
+## Ziel und Nicht-Ziele
+
+**Ziel:**
+
+- One **ingestion contract** for all features and connector lifecycles.  
+- Indexing **decoupled** from the agent loop (agents may still *invoke* tools that ultimately call ingestion, but ingestion must not *depend* on an agent run).  
+- **Explicit** handling of connection establishment, sync, and revocation.  
+- **Bounded** indexing of user/mandate context with a clear PII policy.
+
+**Explizit NICHT:**
+
+- Moving **retrieval** (`buildAgentContext`) out of agents.  
+- Guaranteeing **real-time** indexing for every byte without async jobs (latency targets are product decisions).  
+- Indexing **everything** in the database “because we can”—only curated, policy-approved surfaces.
+
+---
+
+## Betroffene Module (erwartet)
+
+- **Gateway:** `serviceKnowledge`, file upload routes, connector OAuth handlers, sync workers, possibly new `serviceKnowledgeIngest` or package under `modules/serviceCenter/services/`.  
+- **Interfaces:** `interfaceDbKnowledge` extensions for source metadata if needed.  
+- **Wiki / Reference:** `b-reference/gateway/ai-agent.md` (ingestion vs. retrieval) after implementation.
+
+---
+
+## Offene Entscheidungen
+
+| Thema | Optionen |
+|-------|----------|
+| **Email bodies** | Full text vs. summary-only vs. attachment-only |
+| **Multi-tenant isolation audits** | Periodic job to verify chunk `mandateId` matches connection |
+| **Cost caps** | Per-mandate embedding budget; defer large backfills |
+| **Neutralization** | Mandatory for certain `sourceKind`s even when not file-upload |
+| **Provenance shape** | First-class DB columns vs **documented `chunkMetadata` keys** for `connectionId`, external id, revision (must support **Teil 2** purge rules). |
+
+---
+
+## Links
+
+- **How-to / orientation:** [Unified knowledge & RAG ingestion (guide)](../../d-guides/unified-knowledge-rag.md)  
+- **Gateway reference (retrieval + knowledge):** `wiki/b-reference/gateway/architecture.md`, `wiki/b-reference/gateway/ai-agent.md`  
+- **Implementation touchpoints (indicative):** `gateway/modules/serviceCenter/services/serviceKnowledge/mainServiceKnowledge.py`, `gateway/modules/routes/routeDataFiles.py`, `gateway/modules/features/commcoach/serviceCommcoachIndexer.py`, agent `coreTools` `_documentTools` / `_workspaceTools`.
+
+## Akzeptanzkriterien (Plan-Ebene)
+
+| # | Kriterium | Prio |
+|---|-----------|------|
+| 1 | Every new **file** that should be searchable triggers ingestion **without** requiring an agent session. | must |
+| 2 | **User connection** connect / disconnect has defined ingestion or purge behavior documented and implementable. | must |
+| 3 | **Profile/mandate** snapshots use an explicit allowlist; secrets never enter the embedding pipeline. | must |
+| 4 | Ingestion is **idempotent** for unchanged content (no duplicate embedding work). | should |
+| 5 | **Teil 3.3** matrix completed: every `modules/features/*` product row has **retrieval** (agent vs none), **corpus** (upload / tools / feature indexer), and **gap** explicitly stated—not “non-injecting” if **`AgentService`** already provides retrieval injection. | should |
+
+---
+
+## Testplan (Konzept-Verifikation)
+
+| ID | Frage | Methode |
+|----|-------|---------|
+| T1 | Sind alle bestehenden Index-Entry-Points inventarisiert? | Code-Audit + Tabelle in Build-Phase |
+| T2 | Ist klar welche Features **Retrieval** (Agent) vs nur **Corpus** vs beides nutzen? | Review **Teil 3.3** Matrix gegen `runAgent` / `indexFile` Call-Sites |
+| T3 | Bleibt **plattform-RAG-Retrieval** (`buildAgentContext` über `AgentService`) unveraendert in seiner Rolle fuer Workspace/Grafikeditor/CommCoach? | Review `agentLoop` + `mainServiceAgent._createBuildRagContextFn` |
+| T4 | Ist Revoke/Purge fuer Connector-Chunks ohne **connectionId-Spalte** heute als **Metadata-Konvention** spezifizierbar? | Review **Teil 2.1** + **Offene Entscheidungen** Provenance |
+| T5 | Ist Revoke/Purge pro **User-Connection-Authority / Integrationsoberfläche** (Teil 2.2, nur `msft` / `google` / `clickup` laut `routeDataConnections`) in einem Threat-Model abgedeckt? | Datenfluss Connection → `FileItem` / virtuelles Doc → Chunks |