From 44e34a4cd7e74da091bbb16c54a7b7c0e484179c Mon Sep 17 00:00:00 2001 From: Ida Date: Thu, 16 Apr 2026 18:25:10 +0200 Subject: [PATCH] unified RAG plan --- ...-unified-knowledge-indexing-rag-concept.md | 409 ++++++++++++++++++ 1 file changed, 409 insertions(+) create mode 100644 c-work/1-plan/2026-04-unified-knowledge-indexing-rag-concept.md diff --git a/c-work/1-plan/2026-04-unified-knowledge-indexing-rag-concept.md b/c-work/1-plan/2026-04-unified-knowledge-indexing-rag-concept.md new file mode 100644 index 0000000..672ed44 --- /dev/null +++ b/c-work/1-plan/2026-04-unified-knowledge-indexing-rag-concept.md @@ -0,0 +1,409 @@ + + + + +# Unified Knowledge Indexing — One RAG Corpus for All Platform Information + +## How to read this document + +| Section | Content | +|---------|---------| +| **Beschreibung und Kontext** | Scope (**ingestion vs retrieval**), **terminology** (feature / service / connector / interface), **as-is vs target**, business case and risks. | +| **Teil 1** | Ingestion as its **own lifecycle**: façade API, idempotency, orchestration—**not** owned by `agentLoop`. | +| **Teil 2** | **Triggers** (beyond upload): **user connections**, account snapshots, purge; **per-connection-type** indexing guidance; event-driven option. | +| **Teil 3** | **Feature injection** split into **retrieval** (agent + `buildAgentContext`) vs **corpus** (`indexFile`); **matrix** per `modules/features/*` product; real **gaps** vs false “non-injection”. | +| **Implementation phases · Ziele · AC · Testplan** | Rollout, explicit non-goals, acceptance criteria, verification. | + +**Single sentence summary:** Keep **retrieval** on **`AgentService`**; unify **when and how** the shared **`interfaceDbKnowledge`** corpus is **filled** (routes, **user connections** / integrations, features, snapshots) behind one **ingestion contract**, without assuming every product uses the workspace agent. + +## Beschreibung und Kontext + +### Scope of this document + +We distinguish **ingestion** (chunking, embedding, persisting into **`interfaceDbKnowledge`**) from **retrieval** (semantic search + `buildAgentContext` for the LLM). **Retrieval** for the **unified knowledge store** is consumed primarily through **`serviceAgent`** / `runAgent` (workspace, graphical editor, CommCoach). Other products (e.g. **chatbot**, **teamsbot**) may use **different** LLM stacks—**Teil 3** maps who gets platform RAG vs who does not. This plan does **not** mandate one global retrieval path for every feature; it **does** mandate a **single ingestion story** into the same corpus where that corpus is used. The gap we address is **how and when** the corpus is **filled**, not how every LLM entry point **reads** it. + +### Terminology (Gateway — see `wiki/b-reference/gateway/architecture.md`) + +This concept separates **feature modules**, **services**, **connectors**, and **interfaces**. Conflating them produces wrong ownership (e.g. treating “SharePoint” as a `modules/features/` product, or treating “mail” as if it were `serviceKnowledge`). + +| Term | Where in Gateway | Role for indexing | +|------|------------------|-------------------| +| **Feature** | `modules/features/*` (e.g. `workspace`, `graphicalEditor`, `commcoach`, `trustee`, `chatbot`) | Product domains: UI, feature routers, orchestration. They **trigger** actions (upload, sync UX, feature-specific indexers) but must not be the **only** place that starts embedding work. | +| **Service** | `modules/serviceCenter/services/*` | Cross-cutting facades: **`serviceKnowledge`** (indexing, search, `buildAgentContext`); **`serviceExtraction`** (content objects); **`serviceChat`** (chat/workflow documents); **`serviceMessaging`** (e-mail, notifications); **`serviceAgent`** (tools that may *indirectly* call indexing). **Unified ingestion** is primarily a **service-layer** responsibility. | +| **Connector** | `modules/connectors/*` (Microsoft, Google, …) | Vendor adapters: OAuth, list, download. **SharePoint** and **mailbox** I/O live here; routes/features **call** connectors—they are not interchangeable with a feature or with `serviceKnowledge`. | +| **Interface** | `modules/interfaces/*` | Persistence contracts: **`interfaceDbKnowledge`** (`FileContentIndex`, `ContentChunk`, …), **`interfaceDbManagement`** (`FileItem`, `DataSource`), **`interfaceDbApp`** (User, Mandate, `UserConnections`, Preferences). Profile, mandate, and connection rows are **interface-backed**, not a single “profile feature”. | + +### What we have today (as-is) + +**1. A strong technical “write” implementation, but no product-wide ingestion contract** + +- **`serviceKnowledge`** (`mainServiceKnowledge.py`) already implements the heavy lifting: **`indexFile`** resolves scope from **`FileItem`** (single source of truth), optional neutralization, sentence-aware chunking, **`serviceAi`** embeddings, **`ContentChunk`** persistence, and status on **`FileContentIndex`**. That is the right **unit of work** once **content objects** exist. +- **Retrieval** is also centralized in the same service: **`buildAgentContext`** composes multi-layer context for agents. So **read** and **write** to the vector store are **service-owned**; what is **not** unified is **who may call writes**, with which **idempotency**, and on which **lifecycle events**. + +**2. Multiple invocation lanes, same underlying method** + +Indexing is **operationally** reached from a **mix of layers** (not one façade): + +- **HTTP routes** — e.g. file pipeline in **`routeDataFiles`**: pre-scan / extraction → **`indexFile`**. This is the “happy path” for uploads. +- **Agent tools** — **`serviceAgent`** core tools (e.g. document/workspace helpers) can call **`indexFile`** when the user interacts through the agent. That ties **embedding** to **an agent session** even when the same file could have been indexed on upload. +- **Feature-specific code** — e.g. **CommCoach** indexer paths that call **`indexFile`** for that product’s artifacts. Correct for the feature, but it is **another** ad hoc entry point with its own assumptions. +- **Connectors** — Microsoft/Google (and similar) packages can fetch bytes and ultimately produce files or blobs; **OAuth and delta sync** are not yet modeled everywhere as **first-class ingestion lifecycles** (connect → backfill → incremental → revoke) that all funnel through the same API and metadata. + +There is **no single** `requestIngestion(...)`, **no standard job identity** for “this external item revision”, and **no one place** that records “this mandate revoked access → tombstone these chunks”. + +**3. Extraction vs indexing: clear in code, not enforced at the platform edge** + +- **`serviceExtraction`** (and preprocessing helpers) produce **content objects**; **`indexFile`** consumes them. The boundary is clean **inside** the pipeline, but **not every** new binary or external document **must** pass through a single orchestrated “extract then index” step—some paths may skip, duplicate, or call **`indexFile`** with partial metadata. + +**4. Truth for scope and identity lives in interfaces—not in “features”** + +- **`interfaceDbManagement`** (`FileItem`, …) and **`interfaceDbApp`** (mandate, `UserConnections`, user profile fields) define **who may see what**. **`indexFile`** already mirrors **`FileItem`** for scope; that pattern is **good** but **not generalized** to connector-backed items, virtual documents, or curated “account snapshot” chunks. If a connector writes under a different mental model of `mandateId` / `featureInstanceId`, **`interfaceDbKnowledge`** can drift from app/management truth without a systematic reconcile. + +**5. User/mandate/profile deltas are not first-class ingestion events** + +- Changes to membership, preferences, or connections update **`interfaceDbApp`** (and related tables). They affect **searchability and personalization** but are **not** consistently reflected as **versioned, allowlisted** chunks in the knowledge store—unless a feature manually adds text somewhere. That leaves agents either **under-informed** or dependent on **non-RAG** code paths for the same facts. + +**Summary (as-is):** The **engine** for indexing is **`serviceKnowledge.indexFile`**; the **policy graph** for *when* to run it is **implicit** and **spread across** routes, tools, and features. **Connectors** and **account/mandate** data are **not** uniformly treated as **ingestion sources** with connect/sync/revoke semantics. + +### What would make more sense (target) + +**1. One ingestion façade at the service boundary (not inside `agentLoop`)** + +- A small, stable API (conceptually **`requestIngestion` / `getIngestionStatus`**, implemented atop or beside **`KnowledgeService`**) that **every** lane calls: routes, feature hooks, **connector sync workers**, and (if needed) agent tools as **thin** delegates. +- **Idempotency** (content hash, external revision, `eTag`, …) enforced **here**, so routes and tools cannot accidentally **double-embed** the same logical object. + +**2. Lifecycle parity for connectors and “connections”** + +- **Establish** → register datasource + optional short **non-secret** summary chunk + enqueue **backfill**. +- **Delta** → incremental jobs with persisted cursors. +- **Revoke / token invalid / GDPR** → **tombstone or purge** by `connectionId` / `sourceKind`, aligned with RBAC—not ad hoc deletes scattered in UI code. + +**3. Provenance / `sourceKind` (schema or `chunkMetadata`)** + +- Today chunks are **file-anchored**; extended provenance (internal file vs SharePoint item vs mailbox artifact vs `profile_snapshot`, **`connectionId`** for purge, revision keys) should be **consistent**—either **first-class fields** on `ContentChunk` / index rows **or** a **defined convention** inside **`chunkMetadata` / `contextRef`** until a migration is justified. Goal: retrieval, audit, and **connector revoke** cleanup are **data-driven**, not inferred only from call site. + +**4. Curated snapshots for interface-backed facts** + +- **Allowlisted** projections of mandate membership, locale, entitlements (labels), etc., regenerated on **interface-level** events—**not** dumping full user rows or secrets into embeddings. + +**5. Keep retrieval exactly where it is** + +- **`buildAgentContext`** remains the agent’s way to **consume** the corpus; ingestion only ensures that corpus is **complete, scoped, and attributable** when the agent runs. + +**6. Observability and cost in one place** + +- Queue depth, embedding spend, failures, and “skipped duplicate” counts attach to the **ingestion façade**, not to each feature. + +### Business goal + +Whenever **meaningful information** appears—files, bytes from **connectors**, configuration that should shape answers, and **bounded** user/mandate context—the platform should **ingest it once** into a **unified, scoped** knowledge layer so agents see **one coherent corpus** with clear **provenance** and **permissions**. + +### Why this matters now + +Information deltas arrive through **routes**, **features**, **`serviceAgent`** tools, **connectors**, and **`interfaceDbApp`** / **`interfaceDbManagement`** updates. Without **one** ingestion contract and triggers per **source**, you get: **missing** indexes, **duplicate** work, **scope drift** between knowledge rows and app truth, and **repeated** engineering per entry path instead of **once** at the **service** layer. + +### Risk if we do not unify + +Fragmented memory, inconsistent agent answers, compliance gaps (over-indexing sensitive fields or under-indexing allowed context), and duplicated work **per route/feature/tool** instead of at a **single service boundary**. + +--- + +## Teil 1 — Indexing as its own lifecycle (not owned by the agent) + +### 1.1 Current useful core + +*(Same technical point as **“What we have today” §1** above; repeated here for readers who start at Teil 1.)* After structured **content objects** exist, **`KnowledgeService.indexFile`** performs chunking, embedding (via **`AiService`**), neutralization when required, and persistence via **`interfaceDbKnowledge`**. The **gap** is not the lack of a service method but the lack of a **single product-wide contract** for *when* and *what* enters that pipeline. + +### 1.2 Target responsibility split + +| Concern | Owner | Notes | +|--------|--------|--------| +| **Ingestion** (normalize → chunk → embed → store) | **Knowledge ingestion service** (logical module; may remain `KnowledgeService` + new façade) | No dependency on `agentLoop`. | +| **Retrieval** (query → ranked context string) | **Agent** (and similar LLM entry points) | Unchanged by this concept. | +| **Orchestration** (queues, retries, backoff) | **Job runner / worker** (new or existing infra) | Keeps API latency low. | + +### 1.3 Public ingestion contract (conceptual) + +Introduce a small, stable API surface that **all features** call—never “only if an agent runs”: + +- **`requestIngestion(job: IngestionJob) -> IngestionHandle`** + - Idempotent key: `(sourceKind, sourceId, contentVersion | hash)` + - Returns immediately with `queued` / `duplicate` / `skipped` and optional `jobId` for status polling. + +- **`getIngestionStatus(handle)`** + - Surfaces the same states already used on `FileContentIndex` (`pending`, `extracted`, `embedding`, `indexed`, `failed`) plus connection- or source-specific substates if needed. + +The implementation can stay in-process at first (asyncio task queue) and move to Redis/Celery/ARQ later without changing callers. + +### 1.4 Idempotency and versioning + +- **Re-index** when content changes: compare **content hash** or **external revision** (SharePoint `eTag`, email `Message-ID` + folder cursor, file `updatedAt`). +- **Skip** when hash unchanged to control embedding cost. +- **Tombstone** or **scope-disable** when a source is deleted or access revoked (see Teil 2). + +--- + +## Teil 2 — Triggers: not only “file write”, but every information delta + +“Write path” is too narrow if we read it as “HTTP upload only”. The unified model should treat **any authoritative addition or change of platform-visible information** as a potential ingestion trigger. + +### 2.1 Trigger taxonomy + +| Trigger category | Examples | Ingestion behavior (conceptual) | +|------------------|----------|----------------------------------| +| **Artifact persisted** | User uploads PDF; paste text saved as file; export from a feature | Existing pipeline: extract → `indexFile` (or equivalent). | +| **User connection added / re-authorized** | SharePoint OAuth success; Microsoft/Google mail connection; new API credential with data scope | **Register datasource** + enqueue **initial sync** (backfill) + index a **short connection summary document** (site name, root path, principal, *no secrets*). | +| **Sync for an existing connection** | Scheduled delta; webhook (if available); manual “refresh” | Incremental fetch → map to content objects or rows → **upsert** chunks keyed by external id. | +| **Connection revoked / token invalid** | User disconnects; admin removes mandate integration | **Tombstone** or **purge** chunks keyed by **connection / external source** (today: enforce via **`chunkMetadata` / `contextRef`** convention or future columns); ensure retrieval never serves stale data from that connection. | +| **Mandate / membership** | User added to mandate; role change; feature instance attached | Regenerate **mandate-safe summary** documents (see Section 2.3) if policy allows; **re-resolve scope** for existing chunks (may be heavy—often better to store immutable `mandateId` on chunks at write time and rely on retrieval filters). | +| **User profile (bounded)** | Display name, locale, timezone, **non-sensitive** preferences | Optional **UserContextDocument** for personalization—not a dump of the whole `User` row. | +| **Feature configuration** | Instance labels, data source labels, automation descriptions | If they should influence answers, emit structured **FeatureConfigSnapshot** chunks (small, text-first). | +| **Artifact deleted / data subject erasure** | User deletes a stored file; mandate/user erase | Purge or tombstone the corresponding **`FileContentIndex` / `ContentChunk`** rows (by `fileId`); erasure jobs cascade by **`userId`** / mandate policy. **Connection-wide** revoke remains the **connection** row above. | + +### 2.2 User connections (added by the user) as first-class ingestion sources — lifecycle and **what to index per connection type** + +**Conceptual focus:** The trigger is OAuth success, saved credential, or linked account in **`UserConnection`** that grants access to an external system. **Implementation** still flows through provider code under `gateway/modules/connectors/` (e.g. **`providerMsft`**, **`providerGoogle`**, **`providerClickup`**); that mapping is **technical**, not the product wording. + +**Scope — what counts as a user connection here:** `gateway/modules/routes/routeDataConnections.py` only allows **creating** connections with `type` **`msft`**, **`google`**, or **`clickup`** (`create_connection` → OAuth via `connect_service`). The **authorities options** endpoint also lists **`local`**, but that path is **not** wired in `create_connection`. **This subsection only covers those user-connection authorities** (plus the surfaces each OAuth integration can reach, e.g. Graph mail for Microsoft). Other Gateway connector packages (FTP, Jira, preprocessor, outbound-only mail, geo APIs, …) are **out of scope** in §2.2 until they are exposed the same way as **`UserConnection`** rows. + +**Cross-cutting rules (every user-added connection):** + +- **Never index:** OAuth tokens, refresh payloads, raw credentials, webhook signing secrets. +- **Always safe to index (metadata only):** human-readable **connection** label, tenant/site name, root path / mailbox address **as display string**, last sync cursor (store in DB, not necessarily as embedding), **external id** + **revision** for idempotency. +- **Prefer file pipeline for binaries:** download → store as `FileItem` (or equivalent) → reuse existing **extract → `indexFile`** path so neutralization and scope mirror upload behavior. +- **Prefer virtual documents** for small text-native items (mail headers/snippets, issue titles/descriptions) to avoid N binary copies. +- **Quota:** per-mandate max documents, max bytes, and “index only last N days” for mail are **product** knobs, not defaults baked into each adapter. + +**Lifecycle pattern (target) — tied to the connection row, not to “a connector class”:** + +1. **Connection event** (`ConnectionEstablished`) fires when the user **adds** or **re-authorizes** a connection (OAuth / credential storage, **`UserConnection`**, authority **`msft`**, **`google`**, or **`clickup`** per current API). +2. **Ingestion registry** records: `{ connectionId, featureCode, mandateId, userId, scope, externalRoot, adapterKind }` (adapter kind = which integration backs this connection). +3. **Sync planner** enqueues jobs **for that connection**: + - **Bootstrap:** list roots, respect quotas, prioritize recently modified. + - **Delta:** cursor per drive/site/folder/mailbox/label; persist cursor in DB. +4. **Normalizer** maps each external item to either: + - **File-like** → persist bytes + run extraction + **`indexFile`**, or + - **Virtual document** → build `contentObjects` in memory + **`indexFile`** with a synthetic `fileId` / stable external key. + +--- + +#### When the user connects **Microsoft** (Graph — SharePoint, OneDrive, Outlook, Teams) — `providerMsft` + +| **Connection surface** (implementation) | **Should be indexed (typical)** | **Usually skip or optional** | **Notes** | +|--------|----------------------------------|------------------------------|-----------| +| **SharePoint** (`SharepointAdapter`) | Document libraries: **PDF, Office, text, markdown**; list **metadata** (library name, path, item name) as `contextRef`. | Huge video blobs, raw executables, duplicates already indexed via another path. | Use **driveItem id + eTag** for revision. Respect **library/folder allowlist** on this **connection**. | +| **OneDrive** (`OneDriveAdapter`) | Same as SharePoint for **personal files** reachable through the user’s connection. | System/temp folders if exposed. | Scope = **personal** unless shared into mandate explicitly. | +| **Outlook** (`OutlookAdapter`) | **Mailbox:** subjects, **from/to/cc**, **received date**, **body** (plain or stripped HTML) per policy; **calendar** titles/locations/descriptions if product enables. | Full MIME raw, embedded images as separate media unless needed; **entire mailbox** without date window in v1. | Strong **retention + PII** policy: optional “headers + snippet only”; strip signatures/quoted threads; **attachments** → child **file-like** jobs (virus/size limits). | +| **Teams** (`TeamsAdapter`) | **Channel messages** (text), **meeting chat** exports if API allows; **files shared in channel** as file-like. | Message reactions, per-user read receipts; continuous full channel history without bounds. | Often **high volume** — default to **recent window** or **keyword/subscription** driven sync. | + +--- + +#### When the user connects **Google** (Drive, Gmail) — `providerGoogle` + +| **Connection surface** (implementation) | **Should be indexed (typical)** | **Usually skip or optional** | **Notes** | +|--------|----------------------------------|------------------------------|-----------| +| **Drive** (`DriveAdapter`) | Native Google files after **export** to Office/PDF (existing export MIME map); standard uploaded files **download → extract**. | Trashed items; shared drives the user did not authorize. | Use **file id + modifiedTime**; Google Docs need **export** before text extraction. | +| **Gmail** (`GmailAdapter`) | **Threads:** subject, participants, internalDate, **snippet** or **body** per policy; **attachments** as separate ingest jobs. | Entire “All Mail” unbounded; **labels** that are purely system. | Same mail cautions as Outlook; **Message-ID** + **History-ID**/cursor for delta. | + +--- + +#### When the user connects **ClickUp** — `providerClickup` (`AuthAuthority.CLICKUP`) + +| **Connection surface** (implementation) | **Should be indexed (typical)** | **Usually skip or optional** | **Notes** | +|--------|----------------------------------|------------------------------|-----------| +| **ClickUp** (`providerClickup`) | Task **name**, **description**, **comments**; **attachment** content if downloaded. | Activity stream noise, every status micro-change unless text changed. | Rate limits → prioritize **recently updated** tasks. | + +--- + +**Email and messaging (Outlook + Gmail via Microsoft / Google user connections) — shared cautions** + +- Default tiers: **metadata only** → **snippet** → **full body** → **attachments** (most expensive / sensitive). +- Apply **quoted-thread stripping**, **signature removal**, and **max body length** before embed. +- **Legal hold / retention:** ingestion must respect mandate **delete** and **export** rules; **disconnecting** or **revoking** the mail **connection** must **purge** mail-sourced chunks. + +### 2.3 “Account and stuff” — what to index vs. what never to index + +**Goal:** Give agents **useful, permission-safe** context (“who is this user in this mandate”, “which features are on”, “preferred language”) without creating a **second copy of sensitive credentials** in the vector store. + +| Data | Typical treatment | +|------|-------------------| +| Passwords, refresh tokens, API secrets | **Never** index; never pass through embedding pipeline. | +| Email, phone, government IDs | **Default deny**; only if product explicitly enables “index PII” with neutralization and mandate policy. | +| Display name, locale, feature entitlements (labels) | **Allow** as a small structured **UserMandateSnapshot** document regenerated on change. | +| Full `User` or `Mandate` DB row | **Avoid**; generate **curated** JSON/text snapshots with field allowlists. | + +Snapshots should be stored with the same **scope model** as file chunks (`personal`, `featureInstance`, `mandate`, `global`) so `semanticSearch` filters stay consistent. + +### 2.4 Event-driven vs. direct calls + +**Minimum viable:** each feature calls `requestIngestion` at the end of its own transaction (direct call). + +**Scalable target:** emit **domain events** (`FileCommitted`, `UserConnectionReady` / provider-specific ready event, `ProfileUpdated`) and a single **KnowledgeIngestionConsumer** subscribes. Benefits: one place for metrics, retries, and rate limits; features stay thin. + +**Storage (already implemented — not redesigned here):** The platform already uses **one** knowledge persistence stack: **`FileContentIndex`** (incl. `mandateId`, `scope`, status) and **`ContentChunk`** (pgvector embeddings, `fileId`, `userId`, `featureInstanceId`, `contextRef`, optional **`chunkMetadata`**), accessed via **`interfaceDbKnowledge`**. Chunks are **file-anchored** today; **connection- / source-specific** provenance (e.g. `connectionId`, external ids) can ride in **`contextRef` / `chunkMetadata`** until optional schema extensions are justified. **This document targets ingestion triggers and lifecycles**, not a second corpus or a duplicate storage model. + +--- + +## Teil 3 — Feature injection: retrieval vs corpus, agent loop, and real gaps + +“Injection” is ambiguous. This section uses **two** precise meanings: + +| Kind | What happens | Primary implementation today | +|------|----------------|------------------------------| +| **Retrieval injection** | Relevant **existing** chunks and workflow context are **assembled** and **inserted into the LLM prompt** (system message) each agent round. | **`AgentService.runAgent`** → `buildRagContextFn` → **`KnowledgeService.buildAgentContext`** → **`ConversationManager.injectRagContext`**. CommCoach wraps the same **`buildAgentContext`** and adds coaching-specific context. | +| **Corpus injection (indexing)** | **New** text/binary is **chunked and embedded** and written to **`interfaceDbKnowledge`** so it can be retrieved later. | **`KnowledgeService.indexFile`**; callers include **`routeDataFiles._autoIndexFile`**, **`serviceAgent`** tools (**`_documentTools`**, **`_workspaceTools`**), and **CommCoach** **`serviceCommcoachIndexer`**. | + +A feature can **already participate fully in retrieval injection** by using **`AgentService`** without ever calling **`indexFile`** in its own folder. **Corpus** growth can still happen **indirectly** (upload pipeline, agent tools). Planning must **not** label such features as “non-injecting.” + +### 3.1 Features that already use **`AgentService.runAgent`** (retrieval injection is on by default) + +These **`modules/features/*`** entry points resolve **`getService("agent", ctx)`** and stream **`agentService.runAgent(...)`** (code audit): + +- **`workspace`** (`routeFeatureWorkspace.py`) +- **`graphicalEditor`** (`routeFeatureGraphicalEditor.py`) +- **`commcoach`** (`serviceCommcoach.py` — custom **`buildRagContextFn`**, still uses platform **`buildAgentContext`** inside) + +For all three, **every agent round** gets **retrieval injection** unless RAG fails or returns empty. **Corpus** updates for the same sessions still depend on **separate** mechanisms: + +| Corpus path | When it runs | +|-------------|----------------| +| **Upload / `FileItem`** | **`routeDataFiles`** **`_autoIndexFile`** after storage (feature-agnostic). | +| **Agent tools** | If the model invokes tools in **`_documentTools`** / **`_workspaceTools`** that call **`indexFile`**, **corpus** changes **during** that agent run—implemented in **`serviceAgent`**, not in the feature’s route file. | + +So **workspace** and **graphicalEditor** **do** “inject” in the **retrieval** sense today; they **can** “inject” in the **corpus** sense when users **upload** files or when the **agent** runs indexing-capable tools. What they **often lack** is **feature-owned, automatic corpus** logic (e.g. “on every graph publish, index a snapshot”) without an upload or tool call. + +### 3.2 Features that do **not** use **`AgentService`** (no platform RAG prompt injection from this stack) + +These domains **do not** call **`runAgent`** in their **`modules/features/*`** trees (audit). They therefore **do not** receive **`buildAgentContext`** through the **workspace agent** loop: + +| Feature | Notes | +|---------|--------| +| **chatbot** | Uses an **internal** LangGraph-style flow (SQL / Tavily / answer nodes). **No** `getService("knowledge")` / **`buildAgentContext`** usage under **`modules/features/chatbot/`** in the audited tree—**retrieval injection** and the **unified corpus** are **not** wired the same way as the workspace agent. | +| **trustee** | Domain CRUD and quick actions (e.g. **`agentPrompt`** is a **UI hint** to open the workspace with a prefilled prompt—not **`AgentService` inside trustee**). Corpus: **only** via shared **upload** or if the user later uses **workspace agent** with tools. | +| **realEstate** | No **`AgentService`** hook in feature tree; same **upload** story for files. | +| **teamsbot** | Uses **`serviceAi`** (and related) for the meeting pipeline; **`sessionContext`** is **ephemeral** prompt text. **No** **`AgentService`** / **`buildAgentContext`** in the same pattern as workspace. | +| **neutralization** | **Service/pipeline** used **inside** **`indexFile`** when **`FileItem.neutralize`** applies—not a feature that “injects” either kind by itself. | + +### 3.3 Summary matrix (per `modules/features/` domain) + +| Feature | **`AgentService.runAgent`** | **Retrieval injection** (platform RAG prompt) | **Corpus injection** (typical today) | **Likely gap** (this document) | +|---------|----------------------------|-----------------------------------------------|-------------------------------------|--------------------------------| +| **workspace** | Yes | Yes | Upload **`_autoIndexFile`**; optional **`indexFile`** via agent **tools** | **Automatic** corpus for artifacts that never become **`FileItem`** or tool outputs (exports, structured summaries). | +| **graphicalEditor** | Yes | Yes | Same as workspace | **Published graph / metadata** as searchable corpus without manual upload. | +| **commcoach** | Yes | Yes (+ custom RAG layer) | Session **`indexFile`** (**`serviceCommcoachIndexer`**) + upload/tools | Extend only if new artifact types need the same **feature-local** indexer pattern. | +| **chatbot** | No | **No** (unified store) | No feature-local **`indexFile`** | Decide if chatbot should call **`buildAgentContext`** / **`indexFile`** or stay on SQL/Tavily; **FAQ / grounding** text may need **corpus** hooks. | +| **trustee** | No | Only if user works in **workspace** | Upload path; agent tools only in workspace | **Trustee-native** persist events → ingestion when files are not the only representation. | +| **realEstate** | No | Only via workspace | Upload path | Same as trustee for case/property narratives. | +| **teamsbot** | No | No | None from unified store by default | Persisted **transcripts / notes** → **`indexFile`** if they should be mandate-searchable. | +| **neutralization** | N/A | N/A | Preconditions for **`indexFile`** | Ensure all **new** ingest paths honor **`FileItem.neutralize`**. | + +### 3.4 Shared corpus mechanisms (not feature-local, but serve agent features) + +| Mechanism | Role | +|-----------|------| +| **`routeDataFiles` + `_autoIndexFile`** | Indexes **uploaded** `FileItem`s for **any** UI that uses the upload API—including workspace. | +| **`serviceAgent`** **`_documentTools`** / **`_workspaceTools`** | **Corpus** writes when the **model** chooses tools; available to **workspace** and **graphicalEditor** agent sessions (and **CommCoach** when those tools are in the toolset). | +| **CommCoach** **`serviceCommcoachIndexer`** | **Feature-local** corpus: coaching session text → **`indexFile`** without requiring an upload. | + +### 3.5 Where **additional feature-native corpus injection** is still needed + +Use this checklist **only** after accounting for §3.1–3.4: + +1. **Content is authoritative** in the feature DB or blob store **without** a guaranteed **`FileItem`** + **`_autoIndexFile`** path. +2. **Retrieval injection alone** is insufficient because nothing ever **wrote** chunks (e.g. chatbot never hits **`indexFile`**). +3. **Relying on the agent to call tools** is too fragile for compliance or UX (“user must remember to index”). + +Then add **`requestIngestion` / `indexFile`** at the **feature commit point** (or emit a domain event), with **`contextRef` / `chunkMetadata`** for **`feature_code`**, business ids, and **no secrets**. + +### 3.6 Implementation pattern (feature-native corpus only) + +1. **Commit point** — authoritative write in the feature or shared storage. +2. **Scope** — align with **`FileItem`** / **`ServiceCenterContext`** rules already used in **`indexFile`**. +3. **Unified façade** — one ingestion API; avoid a second embedding pipeline. +4. **Purge** — tie to **`fileId`**, business key, or future connector purge keys on revoke/delete. + +### 3.7 Phasing + +- **P0:** For **each** row in §3.3, confirm **retrieval** vs **corpus** paths; document “satisfied by agent+upload+tools” vs “needs feature hook.” +- **P1:** Implement **feature-native corpus** for one domain with a clear §3.5 gap (e.g. **trustee** entity text, **teamsbot** persisted transcript). +- **P2:** **Chatbot** architecture decision: integrate **`serviceKnowledge`** or keep parallel retrieval; if integrate, add explicit **corpus** rules for config/FAQ. + +--- + +## Implementation phases (suggested) + +Phases align with **Teil 1** (façade), **Teil 2** (connector + trigger catalog), and **Teil 3.7** (feature matrix and feature-native corpus pilots). **P0** overlaps **Teil 3.7 P0** (complete the per-feature matrix before large builds). + +| Phase | Outcome | +|-------|---------| +| **P0 — Catalog & façade** | Document all current `indexFile` call sites; wrap them in `requestIngestion`; add metrics and structured logging; complete **Teil 3.3** matrix (retrieval vs corpus vs gap) per feature. | +| **P1 — User-connection hooks** | On connection success/failure/revoke, enqueue bootstrap/delta/purge jobs per **Teil 2.2**; SharePoint and one mail provider as pilots. | +| **P2 — Profile & mandate snapshots** | Allowlisted fields only (**Teil 2.3**); regenerate on events; explicit admin toggle per mandate if needed. | +| **P3 — Event bus** | Move direct calls to async consumer where load requires it (**Teil 2.4** scalable target). | + +--- + +## Ziel und Nicht-Ziele + +**Ziel:** + +- One **ingestion contract** for all features and connector lifecycles. +- Indexing **decoupled** from the agent loop (agents may still *invoke* tools that ultimately call ingestion, but ingestion must not *depend* on an agent run). +- **Explicit** handling of connection establishment, sync, and revocation. +- **Bounded** indexing of user/mandate context with a clear PII policy. + +**Explizit NICHT:** + +- Moving **retrieval** (`buildAgentContext`) out of agents. +- Guaranteeing **real-time** indexing for every byte without async jobs (latency targets are product decisions). +- Indexing **everything** in the database “because we can”—only curated, policy-approved surfaces. + +--- + +## Betroffene Module (erwartet) + +- **Gateway:** `serviceKnowledge`, file upload routes, connector OAuth handlers, sync workers, possibly new `serviceKnowledgeIngest` or package under `modules/serviceCenter/services/`. +- **Interfaces:** `interfaceDbKnowledge` extensions for source metadata if needed. +- **Wiki / Reference:** `b-reference/gateway/ai-agent.md` (ingestion vs. retrieval) after implementation. + +--- + +## Offene Entscheidungen + +| Thema | Optionen | +|-------|----------| +| **Email bodies** | Full text vs. summary-only vs. attachment-only | +| **Multi-tenant isolation audits** | Periodic job to verify chunk `mandateId` matches connection | +| **Cost caps** | Per-mandate embedding budget; defer large backfills | +| **Neutralization** | Mandatory for certain `sourceKind`s even when not file-upload | +| **Provenance shape** | First-class DB columns vs **documented `chunkMetadata` keys** for `connectionId`, external id, revision (must support **Teil 2** purge rules). | + +--- + +## Links + +- **How-to / orientation:** [Unified knowledge & RAG ingestion (guide)](../../d-guides/unified-knowledge-rag.md) +- **Gateway reference (retrieval + knowledge):** `wiki/b-reference/gateway/architecture.md`, `wiki/b-reference/gateway/ai-agent.md` +- **Implementation touchpoints (indicative):** `gateway/modules/serviceCenter/services/serviceKnowledge/mainServiceKnowledge.py`, `gateway/modules/routes/routeDataFiles.py`, `gateway/modules/features/commcoach/serviceCommcoachIndexer.py`, agent `coreTools` `_documentTools` / `_workspaceTools`. + +## Akzeptanzkriterien (Plan-Ebene) + +| # | Kriterium | Prio | +|---|-----------|------| +| 1 | Every new **file** that should be searchable triggers ingestion **without** requiring an agent session. | must | +| 2 | **User connection** connect / disconnect has defined ingestion or purge behavior documented and implementable. | must | +| 3 | **Profile/mandate** snapshots use an explicit allowlist; secrets never enter the embedding pipeline. | must | +| 4 | Ingestion is **idempotent** for unchanged content (no duplicate embedding work). | should | +| 5 | **Teil 3.3** matrix completed: every `modules/features/*` product row has **retrieval** (agent vs none), **corpus** (upload / tools / feature indexer), and **gap** explicitly stated—not “non-injecting” if **`AgentService`** already provides retrieval injection. | should | + +--- + +## Testplan (Konzept-Verifikation) + +| ID | Frage | Methode | +|----|-------|---------| +| T1 | Sind alle bestehenden Index-Entry-Points inventarisiert? | Code-Audit + Tabelle in Build-Phase | +| T2 | Ist klar welche Features **Retrieval** (Agent) vs nur **Corpus** vs beides nutzen? | Review **Teil 3.3** Matrix gegen `runAgent` / `indexFile` Call-Sites | +| T3 | Bleibt **plattform-RAG-Retrieval** (`buildAgentContext` über `AgentService`) unveraendert in seiner Rolle fuer Workspace/Grafikeditor/CommCoach? | Review `agentLoop` + `mainServiceAgent._createBuildRagContextFn` | +| T4 | Ist Revoke/Purge fuer Connector-Chunks ohne **connectionId-Spalte** heute als **Metadata-Konvention** spezifizierbar? | Review **Teil 2.1** + **Offene Entscheidungen** Provenance | +| T5 | Ist Revoke/Purge pro **User-Connection-Authority / Integrationsoberfläche** (Teil 2.2, nur `msft` / `google` / `clickup` laut `routeDataConnections`) in einem Threat-Model abgedeckt? | Datenfluss Connection → `FileItem` / virtuelles Doc → Chunks |