Merge branch 'main' of https://github.com/valueonag/wiki
This commit is contained in:
commit
9c9b1cdd10
1 changed files with 409 additions and 0 deletions
409
c-work/1-plan/2026-04-unified-knowledge-indexing-rag-concept.md
Normal file
409
c-work/1-plan/2026-04-unified-knowledge-indexing-rag-concept.md
Normal file
|
|
@ -0,0 +1,409 @@
|
|||
<!-- status: plan -->
|
||||
<!-- started: 2026-04-16 -->
|
||||
<!-- component: gateway | platform | frontend-nyla -->
|
||||
|
||||
# Unified Knowledge Indexing — One RAG Corpus for All Platform Information
|
||||
|
||||
## How to read this document
|
||||
|
||||
| Section | Content |
|
||||
|---------|---------|
|
||||
| **Beschreibung und Kontext** | Scope (**ingestion vs retrieval**), **terminology** (feature / service / connector / interface), **as-is vs target**, business case and risks. |
|
||||
| **Teil 1** | Ingestion as its **own lifecycle**: façade API, idempotency, orchestration—**not** owned by `agentLoop`. |
|
||||
| **Teil 2** | **Triggers** (beyond upload): **user connections**, account snapshots, purge; **per-connection-type** indexing guidance; event-driven option. |
|
||||
| **Teil 3** | **Feature injection** split into **retrieval** (agent + `buildAgentContext`) vs **corpus** (`indexFile`); **matrix** per `modules/features/*` product; real **gaps** vs false “non-injection”. |
|
||||
| **Implementation phases · Ziele · AC · Testplan** | Rollout, explicit non-goals, acceptance criteria, verification. |
|
||||
|
||||
**Single sentence summary:** Keep **retrieval** on **`AgentService`**; unify **when and how** the shared **`interfaceDbKnowledge`** corpus is **filled** (routes, **user connections** / integrations, features, snapshots) behind one **ingestion contract**, without assuming every product uses the workspace agent.
|
||||
|
||||
## Beschreibung und Kontext
|
||||
|
||||
### Scope of this document
|
||||
|
||||
We distinguish **ingestion** (chunking, embedding, persisting into **`interfaceDbKnowledge`**) from **retrieval** (semantic search + `buildAgentContext` for the LLM). **Retrieval** for the **unified knowledge store** is consumed primarily through **`serviceAgent`** / `runAgent` (workspace, graphical editor, CommCoach). Other products (e.g. **chatbot**, **teamsbot**) may use **different** LLM stacks—**Teil 3** maps who gets platform RAG vs who does not. This plan does **not** mandate one global retrieval path for every feature; it **does** mandate a **single ingestion story** into the same corpus where that corpus is used. The gap we address is **how and when** the corpus is **filled**, not how every LLM entry point **reads** it.
|
||||
|
||||
### Terminology (Gateway — see `wiki/b-reference/gateway/architecture.md`)
|
||||
|
||||
This concept separates **feature modules**, **services**, **connectors**, and **interfaces**. Conflating them produces wrong ownership (e.g. treating “SharePoint” as a `modules/features/` product, or treating “mail” as if it were `serviceKnowledge`).
|
||||
|
||||
| Term | Where in Gateway | Role for indexing |
|
||||
|------|------------------|-------------------|
|
||||
| **Feature** | `modules/features/*` (e.g. `workspace`, `graphicalEditor`, `commcoach`, `trustee`, `chatbot`) | Product domains: UI, feature routers, orchestration. They **trigger** actions (upload, sync UX, feature-specific indexers) but must not be the **only** place that starts embedding work. |
|
||||
| **Service** | `modules/serviceCenter/services/*` | Cross-cutting facades: **`serviceKnowledge`** (indexing, search, `buildAgentContext`); **`serviceExtraction`** (content objects); **`serviceChat`** (chat/workflow documents); **`serviceMessaging`** (e-mail, notifications); **`serviceAgent`** (tools that may *indirectly* call indexing). **Unified ingestion** is primarily a **service-layer** responsibility. |
|
||||
| **Connector** | `modules/connectors/*` (Microsoft, Google, …) | Vendor adapters: OAuth, list, download. **SharePoint** and **mailbox** I/O live here; routes/features **call** connectors—they are not interchangeable with a feature or with `serviceKnowledge`. |
|
||||
| **Interface** | `modules/interfaces/*` | Persistence contracts: **`interfaceDbKnowledge`** (`FileContentIndex`, `ContentChunk`, …), **`interfaceDbManagement`** (`FileItem`, `DataSource`), **`interfaceDbApp`** (User, Mandate, `UserConnections`, Preferences). Profile, mandate, and connection rows are **interface-backed**, not a single “profile feature”. |
|
||||
|
||||
### What we have today (as-is)
|
||||
|
||||
**1. A strong technical “write” implementation, but no product-wide ingestion contract**
|
||||
|
||||
- **`serviceKnowledge`** (`mainServiceKnowledge.py`) already implements the heavy lifting: **`indexFile`** resolves scope from **`FileItem`** (single source of truth), optional neutralization, sentence-aware chunking, **`serviceAi`** embeddings, **`ContentChunk`** persistence, and status on **`FileContentIndex`**. That is the right **unit of work** once **content objects** exist.
|
||||
- **Retrieval** is also centralized in the same service: **`buildAgentContext`** composes multi-layer context for agents. So **read** and **write** to the vector store are **service-owned**; what is **not** unified is **who may call writes**, with which **idempotency**, and on which **lifecycle events**.
|
||||
|
||||
**2. Multiple invocation lanes, same underlying method**
|
||||
|
||||
Indexing is **operationally** reached from a **mix of layers** (not one façade):
|
||||
|
||||
- **HTTP routes** — e.g. file pipeline in **`routeDataFiles`**: pre-scan / extraction → **`indexFile`**. This is the “happy path” for uploads.
|
||||
- **Agent tools** — **`serviceAgent`** core tools (e.g. document/workspace helpers) can call **`indexFile`** when the user interacts through the agent. That ties **embedding** to **an agent session** even when the same file could have been indexed on upload.
|
||||
- **Feature-specific code** — e.g. **CommCoach** indexer paths that call **`indexFile`** for that product’s artifacts. Correct for the feature, but it is **another** ad hoc entry point with its own assumptions.
|
||||
- **Connectors** — Microsoft/Google (and similar) packages can fetch bytes and ultimately produce files or blobs; **OAuth and delta sync** are not yet modeled everywhere as **first-class ingestion lifecycles** (connect → backfill → incremental → revoke) that all funnel through the same API and metadata.
|
||||
|
||||
There is **no single** `requestIngestion(...)`, **no standard job identity** for “this external item revision”, and **no one place** that records “this mandate revoked access → tombstone these chunks”.
|
||||
|
||||
**3. Extraction vs indexing: clear in code, not enforced at the platform edge**
|
||||
|
||||
- **`serviceExtraction`** (and preprocessing helpers) produce **content objects**; **`indexFile`** consumes them. The boundary is clean **inside** the pipeline, but **not every** new binary or external document **must** pass through a single orchestrated “extract then index” step—some paths may skip, duplicate, or call **`indexFile`** with partial metadata.
|
||||
|
||||
**4. Truth for scope and identity lives in interfaces—not in “features”**
|
||||
|
||||
- **`interfaceDbManagement`** (`FileItem`, …) and **`interfaceDbApp`** (mandate, `UserConnections`, user profile fields) define **who may see what**. **`indexFile`** already mirrors **`FileItem`** for scope; that pattern is **good** but **not generalized** to connector-backed items, virtual documents, or curated “account snapshot” chunks. If a connector writes under a different mental model of `mandateId` / `featureInstanceId`, **`interfaceDbKnowledge`** can drift from app/management truth without a systematic reconcile.
|
||||
|
||||
**5. User/mandate/profile deltas are not first-class ingestion events**
|
||||
|
||||
- Changes to membership, preferences, or connections update **`interfaceDbApp`** (and related tables). They affect **searchability and personalization** but are **not** consistently reflected as **versioned, allowlisted** chunks in the knowledge store—unless a feature manually adds text somewhere. That leaves agents either **under-informed** or dependent on **non-RAG** code paths for the same facts.
|
||||
|
||||
**Summary (as-is):** The **engine** for indexing is **`serviceKnowledge.indexFile`**; the **policy graph** for *when* to run it is **implicit** and **spread across** routes, tools, and features. **Connectors** and **account/mandate** data are **not** uniformly treated as **ingestion sources** with connect/sync/revoke semantics.
|
||||
|
||||
### What would make more sense (target)
|
||||
|
||||
**1. One ingestion façade at the service boundary (not inside `agentLoop`)**
|
||||
|
||||
- A small, stable API (conceptually **`requestIngestion` / `getIngestionStatus`**, implemented atop or beside **`KnowledgeService`**) that **every** lane calls: routes, feature hooks, **connector sync workers**, and (if needed) agent tools as **thin** delegates.
|
||||
- **Idempotency** (content hash, external revision, `eTag`, …) enforced **here**, so routes and tools cannot accidentally **double-embed** the same logical object.
|
||||
|
||||
**2. Lifecycle parity for connectors and “connections”**
|
||||
|
||||
- **Establish** → register datasource + optional short **non-secret** summary chunk + enqueue **backfill**.
|
||||
- **Delta** → incremental jobs with persisted cursors.
|
||||
- **Revoke / token invalid / GDPR** → **tombstone or purge** by `connectionId` / `sourceKind`, aligned with RBAC—not ad hoc deletes scattered in UI code.
|
||||
|
||||
**3. Provenance / `sourceKind` (schema or `chunkMetadata`)**
|
||||
|
||||
- Today chunks are **file-anchored**; extended provenance (internal file vs SharePoint item vs mailbox artifact vs `profile_snapshot`, **`connectionId`** for purge, revision keys) should be **consistent**—either **first-class fields** on `ContentChunk` / index rows **or** a **defined convention** inside **`chunkMetadata` / `contextRef`** until a migration is justified. Goal: retrieval, audit, and **connector revoke** cleanup are **data-driven**, not inferred only from call site.
|
||||
|
||||
**4. Curated snapshots for interface-backed facts**
|
||||
|
||||
- **Allowlisted** projections of mandate membership, locale, entitlements (labels), etc., regenerated on **interface-level** events—**not** dumping full user rows or secrets into embeddings.
|
||||
|
||||
**5. Keep retrieval exactly where it is**
|
||||
|
||||
- **`buildAgentContext`** remains the agent’s way to **consume** the corpus; ingestion only ensures that corpus is **complete, scoped, and attributable** when the agent runs.
|
||||
|
||||
**6. Observability and cost in one place**
|
||||
|
||||
- Queue depth, embedding spend, failures, and “skipped duplicate” counts attach to the **ingestion façade**, not to each feature.
|
||||
|
||||
### Business goal
|
||||
|
||||
Whenever **meaningful information** appears—files, bytes from **connectors**, configuration that should shape answers, and **bounded** user/mandate context—the platform should **ingest it once** into a **unified, scoped** knowledge layer so agents see **one coherent corpus** with clear **provenance** and **permissions**.
|
||||
|
||||
### Why this matters now
|
||||
|
||||
Information deltas arrive through **routes**, **features**, **`serviceAgent`** tools, **connectors**, and **`interfaceDbApp`** / **`interfaceDbManagement`** updates. Without **one** ingestion contract and triggers per **source**, you get: **missing** indexes, **duplicate** work, **scope drift** between knowledge rows and app truth, and **repeated** engineering per entry path instead of **once** at the **service** layer.
|
||||
|
||||
### Risk if we do not unify
|
||||
|
||||
Fragmented memory, inconsistent agent answers, compliance gaps (over-indexing sensitive fields or under-indexing allowed context), and duplicated work **per route/feature/tool** instead of at a **single service boundary**.
|
||||
|
||||
---
|
||||
|
||||
## Teil 1 — Indexing as its own lifecycle (not owned by the agent)
|
||||
|
||||
### 1.1 Current useful core
|
||||
|
||||
*(Same technical point as **“What we have today” §1** above; repeated here for readers who start at Teil 1.)* After structured **content objects** exist, **`KnowledgeService.indexFile`** performs chunking, embedding (via **`AiService`**), neutralization when required, and persistence via **`interfaceDbKnowledge`**. The **gap** is not the lack of a service method but the lack of a **single product-wide contract** for *when* and *what* enters that pipeline.
|
||||
|
||||
### 1.2 Target responsibility split
|
||||
|
||||
| Concern | Owner | Notes |
|
||||
|--------|--------|--------|
|
||||
| **Ingestion** (normalize → chunk → embed → store) | **Knowledge ingestion service** (logical module; may remain `KnowledgeService` + new façade) | No dependency on `agentLoop`. |
|
||||
| **Retrieval** (query → ranked context string) | **Agent** (and similar LLM entry points) | Unchanged by this concept. |
|
||||
| **Orchestration** (queues, retries, backoff) | **Job runner / worker** (new or existing infra) | Keeps API latency low. |
|
||||
|
||||
### 1.3 Public ingestion contract (conceptual)
|
||||
|
||||
Introduce a small, stable API surface that **all features** call—never “only if an agent runs”:
|
||||
|
||||
- **`requestIngestion(job: IngestionJob) -> IngestionHandle`**
|
||||
- Idempotent key: `(sourceKind, sourceId, contentVersion | hash)`
|
||||
- Returns immediately with `queued` / `duplicate` / `skipped` and optional `jobId` for status polling.
|
||||
|
||||
- **`getIngestionStatus(handle)`**
|
||||
- Surfaces the same states already used on `FileContentIndex` (`pending`, `extracted`, `embedding`, `indexed`, `failed`) plus connection- or source-specific substates if needed.
|
||||
|
||||
The implementation can stay in-process at first (asyncio task queue) and move to Redis/Celery/ARQ later without changing callers.
|
||||
|
||||
### 1.4 Idempotency and versioning
|
||||
|
||||
- **Re-index** when content changes: compare **content hash** or **external revision** (SharePoint `eTag`, email `Message-ID` + folder cursor, file `updatedAt`).
|
||||
- **Skip** when hash unchanged to control embedding cost.
|
||||
- **Tombstone** or **scope-disable** when a source is deleted or access revoked (see Teil 2).
|
||||
|
||||
---
|
||||
|
||||
## Teil 2 — Triggers: not only “file write”, but every information delta
|
||||
|
||||
“Write path” is too narrow if we read it as “HTTP upload only”. The unified model should treat **any authoritative addition or change of platform-visible information** as a potential ingestion trigger.
|
||||
|
||||
### 2.1 Trigger taxonomy
|
||||
|
||||
| Trigger category | Examples | Ingestion behavior (conceptual) |
|
||||
|------------------|----------|----------------------------------|
|
||||
| **Artifact persisted** | User uploads PDF; paste text saved as file; export from a feature | Existing pipeline: extract → `indexFile` (or equivalent). |
|
||||
| **User connection added / re-authorized** | SharePoint OAuth success; Microsoft/Google mail connection; new API credential with data scope | **Register datasource** + enqueue **initial sync** (backfill) + index a **short connection summary document** (site name, root path, principal, *no secrets*). |
|
||||
| **Sync for an existing connection** | Scheduled delta; webhook (if available); manual “refresh” | Incremental fetch → map to content objects or rows → **upsert** chunks keyed by external id. |
|
||||
| **Connection revoked / token invalid** | User disconnects; admin removes mandate integration | **Tombstone** or **purge** chunks keyed by **connection / external source** (today: enforce via **`chunkMetadata` / `contextRef`** convention or future columns); ensure retrieval never serves stale data from that connection. |
|
||||
| **Mandate / membership** | User added to mandate; role change; feature instance attached | Regenerate **mandate-safe summary** documents (see Section 2.3) if policy allows; **re-resolve scope** for existing chunks (may be heavy—often better to store immutable `mandateId` on chunks at write time and rely on retrieval filters). |
|
||||
| **User profile (bounded)** | Display name, locale, timezone, **non-sensitive** preferences | Optional **UserContextDocument** for personalization—not a dump of the whole `User` row. |
|
||||
| **Feature configuration** | Instance labels, data source labels, automation descriptions | If they should influence answers, emit structured **FeatureConfigSnapshot** chunks (small, text-first). |
|
||||
| **Artifact deleted / data subject erasure** | User deletes a stored file; mandate/user erase | Purge or tombstone the corresponding **`FileContentIndex` / `ContentChunk`** rows (by `fileId`); erasure jobs cascade by **`userId`** / mandate policy. **Connection-wide** revoke remains the **connection** row above. |
|
||||
|
||||
### 2.2 User connections (added by the user) as first-class ingestion sources — lifecycle and **what to index per connection type**
|
||||
|
||||
**Conceptual focus:** The trigger is OAuth success, saved credential, or linked account in **`UserConnection`** that grants access to an external system. **Implementation** still flows through provider code under `gateway/modules/connectors/` (e.g. **`providerMsft`**, **`providerGoogle`**, **`providerClickup`**); that mapping is **technical**, not the product wording.
|
||||
|
||||
**Scope — what counts as a user connection here:** `gateway/modules/routes/routeDataConnections.py` only allows **creating** connections with `type` **`msft`**, **`google`**, or **`clickup`** (`create_connection` → OAuth via `connect_service`). The **authorities options** endpoint also lists **`local`**, but that path is **not** wired in `create_connection`. **This subsection only covers those user-connection authorities** (plus the surfaces each OAuth integration can reach, e.g. Graph mail for Microsoft). Other Gateway connector packages (FTP, Jira, preprocessor, outbound-only mail, geo APIs, …) are **out of scope** in §2.2 until they are exposed the same way as **`UserConnection`** rows.
|
||||
|
||||
**Cross-cutting rules (every user-added connection):**
|
||||
|
||||
- **Never index:** OAuth tokens, refresh payloads, raw credentials, webhook signing secrets.
|
||||
- **Always safe to index (metadata only):** human-readable **connection** label, tenant/site name, root path / mailbox address **as display string**, last sync cursor (store in DB, not necessarily as embedding), **external id** + **revision** for idempotency.
|
||||
- **Prefer file pipeline for binaries:** download → store as `FileItem` (or equivalent) → reuse existing **extract → `indexFile`** path so neutralization and scope mirror upload behavior.
|
||||
- **Prefer virtual documents** for small text-native items (mail headers/snippets, issue titles/descriptions) to avoid N binary copies.
|
||||
- **Quota:** per-mandate max documents, max bytes, and “index only last N days” for mail are **product** knobs, not defaults baked into each adapter.
|
||||
|
||||
**Lifecycle pattern (target) — tied to the connection row, not to “a connector class”:**
|
||||
|
||||
1. **Connection event** (`ConnectionEstablished`) fires when the user **adds** or **re-authorizes** a connection (OAuth / credential storage, **`UserConnection`**, authority **`msft`**, **`google`**, or **`clickup`** per current API).
|
||||
2. **Ingestion registry** records: `{ connectionId, featureCode, mandateId, userId, scope, externalRoot, adapterKind }` (adapter kind = which integration backs this connection).
|
||||
3. **Sync planner** enqueues jobs **for that connection**:
|
||||
- **Bootstrap:** list roots, respect quotas, prioritize recently modified.
|
||||
- **Delta:** cursor per drive/site/folder/mailbox/label; persist cursor in DB.
|
||||
4. **Normalizer** maps each external item to either:
|
||||
- **File-like** → persist bytes + run extraction + **`indexFile`**, or
|
||||
- **Virtual document** → build `contentObjects` in memory + **`indexFile`** with a synthetic `fileId` / stable external key.
|
||||
|
||||
---
|
||||
|
||||
#### When the user connects **Microsoft** (Graph — SharePoint, OneDrive, Outlook, Teams) — `providerMsft`
|
||||
|
||||
| **Connection surface** (implementation) | **Should be indexed (typical)** | **Usually skip or optional** | **Notes** |
|
||||
|--------|----------------------------------|------------------------------|-----------|
|
||||
| **SharePoint** (`SharepointAdapter`) | Document libraries: **PDF, Office, text, markdown**; list **metadata** (library name, path, item name) as `contextRef`. | Huge video blobs, raw executables, duplicates already indexed via another path. | Use **driveItem id + eTag** for revision. Respect **library/folder allowlist** on this **connection**. |
|
||||
| **OneDrive** (`OneDriveAdapter`) | Same as SharePoint for **personal files** reachable through the user’s connection. | System/temp folders if exposed. | Scope = **personal** unless shared into mandate explicitly. |
|
||||
| **Outlook** (`OutlookAdapter`) | **Mailbox:** subjects, **from/to/cc**, **received date**, **body** (plain or stripped HTML) per policy; **calendar** titles/locations/descriptions if product enables. | Full MIME raw, embedded images as separate media unless needed; **entire mailbox** without date window in v1. | Strong **retention + PII** policy: optional “headers + snippet only”; strip signatures/quoted threads; **attachments** → child **file-like** jobs (virus/size limits). |
|
||||
| **Teams** (`TeamsAdapter`) | **Channel messages** (text), **meeting chat** exports if API allows; **files shared in channel** as file-like. | Message reactions, per-user read receipts; continuous full channel history without bounds. | Often **high volume** — default to **recent window** or **keyword/subscription** driven sync. |
|
||||
|
||||
---
|
||||
|
||||
#### When the user connects **Google** (Drive, Gmail) — `providerGoogle`
|
||||
|
||||
| **Connection surface** (implementation) | **Should be indexed (typical)** | **Usually skip or optional** | **Notes** |
|
||||
|--------|----------------------------------|------------------------------|-----------|
|
||||
| **Drive** (`DriveAdapter`) | Native Google files after **export** to Office/PDF (existing export MIME map); standard uploaded files **download → extract**. | Trashed items; shared drives the user did not authorize. | Use **file id + modifiedTime**; Google Docs need **export** before text extraction. |
|
||||
| **Gmail** (`GmailAdapter`) | **Threads:** subject, participants, internalDate, **snippet** or **body** per policy; **attachments** as separate ingest jobs. | Entire “All Mail” unbounded; **labels** that are purely system. | Same mail cautions as Outlook; **Message-ID** + **History-ID**/cursor for delta. |
|
||||
|
||||
---
|
||||
|
||||
#### When the user connects **ClickUp** — `providerClickup` (`AuthAuthority.CLICKUP`)
|
||||
|
||||
| **Connection surface** (implementation) | **Should be indexed (typical)** | **Usually skip or optional** | **Notes** |
|
||||
|--------|----------------------------------|------------------------------|-----------|
|
||||
| **ClickUp** (`providerClickup`) | Task **name**, **description**, **comments**; **attachment** content if downloaded. | Activity stream noise, every status micro-change unless text changed. | Rate limits → prioritize **recently updated** tasks. |
|
||||
|
||||
---
|
||||
|
||||
**Email and messaging (Outlook + Gmail via Microsoft / Google user connections) — shared cautions**
|
||||
|
||||
- Default tiers: **metadata only** → **snippet** → **full body** → **attachments** (most expensive / sensitive).
|
||||
- Apply **quoted-thread stripping**, **signature removal**, and **max body length** before embed.
|
||||
- **Legal hold / retention:** ingestion must respect mandate **delete** and **export** rules; **disconnecting** or **revoking** the mail **connection** must **purge** mail-sourced chunks.
|
||||
|
||||
### 2.3 “Account and stuff” — what to index vs. what never to index
|
||||
|
||||
**Goal:** Give agents **useful, permission-safe** context (“who is this user in this mandate”, “which features are on”, “preferred language”) without creating a **second copy of sensitive credentials** in the vector store.
|
||||
|
||||
| Data | Typical treatment |
|
||||
|------|-------------------|
|
||||
| Passwords, refresh tokens, API secrets | **Never** index; never pass through embedding pipeline. |
|
||||
| Email, phone, government IDs | **Default deny**; only if product explicitly enables “index PII” with neutralization and mandate policy. |
|
||||
| Display name, locale, feature entitlements (labels) | **Allow** as a small structured **UserMandateSnapshot** document regenerated on change. |
|
||||
| Full `User` or `Mandate` DB row | **Avoid**; generate **curated** JSON/text snapshots with field allowlists. |
|
||||
|
||||
Snapshots should be stored with the same **scope model** as file chunks (`personal`, `featureInstance`, `mandate`, `global`) so `semanticSearch` filters stay consistent.
|
||||
|
||||
### 2.4 Event-driven vs. direct calls
|
||||
|
||||
**Minimum viable:** each feature calls `requestIngestion` at the end of its own transaction (direct call).
|
||||
|
||||
**Scalable target:** emit **domain events** (`FileCommitted`, `UserConnectionReady` / provider-specific ready event, `ProfileUpdated`) and a single **KnowledgeIngestionConsumer** subscribes. Benefits: one place for metrics, retries, and rate limits; features stay thin.
|
||||
|
||||
**Storage (already implemented — not redesigned here):** The platform already uses **one** knowledge persistence stack: **`FileContentIndex`** (incl. `mandateId`, `scope`, status) and **`ContentChunk`** (pgvector embeddings, `fileId`, `userId`, `featureInstanceId`, `contextRef`, optional **`chunkMetadata`**), accessed via **`interfaceDbKnowledge`**. Chunks are **file-anchored** today; **connection- / source-specific** provenance (e.g. `connectionId`, external ids) can ride in **`contextRef` / `chunkMetadata`** until optional schema extensions are justified. **This document targets ingestion triggers and lifecycles**, not a second corpus or a duplicate storage model.
|
||||
|
||||
---
|
||||
|
||||
## Teil 3 — Feature injection: retrieval vs corpus, agent loop, and real gaps
|
||||
|
||||
“Injection” is ambiguous. This section uses **two** precise meanings:
|
||||
|
||||
| Kind | What happens | Primary implementation today |
|
||||
|------|----------------|------------------------------|
|
||||
| **Retrieval injection** | Relevant **existing** chunks and workflow context are **assembled** and **inserted into the LLM prompt** (system message) each agent round. | **`AgentService.runAgent`** → `buildRagContextFn` → **`KnowledgeService.buildAgentContext`** → **`ConversationManager.injectRagContext`**. CommCoach wraps the same **`buildAgentContext`** and adds coaching-specific context. |
|
||||
| **Corpus injection (indexing)** | **New** text/binary is **chunked and embedded** and written to **`interfaceDbKnowledge`** so it can be retrieved later. | **`KnowledgeService.indexFile`**; callers include **`routeDataFiles._autoIndexFile`**, **`serviceAgent`** tools (**`_documentTools`**, **`_workspaceTools`**), and **CommCoach** **`serviceCommcoachIndexer`**. |
|
||||
|
||||
A feature can **already participate fully in retrieval injection** by using **`AgentService`** without ever calling **`indexFile`** in its own folder. **Corpus** growth can still happen **indirectly** (upload pipeline, agent tools). Planning must **not** label such features as “non-injecting.”
|
||||
|
||||
### 3.1 Features that already use **`AgentService.runAgent`** (retrieval injection is on by default)
|
||||
|
||||
These **`modules/features/*`** entry points resolve **`getService("agent", ctx)`** and stream **`agentService.runAgent(...)`** (code audit):
|
||||
|
||||
- **`workspace`** (`routeFeatureWorkspace.py`)
|
||||
- **`graphicalEditor`** (`routeFeatureGraphicalEditor.py`)
|
||||
- **`commcoach`** (`serviceCommcoach.py` — custom **`buildRagContextFn`**, still uses platform **`buildAgentContext`** inside)
|
||||
|
||||
For all three, **every agent round** gets **retrieval injection** unless RAG fails or returns empty. **Corpus** updates for the same sessions still depend on **separate** mechanisms:
|
||||
|
||||
| Corpus path | When it runs |
|
||||
|-------------|----------------|
|
||||
| **Upload / `FileItem`** | **`routeDataFiles`** **`_autoIndexFile`** after storage (feature-agnostic). |
|
||||
| **Agent tools** | If the model invokes tools in **`_documentTools`** / **`_workspaceTools`** that call **`indexFile`**, **corpus** changes **during** that agent run—implemented in **`serviceAgent`**, not in the feature’s route file. |
|
||||
|
||||
So **workspace** and **graphicalEditor** **do** “inject” in the **retrieval** sense today; they **can** “inject” in the **corpus** sense when users **upload** files or when the **agent** runs indexing-capable tools. What they **often lack** is **feature-owned, automatic corpus** logic (e.g. “on every graph publish, index a snapshot”) without an upload or tool call.
|
||||
|
||||
### 3.2 Features that do **not** use **`AgentService`** (no platform RAG prompt injection from this stack)
|
||||
|
||||
These domains **do not** call **`runAgent`** in their **`modules/features/*`** trees (audit). They therefore **do not** receive **`buildAgentContext`** through the **workspace agent** loop:
|
||||
|
||||
| Feature | Notes |
|
||||
|---------|--------|
|
||||
| **chatbot** | Uses an **internal** LangGraph-style flow (SQL / Tavily / answer nodes). **No** `getService("knowledge")` / **`buildAgentContext`** usage under **`modules/features/chatbot/`** in the audited tree—**retrieval injection** and the **unified corpus** are **not** wired the same way as the workspace agent. |
|
||||
| **trustee** | Domain CRUD and quick actions (e.g. **`agentPrompt`** is a **UI hint** to open the workspace with a prefilled prompt—not **`AgentService` inside trustee**). Corpus: **only** via shared **upload** or if the user later uses **workspace agent** with tools. |
|
||||
| **realEstate** | No **`AgentService`** hook in feature tree; same **upload** story for files. |
|
||||
| **teamsbot** | Uses **`serviceAi`** (and related) for the meeting pipeline; **`sessionContext`** is **ephemeral** prompt text. **No** **`AgentService`** / **`buildAgentContext`** in the same pattern as workspace. |
|
||||
| **neutralization** | **Service/pipeline** used **inside** **`indexFile`** when **`FileItem.neutralize`** applies—not a feature that “injects” either kind by itself. |
|
||||
|
||||
### 3.3 Summary matrix (per `modules/features/` domain)
|
||||
|
||||
| Feature | **`AgentService.runAgent`** | **Retrieval injection** (platform RAG prompt) | **Corpus injection** (typical today) | **Likely gap** (this document) |
|
||||
|---------|----------------------------|-----------------------------------------------|-------------------------------------|--------------------------------|
|
||||
| **workspace** | Yes | Yes | Upload **`_autoIndexFile`**; optional **`indexFile`** via agent **tools** | **Automatic** corpus for artifacts that never become **`FileItem`** or tool outputs (exports, structured summaries). |
|
||||
| **graphicalEditor** | Yes | Yes | Same as workspace | **Published graph / metadata** as searchable corpus without manual upload. |
|
||||
| **commcoach** | Yes | Yes (+ custom RAG layer) | Session **`indexFile`** (**`serviceCommcoachIndexer`**) + upload/tools | Extend only if new artifact types need the same **feature-local** indexer pattern. |
|
||||
| **chatbot** | No | **No** (unified store) | No feature-local **`indexFile`** | Decide if chatbot should call **`buildAgentContext`** / **`indexFile`** or stay on SQL/Tavily; **FAQ / grounding** text may need **corpus** hooks. |
|
||||
| **trustee** | No | Only if user works in **workspace** | Upload path; agent tools only in workspace | **Trustee-native** persist events → ingestion when files are not the only representation. |
|
||||
| **realEstate** | No | Only via workspace | Upload path | Same as trustee for case/property narratives. |
|
||||
| **teamsbot** | No | No | None from unified store by default | Persisted **transcripts / notes** → **`indexFile`** if they should be mandate-searchable. |
|
||||
| **neutralization** | N/A | N/A | Preconditions for **`indexFile`** | Ensure all **new** ingest paths honor **`FileItem.neutralize`**. |
|
||||
|
||||
### 3.4 Shared corpus mechanisms (not feature-local, but serve agent features)
|
||||
|
||||
| Mechanism | Role |
|
||||
|-----------|------|
|
||||
| **`routeDataFiles` + `_autoIndexFile`** | Indexes **uploaded** `FileItem`s for **any** UI that uses the upload API—including workspace. |
|
||||
| **`serviceAgent`** **`_documentTools`** / **`_workspaceTools`** | **Corpus** writes when the **model** chooses tools; available to **workspace** and **graphicalEditor** agent sessions (and **CommCoach** when those tools are in the toolset). |
|
||||
| **CommCoach** **`serviceCommcoachIndexer`** | **Feature-local** corpus: coaching session text → **`indexFile`** without requiring an upload. |
|
||||
|
||||
### 3.5 Where **additional feature-native corpus injection** is still needed
|
||||
|
||||
Use this checklist **only** after accounting for §3.1–3.4:
|
||||
|
||||
1. **Content is authoritative** in the feature DB or blob store **without** a guaranteed **`FileItem`** + **`_autoIndexFile`** path.
|
||||
2. **Retrieval injection alone** is insufficient because nothing ever **wrote** chunks (e.g. chatbot never hits **`indexFile`**).
|
||||
3. **Relying on the agent to call tools** is too fragile for compliance or UX (“user must remember to index”).
|
||||
|
||||
Then add **`requestIngestion` / `indexFile`** at the **feature commit point** (or emit a domain event), with **`contextRef` / `chunkMetadata`** for **`feature_code`**, business ids, and **no secrets**.
|
||||
|
||||
### 3.6 Implementation pattern (feature-native corpus only)
|
||||
|
||||
1. **Commit point** — authoritative write in the feature or shared storage.
|
||||
2. **Scope** — align with **`FileItem`** / **`ServiceCenterContext`** rules already used in **`indexFile`**.
|
||||
3. **Unified façade** — one ingestion API; avoid a second embedding pipeline.
|
||||
4. **Purge** — tie to **`fileId`**, business key, or future connector purge keys on revoke/delete.
|
||||
|
||||
### 3.7 Phasing
|
||||
|
||||
- **P0:** For **each** row in §3.3, confirm **retrieval** vs **corpus** paths; document “satisfied by agent+upload+tools” vs “needs feature hook.”
|
||||
- **P1:** Implement **feature-native corpus** for one domain with a clear §3.5 gap (e.g. **trustee** entity text, **teamsbot** persisted transcript).
|
||||
- **P2:** **Chatbot** architecture decision: integrate **`serviceKnowledge`** or keep parallel retrieval; if integrate, add explicit **corpus** rules for config/FAQ.
|
||||
|
||||
---
|
||||
|
||||
## Implementation phases (suggested)
|
||||
|
||||
Phases align with **Teil 1** (façade), **Teil 2** (connector + trigger catalog), and **Teil 3.7** (feature matrix and feature-native corpus pilots). **P0** overlaps **Teil 3.7 P0** (complete the per-feature matrix before large builds).
|
||||
|
||||
| Phase | Outcome |
|
||||
|-------|---------|
|
||||
| **P0 — Catalog & façade** | Document all current `indexFile` call sites; wrap them in `requestIngestion`; add metrics and structured logging; complete **Teil 3.3** matrix (retrieval vs corpus vs gap) per feature. |
|
||||
| **P1 — User-connection hooks** | On connection success/failure/revoke, enqueue bootstrap/delta/purge jobs per **Teil 2.2**; SharePoint and one mail provider as pilots. |
|
||||
| **P2 — Profile & mandate snapshots** | Allowlisted fields only (**Teil 2.3**); regenerate on events; explicit admin toggle per mandate if needed. |
|
||||
| **P3 — Event bus** | Move direct calls to async consumer where load requires it (**Teil 2.4** scalable target). |
|
||||
|
||||
---
|
||||
|
||||
## Ziel und Nicht-Ziele
|
||||
|
||||
**Ziel:**
|
||||
|
||||
- One **ingestion contract** for all features and connector lifecycles.
|
||||
- Indexing **decoupled** from the agent loop (agents may still *invoke* tools that ultimately call ingestion, but ingestion must not *depend* on an agent run).
|
||||
- **Explicit** handling of connection establishment, sync, and revocation.
|
||||
- **Bounded** indexing of user/mandate context with a clear PII policy.
|
||||
|
||||
**Explizit NICHT:**
|
||||
|
||||
- Moving **retrieval** (`buildAgentContext`) out of agents.
|
||||
- Guaranteeing **real-time** indexing for every byte without async jobs (latency targets are product decisions).
|
||||
- Indexing **everything** in the database “because we can”—only curated, policy-approved surfaces.
|
||||
|
||||
---
|
||||
|
||||
## Betroffene Module (erwartet)
|
||||
|
||||
- **Gateway:** `serviceKnowledge`, file upload routes, connector OAuth handlers, sync workers, possibly new `serviceKnowledgeIngest` or package under `modules/serviceCenter/services/`.
|
||||
- **Interfaces:** `interfaceDbKnowledge` extensions for source metadata if needed.
|
||||
- **Wiki / Reference:** `b-reference/gateway/ai-agent.md` (ingestion vs. retrieval) after implementation.
|
||||
|
||||
---
|
||||
|
||||
## Offene Entscheidungen
|
||||
|
||||
| Thema | Optionen |
|
||||
|-------|----------|
|
||||
| **Email bodies** | Full text vs. summary-only vs. attachment-only |
|
||||
| **Multi-tenant isolation audits** | Periodic job to verify chunk `mandateId` matches connection |
|
||||
| **Cost caps** | Per-mandate embedding budget; defer large backfills |
|
||||
| **Neutralization** | Mandatory for certain `sourceKind`s even when not file-upload |
|
||||
| **Provenance shape** | First-class DB columns vs **documented `chunkMetadata` keys** for `connectionId`, external id, revision (must support **Teil 2** purge rules). |
|
||||
|
||||
---
|
||||
|
||||
## Links
|
||||
|
||||
- **How-to / orientation:** [Unified knowledge & RAG ingestion (guide)](../../d-guides/unified-knowledge-rag.md)
|
||||
- **Gateway reference (retrieval + knowledge):** `wiki/b-reference/gateway/architecture.md`, `wiki/b-reference/gateway/ai-agent.md`
|
||||
- **Implementation touchpoints (indicative):** `gateway/modules/serviceCenter/services/serviceKnowledge/mainServiceKnowledge.py`, `gateway/modules/routes/routeDataFiles.py`, `gateway/modules/features/commcoach/serviceCommcoachIndexer.py`, agent `coreTools` `_documentTools` / `_workspaceTools`.
|
||||
|
||||
## Akzeptanzkriterien (Plan-Ebene)
|
||||
|
||||
| # | Kriterium | Prio |
|
||||
|---|-----------|------|
|
||||
| 1 | Every new **file** that should be searchable triggers ingestion **without** requiring an agent session. | must |
|
||||
| 2 | **User connection** connect / disconnect has defined ingestion or purge behavior documented and implementable. | must |
|
||||
| 3 | **Profile/mandate** snapshots use an explicit allowlist; secrets never enter the embedding pipeline. | must |
|
||||
| 4 | Ingestion is **idempotent** for unchanged content (no duplicate embedding work). | should |
|
||||
| 5 | **Teil 3.3** matrix completed: every `modules/features/*` product row has **retrieval** (agent vs none), **corpus** (upload / tools / feature indexer), and **gap** explicitly stated—not “non-injecting” if **`AgentService`** already provides retrieval injection. | should |
|
||||
|
||||
---
|
||||
|
||||
## Testplan (Konzept-Verifikation)
|
||||
|
||||
| ID | Frage | Methode |
|
||||
|----|-------|---------|
|
||||
| T1 | Sind alle bestehenden Index-Entry-Points inventarisiert? | Code-Audit + Tabelle in Build-Phase |
|
||||
| T2 | Ist klar welche Features **Retrieval** (Agent) vs nur **Corpus** vs beides nutzen? | Review **Teil 3.3** Matrix gegen `runAgent` / `indexFile` Call-Sites |
|
||||
| T3 | Bleibt **plattform-RAG-Retrieval** (`buildAgentContext` über `AgentService`) unveraendert in seiner Rolle fuer Workspace/Grafikeditor/CommCoach? | Review `agentLoop` + `mainServiceAgent._createBuildRagContextFn` |
|
||||
| T4 | Ist Revoke/Purge fuer Connector-Chunks ohne **connectionId-Spalte** heute als **Metadata-Konvention** spezifizierbar? | Review **Teil 2.1** + **Offene Entscheidungen** Provenance |
|
||||
| T5 | Ist Revoke/Purge pro **User-Connection-Authority / Integrationsoberfläche** (Teil 2.2, nur `msft` / `google` / `clickup` laut `routeDataConnections`) in einem Threat-Model abgedeckt? | Datenfluss Connection → `FileItem` / virtuelles Doc → Chunks |
|
||||
Loading…
Reference in a new issue