wiki/c-work/4-done/2026-04-id-unified-knowledge-indexing-rag-concept.md

<!-- status: build -->
<!-- started: 2026-04-16 -->
<!-- lastReviewed: 2026-04-21 -->
<!-- component: gateway | platform | frontend-nyla -->

# Unified Knowledge Indexing — One RAG Corpus for All Platform Information

## How to read this document

| Section | Content |
|---------|---------|
| **Beschreibung und Kontext** | Scope (**ingestion vs retrieval**), **terminology** (feature / service / connector / interface), **as-is vs target**, business case and risks. |
| **Teil 1** | Ingestion as its **own lifecycle**: façade API, idempotency, orchestration—**not** owned by `agentLoop`. |
| **Teil 2** | **Triggers** (beyond upload): **user connections**, account snapshots, purge; **per-connection-type** indexing guidance; event-driven option. |
| **Teil 3** | **Feature injection** split into **retrieval** (agent + `buildAgentContext`) vs **corpus** (`indexFile`); **matrix** per `modules/features/*` product; real **gaps** vs false “non-injection”. |
| **Implementation phases · Ziele · AC · Testplan** | Rollout, explicit non-goals, acceptance criteria, verification. |

**Single sentence summary:** Keep **retrieval** on **`AgentService`**; unify **when and how** the shared **`interfaceDbKnowledge`** corpus is **filled** (routes, **user connections** / integrations, features, snapshots) behind one **ingestion contract**, without assuming every product uses the workspace agent.

## Beschreibung und Kontext

### Scope of this document

We distinguish **ingestion** (chunking, embedding, persisting into **`interfaceDbKnowledge`**) from **retrieval** (semantic search + `buildAgentContext` for the LLM). **Retrieval** for the **unified knowledge store** is consumed primarily through **`serviceAgent`** / `runAgent` (workspace, graphical editor, CommCoach). Other products (e.g. **chatbot**, **teamsbot**) may use **different** LLM stacks—**Teil 3** maps who gets platform RAG vs who does not. This plan does **not** mandate one global retrieval path for every feature; it **does** mandate a **single ingestion story** into the same corpus where that corpus is used. The gap we address is **how and when** the corpus is **filled**, not how every LLM entry point **reads** it.

### Terminology (Gateway — see `wiki/b-reference/gateway/architecture.md`)

This concept separates **feature modules**, **services**, **connectors**, and **interfaces**. Conflating them produces wrong ownership (e.g. treating “SharePoint” as a `modules/features/` product, or treating “mail” as if it were `serviceKnowledge`).

| Term | Where in Gateway | Role for indexing |
|------|------------------|-------------------|
| **Feature** | `modules/features/*` (e.g. `workspace`, `graphicalEditor`, `commcoach`, `trustee`, `chatbot`) | Product domains: UI, feature routers, orchestration. They **trigger** actions (upload, sync UX, feature-specific indexers) but must not be the **only** place that starts embedding work. |
| **Service** | `modules/serviceCenter/services/*` | Cross-cutting facades: **`serviceKnowledge`** (indexing, search, `buildAgentContext`); **`serviceExtraction`** (content objects); **`serviceChat`** (chat/workflow documents); **`serviceMessaging`** (e-mail, notifications); **`serviceAgent`** (tools that may *indirectly* call indexing). **Unified ingestion** is primarily a **service-layer** responsibility. |
| **Connector** | `modules/connectors/*` (Microsoft, Google, …) | Vendor adapters: OAuth, list, download. **SharePoint** and **mailbox** I/O live here; routes/features **call** connectors—they are not interchangeable with a feature or with `serviceKnowledge`. |
| **Interface** | `modules/interfaces/*` | Persistence contracts: **`interfaceDbKnowledge`** (`FileContentIndex`, `ContentChunk`, …), **`interfaceDbManagement`** (`FileItem`, `DataSource`), **`interfaceDbApp`** (User, Mandate, `UserConnections`, Preferences). Profile, mandate, and connection rows are **interface-backed**, not a single “profile feature”. |

### What we have today (as-is)

**1. A strong technical “write” implementation, but no product-wide ingestion contract**

- **`serviceKnowledge`** (`mainServiceKnowledge.py`) already implements the heavy lifting: **`indexFile`** resolves scope from **`FileItem`** (single source of truth), optional neutralization, sentence-aware chunking, **`serviceAi`** embeddings, **`ContentChunk`** persistence, and status on **`FileContentIndex`**. That is the right **unit of work** once **content objects** exist.
- **Retrieval** is also centralized in the same service: **`buildAgentContext`** composes multi-layer context for agents. So **read** and **write** to the vector store are **service-owned**; what is **not** unified is **who may call writes**, with which **idempotency**, and on which **lifecycle events**.

**2. Multiple invocation lanes, same underlying method**

Indexing is **operationally** reached from a **mix of layers** (not one façade):

- **HTTP routes** — e.g. file pipeline in **`routeDataFiles`**: pre-scan / extraction → **`indexFile`**. This is the “happy path” for uploads.
- **Agent tools** — **`serviceAgent`** core tools (e.g. document/workspace helpers) can call **`indexFile`** when the user interacts through the agent. That ties **embedding** to **an agent session** even when the same file could have been indexed on upload.
- **Feature-specific code** — e.g. **CommCoach** indexer paths that call **`indexFile`** for that product’s artifacts. Correct for the feature, but it is **another** ad hoc entry point with its own assumptions.
- **Connectors** — Microsoft/Google (and similar) packages can fetch bytes and ultimately produce files or blobs; **OAuth and delta sync** are not yet modeled everywhere as **first-class ingestion lifecycles** (connect → backfill → incremental → revoke) that all funnel through the same API and metadata.

There is **no single** `requestIngestion(...)`, **no standard job identity** for “this external item revision”, and **no one place** that records “this mandate revoked access → tombstone these chunks”.

**3. Extraction vs indexing: clear in code, not enforced at the platform edge**

- **`serviceExtraction`** (and preprocessing helpers) produce **content objects**; **`indexFile`** consumes them. The boundary is clean **inside** the pipeline, but **not every** new binary or external document **must** pass through a single orchestrated “extract then index” step—some paths may skip, duplicate, or call **`indexFile`** with partial metadata.

**4. Truth for scope and identity lives in interfaces—not in “features”**

- **`interfaceDbManagement`** (`FileItem`, …) and **`interfaceDbApp`** (mandate, `UserConnections`, user profile fields) define **who may see what**. **`indexFile`** already mirrors **`FileItem`** for scope; that pattern is **good** but **not generalized** to connector-backed items, virtual documents, or curated “account snapshot” chunks. If a connector writes under a different mental model of `mandateId` / `featureInstanceId`, **`interfaceDbKnowledge`** can drift from app/management truth without a systematic reconcile.

**5. User/mandate/profile deltas are not first-class ingestion events**

- Changes to membership, preferences, or connections update **`interfaceDbApp`** (and related tables). They affect **searchability and personalization** but are **not** consistently reflected as **versioned, allowlisted** chunks in the knowledge store—unless a feature manually adds text somewhere. That leaves agents either **under-informed** or dependent on **non-RAG** code paths for the same facts.

**Summary (as-is):** The **engine** for indexing is **`serviceKnowledge.indexFile`**; the **policy graph** for *when* to run it is **implicit** and **spread across** routes, tools, and features. **Connectors** and **account/mandate** data are **not** uniformly treated as **ingestion sources** with connect/sync/revoke semantics.

### What would make more sense (target)

**1. One ingestion façade at the service boundary (not inside `agentLoop`)**

- A small, stable API (conceptually **`requestIngestion` / `getIngestionStatus`**, implemented atop or beside **`KnowledgeService`**) that **every** lane calls: routes, feature hooks, **connector sync workers**, and (if needed) agent tools as **thin** delegates.
- **Idempotency** (content hash, external revision, `eTag`, …) enforced **here**, so routes and tools cannot accidentally **double-embed** the same logical object.

**2. Lifecycle parity for connectors and “connections”**

- **Establish** → register datasource + optional short **non-secret** summary chunk + enqueue **backfill**.
- **Delta** → incremental jobs with persisted cursors.
- **Revoke / token invalid / GDPR** → **tombstone or purge** by `connectionId` / `sourceKind`, aligned with RBAC—not ad hoc deletes scattered in UI code.

**3. Provenance / `sourceKind` (schema or `chunkMetadata`)**

- Today chunks are **file-anchored**; extended provenance (internal file vs SharePoint item vs mailbox artifact vs `profile_snapshot`, **`connectionId`** for purge, revision keys) should be **consistent**—either **first-class fields** on `ContentChunk` / index rows **or** a **defined convention** inside **`chunkMetadata` / `contextRef`** until a migration is justified. Goal: retrieval, audit, and **connector revoke** cleanup are **data-driven**, not inferred only from call site.

**4. Curated snapshots for interface-backed facts**

- **Allowlisted** projections of mandate membership, locale, entitlements (labels), etc., regenerated on **interface-level** events—**not** dumping full user rows or secrets into embeddings.

**5. Keep retrieval exactly where it is**

- **`buildAgentContext`** remains the agent’s way to **consume** the corpus; ingestion only ensures that corpus is **complete, scoped, and attributable** when the agent runs.

**6. Observability and cost in one place**

- Queue depth, embedding spend, failures, and “skipped duplicate” counts attach to the **ingestion façade**, not to each feature.

### Business goal

Whenever **meaningful information** appears—files, bytes from **connectors**, configuration that should shape answers, and **bounded** user/mandate context—the platform should **ingest it once** into a **unified, scoped** knowledge layer so agents see **one coherent corpus** with clear **provenance** and **permissions**.

### Why this matters now

Information deltas arrive through **routes**, **features**, **`serviceAgent`** tools, **connectors**, and **`interfaceDbApp`** / **`interfaceDbManagement`** updates. Without **one** ingestion contract and triggers per **source**, you get: **missing** indexes, **duplicate** work, **scope drift** between knowledge rows and app truth, and **repeated** engineering per entry path instead of **once** at the **service** layer.

### Risk if we do not unify

Fragmented memory, inconsistent agent answers, compliance gaps (over-indexing sensitive fields or under-indexing allowed context), and duplicated work **per route/feature/tool** instead of at a **single service boundary**.

---

## Teil 1 — Indexing as its own lifecycle (not owned by the agent)

### 1.1 Current useful core

*(Same technical point as **“What we have today” §1** above; repeated here for readers who start at Teil 1.)* After structured **content objects** exist, **`KnowledgeService.indexFile`** performs chunking, embedding (via **`AiService`**), neutralization when required, and persistence via **`interfaceDbKnowledge`**. The **gap** is not the lack of a service method but the lack of a **single product-wide contract** for *when* and *what* enters that pipeline.

### 1.2 Target responsibility split

| Concern | Owner | Notes |
|--------|--------|--------|
| **Ingestion** (normalize → chunk → embed → store) | **Knowledge ingestion service** (logical module; may remain `KnowledgeService` + new façade) | No dependency on `agentLoop`. |
| **Retrieval** (query → ranked context string) | **Agent** (and similar LLM entry points) | Unchanged by this concept. |
| **Orchestration** (queues, retries, backoff) | **Job runner / worker** (new or existing infra) | Keeps API latency low. |

### 1.3 Public ingestion contract (conceptual)

Introduce a small, stable API surface that **all features** call—never “only if an agent runs”:

- **`requestIngestion(job: IngestionJob) -> IngestionHandle`**
  - Idempotent key: `(sourceKind, sourceId, contentVersion | hash)`
  - Returns immediately with `queued` / `duplicate` / `skipped` and optional `jobId` for status polling.

- **`getIngestionStatus(handle)`**
  - Surfaces the same states already used on `FileContentIndex` (`pending`, `extracted`, `embedding`, `indexed`, `failed`) plus connection- or source-specific substates if needed.

The implementation can stay in-process at first (asyncio task queue) and move to Redis/Celery/ARQ later without changing callers.

### 1.4 Idempotency and versioning

- **Re-index** when content changes: compare **content hash** or **external revision** (SharePoint `eTag`, email `Message-ID` + folder cursor, file `updatedAt`).
- **Skip** when hash unchanged to control embedding cost.
- **Tombstone** or **scope-disable** when a source is deleted or access revoked (see Teil 2).

#### Implementation pitfalls (observed during P0 build, 2026-04-21)

The first end-to-end AC4 test on a 500-page PDF revealed **three** independent bugs that all had to be fixed before `ingestion.skipped.duplicate` could ever fire. Each is a **design rule** that every future ingestion lane must honor:

1. **Hash must derive only from content.** `_computeIngestionHash` initially hashed over `(contentObjectId, contentType, data)`, but `contentObjectId` came from `uuid.uuid4()` inside the extractors and was therefore a fresh value on every run. The hash was effectively a per-run nonce — the duplicate check could never match. **Rule:** hashes MUST be a pure function of payload (`contentType`, `data`, and extractor order); never of caller-supplied per-run identifiers. (Tests: `tests/unit/services/test_ingestion_hash_stability.py`.)
2. **Pre-upserts must preserve `_ingestion` metadata and the `indexed` status.** `routeDataFiles._autoIndexFile` persisted a fresh `FileContentIndex` from the pre-scan **before** calling `requestIngestion`, overwriting `structure._ingestion.hash` and `status="indexed"` from any prior successful run. The duplicate check saw a row with empty metadata and re-ran the whole embedding stage. **Rule:** any upsert on the idempotency row taken outside `requestIngestion` MUST read the existing row first and merge forward both `_ingestion` and (where applicable) the terminal `indexed` status.
3. **Extraction-pipeline defaults must preserve granularity for RAG.** `ExtractionOptions.mergeStrategy` defaulted to concatenating every text `ContentPart` into one blob, collapsing a 500-page PDF into a single chunk whose embedding is a blurred average of the whole document — unusable for targeted retrieval. **Rule:** every ingestion lane passes `mergeStrategy=None` explicitly until the default itself can be safely flipped after auditing non-RAG callers. (Tests: `tests/unit/services/test_extraction_merge_strategy.py`.)

**Deferred to P1** (uncovered during P0, not blocking AC1–AC5):

- **In-flight duplicate detection.** The current duplicate check only matches when `status == "indexed"`, so two nearly-simultaneous calls for the same `sourceId` both run full embedding. Fix candidates: accept `status ∈ {"extracted", "embedding", "indexed"}` with matching hash as "already in progress", or a per-`sourceId` `asyncio.Lock` in `KnowledgeService`.
- **Pre-extraction byte-hash shortcut.** `requestIngestion`'s duplicate check runs **after** extraction, so re-indexing a 1.6 MB PDF still spends ~15 s in `runExtraction` before the content hash is computed. The file-bytes SHA already exists in `interfaceDbManagement` for upload-dedup — a short-circuit in `_autoIndexFile` (and symmetric paths) could skip extraction entirely for an unchanged file.

---

## Teil 2 — Triggers: not only “file write”, but every information delta

“Write path” is too narrow if we read it as “HTTP upload only”. The unified model should treat **any authoritative addition or change of platform-visible information** as a potential ingestion trigger.

### 2.1 Trigger taxonomy

| Trigger category | Examples | Ingestion behavior (conceptual) |
|------------------|----------|----------------------------------|
| **Artifact persisted** | User uploads PDF; paste text saved as file; export from a feature | Existing pipeline: extract → `indexFile` (or equivalent). |
| **User connection added / re-authorized** | SharePoint OAuth success; Microsoft/Google mail connection; new API credential with data scope | **Register datasource** + enqueue **initial sync** (backfill) + index a **short connection summary document** (site name, root path, principal, *no secrets*). |
| **Sync for an existing connection** | Scheduled delta; webhook (if available); manual “refresh” | Incremental fetch → map to content objects or rows → **upsert** chunks keyed by external id. |
| **Connection revoked / token invalid** | User disconnects; admin removes mandate integration | **Tombstone** or **purge** chunks keyed by **connection / external source** (today: enforce via **`chunkMetadata` / `contextRef`** convention or future columns); ensure retrieval never serves stale data from that connection. |
| **Mandate / membership** | User added to mandate; role change; feature instance attached | Regenerate **mandate-safe summary** documents (see Section 2.3) if policy allows; **re-resolve scope** for existing chunks (may be heavy—often better to store immutable `mandateId` on chunks at write time and rely on retrieval filters). |
| **User profile (bounded)** | Display name, locale, timezone, **non-sensitive** preferences | Optional **UserContextDocument** for personalization—not a dump of the whole `User` row. |
| **Feature configuration** | Instance labels, data source labels, automation descriptions | If they should influence answers, emit structured **FeatureConfigSnapshot** chunks (small, text-first). |
| **Artifact deleted / data subject erasure** | User deletes a stored file; mandate/user erase | Purge or tombstone the corresponding **`FileContentIndex` / `ContentChunk`** rows (by `fileId`); erasure jobs cascade by **`userId`** / mandate policy. **Connection-wide** revoke remains the **connection** row above. |

### 2.2 User connections (added by the user) as first-class ingestion sources — lifecycle and **what to index per connection type**

**Conceptual focus:** The trigger is OAuth success, saved credential, or linked account in **`UserConnection`** that grants access to an external system. **Implementation** still flows through provider code under `gateway/modules/connectors/` (e.g. **`providerMsft`**, **`providerGoogle`**, **`providerClickup`**); that mapping is **technical**, not the product wording.

**Scope — what counts as a user connection here:** `gateway/modules/routes/routeDataConnections.py` only allows **creating** connections with `type` **`msft`**, **`google`**, or **`clickup`** (`create_connection` → OAuth via `connect_service`). The **authorities options** endpoint also lists **`local`**, but that path is **not** wired in `create_connection`. **This subsection only covers those user-connection authorities** (plus the surfaces each OAuth integration can reach, e.g. Graph mail for Microsoft). Other Gateway connector packages (FTP, Jira, preprocessor, outbound-only mail, geo APIs, …) are **out of scope** in §2.2 until they are exposed the same way as **`UserConnection`** rows.

**Cross-cutting rules (every user-added connection):**

- **Never index:** OAuth tokens, refresh payloads, raw credentials, webhook signing secrets.
- **Always safe to index (metadata only):** human-readable **connection** label, tenant/site name, root path / mailbox address **as display string**, last sync cursor (store in DB, not necessarily as embedding), **external id** + **revision** for idempotency.
- **Prefer file pipeline for binaries:** download → store as `FileItem` (or equivalent) → reuse existing **extract → `indexFile`** path so neutralization and scope mirror upload behavior.
- **Prefer virtual documents** for small text-native items (mail headers/snippets, issue titles/descriptions) to avoid N binary copies.
- **Quota:** per-mandate max documents, max bytes, and “index only last N days” for mail are **product** knobs, not defaults baked into each adapter.

**Lifecycle pattern (target) — tied to the connection row, not to “a connector class”:**

1. **Connection event** (`ConnectionEstablished`) fires when the user **adds** or **re-authorizes** a connection (OAuth / credential storage, **`UserConnection`**, authority **`msft`**, **`google`**, or **`clickup`** per current API).
2. **Ingestion registry** records: `{ connectionId, featureCode, mandateId, userId, scope, externalRoot, adapterKind }` (adapter kind = which integration backs this connection).
3. **Sync planner** enqueues jobs **for that connection**:
   - **Bootstrap:** list roots, respect quotas, prioritize recently modified.
   - **Delta:** cursor per drive/site/folder/mailbox/label; persist cursor in DB.
4. **Normalizer** maps each external item to either:
   - **File-like** → persist bytes + run extraction + **`indexFile`**, or
   - **Virtual document** → build `contentObjects` in memory + **`indexFile`** with a synthetic `fileId` / stable external key.

---

#### When the user connects **Microsoft** (Graph — SharePoint, OneDrive, Outlook, Teams) — `providerMsft`

| **Connection surface** (implementation) | **Should be indexed (typical)** | **Usually skip or optional** | **Notes** |
|--------|----------------------------------|------------------------------|-----------|
| **SharePoint** (`SharepointAdapter`) | Document libraries: **PDF, Office, text, markdown**; list **metadata** (library name, path, item name) as `contextRef`. | Huge video blobs, raw executables, duplicates already indexed via another path. | Use **driveItem id + eTag** for revision. Respect **library/folder allowlist** on this **connection**. |
| **OneDrive** (`OneDriveAdapter`) | Same as SharePoint for **personal files** reachable through the user’s connection. | System/temp folders if exposed. | Scope = **personal** unless shared into mandate explicitly. |
| **Outlook** (`OutlookAdapter`) | **Mailbox:** subjects, **from/to/cc**, **received date**, **body** (plain or stripped HTML) per policy; **calendar** titles/locations/descriptions if product enables. | Full MIME raw, embedded images as separate media unless needed; **entire mailbox** without date window in v1. | Strong **retention + PII** policy: optional “headers + snippet only”; strip signatures/quoted threads; **attachments** → child **file-like** jobs (virus/size limits). |
| **Teams** (`TeamsAdapter`) | **Channel messages** (text), **meeting chat** exports if API allows; **files shared in channel** as file-like. | Message reactions, per-user read receipts; continuous full channel history without bounds. | Often **high volume** — default to **recent window** or **keyword/subscription** driven sync. |

---

#### When the user connects **Google** (Drive, Gmail) — `providerGoogle`

| **Connection surface** (implementation) | **Should be indexed (typical)** | **Usually skip or optional** | **Notes** |
|--------|----------------------------------|------------------------------|-----------|
| **Drive** (`DriveAdapter`) | Native Google files after **export** to Office/PDF (existing export MIME map); standard uploaded files **download → extract**. | Trashed items; shared drives the user did not authorize. | Use **file id + modifiedTime**; Google Docs need **export** before text extraction. |
| **Gmail** (`GmailAdapter`) | **Threads:** subject, participants, internalDate, **snippet** or **body** per policy; **attachments** as separate ingest jobs. | Entire “All Mail” unbounded; **labels** that are purely system. | Same mail cautions as Outlook; **Message-ID** + **History-ID**/cursor for delta. |

---

#### When the user connects **ClickUp** — `providerClickup` (`AuthAuthority.CLICKUP`)

| **Connection surface** (implementation) | **Should be indexed (typical)** | **Usually skip or optional** | **Notes** |
|--------|----------------------------------|------------------------------|-----------|
| **ClickUp** (`providerClickup`) | Task **name**, **description**, **comments**; **attachment** content if downloaded. | Activity stream noise, every status micro-change unless text changed. | Rate limits → prioritize **recently updated** tasks. |

---

**Email and messaging (Outlook + Gmail via Microsoft / Google user connections) — shared cautions**

- Default tiers: **metadata only** → **snippet** → **full body** → **attachments** (most expensive / sensitive).
- Apply **quoted-thread stripping**, **signature removal**, and **max body length** before embed.
- **Legal hold / retention:** ingestion must respect mandate **delete** and **export** rules; **disconnecting** or **revoking** the mail **connection** must **purge** mail-sourced chunks.

### 2.3 “Account and stuff” — what to index vs. what never to index

**Goal:** Give agents **useful, permission-safe** context (“who is this user in this mandate”, “which features are on”, “preferred language”) without creating a **second copy of sensitive credentials** in the vector store.

| Data | Typical treatment |
|------|-------------------|
| Passwords, refresh tokens, API secrets | **Never** index; never pass through embedding pipeline. |
| Email, phone, government IDs | **Default deny**; only if product explicitly enables “index PII” with neutralization and mandate policy. |
| Display name, locale, feature entitlements (labels) | **Allow** as a small structured **UserMandateSnapshot** document regenerated on change. |
| Full `User` or `Mandate` DB row | **Avoid**; generate **curated** JSON/text snapshots with field allowlists. |

Snapshots should be stored with the same **scope model** as file chunks (`personal`, `featureInstance`, `mandate`, `global`) so `semanticSearch` filters stay consistent.

### 2.4 Event-driven vs. direct calls

**Minimum viable:** each feature calls `requestIngestion` at the end of its own transaction (direct call).

**Scalable target:** emit **domain events** (`FileCommitted`, `UserConnectionReady` / provider-specific ready event, `ProfileUpdated`) and a single **KnowledgeIngestionConsumer** subscribes. Benefits: one place for metrics, retries, and rate limits; features stay thin.

**Storage (already implemented — not redesigned here):** The platform already uses **one** knowledge persistence stack: **`FileContentIndex`** (incl. `mandateId`, `scope`, status) and **`ContentChunk`** (pgvector embeddings, `fileId`, `userId`, `featureInstanceId`, `contextRef`, optional **`chunkMetadata`**), accessed via **`interfaceDbKnowledge`**. Chunks are **file-anchored** today; **connection- / source-specific** provenance (e.g. `connectionId`, external ids) can ride in **`contextRef` / `chunkMetadata`** until optional schema extensions are justified. **This document targets ingestion triggers and lifecycles**, not a second corpus or a duplicate storage model.

---

## Teil 3 — Feature injection: retrieval vs corpus, agent loop, and real gaps

“Injection” is ambiguous. This section uses **two** precise meanings:

| Kind | What happens | Primary implementation today |
|------|----------------|------------------------------|
| **Retrieval injection** | Relevant **existing** chunks and workflow context are **assembled** and **inserted into the LLM prompt** (system message) each agent round. | **`AgentService.runAgent`** → `buildRagContextFn` → **`KnowledgeService.buildAgentContext`** → **`ConversationManager.injectRagContext`**. CommCoach wraps the same **`buildAgentContext`** and adds coaching-specific context. |
| **Corpus injection (indexing)** | **New** text/binary is **chunked and embedded** and written to **`interfaceDbKnowledge`** so it can be retrieved later. | **`KnowledgeService.indexFile`**; callers include **`routeDataFiles._autoIndexFile`**, **`serviceAgent`** tools (**`_documentTools`**, **`_workspaceTools`**), and **CommCoach** **`serviceCommcoachIndexer`**. |

A feature can **already participate fully in retrieval injection** by using **`AgentService`** without ever calling **`indexFile`** in its own folder. **Corpus** growth can still happen **indirectly** (upload pipeline, agent tools). Planning must **not** label such features as “non-injecting.”

### 3.1 Features that already use **`AgentService.runAgent`** (retrieval injection is on by default)

These **`modules/features/*`** entry points resolve **`getService("agent", ctx)`** and stream **`agentService.runAgent(...)`** (code audit):

- **`workspace`** (`routeFeatureWorkspace.py`)
- **`graphicalEditor`** (`routeFeatureGraphicalEditor.py`)
- **`commcoach`** (`serviceCommcoach.py` — custom **`buildRagContextFn`**, still uses platform **`buildAgentContext`** inside)

For all three, **every agent round** gets **retrieval injection** unless RAG fails or returns empty. **Corpus** updates for the same sessions still depend on **separate** mechanisms:

| Corpus path | When it runs |
|-------------|----------------|
| **Upload / `FileItem`** | **`routeDataFiles`** **`_autoIndexFile`** after storage (feature-agnostic). |
| **Agent tools** | If the model invokes tools in **`_documentTools`** / **`_workspaceTools`** that call **`indexFile`**, **corpus** changes **during** that agent run—implemented in **`serviceAgent`**, not in the feature’s route file. |

So **workspace** and **graphicalEditor** **do** “inject” in the **retrieval** sense today; they **can** “inject” in the **corpus** sense when users **upload** files or when the **agent** runs indexing-capable tools. What they **often lack** is **feature-owned, automatic corpus** logic (e.g. “on every graph publish, index a snapshot”) without an upload or tool call.

### 3.2 Features that do **not** use **`AgentService`** (no platform RAG prompt injection from this stack)

These domains **do not** call **`runAgent`** in their **`modules/features/*`** trees (audit). They therefore **do not** receive **`buildAgentContext`** through the **workspace agent** loop:

| Feature | Notes |
|---------|--------|
| **chatbot** | Uses an **internal** LangGraph-style flow (SQL / Tavily / answer nodes). **No** `getService("knowledge")` / **`buildAgentContext`** usage under **`modules/features/chatbot/`** in the audited tree—**retrieval injection** and the **unified corpus** are **not** wired the same way as the workspace agent. |
| **trustee** | Domain CRUD and quick actions (e.g. **`agentPrompt`** is a **UI hint** to open the workspace with a prefilled prompt—not **`AgentService` inside trustee**). Corpus: **only** via shared **upload** or if the user later uses **workspace agent** with tools. |
| **realEstate** | No **`AgentService`** hook in feature tree; same **upload** story for files. |
| **teamsbot** | Uses **`serviceAi`** (and related) for the meeting pipeline; **`sessionContext`** is **ephemeral** prompt text. **No** **`AgentService`** / **`buildAgentContext`** in the same pattern as workspace. |
| **neutralization** | **Service/pipeline** used **inside** **`indexFile`** when **`FileItem.neutralize`** applies—not a feature that “injects” either kind by itself. |

### 3.3 Summary matrix (per `modules/features/` domain)

*Matrix verified by audit on 2026-04-21 (P0):* Under `gateway/modules/features/`, only `workspace`, `graphicalEditor`, and `commcoach` resolve `getService("agent")` / `getService("knowledge")` or call `runAgent`; only `commcoach/serviceCommcoachIndexer.py` and `commcoach/serviceCommcoach.py` touch `indexFile` / `buildAgentContext` inside the feature tree. All other domains (`chatbot`, `trustee`, `realEstate`, `teamsbot`, `neutralization`) match the "No" rows below.

| Feature | **`AgentService.runAgent`** | **Retrieval injection** (platform RAG prompt) | **Corpus injection** (typical today) | **Likely gap** (this document) |
|---------|----------------------------|-----------------------------------------------|-------------------------------------|--------------------------------|
| **workspace** | Yes | Yes | Upload **`_autoIndexFile`**; optional **`indexFile`** via agent **tools** | **Automatic** corpus for artifacts that never become **`FileItem`** or tool outputs (exports, structured summaries). |
| **graphicalEditor** | Yes | Yes | Same as workspace | **Published graph / metadata** as searchable corpus without manual upload. |
| **commcoach** | Yes | Yes (+ custom RAG layer) | Session **`indexFile`** (**`serviceCommcoachIndexer`**) + upload/tools | Extend only if new artifact types need the same **feature-local** indexer pattern. |
| **chatbot** | No | **No** (unified store) | No feature-local **`indexFile`** | Decide if chatbot should call **`buildAgentContext`** / **`indexFile`** or stay on SQL/Tavily; **FAQ / grounding** text may need **corpus** hooks. |
| **trustee** | No | Only if user works in **workspace** | Upload path; agent tools only in workspace | **Trustee-native** persist events → ingestion when files are not the only representation. |
| **realEstate** | No | Only via workspace | Upload path | Same as trustee for case/property narratives. |
| **teamsbot** | No | No | None from unified store by default | Persisted **transcripts / notes** → **`indexFile`** if they should be mandate-searchable. |
| **neutralization** | N/A | N/A | Preconditions for **`indexFile`** | Ensure all **new** ingest paths honor **`FileItem.neutralize`**. |

### 3.4 Shared corpus mechanisms (not feature-local, but serve agent features)

| Mechanism | Role |
|-----------|------|
| **`routeDataFiles` + `_autoIndexFile`** | Indexes **uploaded** `FileItem`s for **any** UI that uses the upload API—including workspace. |
| **`serviceAgent`** **`_documentTools`** / **`_workspaceTools`** | **Corpus** writes when the **model** chooses tools; available to **workspace** and **graphicalEditor** agent sessions (and **CommCoach** when those tools are in the toolset). |
| **CommCoach** **`serviceCommcoachIndexer`** | **Feature-local** corpus: coaching session text → **`indexFile`** without requiring an upload. |

### 3.5 Where **additional feature-native corpus injection** is still needed

Use this checklist **only** after accounting for §3.1–3.4:

1. **Content is authoritative** in the feature DB or blob store **without** a guaranteed **`FileItem`** + **`_autoIndexFile`** path.
2. **Retrieval injection alone** is insufficient because nothing ever **wrote** chunks (e.g. chatbot never hits **`indexFile`**).
3. **Relying on the agent to call tools** is too fragile for compliance or UX (“user must remember to index”).

Then add **`requestIngestion` / `indexFile`** at the **feature commit point** (or emit a domain event), with **`contextRef` / `chunkMetadata`** for **`feature_code`**, business ids, and **no secrets**.

### 3.6 Implementation pattern (feature-native corpus only)

1. **Commit point** — authoritative write in the feature or shared storage.
2. **Scope** — align with **`FileItem`** / **`ServiceCenterContext`** rules already used in **`indexFile`**.
3. **Unified façade** — one ingestion API; avoid a second embedding pipeline.
4. **Purge** — tie to **`fileId`**, business key, or future connector purge keys on revoke/delete.

### 3.7 Phasing

- **P0:** For **each** row in §3.3, confirm **retrieval** vs **corpus** paths; document “satisfied by agent+upload+tools” vs “needs feature hook.”
- **P1:** Implement **feature-native corpus** for one domain with a clear §3.5 gap (e.g. **trustee** entity text, **teamsbot** persisted transcript).
- **P2:** **Chatbot** architecture decision: integrate **`serviceKnowledge`** or keep parallel retrieval; if integrate, add explicit **corpus** rules for config/FAQ.

---

## Implementation phases (suggested)

Phases align with **Teil 1** (façade), **Teil 2** (connector + trigger catalog), and **Teil 3.7** (feature matrix and feature-native corpus pilots). **P0** overlaps **Teil 3.7 P0** (complete the per-feature matrix before large builds).

| Phase | Outcome |
|-------|---------|
| **P0 — Façade + idempotency** *(done, 2026-04-21)* | Single `requestIngestion` / `getIngestionStatus` entry point on `KnowledgeService` with content-hash idempotency, provenance in `structure._ingestion`, and structured logging (`ingestion.queued` / `ingestion.indexed` / `ingestion.skipped.duplicate` / `ingestion.failed`). All prior `indexFile` call sites now route through the façade: `routeDataFiles._autoIndexFile`, `commcoach/serviceCommcoachIndexer.indexSessionData`, `serviceAgent/coreTools/_workspaceTools.readFile`, `serviceAgent/coreTools/_documentTools.describeImage`. Agent tools no longer carry on-demand extraction + ingestion fallbacks — they are pure consumers of the knowledge store. **Teil 3.3** matrix audited. Three implementation bugs fixed during verification: stable content hash, pre-upsert `_ingestion` preservation, `mergeStrategy=None` for per-page granularity (see **§1.4 Implementation pitfalls**). |
| **P1 — User-connection hooks** *(done, 2026-04-21)* | `connection.established` / `connection.revoked` callbacks emitted from every OAuth callback (`routeSecurityMsft`, `routeSecurityGoogle`, `routeSecurityClickup`) and from `routeDataConnections.disconnect_service` / `delete_connection`; the `ConnectionStatus.INACTIVE` enum bug (the value did not exist) was fixed by switching the disconnect path to `ConnectionStatus.REVOKED`. A new central `KnowledgeIngestionConsumer` (`subConnectorIngestConsumer.py`, registered in `app.py` lifespan) maps `established` to a `connection.bootstrap` BackgroundJob and `revoked` to a synchronous purge through `KnowledgeService.purgeConnection` → `interfaceDbKnowledge.deleteFileContentIndexByConnectionId`. `FileContentIndex` gained `connectionId` and `sourceKind` columns (auto-applied by `connectorDbPostgre`); `IngestionJob` carries both end-to-end so every chunk is purgeable by connection. **All three OAuth authorities are wired up** with one bootstrap module per service: `subConnectorSyncSharepoint.py` (`sourceKind="sharepoint_item"`, `eTag` as `contentVersion`, walks sites with the `@odata.nextLink` paginated `SharepointAdapter.browse`), `subConnectorSyncOutlook.py` (virtual `outlook_message` documents — header / snippet / cleaned body via the shared `cleanEmailBody` utility — with `changeKey` revisions and optional `outlook_attachment` child jobs), `subConnectorSyncGdrive.py` (`gdrive_item`, `modifiedTime` revisions, recursive walk from My Drive root with depth/age caps and Google-Doc export support inherited from `DriveAdapter.download`), `subConnectorSyncGmail.py` (virtual `gmail_message` documents with `historyId` revisions, walks `INBOX + SENT` by default, MIME-tree body extraction prefers `text/plain` and falls back to `text/html`, optional `gmail_attachment` child jobs), `subConnectorSyncClickup.py` (virtual `clickup_task` documents with `date_updated` revisions, walks teams → spaces → folder/folderless lists → tasks with workspace and per-workspace list caps, header carries name/status/list/space/assignees/tags/url so search prompts retrieve task context without a live API call). The dispatcher `_bootstrapJobHandler` fans out per authority (msft → sharepoint+outlook in parallel, google → drive+gmail in parallel, clickup → tasks); unsupported authorities log `ingestion.connection.bootstrap.skipped reason=unsupported_authority`. Structured-log schema (started / progress / done / purged) defined in **§ Structured ingestion logs** below. Eight new unit tests (purge, consumer dispatch + per-authority routing, `cleanEmailBody`, bootstrapSharepoint, bootstrapOutlook, bootstrapGmail, bootstrapGdrive, bootstrapClickup) lock the contract. **Retrieval threshold calibration (2026-04-21):** during UI verification `buildAgentContext` returned `instanceChunks=0` despite 640 correctly-indexed rows — root cause was overly aggressive `minScore` thresholds (Layer 1 `0.65`, Layer 1.5 `0.55`, Layer 3 `0.70`) versus realistic `text-embedding-3-small` cosine similarities in the `0.30`–`0.55` range. All three thresholds lowered to `0.35`; agent then correctly synthesized answers from indexed Outlook/SharePoint content without resorting to live tools. |
| **P2 — Profile & mandate snapshots** | Allowlisted fields only (**Teil 2.3**); regenerate on events; explicit admin toggle per mandate if needed. |
| **P3 — Event bus** | Move direct calls to async consumer where load requires it (**Teil 2.4** scalable target). |

---

## Ziel und Nicht-Ziele

**Ziel:**

- One **ingestion contract** for all features and connector lifecycles.
- Indexing **decoupled** from the agent loop (agents may still *invoke* tools that ultimately call ingestion, but ingestion must not *depend* on an agent run).
- **Explicit** handling of connection establishment, sync, and revocation.
- **Bounded** indexing of user/mandate context with a clear PII policy.

**Explizit NICHT:**

- Moving **retrieval** (`buildAgentContext`) out of agents.
- Guaranteeing **real-time** indexing for every byte without async jobs (latency targets are product decisions).
- Indexing **everything** in the database “because we can”—only curated, policy-approved surfaces.

---

## Betroffene Module (erwartet)

- **Gateway:** `serviceKnowledge`, file upload routes, connector OAuth handlers, sync workers, possibly new `serviceKnowledgeIngest` or package under `modules/serviceCenter/services/`.
- **Interfaces:** `interfaceDbKnowledge` extensions for source metadata if needed.
- **Wiki / Reference:** `b-reference/gateway/ai-agent.md` (ingestion vs. retrieval) after implementation.

---

## Offene Entscheidungen

| Thema | Optionen |
|-------|----------|
| **Email bodies** | Full text vs. summary-only vs. attachment-only |
| **Multi-tenant isolation audits** | Periodic job to verify chunk `mandateId` matches connection |
| **Cost caps** | Per-mandate embedding budget; defer large backfills |
| **Neutralization** | Mandatory for certain `sourceKind`s even when not file-upload |
| **Provenance shape** | First-class DB columns vs **documented `chunkMetadata` keys** for `connectionId`, external id, revision (must support **Teil 2** purge rules). |
| **In-flight duplicate handling** | Accept `status ∈ {"extracted","embedding","indexed"}` with matching hash as in-progress (cheap, lossy under failure) **vs** per-`sourceId` `asyncio.Lock` in `KnowledgeService` (strict, requires singleton) — see **§1.4 Deferred to P1**. |
| **Pre-extraction dedup shortcut** | Short-circuit `_autoIndexFile` via the file-bytes SHA in `interfaceDbManagement` before running `runExtraction` (~15 s saved per re-index of a large PDF) — see **§1.4 Deferred to P1**. |

---

## Structured ingestion logs (P1 schema)

The connection-lifecycle lane emits the following structured log events. Each event is a single `logger.info` / `.warning` / `.error` call with a stable `extra={"event": ...}` field so downstream log shippers can route on `event` without parsing the message string.

| `event` | Severity | Emitter | Required `extra` keys | Meaning |
|---------|----------|---------|------------------------|---------|
| `ingestion.connection.bootstrap.queued` | info | `KnowledgeIngestionConsumer._onConnectionEstablished` | `connectionId`, `authority` | A `connection.established` callback was received and a `connection.bootstrap` BackgroundJob is being enqueued. |
| `ingestion.connection.bootstrap.started` | info | `bootstrap{Sharepoint,Outlook,Gdrive,Gmail,Clickup}` | `connectionId`, `part` (`sharepoint` \| `outlook` \| `gdrive` \| `gmail` \| `clickup`) | The per-part bootstrap walker has begun work. |
| `ingestion.connection.bootstrap.progress` | info | bootstrap walkers | `connectionId`, `part`, `processed`, `skippedDup`, `failed` | Heart-beat every ~50 items so long-running runs are observable. |
| `ingestion.connection.bootstrap.done` | info | bootstrap walkers + façade-level totals | `connectionId`, `part`, `indexed`, `skippedDup`, `skippedPolicy`, `failed`, `durationMs` (Outlook/Gmail add `attachmentsIndexed`; SharePoint/Drive add `bytes`; ClickUp adds `workspaces` + `lists`) | Walker finished cleanly. |
| `ingestion.connection.bootstrap.failed` | error | `_bootstrapJobHandler` | `part`, `connectionId`, `error` | One bootstrap part raised — recorded but the other parts still complete. |
| `ingestion.connection.bootstrap.skipped` | info | `_bootstrapJobHandler` | `connectionId`, `authority`, `reason` (`unsupported_authority`) | Authority has no bootstrap module registered (e.g. a future provider). |
| `ingestion.connection.purged` | info | `_onConnectionRevoked` | `connectionId`, `authority`, `reason`, `indexRows`, `chunks` | Knowledge purge for a revoked connection completed; numbers reflect the deleted rows. |
| `ingestion.connection.purged.failed` | error | `_onConnectionRevoked` | `connectionId`, `error` | Purge raised; the revoke event was still acknowledged upstream. |

All events should keep field naming consistent with the existing `ingestion.queued / .indexed / .skipped.duplicate / .failed` family from P0 (camelCase, `connectionId`, `mandateId`, `userId`). Counters are integers, durations are in milliseconds.

## Links

- **How-to / orientation:** [Unified knowledge & RAG ingestion (guide)](../../d-guides/unified-knowledge-rag.md)
- **Gateway reference (retrieval + knowledge):** `wiki/b-reference/gateway/architecture.md`, `wiki/b-reference/gateway/ai-agent.md`
- **Implementation touchpoints (indicative):** `gateway/modules/serviceCenter/services/serviceKnowledge/mainServiceKnowledge.py`, `gateway/modules/routes/routeDataFiles.py`, `gateway/modules/features/commcoach/serviceCommcoachIndexer.py`, agent `coreTools` `_documentTools` / `_workspaceTools`, `gateway/modules/datamodels/datamodelExtraction.py` (`ExtractionOptions.mergeStrategy: Optional[MergeStrategy]`).
- **Unit tests (P0 guardrails):** `gateway/tests/unit/services/test_ingestion_hash_stability.py`, `gateway/tests/unit/services/test_extraction_merge_strategy.py`.
- **Unit tests (P1 guardrails):** `gateway/tests/unit/services/test_connection_purge.py`, `gateway/tests/unit/services/test_knowledge_ingest_consumer.py`, `gateway/tests/unit/services/test_clean_email_body.py`, `gateway/tests/unit/services/test_bootstrap_sharepoint.py`, `gateway/tests/unit/services/test_bootstrap_outlook.py`, `gateway/tests/unit/services/test_bootstrap_gmail.py`, `gateway/tests/unit/services/test_bootstrap_gdrive.py`, `gateway/tests/unit/services/test_bootstrap_clickup.py`.
- **P1 implementation touchpoints:** `gateway/modules/serviceCenter/services/serviceKnowledge/subConnectorIngestConsumer.py`, `gateway/modules/serviceCenter/services/serviceKnowledge/subConnectorSyncSharepoint.py`, `gateway/modules/serviceCenter/services/serviceKnowledge/subConnectorSyncOutlook.py`, `gateway/modules/serviceCenter/services/serviceKnowledge/subConnectorSyncGdrive.py`, `gateway/modules/serviceCenter/services/serviceKnowledge/subConnectorSyncGmail.py`, `gateway/modules/serviceCenter/services/serviceKnowledge/subConnectorSyncClickup.py`, `gateway/modules/serviceCenter/services/serviceKnowledge/subTextClean.py`, `gateway/modules/interfaces/interfaceDbKnowledge.py` (`deleteFileContentIndexByConnectionId`), `gateway/modules/datamodels/datamodelKnowledge.py` (`FileContentIndex.connectionId` + `sourceKind`), `gateway/modules/connectors/providerMsft/connectorMsft.py` (`@odata.nextLink`-loop in `SharepointAdapter.browse`, `eTag` in `_graphItemToExternalEntry`), `gateway/modules/routes/routeSecurityMsft.py` / `routeSecurityGoogle.py` / `routeSecurityClickup.py` / `routeDataConnections.py` (callback emission + `ConnectionStatus.REVOKED` fix), `gateway/app.py` (consumer registration in lifespan).

## Akzeptanzkriterien (Plan-Ebene)

| # | Kriterium | Prio |
|---|-----------|------|
| 1 | Every new **file** that should be searchable triggers ingestion **without** requiring an agent session. | must |
| 2 | **User connection** connect / disconnect has defined ingestion or purge behavior documented and implementable. | must |
| 3 | **Profile/mandate** snapshots use an explicit allowlist; secrets never enter the embedding pipeline. | must |
| 4 | Ingestion is **idempotent** for unchanged content (no duplicate embedding work). Verified 2026-04-21 on a 500-page PDF: second re-index trigger logs `ingestion.skipped.duplicate` with a stable hash, zero embedding API calls. See **§1.4 pitfalls** for the three bug classes that had to be fixed first. | must |
| 5 | **Teil 3.3** matrix completed: every `modules/features/*` product row has **retrieval** (agent vs none), **corpus** (upload / tools / feature indexer), and **gap** explicitly stated—not “non-injecting” if **`AgentService`** already provides retrieval injection. | should |

---

## Testplan (Konzept-Verifikation)

| ID | Frage | Methode |
|----|-------|---------|
| T1 | Sind alle bestehenden Index-Entry-Points inventarisiert? | Code-Audit + Tabelle in Build-Phase |
| T2 | Ist klar welche Features **Retrieval** (Agent) vs nur **Corpus** vs beides nutzen? | Review **Teil 3.3** Matrix gegen `runAgent` / `indexFile` Call-Sites |
| T3 | Bleibt **plattform-RAG-Retrieval** (`buildAgentContext` über `AgentService`) unveraendert in seiner Rolle fuer Workspace/Grafikeditor/CommCoach? | Review `agentLoop` + `mainServiceAgent._createBuildRagContextFn` |
| T4 | Ist Revoke/Purge fuer Connector-Chunks ohne **connectionId-Spalte** heute als **Metadata-Konvention** spezifizierbar? | Review **Teil 2.1** + **Offene Entscheidungen** Provenance |
| T5 | Ist Revoke/Purge pro **User-Connection-Authority / Integrationsoberfläche** (Teil 2.2, nur `msft` / `google` / `clickup` laut `routeDataConnections`) in einem Threat-Model abgedeckt? | Datenfluss Connection → `FileItem` / virtuelles Doc → Chunks |
| T6 | Ist der Content-Hash stabil zwischen zwei Extraktions-Runs desselben Files (verschiedene `contentObjectId`-UUIDs, identisches Payload)? | Unit: `tests/unit/services/test_ingestion_hash_stability.py` (5 Cases: UUID-Regen, Daten-Delta, Order-Delta, Type-Delta, Empty). Live: zweiter Trigger auf bereits indexiertes File loggt `ingestion.skipped.duplicate` mit identischem Hash (verifiziert 2026-04-21). |
| T7 | Bleiben bei Multi-Page-PDFs die Per-Page-Chunks erhalten (keine `MergeStrategy`-Konkatenation)? | Unit: `tests/unit/services/test_extraction_merge_strategy.py`. Live: 500-Seiten-PDF → 563 ContentObjects, 567 Embedding-Chunks in 24 Batches (verifiziert 2026-04-21). |
| T8 | Überleben `_ingestion.hash` und `status="indexed"` einen Pre-Scan-Re-Upsert in `_autoIndexFile`? | Review `routeDataFiles._autoIndexFile` Zeile ~127: existing row wird vor upsert gelesen und `_ingestion` + `indexed` in frischen `contentIndex` gemerged. Live: zweiter Trigger → `ingestion.skipped.duplicate` statt Re-Embedding. |
| T9 | Räumt ein `connection.revoked` Event **alle** `FileContentIndex`-Rows + `ContentChunk`s einer Connection und **nichts anderes** auf (Uploads ohne `connectionId`, andere Connections bleiben intakt)? | Unit: `tests/unit/services/test_connection_purge.py` (3 Cases: positive purge, leerer connectionId-Noop, unbekannter connectionId). |
| T10 | Dispatcht der `KnowledgeIngestionConsumer` `connection.established` korrekt als asynchroner `connection.bootstrap` Job (msft → SharePoint + Outlook parallel; google → Drive + Gmail parallel; clickup → Tasks; unbekannte Authorities `skipped.reason="unsupported_authority"`) und `connection.revoked` synchron als Purge? | Unit: `tests/unit/services/test_knowledge_ingest_consumer.py` (8 Cases: established enqueue, missing-id ignore, revoked purge, missing-id ignore, skip-unsupported, msft fan-out, google fan-out, clickup dispatch). |
| T11 | Reduziert `cleanEmailBody` ein realistisches Outlook-HTML auf den eigenen Body-Anteil (HTML strip, Quote-Strip EN+DE, Signature-Strip, Whitespace-Collapse, `maxChars`-Truncate)? | Unit: `tests/unit/services/test_clean_email_body.py` (8 Cases). Konsequenz: `bootstrapOutlook` schickt nie HTML/Quoted-Replies/Signaturen in den Embedding-Pipeline-Schritt. |
| T12 | Sind die Bootstrap-Walker für SharePoint und Outlook idempotent gegen ein zweites Run mit unveränderten `eTag` / `changeKey`? | Unit: `tests/unit/services/test_bootstrap_sharepoint.py` + `tests/unit/services/test_bootstrap_outlook.py`. Mock-Adapter liefern stable revisions; KnowledgeService-Fake meldet `duplicate` und das Result-Objekt bilanziert `skippedDuplicate`. |
| T13 | Walked `bootstrapGmail` `INBOX + SENT`, parsed MIME-Bodies (preferring `text/plain`, falling back to `text/html`), folgt `nextPageToken`-Pagination und ist idempotent gegen identische `historyId` Revisions? | Unit: `tests/unit/services/test_bootstrap_gmail.py` (6 Cases: header/snippet/body content-objects, MIME plain-vs-html preference, HTML fallback, multi-label fan-out, `nextPageToken` pagination, duplicate accounting). |
| T14 | Walked `bootstrapGdrive` My Drive rekursiv (Folder-MIME-Erkennung, `maxDepth`), respektiert den `maxAgeDays`-Recency-Filter und ist idempotent gegen identische `modifiedTime` Revisions? | Unit: `tests/unit/services/test_bootstrap_gdrive.py` (4 Cases: site/subfolder walk, duplicate accounting, recency-skip via `skippedPolicy`, provenance carries `authority="google"` + `service="drive"`). |
| T15 | Walked `bootstrapClickup` Workspaces → Spaces → Folder/Folderless Lists → Tasks unter `maxWorkspaces` / `maxListsPerWorkspace` / `maxTasks` Caps, respektiert den `maxAgeDays`-Recency-Filter und ist idempotent gegen identische `date_updated` Revisions? | Unit: `tests/unit/services/test_bootstrap_clickup.py` (4 Cases: hierarchy walk indexes 4 tasks across 2 lists, duplicate accounting, recency-skip via `skippedPolicy`, `maxTasks` cap). |