2026-06-02 09:42:12 +02:00

66 KiB

Raw Blame History

Unified Knowledge Indexing — One RAG Corpus for All Platform Information

How to read this document

Section	Content
Beschreibung und Kontext	Scope (ingestion vs retrieval), terminology (feature / service / connector / interface), as-is vs target, business case and risks.
Teil 1	Ingestion as its own lifecycle: façade API, idempotency, orchestration—not owned by `agentLoop`.
Teil 2	Triggers (beyond upload): user connections, account snapshots, purge; per-connection-type indexing guidance; event-driven option.
Teil 3	Feature injection split into retrieval (agent + `buildAgentContext`) vs corpus (`indexFile`); matrix per `modules/features/` product; real gaps* vs false “non-injection”.
Implementation phases · Ziele · AC · Testplan	Rollout, explicit non-goals, acceptance criteria, verification.

Single sentence summary: Keep retrieval on AgentService; unify when and how the shared interfaceDbKnowledge corpus is filled (routes, user connections, feature commit points) behind one ingestion contract. Current roadmap scope: user-connection lifecycle (P1a/P1b), daily refresh to close the post-connect delta gap (P1c), explicit user consent + per-connection ingestion preferences (incl. optional neutralization) in frontend + API, then scalable event bus (P3). Out of current roadmap: standalone profile/mandate snapshot ingestion (former roadmap P2 — content remains in Teil 2.3 as future option only).

Beschreibung und Kontext

Scope of this document

We distinguish ingestion (chunking, embedding, persisting into interfaceDbKnowledge) from retrieval (semantic search + buildAgentContext for the LLM). Retrieval for the unified knowledge store is consumed primarily through serviceAgent / runAgent (workspace, graphical editor, CommCoach). Other products (e.g. chatbot, teamsbot) may use different LLM stacks—Teil 3 maps who gets platform RAG vs who does not. This plan does not mandate one global retrieval path for every feature; it does mandate a single ingestion story into the same corpus where that corpus is used. The gap we address is how and when the corpus is filled, not how every LLM entry point reads it.

Terminology (Gateway — see `wiki/b-reference/platform-core/architecture.md`)

This concept separates feature modules, services, connectors, and interfaces. Conflating them produces wrong ownership (e.g. treating “SharePoint” as a modules/features/ product, or treating “mail” as if it were serviceKnowledge).

Term	Where in Gateway	Role for indexing
Feature	`modules/features/*` (e.g. `workspace`, `graphicalEditor`, `commcoach`, `trustee`, `chatbot`)	Product domains: UI, feature routers, orchestration. They trigger actions (upload, sync UX, feature-specific indexers) but must not be the only place that starts embedding work.
Service	`modules/serviceCenter/services/*`	Cross-cutting facades: `serviceKnowledge` (indexing, search, `buildAgentContext`); `serviceExtraction` (content objects); `serviceChat` (chat/workflow documents); `serviceMessaging` (e-mail, notifications); `serviceAgent` (tools that may indirectly call indexing). Unified ingestion is primarily a service-layer responsibility.
Connector	`modules/connectors/*` (Microsoft, Google, …)	Vendor adapters: OAuth, list, download. SharePoint and mailbox I/O live here; routes/features call connectors—they are not interchangeable with a feature or with `serviceKnowledge`.
Interface	`modules/interfaces/*`	Persistence contracts: `interfaceDbKnowledge` (`FileContentIndex`, `ContentChunk`, …), `interfaceDbManagement` (`FileItem`, `DataSource`), `interfaceDbApp` (User, Mandate, `UserConnections`, Preferences). Profile, mandate, and connection rows are interface-backed, not a single “profile feature”.

What we have today (as-is)

1. A strong technical “write” implementation, but no product-wide ingestion contract

serviceKnowledge (mainServiceKnowledge.py) already implements the heavy lifting: indexFile resolves scope from FileItem (single source of truth), optional neutralization, sentence-aware chunking, serviceAi embeddings, ContentChunk persistence, and status on FileContentIndex. That is the right unit of work once content objects exist.
Retrieval is also centralized in the same service: buildAgentContext composes multi-layer context for agents. So read and write to the vector store are service-owned; what is not unified is who may call writes, with which idempotency, and on which lifecycle events.

2. Multiple invocation lanes, same underlying method

Indexing is operationally reached from a mix of layers (not one façade):

HTTP routes — e.g. file pipeline in routeDataFiles: pre-scan / extraction → indexFile. This is the “happy path” for uploads.
Agent tools — serviceAgent core tools (e.g. document/workspace helpers) can call indexFile when the user interacts through the agent. That ties embedding to an agent session even when the same file could have been indexed on upload.
Feature-specific code — e.g. CommCoach indexer paths that call indexFile for that product’s artifacts. Correct for the feature, but it is another ad hoc entry point with its own assumptions.
Connectors — Microsoft/Google (and similar) packages can fetch bytes and ultimately produce files or blobs; OAuth and delta sync are not yet modeled everywhere as first-class ingestion lifecycles (connect → backfill → incremental → revoke) that all funnel through the same API and metadata.

There is no single requestIngestion(...), no standard job identity for “this external item revision”, and no one place that records “this mandate revoked access → tombstone these chunks”.

3. Extraction vs indexing: clear in code, not enforced at the platform edge

serviceExtraction (and preprocessing helpers) produce content objects; indexFile consumes them. The boundary is clean inside the pipeline, but not every new binary or external document must pass through a single orchestrated “extract then index” step—some paths may skip, duplicate, or call indexFile with partial metadata.

4. Truth for scope and identity lives in interfaces—not in “features”

interfaceDbManagement (FileItem, …) and interfaceDbApp (mandate, UserConnections, user profile fields) define who may see what. indexFile already mirrors FileItem for scope; that pattern is good but not generalized to connector-backed items, virtual documents, or curated “account snapshot” chunks. If a connector writes under a different mental model of mandateId / featureInstanceId, interfaceDbKnowledge can drift from app/management truth without a systematic reconcile.

5. User/mandate/profile deltas are not first-class ingestion events

Changes to membership, preferences, or connections update interfaceDbApp (and related tables). They affect searchability and personalization but are not consistently reflected as versioned, allowlisted chunks in the knowledge store—unless a feature manually adds text somewhere. That leaves agents either under-informed or dependent on non-RAG code paths for the same facts.

Summary (as-is): The engine for indexing is serviceKnowledge.indexFile; the policy graph for when to run it is implicit and spread across routes, tools, and features. Connectors and account/mandate data are not uniformly treated as ingestion sources with connect/sync/revoke semantics.

What would make more sense (target)

1. One ingestion façade at the service boundary (not inside agentLoop)

A small, stable API (conceptually requestIngestion / getIngestionStatus, implemented atop or beside KnowledgeService) that every lane calls: routes, feature hooks, connector sync workers, and (if needed) agent tools as thin delegates.
Idempotency (content hash, external revision, eTag, …) enforced here, so routes and tools cannot accidentally double-embed the same logical object.

2. Lifecycle parity for connectors and “connections”

Establish → register datasource + optional short non-secret summary chunk + enqueue backfill.
Delta → incremental jobs with persisted cursors.
Revoke / token invalid / GDPR → tombstone or purge by connectionId / sourceKind, aligned with RBAC—not ad hoc deletes scattered in UI code.

3. Provenance / sourceKind (schema or chunkMetadata)

Today chunks are file-anchored; extended provenance (internal file vs SharePoint item vs mailbox artifact vs profile_snapshot, connectionId for purge, revision keys) should be consistent—either first-class fields on ContentChunk / index rows or a defined convention inside chunkMetadata / contextRef until a migration is justified. Goal: retrieval, audit, and connector revoke cleanup are data-driven, not inferred only from call site.

4. Curated snapshots for interface-backed facts

Allowlisted projections of mandate membership, locale, entitlements (labels), etc., regenerated on interface-level events—not dumping full user rows or secrets into embeddings.

5. Keep retrieval exactly where it is

buildAgentContext remains the agent’s way to consume the corpus; ingestion only ensures that corpus is complete, scoped, and attributable when the agent runs.

6. Observability and cost in one place

Queue depth, embedding spend, failures, and “skipped duplicate” counts attach to the ingestion façade, not to each feature.

Business goal

Whenever meaningful information appears—files, bytes from connectors, configuration that should shape answers, and bounded user/mandate context—the platform should ingest it once into a unified, scoped knowledge layer so agents see one coherent corpus with clear provenance and permissions.

Why this matters now

Information deltas arrive through routes, features, serviceAgent tools, connectors, and interfaceDbApp / interfaceDbManagement updates. Without one ingestion contract and triggers per source, you get: missing indexes, duplicate work, scope drift between knowledge rows and app truth, and repeated engineering per entry path instead of once at the service layer.

Risk if we do not unify

Fragmented memory, inconsistent agent answers, compliance gaps (over-indexing sensitive fields or under-indexing allowed context), and duplicated work per route/feature/tool instead of at a single service boundary.

Teil 1 — Indexing as its own lifecycle (not owned by the agent)

1.1 Current useful core

(Same technical point as “What we have today” §1 above; repeated here for readers who start at Teil 1.) After structured content objects exist, KnowledgeService.indexFile performs chunking, embedding (via AiService), neutralization when required, and persistence via interfaceDbKnowledge. The gap is not the lack of a service method but the lack of a single product-wide contract for when and what enters that pipeline.

1.2 Target responsibility split

Concern	Owner	Notes
Ingestion (normalize → chunk → embed → store)	Knowledge ingestion service (logical module; may remain `KnowledgeService` + new façade)	No dependency on `agentLoop`.
Retrieval (query → ranked context string)	Agent (and similar LLM entry points)	Unchanged by this concept.
Orchestration (queues, retries, backoff)	Job runner / worker (new or existing infra)	Keeps API latency low.

1.3 Public ingestion contract (conceptual)

Introduce a small, stable API surface that all features call—never “only if an agent runs”:

requestIngestion(job: IngestionJob) -> IngestionHandle
- Idempotent key: (sourceKind, sourceId, contentVersion | hash)
- Returns immediately with queued / duplicate / skipped and optional jobId for status polling.
getIngestionStatus(handle)
- Surfaces the same states already used on FileContentIndex (pending, extracted, embedding, indexed, failed) plus connection- or source-specific substates if needed.

The implementation can stay in-process at first (asyncio task queue) and move to Redis/Celery/ARQ later without changing callers.

1.4 Idempotency and versioning

Re-index when content changes: compare content hash or external revision (SharePoint eTag, email Message-ID + folder cursor, file updatedAt).
Skip when hash unchanged to control embedding cost.
Tombstone or scope-disable when a source is deleted or access revoked (see Teil 2).

Implementation pitfalls (observed during P0 build, 2026-04-21)

The first end-to-end AC4 test on a 500-page PDF revealed three independent bugs that all had to be fixed before ingestion.skipped.duplicate could ever fire. Each is a design rule that every future ingestion lane must honor:

Hash must derive only from content. _computeIngestionHash initially hashed over (contentObjectId, contentType, data), but contentObjectId came from uuid.uuid4() inside the extractors and was therefore a fresh value on every run. The hash was effectively a per-run nonce — the duplicate check could never match. Rule: hashes MUST be a pure function of payload (contentType, data, and extractor order); never of caller-supplied per-run identifiers. (Tests: tests/unit/services/test_ingestion_hash_stability.py.)
Pre-upserts must preserve _ingestion metadata and the indexed status. routeDataFiles._autoIndexFile persisted a fresh FileContentIndex from the pre-scan before calling requestIngestion, overwriting structure._ingestion.hash and status="indexed" from any prior successful run. The duplicate check saw a row with empty metadata and re-ran the whole embedding stage. Rule: any upsert on the idempotency row taken outside requestIngestion MUST read the existing row first and merge forward both _ingestion and (where applicable) the terminal indexed status.
Extraction-pipeline defaults must preserve granularity for RAG. ExtractionOptions.mergeStrategy defaulted to concatenating every text ContentPart into one blob, collapsing a 500-page PDF into a single chunk whose embedding is a blurred average of the whole document — unusable for targeted retrieval. Rule: every ingestion lane passes mergeStrategy=None explicitly until the default itself can be safely flipped after auditing non-RAG callers. (Tests: tests/unit/services/test_extraction_merge_strategy.py.)

Deferred (ingestion idempotency hardening) (uncovered during P0, not blocking AC1–AC5; naming here is not the same milestone as P1 user-connection hooks below):

In-flight duplicate detection. The current duplicate check only matches when status == "indexed", so two nearly-simultaneous calls for the same sourceId both run full embedding. Fix candidates: accept status ∈ {"extracted", "embedding", "indexed"} with matching hash as "already in progress", or a per-sourceId asyncio.Lock in KnowledgeService.
Pre-extraction byte-hash shortcut. requestIngestion's duplicate check runs after extraction, so re-indexing a 1.6 MB PDF still spends ~15 s in runExtraction before the content hash is computed. The file-bytes SHA already exists in interfaceDbManagement for upload-dedup — a short-circuit in _autoIndexFile (and symmetric paths) could skip extraction entirely for an unchanged file.

Teil 2 — Triggers: not only “file write”, but every information delta

“Write path” is too narrow if we read it as “HTTP upload only”. The unified model should treat any authoritative addition or change of platform-visible information as a potential ingestion trigger.

2.1 Trigger taxonomy

Trigger category	Examples	Ingestion behavior (conceptual)
Artifact persisted	User uploads PDF; paste text saved as file; export from a feature	Existing pipeline: extract → `indexFile` (or equivalent).
User connection added / re-authorized	SharePoint OAuth success; Microsoft/Google mail connection; new API credential with data scope	Register datasource + enqueue initial sync (backfill) + index a short connection summary document (site name, root path, principal, no secrets).
Sync for an existing connection	Scheduled delta; webhook (if available); manual “refresh”	Incremental fetch → map to content objects or rows → upsert chunks keyed by external id.
Connection revoked / token invalid	User disconnects; admin removes mandate integration	Tombstone or purge chunks keyed by connection / external source (today: enforce via `chunkMetadata` / `contextRef` convention or future columns); ensure retrieval never serves stale data from that connection.
Mandate / membership	User added to mandate; role change; feature instance attached	Regenerate mandate-safe summary documents (see Section 2.3) if policy allows; re-resolve scope for existing chunks (may be heavy—often better to store immutable `mandateId` on chunks at write time and rely on retrieval filters).
User profile (bounded)	Display name, locale, timezone, non-sensitive preferences	Optional UserContextDocument for personalization—not a dump of the whole `User` row.
Feature configuration	Instance labels, data source labels, automation descriptions	If they should influence answers, emit structured FeatureConfigSnapshot chunks (small, text-first).
Artifact deleted / data subject erasure	User deletes a stored file; mandate/user erase	Purge or tombstone the corresponding `FileContentIndex` / `ContentChunk` rows (by `fileId`); erasure jobs cascade by `userId` / mandate policy. Connection-wide revoke remains the connection row above.

2.2 User connections (added by the user) as first-class ingestion sources — lifecycle and what to index per connection type

Conceptual focus: The trigger is OAuth success, saved credential, or linked account in UserConnection that grants access to an external system. Implementation still flows through provider code under platform-core/modules/connectors/ (e.g. providerMsft, providerGoogle, providerClickup); that mapping is technical, not the product wording.

Scope — what counts as a user connection here: platform-core/modules/routes/routeDataConnections.py only allows creating connections with type msft, google, or clickup (create_connection → OAuth via connect_service). The authorities options endpoint also lists local, but that path is not wired in create_connection. This subsection only covers those user-connection authorities (plus the surfaces each OAuth integration can reach, e.g. Graph mail for Microsoft). Other Gateway connector packages (FTP, Jira, preprocessor, outbound-only mail, geo APIs, …) are out of scope in §2.2 until they are exposed the same way as UserConnection rows.

Cross-cutting rules (every user-added connection):

Never index: OAuth tokens, refresh payloads, raw credentials, webhook signing secrets.
Always safe to index (metadata only): human-readable connection label, tenant/site name, root path / mailbox address as display string, last sync cursor (store in DB, not necessarily as embedding), external id + revision for idempotency.
Prefer file pipeline for binaries: download → store as FileItem (or equivalent) → reuse existing extract → indexFile path so neutralization and scope mirror upload behavior.
Prefer virtual documents for small text-native items (mail headers/snippets, issue titles/descriptions) to avoid N binary copies.
Quota: per-mandate max documents, max bytes, and “index only last N days” for mail are product knobs, not defaults baked into each adapter.

Lifecycle pattern (target) — tied to the connection row, not to “a connector class”:

Connection event (ConnectionEstablished) fires when the user adds or re-authorizes a connection (OAuth / credential storage, UserConnection, authority msft, google, or clickup per current API).
Ingestion registry records: { connectionId, featureCode, mandateId, userId, scope, externalRoot, adapterKind } (adapter kind = which integration backs this connection).
Sync planner enqueues jobs for that connection:
- Bootstrap: list roots, respect quotas, prioritize recently modified.
- Delta: cursor per drive/site/folder/mailbox/label; persist cursor in DB.
Normalizer maps each external item to either:
- File-like → persist bytes + run extraction + indexFile, or
- Virtual document → build contentObjects in memory + indexFile with a synthetic fileId / stable external key.

When the user connects Microsoft (Graph — SharePoint, OneDrive, Outlook, Teams) — `providerMsft`

Connection surface (implementation)	Should be indexed (typical)	Usually skip or optional	Notes
SharePoint (`SharepointAdapter`)	Document libraries: PDF, Office, text, markdown; list metadata (library name, path, item name) as `contextRef`.	Huge video blobs, raw executables, duplicates already indexed via another path.	Use driveItem id + eTag for revision. Respect library/folder allowlist on this connection.
OneDrive (`OneDriveAdapter`)	Same as SharePoint for personal files reachable through the user’s connection.	System/temp folders if exposed.	Scope = personal unless shared into mandate explicitly.
Outlook (`OutlookAdapter`)	Mailbox: subjects, from/to/cc, received date, body (plain or stripped HTML) per policy; calendar titles/locations/descriptions if product enables.	Full MIME raw, embedded images as separate media unless needed; entire mailbox without date window in v1.	Strong retention + PII policy: optional “headers + snippet only”; strip signatures/quoted threads; attachments → child file-like jobs (virus/size limits).
Teams (`TeamsAdapter`)	Channel messages (text), meeting chat exports if API allows; files shared in channel as file-like.	Message reactions, per-user read receipts; continuous full channel history without bounds.	Often high volume — default to recent window or keyword/subscription driven sync.

When the user connects Google (Drive, Gmail) — `providerGoogle`

Connection surface (implementation)	Should be indexed (typical)	Usually skip or optional	Notes
Drive (`DriveAdapter`)	Native Google files after export to Office/PDF (existing export MIME map); standard uploaded files download → extract.	Trashed items; shared drives the user did not authorize.	Use file id + modifiedTime; Google Docs need export before text extraction.
Gmail (`GmailAdapter`)	Threads: subject, participants, internalDate, snippet or body per policy; attachments as separate ingest jobs.	Entire “All Mail” unbounded; labels that are purely system.	Same mail cautions as Outlook; Message-ID + History-ID/cursor for delta.

When the user connects ClickUp — `providerClickup` (`AuthAuthority.CLICKUP`)

Connection surface (implementation)	Should be indexed (typical)	Usually skip or optional	Notes
ClickUp (`providerClickup`)	Task name, description, comments; attachment content if downloaded.	Activity stream noise, every status micro-change unless text changed.	Rate limits → prioritize recently updated tasks.

Email and messaging (Outlook + Gmail via Microsoft / Google user connections) — shared cautions

Default tiers: metadata only → snippet → full body → attachments (most expensive / sensitive). Product default vs user override is defined in §2.6 (per-connection mail depth + attachments).
Apply quoted-thread stripping, signature removal, and max body length before embed.
Legal hold / retention: ingestion must respect mandate delete and export rules; disconnecting or revoking the mail connection must purge mail-sourced chunks.

2.3 “Account and stuff” — what to index vs. what never to index

Roadmap note: Standalone profile/mandate snapshot ingestion (formerly roadmap P2) is out of current scope; the table below remains the target model when that work is picked up again.

Goal: Give agents useful, permission-safe context (“who is this user in this mandate”, “which features are on”, “preferred language”) without creating a second copy of sensitive credentials in the vector store.

Data	Typical treatment
Passwords, refresh tokens, API secrets	Never index; never pass through embedding pipeline.
Email, phone, government IDs	Default deny; only if product explicitly enables “index PII” with neutralization and mandate policy.
Display name, locale, feature entitlements (labels)	Allow as a small structured UserMandateSnapshot document regenerated on change.
Full `User` or `Mandate` DB row	Avoid; generate curated JSON/text snapshots with field allowlists.

Snapshots should be stored with the same scope model as file chunks (personal, featureInstance, mandate, global) so semanticSearch filters stay consistent.

2.4 Event-driven vs. direct calls

Minimum viable: each feature calls requestIngestion at the end of its own transaction (direct call).

Scalable target: emit domain events (FileCommitted, UserConnectionReady / provider-specific ready event, ProfileUpdated) and a single KnowledgeIngestionConsumer subscribes. Benefits: one place for metrics, retries, and rate limits; features stay thin.

Storage (already implemented — not redesigned here): The platform already uses one knowledge persistence stack: FileContentIndex (incl. mandateId, scope, status) and ContentChunk (pgvector embeddings, fileId, userId, featureInstanceId, contextRef, optional chunkMetadata), accessed via interfaceDbKnowledge. Chunks are file-anchored today; connection- / source-specific provenance (e.g. connectionId, external ids) can ride in contextRef / chunkMetadata until optional schema extensions are justified. This document targets ingestion triggers and lifecycles, not a second corpus or a duplicate storage model.

2.5 Lifecycle gap and daily refresh (roadmap P1c, v1)

Gap: After a successful connect, bootstrap runs once (initial fill). New mail, files, or tasks that arrive after that run are not indexed automatically until a delta path exists (webhook, historyId / changes cursors, etc. — see Teil 2.1 row “Sync for an existing connection”).

Pragmatic mitigation (deliberately simple): A daily scheduler (e.g. once per night, staggered by tenant/load) re-invokes the same bootstrap walkers for every active UserConnection that has knowledge ingestion enabled (see §2.6). Idempotency + fast-path skips unchanged items; new and changed items are picked up.

Pros: No new external dependencies (Pub/Sub, watch renewal) in v1; fits existing BackgroundJob + cron/feature-flag patterns.
Con: Data can lag up to ~24 h before it appears in RAG — acceptable for v1 product choice.
Later (without replacing P1c): Add per-authority delta APIs (Gmail users.history.list, Drive changes.list, ClickUp tighter polling) to reduce latency and API cost.

Goal: The user explicitly chooses whether this connection may feed the shared knowledge store used for AI/RAG — and how much. Without consent, no knowledge bootstrap is started for that connection (OAuth may still unlock other product features; that split must be obvious in the UI).

Frontend (ui-nyla): extend the add connection flow (and later connection settings) with the dialog and controls below; persist choices via Gateway API before or when triggering knowledge ingestion.

UX when adding a connection

User starts OAuth as today.
Before or immediately after successful authorization: a dialog that clearly separates “establish connection” from “add to knowledge base”.
No: Connection remains usable for other features; either skip KnowledgeIngestionConsumer.onConnectionEstablished for the knowledge lane or persist knowledgeIngestionEnabled=false and never schedule walkers.
Yes: Show advanced settings (second step or accordion) per settings catalog below; persist per connectionId (or a dedicated preferences row); only then enqueue bootstrap (and later P1c refresh) with allowed surfaces and tiers.

Suggested copy (DE — pick one tone / A-B test):

Formal: „Möchten Sie Inhalte aus dieser Verbindung in Ihre Wissensdatenbank übernehmen? KI-Funktionen können dann passender auf Ihre Dokumente und Nachrichten Bezug nehmen — nur mit Ihrer ausdrücklichen Zustimmung und in dem Umfang, den Sie festlegen.“
Approachable: „Sollen wir aus dieser Verbindung ausgewählte Inhalte sicher in Ihre persönliche Wissensdatenbank legen, damit die KI für Sie besser helfen kann? Sie entscheiden was und wie stark anonymisiert — und können das jederzeit in den Einstellungen ändern oder die Daten entfernen.“

Mirror in EN if the UI is bilingual.

Minimum settings catalog (all per connection where technically applicable)

Layer	Setting	Meaning
Master	Knowledge ingestion for this connection	`off` / `on`: gates bootstrap + §2.5 (P1c) refresh for the knowledge store.
Protection	Neutralize / anonymize before embedding	When `on`: apply the same (or stricter) neutralization pipeline as for uploads (`FileItem.neutralize` / platform rules) to connector-sourced text before chunking — names, e-mail addresses, phone-like patterns, IBAN-like patterns, per policy. User-facing label „anonymisiert“ maps to this pipeline (not a cryptographic guarantee).
Mail (Outlook / Gmail)	Content depth	At least: metadata only (subject, participants, dates — no body) / snippet / full cleaned body (after `cleanEmailBody` and caps).
Mail	Index attachments	`off` / `on` (with size/type caps).
Files (Drive / SharePoint / OneDrive)	Index binary files	`off` / `on`; optional MIME allowlist (Office/PDF/text only) as a simplified UX preset.
ClickUp	Scope	`titles only` / `title + description` / `+ comments` / optional `attachments`.
Microsoft	Parity	Same dimensions where Graph surfaces mirror Google (mailbox / drive-like).
General	Time window	“Only index items from the last N days” (aligns with existing walker caps; slider with a sensible max).
General	Help: what RAG is not	Short explainer: not real-time mail; delay until next scheduled run (§2.5).

Optional power-user toggles (same screen, collapsed): per authority which surfaces ingest (e.g. Google: Gmail on/off, Drive on/off; Microsoft: SharePoint on/off, Outlook on/off — when product exposes both). Reduces accidental over-breadth without extra wizard steps.

Backend consequence: Walkers read persisted preferences for connectionId each run and filter surfaces and payload tiers before indexFile. On preference change, product decision: trigger re-sync, or apply only to new items — document the chosen rule.

Neutralization when the user opts in

Ingestion on + neutralization on: After content is obtained (virtual text or extraction output), apply the neutralization stage before chunking/embedding; that text is what gets embedded.
Neutralization off: Still apply baseline hygiene where already defined (e.g. cleanEmailBody for quotes/signatures) — hygiene ≠ full PII removal.
Compliance copy: If the user chooses full body, state clearly that perfect anonymization is not guaranteed without neutralization.

Teil 3 — Feature injection: retrieval vs corpus, agent loop, and real gaps

“Injection” is ambiguous. This section uses two precise meanings:

Kind	What happens	Primary implementation today
Retrieval injection	Relevant existing chunks and workflow context are assembled and inserted into the LLM prompt (system message) each agent round.	`AgentService.runAgent` → `buildRagContextFn` → `KnowledgeService.buildAgentContext` → `ConversationManager.injectRagContext`. CommCoach wraps the same `buildAgentContext` and adds coaching-specific context.
Corpus injection (indexing)	New text/binary is chunked and embedded and written to `interfaceDbKnowledge` so it can be retrieved later.	`KnowledgeService.indexFile`; callers include `routeDataFiles._autoIndexFile`, `serviceAgent` tools (`_documentTools`, `_workspaceTools`), and CommCoach `serviceCommcoachIndexer`.

A feature can already participate fully in retrieval injection by using AgentService without ever calling indexFile in its own folder. Corpus growth can still happen indirectly (upload pipeline, agent tools). Planning must not label such features as “non-injecting.”

3.1 Features that already use `AgentService.runAgent` (retrieval injection is on by default)

These modules/features/* entry points resolve getService("agent", ctx) and stream agentService.runAgent(...) (code audit):

workspace (routeFeatureWorkspace.py)
graphicalEditor (routeFeatureGraphicalEditor.py)
commcoach (serviceCommcoach.py — custom buildRagContextFn, still uses platform buildAgentContext inside)

For all three, every agent round gets retrieval injection unless RAG fails or returns empty. Corpus updates for the same sessions still depend on separate mechanisms:

Corpus path	When it runs
Upload / `FileItem`	`routeDataFiles` `_autoIndexFile` after storage (feature-agnostic).
Agent tools	If the model invokes tools in `_documentTools` / `_workspaceTools` that call `indexFile`, corpus changes during that agent run—implemented in `serviceAgent`, not in the feature’s route file.

So workspace and graphicalEditor do “inject” in the retrieval sense today; they can “inject” in the corpus sense when users upload files or when the agent runs indexing-capable tools. What they often lack is feature-owned, automatic corpus logic (e.g. “on every graph publish, index a snapshot”) without an upload or tool call.

3.2 Features that do not use `AgentService` (no platform RAG prompt injection from this stack)

These domains do not call runAgent in their modules/features/* trees (audit). They therefore do not receive buildAgentContext through the workspace agent loop:

Feature	Notes
chatbot	Uses an internal LangGraph-style flow (SQL / Tavily / answer nodes). No `getService("knowledge")` / `buildAgentContext` usage under `modules/features/chatbot/` in the audited tree—retrieval injection and the unified corpus are not wired the same way as the workspace agent.
trustee	Domain CRUD and quick actions (e.g. `agentPrompt` is a UI hint to open the workspace with a prefilled prompt—not `AgentService` inside trustee). Corpus: only via shared upload or if the user later uses workspace agent with tools.
realEstate	No `AgentService` hook in feature tree; same upload story for files.
teamsbot	Uses `serviceAi` (and related) for the meeting pipeline; `sessionContext` is ephemeral prompt text. No `AgentService` / `buildAgentContext` in the same pattern as workspace.
neutralization	Service/pipeline used inside `indexFile` when `FileItem.neutralize` applies—not a feature that “injects” either kind by itself.

3.3 Summary matrix (per `modules/features/` domain)

Matrix verified by audit on 2026-04-21 (P0): Under platform-core/modules/features/, only workspace, graphicalEditor, and commcoach resolve getService("agent") / getService("knowledge") or call runAgent; only commcoach/serviceCommcoachIndexer.py and commcoach/serviceCommcoach.py touch indexFile / buildAgentContext inside the feature tree. All other domains (chatbot, trustee, realEstate, teamsbot, neutralization) match the "No" rows below.

Feature	`AgentService.runAgent`	Retrieval injection (platform RAG prompt)	Corpus injection (typical today)	Likely gap (this document)
workspace	Yes	Yes	Upload `_autoIndexFile`; optional `indexFile` via agent tools	Automatic corpus for artifacts that never become `FileItem` or tool outputs (exports, structured summaries).
graphicalEditor	Yes	Yes	Same as workspace	Published graph / metadata as searchable corpus without manual upload.
commcoach	Yes	Yes (+ custom RAG layer)	Session `indexFile` (`serviceCommcoachIndexer`) + upload/tools	Extend only if new artifact types need the same feature-local indexer pattern.
chatbot	No	No (unified store)	No feature-local `indexFile`	Decide if chatbot should call `buildAgentContext` / `indexFile` or stay on SQL/Tavily; FAQ / grounding text may need corpus hooks.
trustee	No	Only if user works in workspace	Upload path; agent tools only in workspace	Trustee-native persist events → ingestion when files are not the only representation.
realEstate	No	Only via workspace	Upload path	Same as trustee for case/property narratives.
teamsbot	No	No	None from unified store by default	Persisted transcripts / notes → `indexFile` if they should be mandate-searchable.
neutralization	N/A	N/A	Preconditions for `indexFile`	Ensure all new ingest paths honor `FileItem.neutralize`.

3.4 Shared corpus mechanisms (not feature-local, but serve agent features)

Mechanism	Role
`routeDataFiles` + `_autoIndexFile`	Indexes uploaded `FileItem`s for any UI that uses the upload API—including workspace.
`serviceAgent` `_documentTools` / `_workspaceTools`	Corpus writes when the model chooses tools; available to workspace and graphicalEditor agent sessions (and CommCoach when those tools are in the toolset).
CommCoach `serviceCommcoachIndexer`	Feature-local corpus: coaching session text → `indexFile` without requiring an upload.

3.5 Where additional feature-native corpus injection is still needed

Use this checklist only after accounting for §3.1–3.4:

Content is authoritative in the feature DB or blob store without a guaranteed FileItem + _autoIndexFile path.
Retrieval injection alone is insufficient because nothing ever wrote chunks (e.g. chatbot never hits indexFile).
Relying on the agent to call tools is too fragile for compliance or UX (“user must remember to index”).

Then add requestIngestion / indexFile at the feature commit point (or emit a domain event), with contextRef / chunkMetadata for feature_code, business ids, and no secrets.

3.6 Implementation pattern (feature-native corpus only)

Commit point — authoritative write in the feature or shared storage.
Scope — align with FileItem / ServiceCenterContext rules already used in indexFile.
Unified façade — one ingestion API; avoid a second embedding pipeline.
Purge — tie to fileId, business key, or future connector purge keys on revoke/delete.

3.7 Phasing (feature matrix — not the same numbering as roadmap P1c/P1d/P3 above)

FM0: For each row in §3.3, confirm retrieval vs corpus paths; document “satisfied by agent+upload+tools” vs “needs feature hook.”
FM1: Implement feature-native corpus for one domain with a clear §3.5 gap (e.g. trustee entity text, teamsbot persisted transcript).
FM2: Chatbot architecture decision: integrate serviceKnowledge or keep parallel retrieval; if integrate, add explicit corpus rules for config/FAQ.

Implementation phases (suggested)

Phases align with Teil 1 (façade), Teil 2 (connector + trigger catalog), and Teil 3.7 (feature matrix and feature-native corpus pilots). P0 overlaps Teil 3.7 P0 (complete the per-feature matrix before large builds).

Authority rollout (2026-04-24): The user-connection ingestion lane (bootstrap + purge tied to UserConnection) is delivered per OAuth authority: msft (P1a), google + clickup (P1b) — same consumer, dispatcher fan-out, purge-by-connectionId, and unit tests for walkers + consumer. Next product slices: P1c (daily refresh, §2.5), consent + per-connection preferences + frontend (§2.6), then P3 (event bus at scale).

Phase	Outcome
P0 — Façade + idempotency (done, 2026-04-21)	Single `requestIngestion` / `getIngestionStatus` entry point on `KnowledgeService` with content-hash idempotency, provenance in `structure._ingestion`, and structured logging (`ingestion.queued` / `ingestion.indexed` / `ingestion.skipped.duplicate` / `ingestion.failed`). All prior `indexFile` call sites now route through the façade: `routeDataFiles._autoIndexFile`, `commcoach/serviceCommcoachIndexer.indexSessionData`, `serviceAgent/coreTools/_workspaceTools.readFile`, `serviceAgent/coreTools/_documentTools.describeImage`. Agent tools no longer carry on-demand extraction + ingestion fallbacks — they are pure consumers of the knowledge store. Teil 3.3 matrix audited. Three implementation bugs fixed during verification: stable content hash, pre-upsert `_ingestion` preservation, `mergeStrategy=None` for per-page granularity (see §1.4 Implementation pitfalls).
P1a — User-connection hooks (Microsoft `msft`) (done, 2026-04-21)	`connection.established` / `connection.revoked` emitted from Microsoft data-OAuth success paths and from disconnect/delete when the row is `msft` (incl. `ConnectionStatus.REVOKED` fix where `INACTIVE` was invalid). Central `KnowledgeIngestionConsumer` (`subConnectorIngestConsumer.py`, `app.py` lifespan) maps `established` → `connection.bootstrap` BackgroundJob and `revoked` → synchronous `KnowledgeService.purgeConnection` → `interfaceDbKnowledge.deleteFileContentIndexByConnectionId`. `FileContentIndex.connectionId` + `sourceKind` (and `IngestionJob` carrying both) make connector-sourced rows purgeable. Bootstrap modules live for Microsoft: `subConnectorSyncSharepoint.py` (`sourceKind="sharepoint_item"`, `eTag` as `contentVersion`, `SharepointAdapter.browse` with `@odata.nextLink` pagination) and `subConnectorSyncOutlook.py` (virtual `outlook_message` docs — header / snippet / cleaned body via `cleanEmailBody`, `changeKey` revisions, optional `outlook_attachment` child jobs). Dispatcher `_bootstrapJobHandler` runs SharePoint + Outlook in parallel for `msft`. Structured logs: § Structured ingestion logs. Retrieval threshold calibration (2026-04-21): `buildAgentContext` `minScore` layers lowered to `0.35` so `text-embedding-3-small` matches real cosine scores; validated on Outlook/SharePoint–indexed content. Tests (P1a): purge, consumer msft dispatch, `cleanEmailBody`, `bootstrapSharepoint`, `bootstrapOutlook`.
P1b — User-connection hooks (Google + ClickUp) (done, 2026-04)	Parity with `msft`: `routeSecurityGoogle` / `routeSecurityClickup` call `KnowledgeIngestionConsumer.onConnectionEstablished` after token save; `routeDataConnections` disconnect/delete call `onConnectionRevoked` for all authorities. `_bootstrapJobHandler` fans out google → `bootstrapGdrive` + `bootstrapGmail` in parallel and clickup → `bootstrapClickup`. Walkers: `subConnectorSyncGdrive.py`, `subConnectorSyncGmail.py`, `subConnectorSyncClickup.py` + `subTextClean.py`. Unit tests: `test_bootstrap_gdrive.py`, `test_bootstrap_gmail.py`, `test_bootstrap_clickup.py`, extended `test_knowledge_ingest_consumer.py`.
P1c — Connection refresh (lifecycle v1) (next)	Daily (or nightly) scheduled re-run of the same bootstrap walkers for connections with knowledge ingestion enabled (§2.6). Reuses idempotency + fast-path; closes the post-connect delta gap without webhooks in v1. Observability: same log family as bootstrap; optional `event` suffix or `reason=scheduled_refresh` for shippers.
P1d — Consent + preferences + UI (next)	Persist §2.6 settings per `connectionId`; Gate `onConnectionEstablished` / P1c jobs on user choice; `ui-nyla` connection wizard + settings screen; walkers honor mail/file/ClickUp depth and neutralization flag.
~~P2 — Profile & mandate snapshots~~	Removed from active roadmap (focus: connections + feature corpus + scale). Target content remains documented in §2.3 for a future re-entry when needed.
P3 — Event bus	Move direct calls to async consumer where load requires it (Teil 2.4 scalable target). Remains in scope.

P1b checklist (completed — kept for audit trail)

routeSecurityGoogle: after successful data OAuth, enqueue same ingestion consumer path as Microsoft (pass connectionId, AuthAuthority.google, mandate/user scope).
routeSecurityClickup: after successful OAuth / token persistence, same.
routeDataConnections: verify disconnect_service / delete_connection emit revoke (or call purgeConnection) for google and clickup rows, not only msft.
_bootstrapJobHandler: remove any “unsupported_authority” skip for google / clickup once walkers are registered; keep skip only for future authorities.
Quality bar: T10/T12–T15 in the testplan — extend from Microsoft-only assumptions to all three routeDataConnections OAuth authorities.

P1c / P1d checklist (next engineering slices)

P1c: BackgroundJob or cron entry; feature flag; per-tenant stagger; only connections with knowledge ingestion = on; metrics on indexed vs skippedDup per run.
P1d ✅ — implemented:
- UserConnection extended with knowledgeIngestionEnabled: bool (default False = strict opt-in) and knowledgePreferences: Optional[Dict] (schemaVersion=1); DB auto-migration adds columns on startup.
- routeDataConnections create_connection accepts knowledgeIngestionEnabled + knowledgePreferences in request body and persists them before returning.
- OAuth callbacks (routeSecurityGoogle, routeSecurityMsft, routeSecurityClickup) gate callbackRegistry.trigger("connection.established", …) on connection.knowledgeIngestionEnabled; emit structured log ingestion.connection.bootstrap.skipped reason=consent_disabled when disabled.
- _bootstrapJobHandler defensive re-check: loads connection via getUserConnectionById and no-ops if flag was disabled after OAuth (race protection).
- IngestionJob.neutralize: bool added; requestIngestion + _indexFileInternal thread it through; for sourceKind != "file" the flag drives _shouldNeutralize directly; for sourceKind == "file" the FileItem.neutralize column remains authoritative.
- subConnectorPrefs.py — loadConnectionPrefs(connectionId) helper + ConnectionIngestionPrefs dataclass with safe defaults for all §2.6 keys.
- All five walkers (Gmail, GDrive, ClickUp, Outlook, SharePoint) load prefs at bootstrap start; limits structs gain mailContentDepth + neutralize (mail walkers), filesIndexBinaries (Drive), clickupScope (ClickUp), and neutralize (all).
- Unit tests (test_p1d_consent_prefs.py — 10 tests): consent gate no-op, prefs defaults + full mapping, Gmail depth modes (metadata/snippet/full), ClickUp scope (titles vs description).
- Frontend (ui-nyla): AddConnectionWizard 4-step modal (connector → consent → preferences → summary + OAuth); old three-button row replaced with single „Verbindung hinzufügen“ button; createConnectionAndAuth hook method; KnowledgePreferences type in connectionApi.ts.
Default policy (document for deploy): knowledgeIngestionEnabled defaults to False for all new connections. Existing connections (before P1d deploy) have the column NULL/False — no bootstrap is triggered retroactively. Users must explicitly opt in via the wizard or connection settings. If the team decides to migrate existing connections to True, a one-time migration script must be run and communicated via release note.

Ziel und Nicht-Ziele

Ziel:

One ingestion contract for all features and connector lifecycles.
Indexing decoupled from the agent loop (agents may still invoke tools that ultimately call ingestion, but ingestion must not depend on an agent run).
Explicit handling of connection establishment, sync, and revocation.
Bounded indexing of user/mandate context with a clear PII policy.
Explicit user consent and per-connection ingestion preferences (incl. optional neutralization) before connector content enters the knowledge store (§2.6).

Explizit NICHT:

Moving retrieval (buildAgentContext) out of agents.
Guaranteeing real-time indexing for every byte without async jobs (latency targets are product decisions).
Indexing everything in the database “because we can”—only curated, policy-approved surfaces.

Betroffene Module (erwartet)

Gateway: serviceKnowledge, file upload routes, connector OAuth handlers, sync workers, possibly new serviceKnowledgeIngest or package under modules/serviceCenter/services/.
Interfaces: interfaceDbKnowledge extensions for source metadata if needed; interfaceDbApp (or adjacent) for per-connectionId ingestion preferences from §2.6.
Frontend: ui-nyla — connection wizard + connection detail settings (consent, depth toggles, neutralization, time window).
Wiki / Reference: b-reference/platform-core/ai-agent.md (ingestion vs. retrieval) after implementation.

Offene Entscheidungen

Thema	Optionen
Email bodies	Default product stance is user-configurable per connection (§2.6 table: metadata / snippet / full cleaned body); mandate policy may still cap max tier.
Multi-tenant isolation audits	Periodic job to verify chunk `mandateId` matches connection
Cost caps	Per-mandate embedding budget; defer large backfills
Neutralization	User opt-in per connection (§2.6); optional mandate floor (“never below snippet+neutralize for mail”) remains a separate governance decision.
Provenance shape	First-class DB columns vs documented `chunkMetadata` keys for `connectionId`, external id, revision (must support Teil 2 purge rules).
In-flight duplicate handling	Accept `status ∈ {"extracted","embedding","indexed"}` with matching hash as in-progress (cheap, lossy under failure) vs per-`sourceId` `asyncio.Lock` in `KnowledgeService` (strict, requires singleton) — see §1.4 Deferred (ingestion idempotency hardening).
Pre-extraction dedup shortcut	Short-circuit `_autoIndexFile` via the file-bytes SHA in `interfaceDbManagement` before running `runExtraction` (~15 s saved per re-index of a large PDF) — see §1.4 Deferred (ingestion idempotency hardening).

Structured ingestion logs (P1 schema)

The connection-lifecycle lane emits the following structured log events. part values sharepoint, outlook, gdrive, gmail, and clickup are all implemented for bootstrap; P1c may add the same events with a distinguishable reason / jobType for scheduled refresh (exact field TBD in implementation). Each event is a single logger.info / .warning / .error call with a stable extra={"event": ...} field so downstream log shippers can route on event without parsing the message string.

`event`	Severity	Emitter	Required `extra` keys	Meaning
`ingestion.connection.bootstrap.queued`	info	`KnowledgeIngestionConsumer._onConnectionEstablished`	`connectionId`, `authority`	A `connection.established` callback was received and a `connection.bootstrap` BackgroundJob is being enqueued.
`ingestion.connection.bootstrap.started`	info	`bootstrap{Sharepoint,Outlook,Gdrive,Gmail,Clickup}`	`connectionId`, `part` (`sharepoint` \| `outlook` \| `gdrive` \| `gmail` \| `clickup`)	The per-part bootstrap walker has begun work.
`ingestion.connection.bootstrap.progress`	info	bootstrap walkers	`connectionId`, `part`, `processed`, `skippedDup`, `failed`	Heart-beat every ~50 items so long-running runs are observable.
`ingestion.connection.bootstrap.done`	info	bootstrap walkers + façade-level totals	`connectionId`, `part`, `indexed`, `skippedDup`, `skippedPolicy`, `failed`, `durationMs` (Outlook/Gmail add `attachmentsIndexed`; SharePoint/Drive add `bytes`; ClickUp adds `workspaces` + `lists`)	Walker finished cleanly.
`ingestion.connection.bootstrap.failed`	error	`_bootstrapJobHandler`	`part`, `connectionId`, `error`	One bootstrap part raised — recorded but the other parts still complete.
`ingestion.connection.bootstrap.skipped`	info	`_bootstrapJobHandler` + OAuth callbacks + defensive check in `_bootstrapJobHandler`	`connectionId`, `authority`, `reason` (`unsupported_authority` │ `consent_disabled`)	Authority has no bootstrap module registered (e.g. a future provider) — or user has not consented (`knowledgeIngestionEnabled=False`).
`ingestion.connection.purged`	info	`_onConnectionRevoked`	`connectionId`, `authority`, `reason`, `indexRows`, `chunks`	Knowledge purge for a revoked connection completed; numbers reflect the deleted rows.
`ingestion.connection.purged.failed`	error	`_onConnectionRevoked`	`connectionId`, `error`	Purge raised; the revoke event was still acknowledged upstream.

All events should keep field naming consistent with the existing ingestion.queued / .indexed / .skipped.duplicate / .failed family from P0 (camelCase, connectionId, mandateId, userId). Counters are integers, durations are in milliseconds.

Links

How-to / orientation: Unified knowledge & RAG ingestion (guide)
Gateway reference (retrieval + knowledge): wiki/b-reference/platform-core/architecture.md, wiki/b-reference/platform-core/ai-agent.md
Implementation touchpoints (indicative): platform-core/modules/serviceCenter/services/serviceKnowledge/mainServiceKnowledge.py, platform-core/modules/routes/routeDataFiles.py, platform-core/modules/features/commcoach/serviceCommcoachIndexer.py, agent coreTools _documentTools / _workspaceTools, platform-core/modules/datamodels/datamodelExtraction.py (ExtractionOptions.mergeStrategy: Optional[MergeStrategy]).
Unit tests (P0 guardrails): platform-core/tests/unit/services/test_ingestion_hash_stability.py, platform-core/tests/unit/services/test_extraction_merge_strategy.py.
Unit tests (P1a — Microsoft, done): platform-core/tests/unit/services/test_connection_purge.py, platform-core/tests/unit/services/test_knowledge_ingest_consumer.py (incl. msft fan-out), platform-core/tests/unit/services/test_clean_email_body.py, platform-core/tests/unit/services/test_bootstrap_sharepoint.py, platform-core/tests/unit/services/test_bootstrap_outlook.py.
Unit tests (P1b — Google + ClickUp, done): test_knowledge_ingest_consumer (google / clickup fan-out), test_bootstrap_gmail.py, test_bootstrap_gdrive.py, test_bootstrap_clickup.py. P1d (done): test_p1d_consent_prefs.py (10 tests: consent gate, prefs parsing, Gmail depth modes, ClickUp scope). P1c: add scheduler tests when implemented.
P1 implementation touchpoints: platform-core/modules/serviceCenter/services/serviceKnowledge/subConnectorIngestConsumer.py, platform-core/modules/serviceCenter/services/serviceKnowledge/subConnectorSyncSharepoint.py, platform-core/modules/serviceCenter/services/serviceKnowledge/subConnectorSyncOutlook.py, platform-core/modules/serviceCenter/services/serviceKnowledge/subConnectorSyncGdrive.py, platform-core/modules/serviceCenter/services/serviceKnowledge/subConnectorSyncGmail.py, platform-core/modules/serviceCenter/services/serviceKnowledge/subConnectorSyncClickup.py, platform-core/modules/serviceCenter/services/serviceKnowledge/subTextClean.py, platform-core/modules/interfaces/interfaceDbKnowledge.py (deleteFileContentIndexByConnectionId), platform-core/modules/datamodels/datamodelKnowledge.py (FileContentIndex.connectionId + sourceKind), platform-core/modules/connectors/providerMsft/connectorMsft.py (@odata.nextLink-loop in SharepointAdapter.browse, eTag in _graphItemToExternalEntry), platform-core/modules/connectors/providerGoogle/connectorGoogle.py (P1b: Drive + Gmail revision keys and download/export paths), platform-core/modules/routes/routeSecurityMsft.py (P1a callbacks), platform-core/modules/routes/routeSecurityGoogle.py and platform-core/modules/routes/routeSecurityClickup.py (P1b: parity callbacks), platform-core/modules/routes/routeDataConnections.py (revoke for all authorities), platform-core/app.py (consumer registration in lifespan).

Akzeptanzkriterien (Plan-Ebene)

#	Kriterium	Prio
1	Every new file that should be searchable triggers ingestion without requiring an agent session.	must
2	User connection connect / disconnect has defined ingestion or purge behavior for each OAuth authority `routeDataConnections` supports (P1a `msft`, P1b `google` / `clickup`); plus user-controlled opt-in and preference bundle before ingestion (P1d, §2.6).	must
3	Profile/mandate snapshot ingestion (former roadmap P2) is deferred; when re-opened, snapshots must use an explicit allowlist and never embed secrets. Until then, §2.6 consent + neutralization covers connector-sourced PII risk.	should (reactivated when P2 returns)
4	Ingestion is idempotent for unchanged content (no duplicate embedding work). Verified 2026-04-21 on a 500-page PDF: second re-index trigger logs `ingestion.skipped.duplicate` with a stable hash, zero embedding API calls. See §1.4 pitfalls for the three bug classes that had to be fixed first.	must
5	Teil 3.3 matrix completed: every `modules/features/` product row has retrieval* (agent vs none), corpus (upload / tools / feature indexer), and gap explicitly stated—not “non-injecting” if `AgentService` already provides retrieval injection.	should

Testplan (Konzept-Verifikation)

ID	Frage	Methode
T1	Sind alle bestehenden Index-Entry-Points inventarisiert?	Code-Audit + Tabelle in Build-Phase
T2	Ist klar welche Features Retrieval (Agent) vs nur Corpus vs beides nutzen?	Review Teil 3.3 Matrix gegen `runAgent` / `indexFile` Call-Sites
T3	Bleibt plattform-RAG-Retrieval (`buildAgentContext` über `AgentService`) unveraendert in seiner Rolle fuer Workspace/Grafikeditor/CommCoach?	Review `agentLoop` + `mainServiceAgent._createBuildRagContextFn`
T4	Ist Revoke/Purge fuer Connector-Chunks ohne connectionId-Spalte heute als Metadata-Konvention spezifizierbar?	Review Teil 2.1 + Offene Entscheidungen Provenance
T5	Ist Revoke/Purge pro User-Connection-Authority / Integrationsoberfläche (Teil 2.2, nur `msft` / `google` / `clickup` laut `routeDataConnections`) in einem Threat-Model abgedeckt?	Datenfluss Connection → `FileItem` / virtuelles Doc → Chunks
T6	Ist der Content-Hash stabil zwischen zwei Extraktions-Runs desselben Files (verschiedene `contentObjectId`-UUIDs, identisches Payload)?	Unit: `tests/unit/services/test_ingestion_hash_stability.py` (5 Cases: UUID-Regen, Daten-Delta, Order-Delta, Type-Delta, Empty). Live: zweiter Trigger auf bereits indexiertes File loggt `ingestion.skipped.duplicate` mit identischem Hash (verifiziert 2026-04-21).
T7	Bleiben bei Multi-Page-PDFs die Per-Page-Chunks erhalten (keine `MergeStrategy`-Konkatenation)?	Unit: `tests/unit/services/test_extraction_merge_strategy.py`. Live: 500-Seiten-PDF → 563 ContentObjects, 567 Embedding-Chunks in 24 Batches (verifiziert 2026-04-21).
T8	Überleben `_ingestion.hash` und `status="indexed"` einen Pre-Scan-Re-Upsert in `_autoIndexFile`?	Review `routeDataFiles._autoIndexFile` Zeile ~127: existing row wird vor upsert gelesen und `_ingestion` + `indexed` in frischen `contentIndex` gemerged. Live: zweiter Trigger → `ingestion.skipped.duplicate` statt Re-Embedding.
T9	Räumt ein `connection.revoked` Event alle `FileContentIndex`-Rows + `ContentChunk`s einer Connection und nichts anderes auf (Uploads ohne `connectionId`, andere Connections bleiben intakt)?	Unit: `tests/unit/services/test_connection_purge.py` (3 Cases: positive purge, leerer connectionId-Noop, unbekannter connectionId).
T10	Dispatcht der `KnowledgeIngestionConsumer` `connection.established` korrekt als asynchroner `connection.bootstrap` Job (P1a: msft → SharePoint + Outlook parallel; P1b: google → Drive + Gmail parallel; clickup → Tasks) und `connection.revoked` synchron als Purge — für jede der drei `routeDataConnections`-Authorities?	P1a + P1b (done): `test_knowledge_ingest_consumer.py` — alle drei Authorities + revoke; unbekannte Authorities `skipped.reason="unsupported_authority"`. P1d: zusätzlich nur bei Consent = ja dispatch.
T11	Reduziert `cleanEmailBody` ein realistisches Outlook-HTML auf den eigenen Body-Anteil (HTML strip, Quote-Strip EN+DE, Signature-Strip, Whitespace-Collapse, `maxChars`-Truncate)?	Unit: `tests/unit/services/test_clean_email_body.py` (8 Cases). Konsequenz: `bootstrapOutlook` schickt nie HTML/Quoted-Replies/Signaturen in den Embedding-Pipeline-Schritt.
T12	Sind die Bootstrap-Walker für SharePoint und Outlook idempotent gegen ein zweites Run mit unveränderten `eTag` / `changeKey`?	Unit: `tests/unit/services/test_bootstrap_sharepoint.py` + `tests/unit/services/test_bootstrap_outlook.py`. Mock-Adapter liefern stable revisions; KnowledgeService-Fake meldet `duplicate` und das Result-Objekt bilanziert `skippedDuplicate`.
T13	Walked `bootstrapGmail` `INBOX + SENT`, parsed MIME-Bodies (preferring `text/plain`, falling back to `text/html`), folgt `nextPageToken`-Pagination und ist idempotent gegen identische `historyId` Revisions?	P1b (done): Unit `test_bootstrap_gmail.py`. P1d: Walker respektiert Content depth aus §2.6 (Metadaten/Snippet/Body).
T14	Walked `bootstrapGdrive` My Drive rekursiv (Folder-MIME-Erkennung, `maxDepth`), respektiert den `maxAgeDays`-Recency-Filter und ist idempotent gegen identische `modifiedTime` Revisions?	P1b (done): Unit `test_bootstrap_gdrive.py`. P1d: „Binärdateien“ / MIME-Allowlist aus §2.6.
T15	Walked `bootstrapClickup` Workspaces → Spaces → Folder/Folderless Lists → Tasks unter `maxWorkspaces` / `maxListsPerWorkspace` / `maxTasks` Caps, respektiert den `maxAgeDays`-Recency-Filter und ist idempotent gegen identische `date_updated` Revisions?	P1b (done): Unit `test_bootstrap_clickup.py`. P1d: ClickUp-Scope (Titel/Beschreibung/Kommentare) aus §2.6.
T16	Führt der P1c-Tagesjob nur Verbindungen mit Wissens-Injektion = ein aus und bleiben Kosten/API-Limits durch Idempotenz + Fast-Path beherrschbar?	Integration oder Unit mit Fake-Clock: zweiter Lauf → überwiegend `skippedDup`; Logs `ingestion.connection.bootstrap.*` mit erkennbarem Scheduled-`reason` (falls implementiert).

66 KiB Raw Blame History Unescape Escape