resumed rag concept
This commit is contained in:
parent
19f28e85d9
commit
8529ceecc1
1 changed files with 11 additions and 8 deletions
|
|
@ -353,7 +353,7 @@ Phases align with **Teil 1** (façade), **Teil 2** (connector + trigger catalog)
|
|||
| Phase | Outcome |
|
||||
|-------|---------|
|
||||
| **P0 — Façade + idempotency** *(done, 2026-04-21)* | Single `requestIngestion` / `getIngestionStatus` entry point on `KnowledgeService` with content-hash idempotency, provenance in `structure._ingestion`, and structured logging (`ingestion.queued` / `ingestion.indexed` / `ingestion.skipped.duplicate` / `ingestion.failed`). All prior `indexFile` call sites now route through the façade: `routeDataFiles._autoIndexFile`, `commcoach/serviceCommcoachIndexer.indexSessionData`, `serviceAgent/coreTools/_workspaceTools.readFile`, `serviceAgent/coreTools/_documentTools.describeImage`. Agent tools no longer carry on-demand extraction + ingestion fallbacks — they are pure consumers of the knowledge store. **Teil 3.3** matrix audited. Three implementation bugs fixed during verification: stable content hash, pre-upsert `_ingestion` preservation, `mergeStrategy=None` for per-page granularity (see **§1.4 Implementation pitfalls**). |
|
||||
| **P1 — User-connection hooks** *(done, 2026-04-21)* | `connection.established` / `connection.revoked` callbacks emitted from every OAuth callback (`routeSecurityMsft`, `routeSecurityGoogle`, `routeSecurityClickup`) and from `routeDataConnections.disconnect_service` / `delete_connection`; the `ConnectionStatus.INACTIVE` enum bug (the value did not exist) was fixed by switching the disconnect path to `ConnectionStatus.REVOKED`. A new central `KnowledgeIngestionConsumer` (`subConnectorIngestConsumer.py`, registered in `app.py` lifespan) maps `established` to a `connection.bootstrap` BackgroundJob and `revoked` to a synchronous purge through `KnowledgeService.purgeConnection` → `interfaceDbKnowledge.deleteFileContentIndexByConnectionId`. `FileContentIndex` gained `connectionId` and `sourceKind` columns (auto-applied by `connectorDbPostgre`); `IngestionJob` carries both end-to-end so every chunk is purgeable by connection. SharePoint pilot (`subConnectorSyncSharepoint.py`) walks sites with the `@odata.nextLink` paginated `SharepointAdapter.browse`, downloads files, runs the standard extraction pipeline and uses the Graph `eTag` as `contentVersion` so reruns log `ingestion.skipped.duplicate`. Outlook pilot (`subConnectorSyncOutlook.py`) treats messages as virtual documents (header / snippet / cleaned body via the new `cleanEmailBody` utility) with `sourceKind="outlook_message"`; attachments are optional child jobs. Structured-log schema (started / progress / done / purged) defined in **§ Structured ingestion logs** below. Five new unit tests (purge, consumer dispatch, `cleanEmailBody`, bootstrapSharepoint mock, bootstrapOutlook mock) lock the contract. **Retrieval threshold calibration (2026-04-21):** during UI verification `buildAgentContext` returned `instanceChunks=0` despite 640 correctly-indexed rows — root cause was overly aggressive `minScore` thresholds (Layer 1 `0.65`, Layer 1.5 `0.55`, Layer 3 `0.70`) versus realistic `text-embedding-3-small` cosine similarities in the `0.30`–`0.55` range. All three thresholds lowered to `0.35`; agent then correctly synthesized answers from indexed Outlook/SharePoint content without resorting to live tools. |
|
||||
| **P1 — User-connection hooks** *(done, 2026-04-21)* | `connection.established` / `connection.revoked` callbacks emitted from every OAuth callback (`routeSecurityMsft`, `routeSecurityGoogle`, `routeSecurityClickup`) and from `routeDataConnections.disconnect_service` / `delete_connection`; the `ConnectionStatus.INACTIVE` enum bug (the value did not exist) was fixed by switching the disconnect path to `ConnectionStatus.REVOKED`. A new central `KnowledgeIngestionConsumer` (`subConnectorIngestConsumer.py`, registered in `app.py` lifespan) maps `established` to a `connection.bootstrap` BackgroundJob and `revoked` to a synchronous purge through `KnowledgeService.purgeConnection` → `interfaceDbKnowledge.deleteFileContentIndexByConnectionId`. `FileContentIndex` gained `connectionId` and `sourceKind` columns (auto-applied by `connectorDbPostgre`); `IngestionJob` carries both end-to-end so every chunk is purgeable by connection. **All three OAuth authorities are wired up** with one bootstrap module per service: `subConnectorSyncSharepoint.py` (`sourceKind="sharepoint_item"`, `eTag` as `contentVersion`, walks sites with the `@odata.nextLink` paginated `SharepointAdapter.browse`), `subConnectorSyncOutlook.py` (virtual `outlook_message` documents — header / snippet / cleaned body via the shared `cleanEmailBody` utility — with `changeKey` revisions and optional `outlook_attachment` child jobs), `subConnectorSyncGdrive.py` (`gdrive_item`, `modifiedTime` revisions, recursive walk from My Drive root with depth/age caps and Google-Doc export support inherited from `DriveAdapter.download`), `subConnectorSyncGmail.py` (virtual `gmail_message` documents with `historyId` revisions, walks `INBOX + SENT` by default, MIME-tree body extraction prefers `text/plain` and falls back to `text/html`, optional `gmail_attachment` child jobs), `subConnectorSyncClickup.py` (virtual `clickup_task` documents with `date_updated` revisions, walks teams → spaces → folder/folderless lists → tasks with workspace and per-workspace list caps, header carries name/status/list/space/assignees/tags/url so search prompts retrieve task context without a live API call). The dispatcher `_bootstrapJobHandler` fans out per authority (msft → sharepoint+outlook in parallel, google → drive+gmail in parallel, clickup → tasks); unsupported authorities log `ingestion.connection.bootstrap.skipped reason=unsupported_authority`. Structured-log schema (started / progress / done / purged) defined in **§ Structured ingestion logs** below. Eight new unit tests (purge, consumer dispatch + per-authority routing, `cleanEmailBody`, bootstrapSharepoint, bootstrapOutlook, bootstrapGmail, bootstrapGdrive, bootstrapClickup) lock the contract. **Retrieval threshold calibration (2026-04-21):** during UI verification `buildAgentContext` returned `instanceChunks=0` despite 640 correctly-indexed rows — root cause was overly aggressive `minScore` thresholds (Layer 1 `0.65`, Layer 1.5 `0.55`, Layer 3 `0.70`) versus realistic `text-embedding-3-small` cosine similarities in the `0.30`–`0.55` range. All three thresholds lowered to `0.35`; agent then correctly synthesized answers from indexed Outlook/SharePoint content without resorting to live tools. |
|
||||
| **P2 — Profile & mandate snapshots** | Allowlisted fields only (**Teil 2.3**); regenerate on events; explicit admin toggle per mandate if needed. |
|
||||
| **P3 — Event bus** | Move direct calls to async consumer where load requires it (**Teil 2.4** scalable target). |
|
||||
|
||||
|
|
@ -405,11 +405,11 @@ The connection-lifecycle lane emits the following structured log events. Each ev
|
|||
| `event` | Severity | Emitter | Required `extra` keys | Meaning |
|
||||
|---------|----------|---------|------------------------|---------|
|
||||
| `ingestion.connection.bootstrap.queued` | info | `KnowledgeIngestionConsumer._onConnectionEstablished` | `connectionId`, `authority` | A `connection.established` callback was received and a `connection.bootstrap` BackgroundJob is being enqueued. |
|
||||
| `ingestion.connection.bootstrap.started` | info | `bootstrapSharepoint` / `bootstrapOutlook` | `connectionId`, `part` (`sharepoint` \| `outlook`) | The per-part bootstrap walker has begun work. |
|
||||
| `ingestion.connection.bootstrap.started` | info | `bootstrap{Sharepoint,Outlook,Gdrive,Gmail,Clickup}` | `connectionId`, `part` (`sharepoint` \| `outlook` \| `gdrive` \| `gmail` \| `clickup`) | The per-part bootstrap walker has begun work. |
|
||||
| `ingestion.connection.bootstrap.progress` | info | bootstrap walkers | `connectionId`, `part`, `processed`, `skippedDup`, `failed` | Heart-beat every ~50 items so long-running runs are observable. |
|
||||
| `ingestion.connection.bootstrap.done` | info | bootstrap walkers + façade-level totals | `connectionId`, `part`, `indexed`, `skippedDup`, `skippedPolicy`, `failed`, `durationMs` (Outlook adds `attachmentsIndexed`) | Walker finished cleanly. |
|
||||
| `ingestion.connection.bootstrap.failed` | error | `_bootstrapJobHandler` | `part`, `connectionId`, `error` | One bootstrap part raised — recorded but the other part still completes. |
|
||||
| `ingestion.connection.bootstrap.skipped` | info | `_bootstrapJobHandler` | `connectionId`, `authority`, `reason` (`P1_pilot_scope`) | Authority is not in P1 scope (everything except `msft`). |
|
||||
| `ingestion.connection.bootstrap.done` | info | bootstrap walkers + façade-level totals | `connectionId`, `part`, `indexed`, `skippedDup`, `skippedPolicy`, `failed`, `durationMs` (Outlook/Gmail add `attachmentsIndexed`; SharePoint/Drive add `bytes`; ClickUp adds `workspaces` + `lists`) | Walker finished cleanly. |
|
||||
| `ingestion.connection.bootstrap.failed` | error | `_bootstrapJobHandler` | `part`, `connectionId`, `error` | One bootstrap part raised — recorded but the other parts still complete. |
|
||||
| `ingestion.connection.bootstrap.skipped` | info | `_bootstrapJobHandler` | `connectionId`, `authority`, `reason` (`unsupported_authority`) | Authority has no bootstrap module registered (e.g. a future provider). |
|
||||
| `ingestion.connection.purged` | info | `_onConnectionRevoked` | `connectionId`, `authority`, `reason`, `indexRows`, `chunks` | Knowledge purge for a revoked connection completed; numbers reflect the deleted rows. |
|
||||
| `ingestion.connection.purged.failed` | error | `_onConnectionRevoked` | `connectionId`, `error` | Purge raised; the revoke event was still acknowledged upstream. |
|
||||
|
||||
|
|
@ -421,8 +421,8 @@ All events should keep field naming consistent with the existing `ingestion.queu
|
|||
- **Gateway reference (retrieval + knowledge):** `wiki/b-reference/gateway/architecture.md`, `wiki/b-reference/gateway/ai-agent.md`
|
||||
- **Implementation touchpoints (indicative):** `gateway/modules/serviceCenter/services/serviceKnowledge/mainServiceKnowledge.py`, `gateway/modules/routes/routeDataFiles.py`, `gateway/modules/features/commcoach/serviceCommcoachIndexer.py`, agent `coreTools` `_documentTools` / `_workspaceTools`, `gateway/modules/datamodels/datamodelExtraction.py` (`ExtractionOptions.mergeStrategy: Optional[MergeStrategy]`).
|
||||
- **Unit tests (P0 guardrails):** `gateway/tests/unit/services/test_ingestion_hash_stability.py`, `gateway/tests/unit/services/test_extraction_merge_strategy.py`.
|
||||
- **Unit tests (P1 guardrails):** `gateway/tests/unit/services/test_connection_purge.py`, `gateway/tests/unit/services/test_knowledge_ingest_consumer.py`, `gateway/tests/unit/services/test_clean_email_body.py`, `gateway/tests/unit/services/test_bootstrap_sharepoint.py`, `gateway/tests/unit/services/test_bootstrap_outlook.py`.
|
||||
- **P1 implementation touchpoints:** `gateway/modules/serviceCenter/services/serviceKnowledge/subConnectorIngestConsumer.py`, `gateway/modules/serviceCenter/services/serviceKnowledge/subConnectorSyncSharepoint.py`, `gateway/modules/serviceCenter/services/serviceKnowledge/subConnectorSyncOutlook.py`, `gateway/modules/serviceCenter/services/serviceKnowledge/subTextClean.py`, `gateway/modules/interfaces/interfaceDbKnowledge.py` (`deleteFileContentIndexByConnectionId`), `gateway/modules/datamodels/datamodelKnowledge.py` (`FileContentIndex.connectionId` + `sourceKind`), `gateway/modules/connectors/providerMsft/connectorMsft.py` (`@odata.nextLink`-loop in `SharepointAdapter.browse`, `eTag` in `_graphItemToExternalEntry`), `gateway/modules/routes/routeSecurityMsft.py` / `routeSecurityGoogle.py` / `routeSecurityClickup.py` / `routeDataConnections.py` (callback emission + `ConnectionStatus.REVOKED` fix), `gateway/app.py` (consumer registration in lifespan).
|
||||
- **Unit tests (P1 guardrails):** `gateway/tests/unit/services/test_connection_purge.py`, `gateway/tests/unit/services/test_knowledge_ingest_consumer.py`, `gateway/tests/unit/services/test_clean_email_body.py`, `gateway/tests/unit/services/test_bootstrap_sharepoint.py`, `gateway/tests/unit/services/test_bootstrap_outlook.py`, `gateway/tests/unit/services/test_bootstrap_gmail.py`, `gateway/tests/unit/services/test_bootstrap_gdrive.py`, `gateway/tests/unit/services/test_bootstrap_clickup.py`.
|
||||
- **P1 implementation touchpoints:** `gateway/modules/serviceCenter/services/serviceKnowledge/subConnectorIngestConsumer.py`, `gateway/modules/serviceCenter/services/serviceKnowledge/subConnectorSyncSharepoint.py`, `gateway/modules/serviceCenter/services/serviceKnowledge/subConnectorSyncOutlook.py`, `gateway/modules/serviceCenter/services/serviceKnowledge/subConnectorSyncGdrive.py`, `gateway/modules/serviceCenter/services/serviceKnowledge/subConnectorSyncGmail.py`, `gateway/modules/serviceCenter/services/serviceKnowledge/subConnectorSyncClickup.py`, `gateway/modules/serviceCenter/services/serviceKnowledge/subTextClean.py`, `gateway/modules/interfaces/interfaceDbKnowledge.py` (`deleteFileContentIndexByConnectionId`), `gateway/modules/datamodels/datamodelKnowledge.py` (`FileContentIndex.connectionId` + `sourceKind`), `gateway/modules/connectors/providerMsft/connectorMsft.py` (`@odata.nextLink`-loop in `SharepointAdapter.browse`, `eTag` in `_graphItemToExternalEntry`), `gateway/modules/routes/routeSecurityMsft.py` / `routeSecurityGoogle.py` / `routeSecurityClickup.py` / `routeDataConnections.py` (callback emission + `ConnectionStatus.REVOKED` fix), `gateway/app.py` (consumer registration in lifespan).
|
||||
|
||||
## Akzeptanzkriterien (Plan-Ebene)
|
||||
|
||||
|
|
@ -449,6 +449,9 @@ All events should keep field naming consistent with the existing `ingestion.queu
|
|||
| T7 | Bleiben bei Multi-Page-PDFs die Per-Page-Chunks erhalten (keine `MergeStrategy`-Konkatenation)? | Unit: `tests/unit/services/test_extraction_merge_strategy.py`. Live: 500-Seiten-PDF → 563 ContentObjects, 567 Embedding-Chunks in 24 Batches (verifiziert 2026-04-21). |
|
||||
| T8 | Überleben `_ingestion.hash` und `status="indexed"` einen Pre-Scan-Re-Upsert in `_autoIndexFile`? | Review `routeDataFiles._autoIndexFile` Zeile ~127: existing row wird vor upsert gelesen und `_ingestion` + `indexed` in frischen `contentIndex` gemerged. Live: zweiter Trigger → `ingestion.skipped.duplicate` statt Re-Embedding. |
|
||||
| T9 | Räumt ein `connection.revoked` Event **alle** `FileContentIndex`-Rows + `ContentChunk`s einer Connection und **nichts anderes** auf (Uploads ohne `connectionId`, andere Connections bleiben intakt)? | Unit: `tests/unit/services/test_connection_purge.py` (3 Cases: positive purge, leerer connectionId-Noop, unbekannter connectionId). |
|
||||
| T10 | Dispatcht der `KnowledgeIngestionConsumer` `connection.established` korrekt als asynchroner `connection.bootstrap` Job (Microsoft fan-out: SharePoint + Outlook parallel; übrige Authorities `skipped.reason="P1_pilot_scope"`) und `connection.revoked` synchron als Purge? | Unit: `tests/unit/services/test_knowledge_ingest_consumer.py` (6 Cases: established enqueue, missing-id ignore, revoked purge, missing-id ignore, bootstrap-skip-non-msft, bootstrap-msft-fan-out). |
|
||||
| T10 | Dispatcht der `KnowledgeIngestionConsumer` `connection.established` korrekt als asynchroner `connection.bootstrap` Job (msft → SharePoint + Outlook parallel; google → Drive + Gmail parallel; clickup → Tasks; unbekannte Authorities `skipped.reason="unsupported_authority"`) und `connection.revoked` synchron als Purge? | Unit: `tests/unit/services/test_knowledge_ingest_consumer.py` (8 Cases: established enqueue, missing-id ignore, revoked purge, missing-id ignore, skip-unsupported, msft fan-out, google fan-out, clickup dispatch). |
|
||||
| T11 | Reduziert `cleanEmailBody` ein realistisches Outlook-HTML auf den eigenen Body-Anteil (HTML strip, Quote-Strip EN+DE, Signature-Strip, Whitespace-Collapse, `maxChars`-Truncate)? | Unit: `tests/unit/services/test_clean_email_body.py` (8 Cases). Konsequenz: `bootstrapOutlook` schickt nie HTML/Quoted-Replies/Signaturen in den Embedding-Pipeline-Schritt. |
|
||||
| T12 | Sind die Bootstrap-Walker für SharePoint und Outlook idempotent gegen ein zweites Run mit unveränderten `eTag` / `changeKey`? | Unit: `tests/unit/services/test_bootstrap_sharepoint.py` + `tests/unit/services/test_bootstrap_outlook.py`. Mock-Adapter liefern stable revisions; KnowledgeService-Fake meldet `duplicate` und das Result-Objekt bilanziert `skippedDuplicate`. |
|
||||
| T13 | Walked `bootstrapGmail` `INBOX + SENT`, parsed MIME-Bodies (preferring `text/plain`, falling back to `text/html`), folgt `nextPageToken`-Pagination und ist idempotent gegen identische `historyId` Revisions? | Unit: `tests/unit/services/test_bootstrap_gmail.py` (6 Cases: header/snippet/body content-objects, MIME plain-vs-html preference, HTML fallback, multi-label fan-out, `nextPageToken` pagination, duplicate accounting). |
|
||||
| T14 | Walked `bootstrapGdrive` My Drive rekursiv (Folder-MIME-Erkennung, `maxDepth`), respektiert den `maxAgeDays`-Recency-Filter und ist idempotent gegen identische `modifiedTime` Revisions? | Unit: `tests/unit/services/test_bootstrap_gdrive.py` (4 Cases: site/subfolder walk, duplicate accounting, recency-skip via `skippedPolicy`, provenance carries `authority="google"` + `service="drive"`). |
|
||||
| T15 | Walked `bootstrapClickup` Workspaces → Spaces → Folder/Folderless Lists → Tasks unter `maxWorkspaces` / `maxListsPerWorkspace` / `maxTasks` Caps, respektiert den `maxAgeDays`-Recency-Filter und ist idempotent gegen identische `date_updated` Revisions? | Unit: `tests/unit/services/test_bootstrap_clickup.py` (4 Cases: hierarchy walk indexes 4 tasks across 2 lists, duplicate accounting, recency-skip via `skippedPolicy`, `maxTasks` cap). |
|
||||
|
|
|
|||
Loading…
Reference in a new issue