diff --git a/c-work/1-plan/2026-04-formgenerator-grouping.md b/c-work/1-plan/2026-04-formgenerator-grouping.md new file mode 100644 index 0000000..dd88dee --- /dev/null +++ b/c-work/1-plan/2026-04-formgenerator-grouping.md @@ -0,0 +1,460 @@ + + + + +# FormGenerator: Persistente Benutzer-Gruppierung + +## Beschreibung und Kontext + +Der `FormGeneratorTable` wird auf vielen Seiten der Plattform genutzt. Nutzer sollen Einträge in benannte, rekursive Gruppen organisieren können — mit persistenter Speicherung und vollständiger Kompatibilität mit Pagination, Suche, Filter und allen Action-Buttons. + +**Kernprinzip: Grouping ist ein eingebautes Feature von `PaginationParams` und `PaginatedResponse` — kein separater Call, keine eigene Route, kein eigenes API-Modul. Der bestehende `refetch()`-Mechanismus ist der einzige Transport.** + +--- + +## Architektur-Kern: Wie es funktioniert + +### Grouping reitet auf dem bestehenden Pagination-Call + +`PaginationParams` (der JSON-Parameter jedes List-Endpoints) bekommt zwei neue optionale Felder: + +``` +saveGroupTree → wenn gesetzt: Backend speichert diesen Baum VOR dem Fetch +groupId → wenn gesetzt: Backend filtert Items auf Items dieser Gruppe +``` + +`PaginatedResponse` bekommt ein neues optionales Feld: + +``` +groupTree → aktueller Gruppen-Baum des Users für diesen Endpoint (immer mitgeliefert) +``` + +**Ein Aufruf tut damit drei Dinge auf einmal:** +1. Speichert den neuen Gruppen-Baum (wenn `saveGroupTree` gesetzt) +2. Filtert auf eine Gruppe (wenn `groupId` gesetzt) +3. Gibt aktuelle Items + aktuelle Gruppen-Baum zurück + +### Ablauf End-to-End + +``` +Seitenaufruf (erster Load): + GET /api/connections/?pagination={"page":1,"pageSize":20} + ← { items: [...], pagination: {...}, groupTree: [{id, name, itemIds, subGroups}] } + +User erstellt Gruppe (lokal sofort sichtbar, dann debounced Save via refetch): + GET /api/connections/?pagination={"page":1,"pageSize":20,"saveGroupTree":[{neuerBaum}]} + ← { items: [...], pagination: {...}, groupTree: [{neuerBaum, vom Backend bestätigt}] } + +User betritt Gruppe "Kunden" (id: "g1"): + GET /api/connections/?pagination={"page":1,"pageSize":20,"groupId":"g1"} + ← { items: [nur Items der Gruppe], pagination: {totalItems: 3, ...}, groupTree: [...] } + → Suche, Filter, Sortierung, mode=ids, mode=filterValues — alles funktioniert + innerhalb des Gruppen-Scopes, da das Backend die IN-Liste kennt +``` + +### Backend: Pro Route genau 2 Zeilen Overhead + +Der gesamte Grouping-Mechanismus ist in `routeHelpers.py` als shared Helper implementiert. Jede Route die Grouping unterstützen soll, ruft ihn auf: + +```python +# Anfang der Route-Funktion (BEVOR items gebaut werden): +groupCtx = handleGroupingInRequest(paginationParams, interface, "connections") +# → speichert saveGroupTree falls vorhanden +# → gibt groupIdItemIds zurück falls groupId gesetzt + +# Items bauen (unverändert)... + +# Falls Gruppen-Scope aktiv: Items auf Gruppe einschränken +items = applyGroupScopeFilter(items, groupCtx.itemIds) + +# Am Ende: groupTree in Response einbetten +return {**result, "groupTree": groupCtx.groupTree} +``` + +**Kein neues Route-File. Kein neues Interface-File. Keine neue URL.** + +--- + +## Betroffene Module + +- **Gateway:** + - `modules/datamodels/datamodelPagination.py` — `PaginationParams` + `groupId`, `saveGroupTree`; `PaginatedResponse` + `groupTree`; neue Klassen `TableGroupNode` + `TableGrouping` in **dieser Datei** + - `modules/interfaces/interfaceDbApp.py` — `AppObjects` um `getTableGrouping(contextKey)` + `upsertTableGrouping(contextKey, rootGroups)` erweitern; neue Tabelle `table_groupings` in `poweron_app` (auto-created) + - `modules/routes/routeHelpers.py` — `handleGroupingInRequest(paginationParams, interface, contextKey)` + `applyGroupScopeFilter(items, itemIds)` hinzufügen + - Jede List-Route die Grouping unterstützen soll: **2 Zeilen** am Anfang + **1 Feld** in der Response (`groupTree`) + +- **Frontend:** + - `FormGeneratorTable.tsx` — `groupingConfig`-Prop; interner Grouping-State; nutzt `hookData.refetch()` als einzigen Transport + - `FormGeneratorControls.tsx` — Gruppen-Toolbar-Button + - `FormGenerator/GroupingManager/GroupRow.tsx` — Gruppen-Header-Zeile (neue Komponente) + - `FormGenerator/GroupingManager/GroupingManager.tsx` — Seitenpanel (neue Komponente) + - **Kein neuer Hook, kein neues API-Modul, keine Änderungen an bestehenden Feature-Hooks** + +- **DB-Migration:** Nein (Auto-Create via DatabaseConnector) + +--- + +## Datenmodell + +### Ergänzungen in `datamodelPagination.py` + +```python +# --- Grouping-Modelle (neu, in derselben Datei) --- + +class TableGroupNode(BaseModel): + id: str + name: str + itemIds: List[str] = Field(default_factory=list) + subGroups: List['TableGroupNode'] = Field(default_factory=list) + order: int = 0 + isExpanded: bool = True + +TableGroupNode.model_rebuild() + +class TableGrouping(BaseModel): + """DB-Tabelle table_groupings in poweron_app.""" + id: str + userId: str + contextKey: str # abgeleitet aus Route-Prefix, z. B. "connections", "prompts", "admin/users" + rootGroups: List[TableGroupNode] = Field(default_factory=list) + updatedAt: Optional[float] = None + + +# --- Erweiterung PaginationParams (2 neue optionale Felder) --- + +class PaginationParams(BaseModel): + page: int = Field(ge=1) + pageSize: int = Field(ge=1, le=1000) + sort: List[SortField] = Field(default_factory=list) + filters: Optional[Dict[str, Any]] = None + # NEU: + groupId: Optional[str] = None # Scope: nur Items dieser Gruppe + saveGroupTree: Optional[List[Dict[str, Any]]] = None # Persistieren: diesen Baum speichern + + +# --- Erweiterung PaginatedResponse (1 neues optionales Feld) --- + +class PaginatedResponse(BaseModel, Generic[T]): + items: List[T] + pagination: Optional[PaginationMetadata] + groupTree: Optional[List[TableGroupNode]] = None # NEU — immer mitgeliefert wenn vorhanden + + model_config = ConfigDict(arbitrary_types_allowed=True) +``` + +--- + +## Backend-Implementierung + +### `routeHelpers.py` — neuer shared Helper + +```python +from dataclasses import dataclass +from typing import Optional, Set + +@dataclass +class GroupingContext: + groupTree: Optional[list] # Für die Response + itemIds: Optional[Set[str]] # Falls groupId gesetzt — IN-Filter-Menge + + +def handleGroupingInRequest( + paginationParams: Optional[PaginationParams], + interface, # AppObjects + contextKey: str, +) -> GroupingContext: + """ + Zentraler Grouping-Handler — aufgerufen am Anfang jeder List-Route. + + 1. Falls paginationParams.saveGroupTree gesetzt: + → interface.upsertTableGrouping(contextKey, saveGroupTree) + → saveGroupTree aus params entfernen (wird nicht weiter verarbeitet) + + 2. Falls paginationParams.groupId gesetzt: + → Gruppe im gespeicherten Baum suchen (rekursiv inkl. Subgruppen) + → itemIds der Gruppe (+ alle Subgruppen) als Set zurückgeben + → groupId aus params entfernen (wird nicht als normaler Filter verarbeitet) + + 3. Aktuellen groupTree laden und für Response bereitstellen. + + Returns: GroupingContext(groupTree, itemIds) + """ + + +def applyGroupScopeFilter(items: list, itemIds: Optional[Set[str]]) -> list: + """ + Wendet den Gruppen-Scope-Filter an. + Gibt items unverändert zurück wenn itemIds is None (kein Scope aktiv). + Filtert sonst auf item["id"] in itemIds. + """ + if itemIds is None: + return items + return [item for item in items if str(item.get("id", "")) in itemIds] +``` + +### Route-Erweiterung — Muster (2 + 1 Zeilen) + +```python +@router.get("/") +async def get_connections(request, pagination=None, mode=None, column=None, currentUser=Depends(getCurrentUser)): + from modules.routes.routeHelpers import handleGroupingInRequest, applyGroupScopeFilter + + interface = getInterface(currentUser) + CONTEXT_KEY = "connections" + + # 1. Grouping verarbeiten (speichern falls nötig, Scope auflösen) + groupCtx = handleGroupingInRequest(paginationParams, interface, CONTEXT_KEY) + + # mode=filterValues / mode=ids (unverändert, aber groupId ist bereits aus params entfernt) + if mode == "filterValues": ... + if mode == "ids": + items = _buildEnhancedItems() + items = applyGroupScopeFilter(items, groupCtx.itemIds) # Scope auch für ids! + return handleIdsInMemory(items, pagination) + + # Items bauen (unverändert) + items = _buildEnhancedItems() + + # 2. Gruppen-Scope-Filter anwenden + items = applyGroupScopeFilter(items, groupCtx.itemIds) + + # Pagination (unverändert) + result = paginateInMemory(items, paginationParams) + + # 3. groupTree in Response einbetten + return {**result.model_dump(), "groupTree": groupCtx.groupTree} +``` + +**`mode=filterValues` und `mode=ids` funktionieren automatisch korrekt im Gruppen-Scope**, weil `groupId` aus `paginationParams` entfernt wurde und `applyGroupScopeFilter` aufgerufen wird — dadurch beziehen sich Filter-Dropdowns und "Select All Filtered" auf Items der aktuellen Gruppe. + +--- + +## Frontend-Implementierung + +### `FormGeneratorTable` — interner Grouping-State + +```typescript +// Neues Prop: +groupingConfig?: { + contextKey: string; // Nur für RBAC/Logging — der eigentliche Key wird serverseitig aus dem Endpoint abgeleitet + enabled: boolean; +} + +// Interner State (nur in FormGeneratorTable, nicht im Hook): +const [groupTree, setGroupTree] = useState([]); +const [activeGroupId, setActiveGroupId] = useState(null); +const [pendingGroupTree, setPendingGroupTree] = useState(null); +``` + +### Datenfluss — `hookData.refetch()` als einziger Transport + +```typescript +// groupTree kommt aus der normalen refetch-Response: +useEffect(() => { + if (hookData?.pagination?.groupTree) { + setGroupTree(hookData.pagination.groupTree); + setPendingGroupTree(null); // Gespeichert — pending löschen + } +}, [hookData]); + +// Beim Betreten eines Gruppen-Scopes: +const _enterGroup = (groupId: string) => { + setActiveGroupId(groupId); + hookData.refetch({ ...currentPaginationParams, page: 1, groupId }); +}; + +// Beim Verlassen des Scopes: +const _exitGroup = () => { + setActiveGroupId(null); + hookData.refetch({ ...currentPaginationParams, page: 1, groupId: undefined }); +}; + +// Bei Gruppen-Mutation (erstellen, umbenennen, löschen, Item zuordnen): +const _mutateGroupTree = (newTree: TableGroupNode[]) => { + setGroupTree(newTree); // Sofort lokal sichtbar (optimistic) + setPendingGroupTree(newTree); // Markiert für nächsten Save + _debouncedSave(newTree); // Debounced: nach 500ms via refetch speichern +}; + +// Debounced Save: normaler refetch + saveGroupTree im pagination param +const _debouncedSave = useMemo(() => debounce((tree: TableGroupNode[]) => { + hookData.refetch({ + ...currentPaginationParams, + saveGroupTree: tree, // Backend speichert und bestätigt + }); +}, 500), [hookData, currentPaginationParams]); +``` + +**Kein neues API-Modul. Kein eigener fetch-Call. `hookData.refetch()` ist alles.** + +### Render-Struktur + +``` +Root-View (kein activeGroupId): +━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ +▼ Kunden (3 Items) [Delete all] [Download all] [+ Subgruppe] [Umbenennen] [×] + ▶ Aktive (1) [Delete all] [+ Subgruppe] [Umbenennen] [×] + — Item A [Edit] [Delete] [Download] + — Item B [Edit] [Delete] [Download] + — Item C (Aktive) [Edit] [Delete] [Download] +▶ Intern [5 Items — Gruppe öffnen →] +── Nicht zugeordnet (2) + — Item X [Edit] [Delete] [Download] + +Gruppen-Scope (activeGroupId = "Intern"): +━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ +← Zurück | Intern (Seite 1/1 · 5 Einträge) [Delete all] [Download all] + — Item C [Edit] [Delete] [Download] + — Item D [Edit] [Delete] [Download] + ... +``` + +Wenn `activeGroupId` gesetzt: refetch läuft mit `groupId` → Backend filtert → Pagination, Suche, Filter, `mode=ids` — alles auf Gruppe begrenzt. + +--- + +## Pagination, Suche, Filter — vollständig korrekt + +| Szenario | Was passiert | +|----------|-------------| +| Root-View, Seite blättern | Normaler refetch. `groupTree` kommt in Response mit. Gruppen-Counter aus `groupNode.itemIds.length`. | +| Gruppen-Scope aktiv | `groupId` in PaginationParams → Backend IN-Filter → Totalcount kommt vom Backend (korrekt). | +| Suche im Scope | `groupId` + `filters.search` → Backend filtert Items der Gruppe nach Suchtext. | +| "Select All Filtered" (mode=ids) | `groupId` in params → `applyGroupScopeFilter` wird VOR `handleIdsInMemory` angewendet → nur IDs der Gruppe werden zurückgegeben. | +| Filter-Dropdown (mode=filterValues) | `groupId` in params → `applyGroupScopeFilter` vor `handleFilterValuesInMemory` → Distinct-Werte kommen nur aus Gruppen-Items. | +| Gruppen-Baum speichern während Paginieren | `saveGroupTree` + Seiten-Params im gleichen Call → Backend speichert Baum UND gibt aktuelle Seite zurück. | + +--- + +## Entscheidungen + +| Datum | Entscheidung | Begründung | +|-------|-------------|------------| +| 2026-04-29 | `saveGroupTree` + `groupId` in `PaginationParams` statt eigener Endpoint | Ein Call, ein Transport, kein zweiter API-Pfad; Grouping ist integraler Bestandteil der Datenabfrage | +| 2026-04-29 | `groupTree` in `PaginatedResponse` | Immer synchron mit aktuellen Items; kein separater Lade-Call nötig | +| 2026-04-29 | Shared Helper in `routeHelpers.py`, 2 Zeilen pro Route | DRY; gesamte Komplexität an einem Ort; pro Route null Eigenlogik | +| 2026-04-29 | Optimistic UI + debounced Save via normalen refetch | UX sofort; kein Flackern; Persistenz ohne Extra-Call | +| 2026-04-29 | Modelle in `datamodelPagination.py` | `PaginatedResponse` liegt dort; keine neue Datei; Import-Graph bleibt minimal | +| 2026-04-29 | `applyGroupScopeFilter` auch für `mode=ids` und `mode=filterValues` | Filter-Dropdown und Bulk-Select funktionieren korrekt im Gruppen-Scope ohne Sonderbehandlung | +| 2026-04-29 | Gruppen-State nur in `FormGeneratorTable`, nicht im Feature-Hook | Keine Änderungen an bestehenden Hooks nötig; Grouping ist transparent für Seiten-Code | + +--- + +## Umsetzungs-Checkliste + +### Phase 1: Backend Core + +- [ ] `datamodelPagination.py`: `TableGroupNode`, `TableGrouping` Klassen hinzufügen +- [ ] `datamodelPagination.py`: `PaginationParams` um `groupId` + `saveGroupTree` erweitern +- [ ] `datamodelPagination.py`: `PaginatedResponse` um `groupTree` erweitern +- [ ] `datamodelPagination.py`: `normalize_pagination_dict` so erweitern, dass `saveGroupTree` und `groupId` korrekt geparst werden +- [ ] `interfaceDbApp.py`: `getTableGrouping(contextKey)` + `upsertTableGrouping(contextKey, rootGroups)` zu `AppObjects` hinzufügen +- [ ] `routeHelpers.py`: `GroupingContext` Dataclass + `handleGroupingInRequest()` + `applyGroupScopeFilter()` implementieren + +### Phase 2: Route-Erweiterungen + +Pro Route: `handleGroupingInRequest` am Anfang + `applyGroupScopeFilter` vor Pagination + `groupTree` in Response + +- [ ] `routeDataConnections.py` (inkl. `mode=ids`, `mode=filterValues`) +- [ ] `routeDataPrompts.py` +- [ ] `routeDataUsers.py` +- [ ] `routeDataMandates.py` +- [ ] `routeDataFiles.py` +- [ ] Weitere nach Bedarf (Trustee, RealEstate, Invitations, …) + +### Phase 3: Frontend Typen + FormGeneratorTable Grundgerüst + +- [ ] TypeScript-Typen `TableGroupNode` direkt in `FormGeneratorTable.tsx` oder `src/types/tableGrouping.ts` +- [ ] `groupingConfig`-Prop zu `FormGeneratorTableProps` hinzufügen +- [ ] `groupTree` aus `hookData`-Response parsen und in internem State halten +- [ ] `activeGroupId`-State + `_enterGroup` / `_exitGroup` +- [ ] `pendingGroupTree`-State + `_mutateGroupTree` + debounced Save via `hookData.refetch()` +- [ ] `PaginatedResponse` Frontend-Typ um `groupTree` erweitern + +### Phase 4: Render-Logik + +- [ ] `GroupRow`-Komponente: `src/components/FormGenerator/GroupingManager/GroupRow.tsx` +- [ ] Render-Algorithmus: Root-View mit Gruppen-Header-Zeilen und DataRows; "Nicht zugeordnet"-Sektion +- [ ] Gruppen-Scope-View: Breadcrumb, "Zurück"-Button, normales Table-Layout +- [ ] Gedimmte Gruppen ohne sichtbare Items (Root-View) mit "Gruppe öffnen"-Button +- [ ] Expand/Collapse je Gruppe (lokal; isExpanded kommt aus `groupTree`) + +### Phase 5: Aktionen und Interaktion + +- [ ] `actionButtons` + `customActions` auf `GroupRow`-Ebene (Batch auf alle Items via `mode=ids` im Scope) +- [ ] Delete-Gruppe mit `useConfirm`: "Nur Gruppe" vs. "Gruppe + alle Items löschen" +- [ ] "In Gruppe verschieben" als BatchAction bei Multi-Select +- [ ] Kontextmenü-Button je DataRow → Gruppen-Dropdown +- [ ] Drag-and-Drop DataRow → GroupRow + +### Phase 6: GroupingManager Panel + +- [ ] `GroupingManager`-Komponente: Gruppen-Baum-Panel +- [ ] Button "Gruppen" in `FormGeneratorControls` (nur wenn `groupingConfig.enabled`) +- [ ] Neue Gruppe erstellen (`usePrompt`) +- [ ] Gruppe umbenennen (inline) +- [ ] Subgruppe erstellen +- [ ] Gruppe löschen (`useConfirm`) +- [ ] Reihenfolge Up/Down + +### Abschluss + +- [ ] i18n: alle neuen UI-Texte mit `t('...')` getaggt +- [ ] CSS Modules für alle neuen Komponenten +- [ ] `b-reference/frontend-nyla/formgenerator.md` aktualisieren +- [ ] `b-reference/gateway/architecture.md` — `PaginationParams`/`PaginatedResponse`-Erweiterung dokumentieren + +--- + +## Akzeptanzkriterien + +| # | Kriterium (Given-When-Then) | Prio | +|---|---------------------------|------| +| 1 | Given `FormGeneratorTable` mit `groupingConfig.enabled` — When Seite lädt — Then kommt `groupTree` in der normalen List-Response mit; **kein zweiter API-Call** | must | +| 2 | Given User erstellt Gruppe — When 500ms nach letzter Änderung — Then ein einziger `refetch` mit `saveGroupTree` wird abgesendet; nach Reload ist Gruppe vorhanden | must | +| 3 | Given User klickt Gruppe "Kunden" — When `_enterGroup` aufgerufen — Then refetch mit `groupId`; Backend filtert; Pagination und Totalcount beziehen sich auf die Gruppe | must | +| 4 | Given aktiver Gruppen-Scope — When User sucht "Test" — Then `groupId` + Search in einem Call; Backend zeigt nur Treffer in der Gruppe | must | +| 5 | Given aktiver Gruppen-Scope — When User klickt "Alle auswählen" (mode=ids) — Then IDs kommen nur aus der Gruppe, nicht aus der Gesamtliste | must | +| 6 | Given Filter-Dropdown geöffnet im Gruppen-Scope — When Werte geladen — Then kommen nur aus Items der Gruppe (mode=filterValues korrekt) | should | +| 7 | Given `FormGeneratorTable` ohne `groupingConfig` — Then identisches Verhalten wie vor dem Feature | must | +| 8 | Given Gruppe mit 5 Items — When User Delete auf Group-Header — Then Confirm → alle 5 Items gelöscht; Gruppe aus Baum entfernt; ein `refetch` mit aktuellem `saveGroupTree` | must | +| 9 | Given User zieht DataRow auf Gruppen-Header — Then Item wird zur Gruppe zugeordnet; `_mutateGroupTree` + debounced Save | should | + +--- + +## Testplan + +| ID | AC | Art | Automatisiert | Repo-Pfad | Status | +|----|----|----|--------------|-----------|--------| +| T1 | 1, 2 | api | ja | `gateway/tests/test_grouping_helpers.py` | pending | +| T2 | 3, 4, 5, 6 | api | ja | `gateway/tests/test_grouping_helpers.py` | pending | +| T3 | 7 | component | nein | manuell | pending | +| T4 | 8 | api + component | nein | manuell | pending | +| T5 | 9 | component | nein | manuell | pending | + +--- + +## Offene Fragen + +1. **`scope: 'user' | 'mandate'`** im `TableGrouping`-Modell bereits vorbereiten für späteres mandate-weites Sharing? +2. **CSV-Export** soll Gruppen-Spalte enthalten? +3. **`activeGroupId` im URL-State** (`?group=g1`) für Deep-Links? +4. **Max-Tiefe** konfigurierbar (`groupingConfig.maxDepth`) oder feste Warnung nach 3 Ebenen? + +--- + +## Links + +- PR: — +- Referenz FormGenerator: `b-reference/frontend-nyla/formgenerator.md` +- Referenz Gateway-Architektur: `b-reference/gateway/architecture.md` +- Referenz DB-Architektur: `b-reference/platform/database-architecture.md` + +--- + +## Abschluss + +- [ ] `b-reference/frontend-nyla/formgenerator.md` — Grouping-Sektion +- [ ] `b-reference/gateway/architecture.md` — `PaginationParams`/`PaginatedResponse` Erweiterung +- [ ] TOPICS.md geprüft +- [ ] Dieses Dokument → `z-archive/` verschoben diff --git a/c-work/4-done/2026-04-id-unified-knowledge-indexing-rag-concept.md b/c-work/4-done/2026-04-id-unified-knowledge-indexing-rag-concept.md index 138c88f..ecfea24 100644 --- a/c-work/4-done/2026-04-id-unified-knowledge-indexing-rag-concept.md +++ b/c-work/4-done/2026-04-id-unified-knowledge-indexing-rag-concept.md @@ -1,6 +1,6 @@ - + # Unified Knowledge Indexing — One RAG Corpus for All Platform Information @@ -15,7 +15,7 @@ | **Teil 3** | **Feature injection** split into **retrieval** (agent + `buildAgentContext`) vs **corpus** (`indexFile`); **matrix** per `modules/features/*` product; real **gaps** vs false “non-injection”. | | **Implementation phases · Ziele · AC · Testplan** | Rollout, explicit non-goals, acceptance criteria, verification. | -**Single sentence summary:** Keep **retrieval** on **`AgentService`**; unify **when and how** the shared **`interfaceDbKnowledge`** corpus is **filled** (routes, **user connections** / integrations, features, snapshots) behind one **ingestion contract**, without assuming every product uses the workspace agent. +**Single sentence summary:** Keep **retrieval** on **`AgentService`**; unify **when and how** the shared **`interfaceDbKnowledge`** corpus is **filled** (routes, **user connections**, **feature commit points**) behind one **ingestion contract**. **Current roadmap scope:** user-connection lifecycle (**P1a/P1b**), **daily refresh** to close the post-connect delta gap (**P1c**), **explicit user consent + per-connection ingestion preferences** (incl. optional **neutralization**) in **frontend + API**, then **scalable event bus** (**P3**). **Out of current roadmap:** standalone **profile/mandate snapshot** ingestion (former roadmap **P2** — content remains in Teil 2.3 as future option only). ## Beschreibung und Kontext @@ -150,7 +150,7 @@ The first end-to-end AC4 test on a 500-page PDF revealed **three** independent b 2. **Pre-upserts must preserve `_ingestion` metadata and the `indexed` status.** `routeDataFiles._autoIndexFile` persisted a fresh `FileContentIndex` from the pre-scan **before** calling `requestIngestion`, overwriting `structure._ingestion.hash` and `status="indexed"` from any prior successful run. The duplicate check saw a row with empty metadata and re-ran the whole embedding stage. **Rule:** any upsert on the idempotency row taken outside `requestIngestion` MUST read the existing row first and merge forward both `_ingestion` and (where applicable) the terminal `indexed` status. 3. **Extraction-pipeline defaults must preserve granularity for RAG.** `ExtractionOptions.mergeStrategy` defaulted to concatenating every text `ContentPart` into one blob, collapsing a 500-page PDF into a single chunk whose embedding is a blurred average of the whole document — unusable for targeted retrieval. **Rule:** every ingestion lane passes `mergeStrategy=None` explicitly until the default itself can be safely flipped after auditing non-RAG callers. (Tests: `tests/unit/services/test_extraction_merge_strategy.py`.) -**Deferred to P1** (uncovered during P0, not blocking AC1–AC5): +**Deferred (ingestion idempotency hardening)** (uncovered during P0, not blocking AC1–AC5; naming here is **not** the same milestone as **P1 user-connection hooks** below): - **In-flight duplicate detection.** The current duplicate check only matches when `status == "indexed"`, so two nearly-simultaneous calls for the same `sourceId` both run full embedding. Fix candidates: accept `status ∈ {"extracted", "embedding", "indexed"}` with matching hash as "already in progress", or a per-`sourceId` `asyncio.Lock` in `KnowledgeService`. - **Pre-extraction byte-hash shortcut.** `requestIngestion`'s duplicate check runs **after** extraction, so re-indexing a 1.6 MB PDF still spends ~15 s in `runExtraction` before the content hash is computed. The file-bytes SHA already exists in `interfaceDbManagement` for upload-dedup — a short-circuit in `_autoIndexFile` (and symmetric paths) could skip extraction entirely for an unchanged file. @@ -231,12 +231,14 @@ The first end-to-end AC4 test on a 500-page PDF revealed **three** independent b **Email and messaging (Outlook + Gmail via Microsoft / Google user connections) — shared cautions** -- Default tiers: **metadata only** → **snippet** → **full body** → **attachments** (most expensive / sensitive). +- Default tiers: **metadata only** → **snippet** → **full body** → **attachments** (most expensive / sensitive). **Product default** vs **user override** is defined in **§2.6** (per-connection mail depth + attachments). - Apply **quoted-thread stripping**, **signature removal**, and **max body length** before embed. - **Legal hold / retention:** ingestion must respect mandate **delete** and **export** rules; **disconnecting** or **revoking** the mail **connection** must **purge** mail-sourced chunks. ### 2.3 “Account and stuff” — what to index vs. what never to index +**Roadmap note:** Standalone **profile/mandate snapshot** ingestion (formerly roadmap **P2**) is **out of current scope**; the table below remains the **target model** when that work is picked up again. + **Goal:** Give agents **useful, permission-safe** context (“who is this user in this mandate”, “which features are on”, “preferred language”) without creating a **second copy of sensitive credentials** in the vector store. | Data | Typical treatment | @@ -256,6 +258,60 @@ Snapshots should be stored with the same **scope model** as file chunks (`person **Storage (already implemented — not redesigned here):** The platform already uses **one** knowledge persistence stack: **`FileContentIndex`** (incl. `mandateId`, `scope`, status) and **`ContentChunk`** (pgvector embeddings, `fileId`, `userId`, `featureInstanceId`, `contextRef`, optional **`chunkMetadata`**), accessed via **`interfaceDbKnowledge`**. Chunks are **file-anchored** today; **connection- / source-specific** provenance (e.g. `connectionId`, external ids) can ride in **`contextRef` / `chunkMetadata`** until optional schema extensions are justified. **This document targets ingestion triggers and lifecycles**, not a second corpus or a duplicate storage model. +### 2.5 Lifecycle gap and daily refresh (roadmap **P1c**, v1) + +**Gap:** After a successful connect, **bootstrap** runs once (initial fill). **New** mail, files, or tasks that arrive **after** that run are **not** indexed automatically until a **delta** path exists (webhook, `historyId` / `changes` cursors, etc. — see Teil **2.1** row *“Sync for an existing connection”*). + +**Pragmatic mitigation (deliberately simple):** A **daily scheduler** (e.g. once per night, staggered by tenant/load) re-invokes the same **bootstrap walkers** for every **active** `UserConnection` that has **knowledge ingestion enabled** (see **§2.6**). Idempotency + fast-path skips unchanged items; **new** and **changed** items are picked up. + +- **Pros:** No new external dependencies (Pub/Sub, watch renewal) in v1; fits existing BackgroundJob + cron/feature-flag patterns. +- **Con:** Data can lag up to **~24 h** before it appears in RAG — acceptable for v1 product choice. +- **Later (without replacing P1c):** Add per-authority **delta APIs** (Gmail `users.history.list`, Drive `changes.list`, ClickUp tighter polling) to reduce latency and API cost. + +### 2.6 User consent, frontend flow, and per-connection preferences (incl. neutralization) + +**Goal:** The user **explicitly** chooses whether this connection may feed the **shared knowledge store** used for AI/RAG — and **how much**. Without consent, **no** knowledge bootstrap is started for that connection (OAuth may still unlock other product features; that split must be obvious in the UI). + +**Frontend (`frontend_nyla`):** extend the **add connection** flow (and later **connection settings**) with the dialog and controls below; persist choices via Gateway API **before** or **when** triggering knowledge ingestion. + +#### UX when adding a connection + +1. User starts OAuth as today. +2. **Before** or **immediately after** successful authorization: a **dialog** that clearly separates “establish connection” from “add to knowledge base”. +3. **No:** Connection remains usable for other features; either skip `KnowledgeIngestionConsumer.onConnectionEstablished` for the knowledge lane or persist `knowledgeIngestionEnabled=false` and never schedule walkers. +4. **Yes:** Show **advanced settings** (second step or accordion) per **settings catalog** below; persist **per `connectionId`** (or a dedicated preferences row); only then enqueue **bootstrap** (and later **P1c** refresh) with allowed surfaces and tiers. + +**Suggested copy (DE — pick one tone / A-B test):** + +- **Formal:** „Möchten Sie Inhalte aus dieser Verbindung in Ihre **Wissensdatenbank** übernehmen? KI-Funktionen können dann passender auf **Ihre** Dokumente und Nachrichten Bezug nehmen — **nur** mit Ihrer ausdrücklichen Zustimmung und in dem Umfang, den Sie festlegen.“ +- **Approachable:** „Sollen wir aus dieser Verbindung ausgewählte Inhalte sicher in Ihre **persönliche Wissensdatenbank** legen, damit die KI für Sie **besser helfen** kann? Sie entscheiden **was** und **wie stark anonymisiert** — und können das jederzeit in den Einstellungen ändern oder die Daten entfernen.“ + +Mirror in EN if the UI is bilingual. + +#### Minimum settings catalog (all **per connection** where technically applicable) + +| Layer | Setting | Meaning | +|--------|-----------|---------| +| **Master** | **Knowledge ingestion for this connection** | `off` / `on`: gates bootstrap + **§2.5** (P1c) refresh for the knowledge store. | +| **Protection** | **Neutralize / anonymize before embedding** | When `on`: apply the same (or stricter) **neutralization** pipeline as for uploads (`FileItem.neutralize` / platform rules) to connector-sourced text **before** chunking — names, e-mail addresses, phone-like patterns, IBAN-like patterns, per policy. User-facing label **„anonymisiert“** maps to this pipeline (not a cryptographic guarantee). | +| **Mail** (Outlook / Gmail) | **Content depth** | At least: **metadata only** (subject, participants, dates — no body) / **snippet** / **full cleaned body** (after `cleanEmailBody` and caps). | +| **Mail** | **Index attachments** | `off` / `on` (with size/type caps). | +| **Files** (Drive / SharePoint / OneDrive) | **Index binary files** | `off` / `on`; optional **MIME allowlist** (Office/PDF/text only) as a simplified UX preset. | +| **ClickUp** | **Scope** | `titles only` / `title + description` / `+ comments` / optional `attachments`. | +| **Microsoft** | **Parity** | Same dimensions where Graph surfaces mirror Google (mailbox / drive-like). | +| **General** | **Time window** | “Only index items from the last **N** days” (aligns with existing walker caps; slider with a sensible max). | +| **General** | **Help: what RAG is not** | Short explainer: not real-time mail; delay until next scheduled run (**§2.5**). | + +**Optional power-user toggles (same screen, collapsed):** per authority **which surfaces** ingest (e.g. **Google:** Gmail on/off, Drive on/off; **Microsoft:** SharePoint on/off, Outlook on/off — when product exposes both). Reduces accidental over-breadth without extra wizard steps. + +**Backend consequence:** Walkers read persisted preferences for `connectionId` each run and **filter** surfaces and payload tiers **before** `indexFile`. On preference change, product decision: trigger **re-sync**, or apply only to **new** items — document the chosen rule. + +#### Neutralization when the user opts in + +- **Ingestion on** + **neutralization on:** After content is obtained (virtual text or extraction output), apply the **neutralization stage** **before** chunking/embedding; **that** text is what gets embedded. +- **Neutralization off:** Still apply baseline **hygiene** where already defined (e.g. `cleanEmailBody` for quotes/signatures) — hygiene **≠** full PII removal. +- **Compliance copy:** If the user chooses **full body**, state clearly that **perfect** anonymization is not guaranteed without neutralization. + --- ## Teil 3 — Feature injection: retrieval vs corpus, agent loop, and real gaps @@ -338,11 +394,11 @@ Then add **`requestIngestion` / `indexFile`** at the **feature commit point** (o 3. **Unified façade** — one ingestion API; avoid a second embedding pipeline. 4. **Purge** — tie to **`fileId`**, business key, or future connector purge keys on revoke/delete. -### 3.7 Phasing +### 3.7 Phasing (feature matrix — **not** the same numbering as roadmap **P1c/P1d/P3** above) -- **P0:** For **each** row in §3.3, confirm **retrieval** vs **corpus** paths; document “satisfied by agent+upload+tools” vs “needs feature hook.” -- **P1:** Implement **feature-native corpus** for one domain with a clear §3.5 gap (e.g. **trustee** entity text, **teamsbot** persisted transcript). -- **P2:** **Chatbot** architecture decision: integrate **`serviceKnowledge`** or keep parallel retrieval; if integrate, add explicit **corpus** rules for config/FAQ. +- **FM0:** For **each** row in §3.3, confirm **retrieval** vs **corpus** paths; document “satisfied by agent+upload+tools” vs “needs feature hook.” +- **FM1:** Implement **feature-native corpus** for one domain with a clear §3.5 gap (e.g. **trustee** entity text, **teamsbot** persisted transcript). +- **FM2:** **Chatbot** architecture decision: integrate **`serviceKnowledge`** or keep parallel retrieval; if integrate, add explicit **corpus** rules for config/FAQ. --- @@ -350,12 +406,41 @@ Then add **`requestIngestion` / `indexFile`** at the **feature commit point** (o Phases align with **Teil 1** (façade), **Teil 2** (connector + trigger catalog), and **Teil 3.7** (feature matrix and feature-native corpus pilots). **P0** overlaps **Teil 3.7 P0** (complete the per-feature matrix before large builds). +**Authority rollout (2026-04-24):** The **user-connection ingestion lane** (bootstrap + purge tied to **`UserConnection`**) is delivered **per OAuth authority**: **`msft` (P1a)**, **`google`** + **`clickup` (P1b)** — same consumer, dispatcher fan-out, purge-by-`connectionId`, and unit tests for walkers + consumer. **Next product slices:** **P1c** (daily refresh, **§2.5**), **consent + per-connection preferences + frontend** (**§2.6**), then **P3** (event bus at scale). + | Phase | Outcome | |-------|---------| | **P0 — Façade + idempotency** *(done, 2026-04-21)* | Single `requestIngestion` / `getIngestionStatus` entry point on `KnowledgeService` with content-hash idempotency, provenance in `structure._ingestion`, and structured logging (`ingestion.queued` / `ingestion.indexed` / `ingestion.skipped.duplicate` / `ingestion.failed`). All prior `indexFile` call sites now route through the façade: `routeDataFiles._autoIndexFile`, `commcoach/serviceCommcoachIndexer.indexSessionData`, `serviceAgent/coreTools/_workspaceTools.readFile`, `serviceAgent/coreTools/_documentTools.describeImage`. Agent tools no longer carry on-demand extraction + ingestion fallbacks — they are pure consumers of the knowledge store. **Teil 3.3** matrix audited. Three implementation bugs fixed during verification: stable content hash, pre-upsert `_ingestion` preservation, `mergeStrategy=None` for per-page granularity (see **§1.4 Implementation pitfalls**). | -| **P1 — User-connection hooks** *(done, 2026-04-21)* | `connection.established` / `connection.revoked` callbacks emitted from every OAuth callback (`routeSecurityMsft`, `routeSecurityGoogle`, `routeSecurityClickup`) and from `routeDataConnections.disconnect_service` / `delete_connection`; the `ConnectionStatus.INACTIVE` enum bug (the value did not exist) was fixed by switching the disconnect path to `ConnectionStatus.REVOKED`. A new central `KnowledgeIngestionConsumer` (`subConnectorIngestConsumer.py`, registered in `app.py` lifespan) maps `established` to a `connection.bootstrap` BackgroundJob and `revoked` to a synchronous purge through `KnowledgeService.purgeConnection` → `interfaceDbKnowledge.deleteFileContentIndexByConnectionId`. `FileContentIndex` gained `connectionId` and `sourceKind` columns (auto-applied by `connectorDbPostgre`); `IngestionJob` carries both end-to-end so every chunk is purgeable by connection. **All three OAuth authorities are wired up** with one bootstrap module per service: `subConnectorSyncSharepoint.py` (`sourceKind="sharepoint_item"`, `eTag` as `contentVersion`, walks sites with the `@odata.nextLink` paginated `SharepointAdapter.browse`), `subConnectorSyncOutlook.py` (virtual `outlook_message` documents — header / snippet / cleaned body via the shared `cleanEmailBody` utility — with `changeKey` revisions and optional `outlook_attachment` child jobs), `subConnectorSyncGdrive.py` (`gdrive_item`, `modifiedTime` revisions, recursive walk from My Drive root with depth/age caps and Google-Doc export support inherited from `DriveAdapter.download`), `subConnectorSyncGmail.py` (virtual `gmail_message` documents with `historyId` revisions, walks `INBOX + SENT` by default, MIME-tree body extraction prefers `text/plain` and falls back to `text/html`, optional `gmail_attachment` child jobs), `subConnectorSyncClickup.py` (virtual `clickup_task` documents with `date_updated` revisions, walks teams → spaces → folder/folderless lists → tasks with workspace and per-workspace list caps, header carries name/status/list/space/assignees/tags/url so search prompts retrieve task context without a live API call). The dispatcher `_bootstrapJobHandler` fans out per authority (msft → sharepoint+outlook in parallel, google → drive+gmail in parallel, clickup → tasks); unsupported authorities log `ingestion.connection.bootstrap.skipped reason=unsupported_authority`. Structured-log schema (started / progress / done / purged) defined in **§ Structured ingestion logs** below. Eight new unit tests (purge, consumer dispatch + per-authority routing, `cleanEmailBody`, bootstrapSharepoint, bootstrapOutlook, bootstrapGmail, bootstrapGdrive, bootstrapClickup) lock the contract. **Retrieval threshold calibration (2026-04-21):** during UI verification `buildAgentContext` returned `instanceChunks=0` despite 640 correctly-indexed rows — root cause was overly aggressive `minScore` thresholds (Layer 1 `0.65`, Layer 1.5 `0.55`, Layer 3 `0.70`) versus realistic `text-embedding-3-small` cosine similarities in the `0.30`–`0.55` range. All three thresholds lowered to `0.35`; agent then correctly synthesized answers from indexed Outlook/SharePoint content without resorting to live tools. | -| **P2 — Profile & mandate snapshots** | Allowlisted fields only (**Teil 2.3**); regenerate on events; explicit admin toggle per mandate if needed. | -| **P3 — Event bus** | Move direct calls to async consumer where load requires it (**Teil 2.4** scalable target). | +| **P1a — User-connection hooks (Microsoft `msft`)** *(done, 2026-04-21)* | **`connection.established`** / **`connection.revoked`** emitted from **Microsoft** data-OAuth success paths and from **disconnect/delete** when the row is **`msft`** (incl. **`ConnectionStatus.REVOKED`** fix where **`INACTIVE`** was invalid). Central **`KnowledgeIngestionConsumer`** (`subConnectorIngestConsumer.py`, **`app.py`** lifespan) maps **`established`** → **`connection.bootstrap`** BackgroundJob and **`revoked`** → synchronous **`KnowledgeService.purgeConnection`** → **`interfaceDbKnowledge.deleteFileContentIndexByConnectionId`**. **`FileContentIndex.connectionId`** + **`sourceKind`** (and **`IngestionJob`** carrying both) make connector-sourced rows purgeable. **Bootstrap modules live for Microsoft:** **`subConnectorSyncSharepoint.py`** (`sourceKind="sharepoint_item"`, **`eTag`** as `contentVersion`, **`SharepointAdapter.browse`** with **`@odata.nextLink`** pagination) and **`subConnectorSyncOutlook.py`** (virtual **`outlook_message`** docs — header / snippet / cleaned body via **`cleanEmailBody`**, **`changeKey`** revisions, optional **`outlook_attachment`** child jobs). Dispatcher **`_bootstrapJobHandler`** runs **SharePoint + Outlook in parallel** for **`msft`**. Structured logs: **§ Structured ingestion logs**. **Retrieval threshold calibration (2026-04-21):** **`buildAgentContext`** **`minScore`** layers lowered to **`0.35`** so **`text-embedding-3-small`** matches real cosine scores; validated on **Outlook/SharePoint–indexed** content. **Tests (P1a):** purge, consumer **msft** dispatch, **`cleanEmailBody`**, **`bootstrapSharepoint`**, **`bootstrapOutlook`**. | +| **P1b — User-connection hooks (Google + ClickUp)** *(done, 2026-04)* | Parity with **`msft`**: **`routeSecurityGoogle`** / **`routeSecurityClickup`** call **`KnowledgeIngestionConsumer.onConnectionEstablished`** after token save; **`routeDataConnections`** disconnect/delete call **`onConnectionRevoked`** for **all** authorities. **`_bootstrapJobHandler`** fans out **google → `bootstrapGdrive` + `bootstrapGmail`** in parallel and **clickup → `bootstrapClickup`**. Walkers: `subConnectorSyncGdrive.py`, `subConnectorSyncGmail.py`, `subConnectorSyncClickup.py` + `subTextClean.py`. Unit tests: `test_bootstrap_gdrive.py`, `test_bootstrap_gmail.py`, `test_bootstrap_clickup.py`, extended `test_knowledge_ingest_consumer.py`. | +| **P1c — Connection refresh (lifecycle v1)** *(next)* | **Daily** (or nightly) **scheduled** re-run of the same bootstrap walkers for connections with **knowledge ingestion enabled** (**§2.6**). Reuses idempotency + fast-path; closes the **post-connect delta gap** without webhooks in v1. Observability: same log family as bootstrap; optional `event` suffix or `reason=scheduled_refresh` for shippers. | +| **P1d — Consent + preferences + UI** *(next)* | Persist **§2.6** settings **per `connectionId`**; Gate **`onConnectionEstablished`** / P1c jobs on user choice; **`frontend_nyla`** connection wizard + settings screen; walkers honor mail/file/ClickUp depth and **neutralization** flag. | +| **~~P2 — Profile & mandate snapshots~~** | **Removed from active roadmap** (focus: connections + feature corpus + scale). Target content remains documented in **§2.3** for a future re-entry when needed. | +| **P3 — Event bus** | Move direct calls to async consumer where load requires it (**Teil 2.4** scalable target). Remains in scope. | + +### P1b checklist *(completed — kept for audit trail)* + +1. **`routeSecurityGoogle`:** after successful **data** OAuth, enqueue **same** ingestion consumer path as Microsoft (pass **`connectionId`**, **`AuthAuthority.google`**, mandate/user scope). +2. **`routeSecurityClickup`:** after successful OAuth / token persistence, same. +3. **`routeDataConnections`:** verify **disconnect_service** / **delete_connection** emit **revoke** (or call **`purgeConnection`**) for **google** and **clickup** rows, not only **msft**. +4. **`_bootstrapJobHandler`:** remove any **“unsupported_authority”** skip for **`google`** / **`clickup`** once walkers are registered; keep skip only for **future** authorities. +5. **Quality bar:** T10/T12–T15 in the testplan — extend from **Microsoft-only** assumptions to **all three** **`routeDataConnections`** OAuth authorities. + +### P1c / P1d checklist *(next engineering slices)* + +1. **P1c:** BackgroundJob or cron entry; feature flag; per-tenant stagger; only connections with **knowledge ingestion = on**; metrics on `indexed` vs `skippedDup` per run. +2. **P1d ✅ — implemented:** + - [x] **`UserConnection`** extended with `knowledgeIngestionEnabled: bool` (default `False` = strict opt-in) and `knowledgePreferences: Optional[Dict]` (`schemaVersion=1`); DB auto-migration adds columns on startup. + - [x] **`routeDataConnections` `create_connection`** accepts `knowledgeIngestionEnabled` + `knowledgePreferences` in request body and persists them before returning. + - [x] **OAuth callbacks** (`routeSecurityGoogle`, `routeSecurityMsft`, `routeSecurityClickup`) gate `callbackRegistry.trigger("connection.established", …)` on `connection.knowledgeIngestionEnabled`; emit structured log `ingestion.connection.bootstrap.skipped reason=consent_disabled` when disabled. + - [x] **`_bootstrapJobHandler`** defensive re-check: loads connection via `getUserConnectionById` and no-ops if flag was disabled after OAuth (race protection). + - [x] **`IngestionJob.neutralize: bool`** added; `requestIngestion` + `_indexFileInternal` thread it through; for `sourceKind != "file"` the flag drives `_shouldNeutralize` directly; for `sourceKind == "file"` the `FileItem.neutralize` column remains authoritative. + - [x] **`subConnectorPrefs.py`** — `loadConnectionPrefs(connectionId)` helper + `ConnectionIngestionPrefs` dataclass with safe defaults for all §2.6 keys. + - [x] **All five walkers** (Gmail, GDrive, ClickUp, Outlook, SharePoint) load prefs at bootstrap start; limits structs gain `mailContentDepth` + `neutralize` (mail walkers), `filesIndexBinaries` (Drive), `clickupScope` (ClickUp), and `neutralize` (all). + - [x] **Unit tests** (`test_p1d_consent_prefs.py` — 10 tests): consent gate no-op, prefs defaults + full mapping, Gmail depth modes (metadata/snippet/full), ClickUp scope (titles vs description). + - [x] **Frontend** (`frontend_nyla`): `AddConnectionWizard` 4-step modal (connector → consent → preferences → summary + OAuth); old three-button row replaced with single „Verbindung hinzufügen“ button; `createConnectionAndAuth` hook method; `KnowledgePreferences` type in `connectionApi.ts`. + + **Default policy (document for deploy):** `knowledgeIngestionEnabled` defaults to `False` for all new connections. Existing connections (before P1d deploy) have the column `NULL`/`False` — **no bootstrap is triggered retroactively**. Users must explicitly opt in via the wizard or connection settings. If the team decides to migrate existing connections to `True`, a one-time migration script must be run and communicated via release note. --- @@ -366,7 +451,8 @@ Phases align with **Teil 1** (façade), **Teil 2** (connector + trigger catalog) - One **ingestion contract** for all features and connector lifecycles. - Indexing **decoupled** from the agent loop (agents may still *invoke* tools that ultimately call ingestion, but ingestion must not *depend* on an agent run). - **Explicit** handling of connection establishment, sync, and revocation. -- **Bounded** indexing of user/mandate context with a clear PII policy. +- **Bounded** indexing of user/mandate context with a clear PII policy. +- **Explicit user consent** and **per-connection** ingestion preferences (incl. optional **neutralization**) before connector content enters the knowledge store (**§2.6**). **Explizit NICHT:** @@ -379,7 +465,8 @@ Phases align with **Teil 1** (façade), **Teil 2** (connector + trigger catalog) ## Betroffene Module (erwartet) - **Gateway:** `serviceKnowledge`, file upload routes, connector OAuth handlers, sync workers, possibly new `serviceKnowledgeIngest` or package under `modules/serviceCenter/services/`. -- **Interfaces:** `interfaceDbKnowledge` extensions for source metadata if needed. +- **Interfaces:** `interfaceDbKnowledge` extensions for source metadata if needed; **`interfaceDbApp`** (or adjacent) for **per-`connectionId`** ingestion preferences from **§2.6**. +- **Frontend:** `frontend_nyla` — connection wizard + connection detail settings (consent, depth toggles, neutralization, time window). - **Wiki / Reference:** `b-reference/gateway/ai-agent.md` (ingestion vs. retrieval) after implementation. --- @@ -388,19 +475,19 @@ Phases align with **Teil 1** (façade), **Teil 2** (connector + trigger catalog) | Thema | Optionen | |-------|----------| -| **Email bodies** | Full text vs. summary-only vs. attachment-only | +| **Email bodies** | Default product stance is **user-configurable per connection** (**§2.6** table: metadata / snippet / full cleaned body); mandate policy may still cap max tier. | | **Multi-tenant isolation audits** | Periodic job to verify chunk `mandateId` matches connection | | **Cost caps** | Per-mandate embedding budget; defer large backfills | -| **Neutralization** | Mandatory for certain `sourceKind`s even when not file-upload | +| **Neutralization** | **User opt-in** per connection (**§2.6**); optional **mandate floor** (“never below snippet+neutralize for mail”) remains a separate governance decision. | | **Provenance shape** | First-class DB columns vs **documented `chunkMetadata` keys** for `connectionId`, external id, revision (must support **Teil 2** purge rules). | -| **In-flight duplicate handling** | Accept `status ∈ {"extracted","embedding","indexed"}` with matching hash as in-progress (cheap, lossy under failure) **vs** per-`sourceId` `asyncio.Lock` in `KnowledgeService` (strict, requires singleton) — see **§1.4 Deferred to P1**. | -| **Pre-extraction dedup shortcut** | Short-circuit `_autoIndexFile` via the file-bytes SHA in `interfaceDbManagement` before running `runExtraction` (~15 s saved per re-index of a large PDF) — see **§1.4 Deferred to P1**. | +| **In-flight duplicate handling** | Accept `status ∈ {"extracted","embedding","indexed"}` with matching hash as in-progress (cheap, lossy under failure) **vs** per-`sourceId` `asyncio.Lock` in `KnowledgeService` (strict, requires singleton) — see **§1.4 Deferred (ingestion idempotency hardening)**. | +| **Pre-extraction dedup shortcut** | Short-circuit `_autoIndexFile` via the file-bytes SHA in `interfaceDbManagement` before running `runExtraction` (~15 s saved per re-index of a large PDF) — see **§1.4 Deferred (ingestion idempotency hardening)**. | --- ## Structured ingestion logs (P1 schema) -The connection-lifecycle lane emits the following structured log events. Each event is a single `logger.info` / `.warning` / `.error` call with a stable `extra={"event": ...}` field so downstream log shippers can route on `event` without parsing the message string. +The connection-lifecycle lane emits the following structured log events. **`part`** values **`sharepoint`**, **`outlook`**, **`gdrive`**, **`gmail`**, and **`clickup`** are all **implemented** for bootstrap; **P1c** may add the same events with a distinguishable `reason` / `jobType` for **scheduled refresh** (exact field TBD in implementation). Each event is a single `logger.info` / `.warning` / `.error` call with a stable `extra={"event": ...}` field so downstream log shippers can route on `event` without parsing the message string. | `event` | Severity | Emitter | Required `extra` keys | Meaning | |---------|----------|---------|------------------------|---------| @@ -409,7 +496,7 @@ The connection-lifecycle lane emits the following structured log events. Each ev | `ingestion.connection.bootstrap.progress` | info | bootstrap walkers | `connectionId`, `part`, `processed`, `skippedDup`, `failed` | Heart-beat every ~50 items so long-running runs are observable. | | `ingestion.connection.bootstrap.done` | info | bootstrap walkers + façade-level totals | `connectionId`, `part`, `indexed`, `skippedDup`, `skippedPolicy`, `failed`, `durationMs` (Outlook/Gmail add `attachmentsIndexed`; SharePoint/Drive add `bytes`; ClickUp adds `workspaces` + `lists`) | Walker finished cleanly. | | `ingestion.connection.bootstrap.failed` | error | `_bootstrapJobHandler` | `part`, `connectionId`, `error` | One bootstrap part raised — recorded but the other parts still complete. | -| `ingestion.connection.bootstrap.skipped` | info | `_bootstrapJobHandler` | `connectionId`, `authority`, `reason` (`unsupported_authority`) | Authority has no bootstrap module registered (e.g. a future provider). | +| `ingestion.connection.bootstrap.skipped` | info | `_bootstrapJobHandler` + OAuth callbacks + defensive check in `_bootstrapJobHandler` | `connectionId`, `authority`, `reason` (`unsupported_authority` │ `consent_disabled`) | Authority has no bootstrap module registered (e.g. a future provider) — **or** user has not consented (`knowledgeIngestionEnabled=False`). | | `ingestion.connection.purged` | info | `_onConnectionRevoked` | `connectionId`, `authority`, `reason`, `indexRows`, `chunks` | Knowledge purge for a revoked connection completed; numbers reflect the deleted rows. | | `ingestion.connection.purged.failed` | error | `_onConnectionRevoked` | `connectionId`, `error` | Purge raised; the revoke event was still acknowledged upstream. | @@ -421,16 +508,17 @@ All events should keep field naming consistent with the existing `ingestion.queu - **Gateway reference (retrieval + knowledge):** `wiki/b-reference/gateway/architecture.md`, `wiki/b-reference/gateway/ai-agent.md` - **Implementation touchpoints (indicative):** `gateway/modules/serviceCenter/services/serviceKnowledge/mainServiceKnowledge.py`, `gateway/modules/routes/routeDataFiles.py`, `gateway/modules/features/commcoach/serviceCommcoachIndexer.py`, agent `coreTools` `_documentTools` / `_workspaceTools`, `gateway/modules/datamodels/datamodelExtraction.py` (`ExtractionOptions.mergeStrategy: Optional[MergeStrategy]`). - **Unit tests (P0 guardrails):** `gateway/tests/unit/services/test_ingestion_hash_stability.py`, `gateway/tests/unit/services/test_extraction_merge_strategy.py`. -- **Unit tests (P1 guardrails):** `gateway/tests/unit/services/test_connection_purge.py`, `gateway/tests/unit/services/test_knowledge_ingest_consumer.py`, `gateway/tests/unit/services/test_clean_email_body.py`, `gateway/tests/unit/services/test_bootstrap_sharepoint.py`, `gateway/tests/unit/services/test_bootstrap_outlook.py`, `gateway/tests/unit/services/test_bootstrap_gmail.py`, `gateway/tests/unit/services/test_bootstrap_gdrive.py`, `gateway/tests/unit/services/test_bootstrap_clickup.py`. -- **P1 implementation touchpoints:** `gateway/modules/serviceCenter/services/serviceKnowledge/subConnectorIngestConsumer.py`, `gateway/modules/serviceCenter/services/serviceKnowledge/subConnectorSyncSharepoint.py`, `gateway/modules/serviceCenter/services/serviceKnowledge/subConnectorSyncOutlook.py`, `gateway/modules/serviceCenter/services/serviceKnowledge/subConnectorSyncGdrive.py`, `gateway/modules/serviceCenter/services/serviceKnowledge/subConnectorSyncGmail.py`, `gateway/modules/serviceCenter/services/serviceKnowledge/subConnectorSyncClickup.py`, `gateway/modules/serviceCenter/services/serviceKnowledge/subTextClean.py`, `gateway/modules/interfaces/interfaceDbKnowledge.py` (`deleteFileContentIndexByConnectionId`), `gateway/modules/datamodels/datamodelKnowledge.py` (`FileContentIndex.connectionId` + `sourceKind`), `gateway/modules/connectors/providerMsft/connectorMsft.py` (`@odata.nextLink`-loop in `SharepointAdapter.browse`, `eTag` in `_graphItemToExternalEntry`), `gateway/modules/routes/routeSecurityMsft.py` / `routeSecurityGoogle.py` / `routeSecurityClickup.py` / `routeDataConnections.py` (callback emission + `ConnectionStatus.REVOKED` fix), `gateway/app.py` (consumer registration in lifespan). +- **Unit tests (P1a — Microsoft, done):** `gateway/tests/unit/services/test_connection_purge.py`, `gateway/tests/unit/services/test_knowledge_ingest_consumer.py` (incl. **msft** fan-out), `gateway/tests/unit/services/test_clean_email_body.py`, `gateway/tests/unit/services/test_bootstrap_sharepoint.py`, `gateway/tests/unit/services/test_bootstrap_outlook.py`. +- **Unit tests (P1b — Google + ClickUp, done):** **`test_knowledge_ingest_consumer`** (google / clickup fan-out), **`test_bootstrap_gmail.py`**, **`test_bootstrap_gdrive.py`**, **`test_bootstrap_clickup.py`**. **P1d (done):** **`test_p1d_consent_prefs.py`** (10 tests: consent gate, prefs parsing, Gmail depth modes, ClickUp scope). **P1c:** add scheduler tests when implemented. +- **P1 implementation touchpoints:** `gateway/modules/serviceCenter/services/serviceKnowledge/subConnectorIngestConsumer.py`, `gateway/modules/serviceCenter/services/serviceKnowledge/subConnectorSyncSharepoint.py`, `gateway/modules/serviceCenter/services/serviceKnowledge/subConnectorSyncOutlook.py`, `gateway/modules/serviceCenter/services/serviceKnowledge/subConnectorSyncGdrive.py`, `gateway/modules/serviceCenter/services/serviceKnowledge/subConnectorSyncGmail.py`, `gateway/modules/serviceCenter/services/serviceKnowledge/subConnectorSyncClickup.py`, `gateway/modules/serviceCenter/services/serviceKnowledge/subTextClean.py`, `gateway/modules/interfaces/interfaceDbKnowledge.py` (`deleteFileContentIndexByConnectionId`), `gateway/modules/datamodels/datamodelKnowledge.py` (`FileContentIndex.connectionId` + `sourceKind`), `gateway/modules/connectors/providerMsft/connectorMsft.py` (`@odata.nextLink`-loop in `SharepointAdapter.browse`, `eTag` in `_graphItemToExternalEntry`), `gateway/modules/connectors/providerGoogle/connectorGoogle.py` (P1b: Drive + Gmail revision keys and download/export paths), `gateway/modules/routes/routeSecurityMsft.py` (P1a callbacks), `gateway/modules/routes/routeSecurityGoogle.py` and `gateway/modules/routes/routeSecurityClickup.py` (P1b: parity callbacks), `gateway/modules/routes/routeDataConnections.py` (revoke for **all** authorities), `gateway/app.py` (consumer registration in lifespan). ## Akzeptanzkriterien (Plan-Ebene) | # | Kriterium | Prio | |---|-----------|------| | 1 | Every new **file** that should be searchable triggers ingestion **without** requiring an agent session. | must | -| 2 | **User connection** connect / disconnect has defined ingestion or purge behavior documented and implementable. | must | -| 3 | **Profile/mandate** snapshots use an explicit allowlist; secrets never enter the embedding pipeline. | must | +| 2 | **User connection** connect / disconnect has defined ingestion or purge behavior **for each** OAuth authority **`routeDataConnections`** supports (**P1a** **`msft`**, **P1b** **`google`** / **`clickup`**); **plus** user-controlled **opt-in** and **preference bundle** before ingestion (**P1d**, **§2.6**). | must | +| 3 | **Profile/mandate** snapshot ingestion (**former roadmap P2**) is **deferred**; when re-opened, snapshots must use an explicit allowlist and never embed secrets. Until then, **§2.6** consent + neutralization covers connector-sourced PII risk. | should (reactivated when P2 returns) | | 4 | Ingestion is **idempotent** for unchanged content (no duplicate embedding work). Verified 2026-04-21 on a 500-page PDF: second re-index trigger logs `ingestion.skipped.duplicate` with a stable hash, zero embedding API calls. See **§1.4 pitfalls** for the three bug classes that had to be fixed first. | must | | 5 | **Teil 3.3** matrix completed: every `modules/features/*` product row has **retrieval** (agent vs none), **corpus** (upload / tools / feature indexer), and **gap** explicitly stated—not “non-injecting” if **`AgentService`** already provides retrieval injection. | should | @@ -449,9 +537,10 @@ All events should keep field naming consistent with the existing `ingestion.queu | T7 | Bleiben bei Multi-Page-PDFs die Per-Page-Chunks erhalten (keine `MergeStrategy`-Konkatenation)? | Unit: `tests/unit/services/test_extraction_merge_strategy.py`. Live: 500-Seiten-PDF → 563 ContentObjects, 567 Embedding-Chunks in 24 Batches (verifiziert 2026-04-21). | | T8 | Überleben `_ingestion.hash` und `status="indexed"` einen Pre-Scan-Re-Upsert in `_autoIndexFile`? | Review `routeDataFiles._autoIndexFile` Zeile ~127: existing row wird vor upsert gelesen und `_ingestion` + `indexed` in frischen `contentIndex` gemerged. Live: zweiter Trigger → `ingestion.skipped.duplicate` statt Re-Embedding. | | T9 | Räumt ein `connection.revoked` Event **alle** `FileContentIndex`-Rows + `ContentChunk`s einer Connection und **nichts anderes** auf (Uploads ohne `connectionId`, andere Connections bleiben intakt)? | Unit: `tests/unit/services/test_connection_purge.py` (3 Cases: positive purge, leerer connectionId-Noop, unbekannter connectionId). | -| T10 | Dispatcht der `KnowledgeIngestionConsumer` `connection.established` korrekt als asynchroner `connection.bootstrap` Job (msft → SharePoint + Outlook parallel; google → Drive + Gmail parallel; clickup → Tasks; unbekannte Authorities `skipped.reason="unsupported_authority"`) und `connection.revoked` synchron als Purge? | Unit: `tests/unit/services/test_knowledge_ingest_consumer.py` (8 Cases: established enqueue, missing-id ignore, revoked purge, missing-id ignore, skip-unsupported, msft fan-out, google fan-out, clickup dispatch). | +| T10 | Dispatcht der `KnowledgeIngestionConsumer` `connection.established` korrekt als asynchroner `connection.bootstrap` Job (**P1a:** **msft** → SharePoint + Outlook parallel; **P1b:** **google** → Drive + Gmail parallel; **clickup** → Tasks) und `connection.revoked` synchron als Purge — **für jede** der drei **`routeDataConnections`**-Authorities? | **P1a + P1b (done):** `test_knowledge_ingest_consumer.py` — alle drei Authorities + revoke; unbekannte Authorities `skipped.reason="unsupported_authority"`. **P1d:** zusätzlich nur bei **Consent = ja** dispatch. | | T11 | Reduziert `cleanEmailBody` ein realistisches Outlook-HTML auf den eigenen Body-Anteil (HTML strip, Quote-Strip EN+DE, Signature-Strip, Whitespace-Collapse, `maxChars`-Truncate)? | Unit: `tests/unit/services/test_clean_email_body.py` (8 Cases). Konsequenz: `bootstrapOutlook` schickt nie HTML/Quoted-Replies/Signaturen in den Embedding-Pipeline-Schritt. | | T12 | Sind die Bootstrap-Walker für SharePoint und Outlook idempotent gegen ein zweites Run mit unveränderten `eTag` / `changeKey`? | Unit: `tests/unit/services/test_bootstrap_sharepoint.py` + `tests/unit/services/test_bootstrap_outlook.py`. Mock-Adapter liefern stable revisions; KnowledgeService-Fake meldet `duplicate` und das Result-Objekt bilanziert `skippedDuplicate`. | -| T13 | Walked `bootstrapGmail` `INBOX + SENT`, parsed MIME-Bodies (preferring `text/plain`, falling back to `text/html`), folgt `nextPageToken`-Pagination und ist idempotent gegen identische `historyId` Revisions? | Unit: `tests/unit/services/test_bootstrap_gmail.py` (6 Cases: header/snippet/body content-objects, MIME plain-vs-html preference, HTML fallback, multi-label fan-out, `nextPageToken` pagination, duplicate accounting). | -| T14 | Walked `bootstrapGdrive` My Drive rekursiv (Folder-MIME-Erkennung, `maxDepth`), respektiert den `maxAgeDays`-Recency-Filter und ist idempotent gegen identische `modifiedTime` Revisions? | Unit: `tests/unit/services/test_bootstrap_gdrive.py` (4 Cases: site/subfolder walk, duplicate accounting, recency-skip via `skippedPolicy`, provenance carries `authority="google"` + `service="drive"`). | -| T15 | Walked `bootstrapClickup` Workspaces → Spaces → Folder/Folderless Lists → Tasks unter `maxWorkspaces` / `maxListsPerWorkspace` / `maxTasks` Caps, respektiert den `maxAgeDays`-Recency-Filter und ist idempotent gegen identische `date_updated` Revisions? | Unit: `tests/unit/services/test_bootstrap_clickup.py` (4 Cases: hierarchy walk indexes 4 tasks across 2 lists, duplicate accounting, recency-skip via `skippedPolicy`, `maxTasks` cap). | +| T13 | Walked `bootstrapGmail` `INBOX + SENT`, parsed MIME-Bodies (preferring `text/plain`, falling back to `text/html`), folgt `nextPageToken`-Pagination und ist idempotent gegen identische `historyId` Revisions? | **P1b (done):** Unit `test_bootstrap_gmail.py`. **P1d:** Walker respektiert **Content depth** aus **§2.6** (Metadaten/Snippet/Body). | +| T14 | Walked `bootstrapGdrive` My Drive rekursiv (Folder-MIME-Erkennung, `maxDepth`), respektiert den `maxAgeDays`-Recency-Filter und ist idempotent gegen identische `modifiedTime` Revisions? | **P1b (done):** Unit `test_bootstrap_gdrive.py`. **P1d:** „Binärdateien“ / MIME-Allowlist aus **§2.6**. | +| T15 | Walked `bootstrapClickup` Workspaces → Spaces → Folder/Folderless Lists → Tasks unter `maxWorkspaces` / `maxListsPerWorkspace` / `maxTasks` Caps, respektiert den `maxAgeDays`-Recency-Filter und ist idempotent gegen identische `date_updated` Revisions? | **P1b (done):** Unit `test_bootstrap_clickup.py`. **P1d:** ClickUp-**Scope** (Titel/Beschreibung/Kommentare) aus **§2.6**. | +| T16 | Führt der **P1c**-Tagesjob nur Verbindungen mit **Wissens-Injektion = ein** aus und bleiben Kosten/API-Limits durch Idempotenz + Fast-Path beherrschbar? | Integration oder Unit mit Fake-Clock: zweiter Lauf → überwiegend `skippedDup`; Logs `ingestion.connection.bootstrap.*` mit erkennbarem Scheduled-`reason` (falls implementiert). | diff --git a/d-guides/deployment/poweron-sec.kdbx b/d-guides/deployment/poweron-sec.kdbx index 0daaaed..5c52c62 100644 Binary files a/d-guides/deployment/poweron-sec.kdbx and b/d-guides/deployment/poweron-sec.kdbx differ