This commit is contained in:
ValueOn AG 2026-05-06 23:28:14 +02:00
commit 7af8751aa9
3 changed files with 576 additions and 27 deletions

View file

@ -0,0 +1,460 @@
<!-- status: plan -->
<!-- started: 2026-04-29 -->
<!-- component: frontend-nyla | gateway -->
# FormGenerator: Persistente Benutzer-Gruppierung
## Beschreibung und Kontext
Der `FormGeneratorTable` wird auf vielen Seiten der Plattform genutzt. Nutzer sollen Einträge in benannte, rekursive Gruppen organisieren können — mit persistenter Speicherung und vollständiger Kompatibilität mit Pagination, Suche, Filter und allen Action-Buttons.
**Kernprinzip: Grouping ist ein eingebautes Feature von `PaginationParams` und `PaginatedResponse` — kein separater Call, keine eigene Route, kein eigenes API-Modul. Der bestehende `refetch()`-Mechanismus ist der einzige Transport.**
---
## Architektur-Kern: Wie es funktioniert
### Grouping reitet auf dem bestehenden Pagination-Call
`PaginationParams` (der JSON-Parameter jedes List-Endpoints) bekommt zwei neue optionale Felder:
```
saveGroupTree → wenn gesetzt: Backend speichert diesen Baum VOR dem Fetch
groupId → wenn gesetzt: Backend filtert Items auf Items dieser Gruppe
```
`PaginatedResponse` bekommt ein neues optionales Feld:
```
groupTree → aktueller Gruppen-Baum des Users für diesen Endpoint (immer mitgeliefert)
```
**Ein Aufruf tut damit drei Dinge auf einmal:**
1. Speichert den neuen Gruppen-Baum (wenn `saveGroupTree` gesetzt)
2. Filtert auf eine Gruppe (wenn `groupId` gesetzt)
3. Gibt aktuelle Items + aktuelle Gruppen-Baum zurück
### Ablauf End-to-End
```
Seitenaufruf (erster Load):
GET /api/connections/?pagination={"page":1,"pageSize":20}
← { items: [...], pagination: {...}, groupTree: [{id, name, itemIds, subGroups}] }
User erstellt Gruppe (lokal sofort sichtbar, dann debounced Save via refetch):
GET /api/connections/?pagination={"page":1,"pageSize":20,"saveGroupTree":[{neuerBaum}]}
← { items: [...], pagination: {...}, groupTree: [{neuerBaum, vom Backend bestätigt}] }
User betritt Gruppe "Kunden" (id: "g1"):
GET /api/connections/?pagination={"page":1,"pageSize":20,"groupId":"g1"}
← { items: [nur Items der Gruppe], pagination: {totalItems: 3, ...}, groupTree: [...] }
→ Suche, Filter, Sortierung, mode=ids, mode=filterValues — alles funktioniert
innerhalb des Gruppen-Scopes, da das Backend die IN-Liste kennt
```
### Backend: Pro Route genau 2 Zeilen Overhead
Der gesamte Grouping-Mechanismus ist in `routeHelpers.py` als shared Helper implementiert. Jede Route die Grouping unterstützen soll, ruft ihn auf:
```python
# Anfang der Route-Funktion (BEVOR items gebaut werden):
groupCtx = handleGroupingInRequest(paginationParams, interface, "connections")
# → speichert saveGroupTree falls vorhanden
# → gibt groupIdItemIds zurück falls groupId gesetzt
# Items bauen (unverändert)...
# Falls Gruppen-Scope aktiv: Items auf Gruppe einschränken
items = applyGroupScopeFilter(items, groupCtx.itemIds)
# Am Ende: groupTree in Response einbetten
return {**result, "groupTree": groupCtx.groupTree}
```
**Kein neues Route-File. Kein neues Interface-File. Keine neue URL.**
---
## Betroffene Module
- **Gateway:**
- `modules/datamodels/datamodelPagination.py``PaginationParams` + `groupId`, `saveGroupTree`; `PaginatedResponse` + `groupTree`; neue Klassen `TableGroupNode` + `TableGrouping` in **dieser Datei**
- `modules/interfaces/interfaceDbApp.py``AppObjects` um `getTableGrouping(contextKey)` + `upsertTableGrouping(contextKey, rootGroups)` erweitern; neue Tabelle `table_groupings` in `poweron_app` (auto-created)
- `modules/routes/routeHelpers.py``handleGroupingInRequest(paginationParams, interface, contextKey)` + `applyGroupScopeFilter(items, itemIds)` hinzufügen
- Jede List-Route die Grouping unterstützen soll: **2 Zeilen** am Anfang + **1 Feld** in der Response (`groupTree`)
- **Frontend:**
- `FormGeneratorTable.tsx``groupingConfig`-Prop; interner Grouping-State; nutzt `hookData.refetch()` als einzigen Transport
- `FormGeneratorControls.tsx` — Gruppen-Toolbar-Button
- `FormGenerator/GroupingManager/GroupRow.tsx` — Gruppen-Header-Zeile (neue Komponente)
- `FormGenerator/GroupingManager/GroupingManager.tsx` — Seitenpanel (neue Komponente)
- **Kein neuer Hook, kein neues API-Modul, keine Änderungen an bestehenden Feature-Hooks**
- **DB-Migration:** Nein (Auto-Create via DatabaseConnector)
---
## Datenmodell
### Ergänzungen in `datamodelPagination.py`
```python
# --- Grouping-Modelle (neu, in derselben Datei) ---
class TableGroupNode(BaseModel):
id: str
name: str
itemIds: List[str] = Field(default_factory=list)
subGroups: List['TableGroupNode'] = Field(default_factory=list)
order: int = 0
isExpanded: bool = True
TableGroupNode.model_rebuild()
class TableGrouping(BaseModel):
"""DB-Tabelle table_groupings in poweron_app."""
id: str
userId: str
contextKey: str # abgeleitet aus Route-Prefix, z. B. "connections", "prompts", "admin/users"
rootGroups: List[TableGroupNode] = Field(default_factory=list)
updatedAt: Optional[float] = None
# --- Erweiterung PaginationParams (2 neue optionale Felder) ---
class PaginationParams(BaseModel):
page: int = Field(ge=1)
pageSize: int = Field(ge=1, le=1000)
sort: List[SortField] = Field(default_factory=list)
filters: Optional[Dict[str, Any]] = None
# NEU:
groupId: Optional[str] = None # Scope: nur Items dieser Gruppe
saveGroupTree: Optional[List[Dict[str, Any]]] = None # Persistieren: diesen Baum speichern
# --- Erweiterung PaginatedResponse (1 neues optionales Feld) ---
class PaginatedResponse(BaseModel, Generic[T]):
items: List[T]
pagination: Optional[PaginationMetadata]
groupTree: Optional[List[TableGroupNode]] = None # NEU — immer mitgeliefert wenn vorhanden
model_config = ConfigDict(arbitrary_types_allowed=True)
```
---
## Backend-Implementierung
### `routeHelpers.py` — neuer shared Helper
```python
from dataclasses import dataclass
from typing import Optional, Set
@dataclass
class GroupingContext:
groupTree: Optional[list] # Für die Response
itemIds: Optional[Set[str]] # Falls groupId gesetzt — IN-Filter-Menge
def handleGroupingInRequest(
paginationParams: Optional[PaginationParams],
interface, # AppObjects
contextKey: str,
) -> GroupingContext:
"""
Zentraler Grouping-Handler — aufgerufen am Anfang jeder List-Route.
1. Falls paginationParams.saveGroupTree gesetzt:
→ interface.upsertTableGrouping(contextKey, saveGroupTree)
→ saveGroupTree aus params entfernen (wird nicht weiter verarbeitet)
2. Falls paginationParams.groupId gesetzt:
→ Gruppe im gespeicherten Baum suchen (rekursiv inkl. Subgruppen)
→ itemIds der Gruppe (+ alle Subgruppen) als Set zurückgeben
→ groupId aus params entfernen (wird nicht als normaler Filter verarbeitet)
3. Aktuellen groupTree laden und für Response bereitstellen.
Returns: GroupingContext(groupTree, itemIds)
"""
def applyGroupScopeFilter(items: list, itemIds: Optional[Set[str]]) -> list:
"""
Wendet den Gruppen-Scope-Filter an.
Gibt items unverändert zurück wenn itemIds is None (kein Scope aktiv).
Filtert sonst auf item["id"] in itemIds.
"""
if itemIds is None:
return items
return [item for item in items if str(item.get("id", "")) in itemIds]
```
### Route-Erweiterung — Muster (2 + 1 Zeilen)
```python
@router.get("/")
async def get_connections(request, pagination=None, mode=None, column=None, currentUser=Depends(getCurrentUser)):
from modules.routes.routeHelpers import handleGroupingInRequest, applyGroupScopeFilter
interface = getInterface(currentUser)
CONTEXT_KEY = "connections"
# 1. Grouping verarbeiten (speichern falls nötig, Scope auflösen)
groupCtx = handleGroupingInRequest(paginationParams, interface, CONTEXT_KEY)
# mode=filterValues / mode=ids (unverändert, aber groupId ist bereits aus params entfernt)
if mode == "filterValues": ...
if mode == "ids":
items = _buildEnhancedItems()
items = applyGroupScopeFilter(items, groupCtx.itemIds) # Scope auch für ids!
return handleIdsInMemory(items, pagination)
# Items bauen (unverändert)
items = _buildEnhancedItems()
# 2. Gruppen-Scope-Filter anwenden
items = applyGroupScopeFilter(items, groupCtx.itemIds)
# Pagination (unverändert)
result = paginateInMemory(items, paginationParams)
# 3. groupTree in Response einbetten
return {**result.model_dump(), "groupTree": groupCtx.groupTree}
```
**`mode=filterValues` und `mode=ids` funktionieren automatisch korrekt im Gruppen-Scope**, weil `groupId` aus `paginationParams` entfernt wurde und `applyGroupScopeFilter` aufgerufen wird — dadurch beziehen sich Filter-Dropdowns und "Select All Filtered" auf Items der aktuellen Gruppe.
---
## Frontend-Implementierung
### `FormGeneratorTable` — interner Grouping-State
```typescript
// Neues Prop:
groupingConfig?: {
contextKey: string; // Nur für RBAC/Logging — der eigentliche Key wird serverseitig aus dem Endpoint abgeleitet
enabled: boolean;
}
// Interner State (nur in FormGeneratorTable, nicht im Hook):
const [groupTree, setGroupTree] = useState<TableGroupNode[]>([]);
const [activeGroupId, setActiveGroupId] = useState<string | null>(null);
const [pendingGroupTree, setPendingGroupTree] = useState<TableGroupNode[] | null>(null);
```
### Datenfluss — `hookData.refetch()` als einziger Transport
```typescript
// groupTree kommt aus der normalen refetch-Response:
useEffect(() => {
if (hookData?.pagination?.groupTree) {
setGroupTree(hookData.pagination.groupTree);
setPendingGroupTree(null); // Gespeichert — pending löschen
}
}, [hookData]);
// Beim Betreten eines Gruppen-Scopes:
const _enterGroup = (groupId: string) => {
setActiveGroupId(groupId);
hookData.refetch({ ...currentPaginationParams, page: 1, groupId });
};
// Beim Verlassen des Scopes:
const _exitGroup = () => {
setActiveGroupId(null);
hookData.refetch({ ...currentPaginationParams, page: 1, groupId: undefined });
};
// Bei Gruppen-Mutation (erstellen, umbenennen, löschen, Item zuordnen):
const _mutateGroupTree = (newTree: TableGroupNode[]) => {
setGroupTree(newTree); // Sofort lokal sichtbar (optimistic)
setPendingGroupTree(newTree); // Markiert für nächsten Save
_debouncedSave(newTree); // Debounced: nach 500ms via refetch speichern
};
// Debounced Save: normaler refetch + saveGroupTree im pagination param
const _debouncedSave = useMemo(() => debounce((tree: TableGroupNode[]) => {
hookData.refetch({
...currentPaginationParams,
saveGroupTree: tree, // Backend speichert und bestätigt
});
}, 500), [hookData, currentPaginationParams]);
```
**Kein neues API-Modul. Kein eigener fetch-Call. `hookData.refetch()` ist alles.**
### Render-Struktur
```
Root-View (kein activeGroupId):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
▼ Kunden (3 Items) [Delete all] [Download all] [+ Subgruppe] [Umbenennen] [×]
▶ Aktive (1) [Delete all] [+ Subgruppe] [Umbenennen] [×]
— Item A [Edit] [Delete] [Download]
— Item B [Edit] [Delete] [Download]
— Item C (Aktive) [Edit] [Delete] [Download]
▶ Intern [5 Items — Gruppe öffnen →]
── Nicht zugeordnet (2)
— Item X [Edit] [Delete] [Download]
Gruppen-Scope (activeGroupId = "Intern"):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
← Zurück | Intern (Seite 1/1 · 5 Einträge) [Delete all] [Download all]
— Item C [Edit] [Delete] [Download]
— Item D [Edit] [Delete] [Download]
...
```
Wenn `activeGroupId` gesetzt: refetch läuft mit `groupId` → Backend filtert → Pagination, Suche, Filter, `mode=ids` — alles auf Gruppe begrenzt.
---
## Pagination, Suche, Filter — vollständig korrekt
| Szenario | Was passiert |
|----------|-------------|
| Root-View, Seite blättern | Normaler refetch. `groupTree` kommt in Response mit. Gruppen-Counter aus `groupNode.itemIds.length`. |
| Gruppen-Scope aktiv | `groupId` in PaginationParams → Backend IN-Filter → Totalcount kommt vom Backend (korrekt). |
| Suche im Scope | `groupId` + `filters.search` → Backend filtert Items der Gruppe nach Suchtext. |
| "Select All Filtered" (mode=ids) | `groupId` in params → `applyGroupScopeFilter` wird VOR `handleIdsInMemory` angewendet → nur IDs der Gruppe werden zurückgegeben. |
| Filter-Dropdown (mode=filterValues) | `groupId` in params → `applyGroupScopeFilter` vor `handleFilterValuesInMemory` → Distinct-Werte kommen nur aus Gruppen-Items. |
| Gruppen-Baum speichern während Paginieren | `saveGroupTree` + Seiten-Params im gleichen Call → Backend speichert Baum UND gibt aktuelle Seite zurück. |
---
## Entscheidungen
| Datum | Entscheidung | Begründung |
|-------|-------------|------------|
| 2026-04-29 | `saveGroupTree` + `groupId` in `PaginationParams` statt eigener Endpoint | Ein Call, ein Transport, kein zweiter API-Pfad; Grouping ist integraler Bestandteil der Datenabfrage |
| 2026-04-29 | `groupTree` in `PaginatedResponse` | Immer synchron mit aktuellen Items; kein separater Lade-Call nötig |
| 2026-04-29 | Shared Helper in `routeHelpers.py`, 2 Zeilen pro Route | DRY; gesamte Komplexität an einem Ort; pro Route null Eigenlogik |
| 2026-04-29 | Optimistic UI + debounced Save via normalen refetch | UX sofort; kein Flackern; Persistenz ohne Extra-Call |
| 2026-04-29 | Modelle in `datamodelPagination.py` | `PaginatedResponse` liegt dort; keine neue Datei; Import-Graph bleibt minimal |
| 2026-04-29 | `applyGroupScopeFilter` auch für `mode=ids` und `mode=filterValues` | Filter-Dropdown und Bulk-Select funktionieren korrekt im Gruppen-Scope ohne Sonderbehandlung |
| 2026-04-29 | Gruppen-State nur in `FormGeneratorTable`, nicht im Feature-Hook | Keine Änderungen an bestehenden Hooks nötig; Grouping ist transparent für Seiten-Code |
---
## Umsetzungs-Checkliste
### Phase 1: Backend Core
- [ ] `datamodelPagination.py`: `TableGroupNode`, `TableGrouping` Klassen hinzufügen
- [ ] `datamodelPagination.py`: `PaginationParams` um `groupId` + `saveGroupTree` erweitern
- [ ] `datamodelPagination.py`: `PaginatedResponse` um `groupTree` erweitern
- [ ] `datamodelPagination.py`: `normalize_pagination_dict` so erweitern, dass `saveGroupTree` und `groupId` korrekt geparst werden
- [ ] `interfaceDbApp.py`: `getTableGrouping(contextKey)` + `upsertTableGrouping(contextKey, rootGroups)` zu `AppObjects` hinzufügen
- [ ] `routeHelpers.py`: `GroupingContext` Dataclass + `handleGroupingInRequest()` + `applyGroupScopeFilter()` implementieren
### Phase 2: Route-Erweiterungen
Pro Route: `handleGroupingInRequest` am Anfang + `applyGroupScopeFilter` vor Pagination + `groupTree` in Response
- [ ] `routeDataConnections.py` (inkl. `mode=ids`, `mode=filterValues`)
- [ ] `routeDataPrompts.py`
- [ ] `routeDataUsers.py`
- [ ] `routeDataMandates.py`
- [ ] `routeDataFiles.py`
- [ ] Weitere nach Bedarf (Trustee, RealEstate, Invitations, …)
### Phase 3: Frontend Typen + FormGeneratorTable Grundgerüst
- [ ] TypeScript-Typen `TableGroupNode` direkt in `FormGeneratorTable.tsx` oder `src/types/tableGrouping.ts`
- [ ] `groupingConfig`-Prop zu `FormGeneratorTableProps` hinzufügen
- [ ] `groupTree` aus `hookData`-Response parsen und in internem State halten
- [ ] `activeGroupId`-State + `_enterGroup` / `_exitGroup`
- [ ] `pendingGroupTree`-State + `_mutateGroupTree` + debounced Save via `hookData.refetch()`
- [ ] `PaginatedResponse` Frontend-Typ um `groupTree` erweitern
### Phase 4: Render-Logik
- [ ] `GroupRow`-Komponente: `src/components/FormGenerator/GroupingManager/GroupRow.tsx`
- [ ] Render-Algorithmus: Root-View mit Gruppen-Header-Zeilen und DataRows; "Nicht zugeordnet"-Sektion
- [ ] Gruppen-Scope-View: Breadcrumb, "Zurück"-Button, normales Table-Layout
- [ ] Gedimmte Gruppen ohne sichtbare Items (Root-View) mit "Gruppe öffnen"-Button
- [ ] Expand/Collapse je Gruppe (lokal; isExpanded kommt aus `groupTree`)
### Phase 5: Aktionen und Interaktion
- [ ] `actionButtons` + `customActions` auf `GroupRow`-Ebene (Batch auf alle Items via `mode=ids` im Scope)
- [ ] Delete-Gruppe mit `useConfirm`: "Nur Gruppe" vs. "Gruppe + alle Items löschen"
- [ ] "In Gruppe verschieben" als BatchAction bei Multi-Select
- [ ] Kontextmenü-Button je DataRow → Gruppen-Dropdown
- [ ] Drag-and-Drop DataRow → GroupRow
### Phase 6: GroupingManager Panel
- [ ] `GroupingManager`-Komponente: Gruppen-Baum-Panel
- [ ] Button "Gruppen" in `FormGeneratorControls` (nur wenn `groupingConfig.enabled`)
- [ ] Neue Gruppe erstellen (`usePrompt`)
- [ ] Gruppe umbenennen (inline)
- [ ] Subgruppe erstellen
- [ ] Gruppe löschen (`useConfirm`)
- [ ] Reihenfolge Up/Down
### Abschluss
- [ ] i18n: alle neuen UI-Texte mit `t('...')` getaggt
- [ ] CSS Modules für alle neuen Komponenten
- [ ] `b-reference/frontend-nyla/formgenerator.md` aktualisieren
- [ ] `b-reference/gateway/architecture.md``PaginationParams`/`PaginatedResponse`-Erweiterung dokumentieren
---
## Akzeptanzkriterien
| # | Kriterium (Given-When-Then) | Prio |
|---|---------------------------|------|
| 1 | Given `FormGeneratorTable` mit `groupingConfig.enabled` — When Seite lädt — Then kommt `groupTree` in der normalen List-Response mit; **kein zweiter API-Call** | must |
| 2 | Given User erstellt Gruppe — When 500ms nach letzter Änderung — Then ein einziger `refetch` mit `saveGroupTree` wird abgesendet; nach Reload ist Gruppe vorhanden | must |
| 3 | Given User klickt Gruppe "Kunden" — When `_enterGroup` aufgerufen — Then refetch mit `groupId`; Backend filtert; Pagination und Totalcount beziehen sich auf die Gruppe | must |
| 4 | Given aktiver Gruppen-Scope — When User sucht "Test" — Then `groupId` + Search in einem Call; Backend zeigt nur Treffer in der Gruppe | must |
| 5 | Given aktiver Gruppen-Scope — When User klickt "Alle auswählen" (mode=ids) — Then IDs kommen nur aus der Gruppe, nicht aus der Gesamtliste | must |
| 6 | Given Filter-Dropdown geöffnet im Gruppen-Scope — When Werte geladen — Then kommen nur aus Items der Gruppe (mode=filterValues korrekt) | should |
| 7 | Given `FormGeneratorTable` ohne `groupingConfig` — Then identisches Verhalten wie vor dem Feature | must |
| 8 | Given Gruppe mit 5 Items — When User Delete auf Group-Header — Then Confirm → alle 5 Items gelöscht; Gruppe aus Baum entfernt; ein `refetch` mit aktuellem `saveGroupTree` | must |
| 9 | Given User zieht DataRow auf Gruppen-Header — Then Item wird zur Gruppe zugeordnet; `_mutateGroupTree` + debounced Save | should |
---
## Testplan
| ID | AC | Art | Automatisiert | Repo-Pfad | Status |
|----|----|----|--------------|-----------|--------|
| T1 | 1, 2 | api | ja | `gateway/tests/test_grouping_helpers.py` | pending |
| T2 | 3, 4, 5, 6 | api | ja | `gateway/tests/test_grouping_helpers.py` | pending |
| T3 | 7 | component | nein | manuell | pending |
| T4 | 8 | api + component | nein | manuell | pending |
| T5 | 9 | component | nein | manuell | pending |
---
## Offene Fragen
1. **`scope: 'user' | 'mandate'`** im `TableGrouping`-Modell bereits vorbereiten für späteres mandate-weites Sharing?
2. **CSV-Export** soll Gruppen-Spalte enthalten?
3. **`activeGroupId` im URL-State** (`?group=g1`) für Deep-Links?
4. **Max-Tiefe** konfigurierbar (`groupingConfig.maxDepth`) oder feste Warnung nach 3 Ebenen?
---
## Links
- PR: —
- Referenz FormGenerator: `b-reference/frontend-nyla/formgenerator.md`
- Referenz Gateway-Architektur: `b-reference/gateway/architecture.md`
- Referenz DB-Architektur: `b-reference/platform/database-architecture.md`
---
## Abschluss
- [ ] `b-reference/frontend-nyla/formgenerator.md` — Grouping-Sektion
- [ ] `b-reference/gateway/architecture.md``PaginationParams`/`PaginatedResponse` Erweiterung
- [ ] TOPICS.md geprüft
- [ ] Dieses Dokument → `z-archive/` verschoben

View file

@ -1,6 +1,6 @@
<!-- status: build -->
<!-- started: 2026-04-16 -->
<!-- lastReviewed: 2026-04-21 -->
<!-- lastReviewed: 2026-04-24 -->
<!-- component: gateway | platform | frontend-nyla -->
# Unified Knowledge Indexing — One RAG Corpus for All Platform Information
@ -15,7 +15,7 @@
| **Teil 3** | **Feature injection** split into **retrieval** (agent + `buildAgentContext`) vs **corpus** (`indexFile`); **matrix** per `modules/features/*` product; real **gaps** vs false “non-injection”. |
| **Implementation phases · Ziele · AC · Testplan** | Rollout, explicit non-goals, acceptance criteria, verification. |
**Single sentence summary:** Keep **retrieval** on **`AgentService`**; unify **when and how** the shared **`interfaceDbKnowledge`** corpus is **filled** (routes, **user connections** / integrations, features, snapshots) behind one **ingestion contract**, without assuming every product uses the workspace agent.
**Single sentence summary:** Keep **retrieval** on **`AgentService`**; unify **when and how** the shared **`interfaceDbKnowledge`** corpus is **filled** (routes, **user connections**, **feature commit points**) behind one **ingestion contract**. **Current roadmap scope:** user-connection lifecycle (**P1a/P1b**), **daily refresh** to close the post-connect delta gap (**P1c**), **explicit user consent + per-connection ingestion preferences** (incl. optional **neutralization**) in **frontend + API**, then **scalable event bus** (**P3**). **Out of current roadmap:** standalone **profile/mandate snapshot** ingestion (former roadmap **P2** — content remains in Teil 2.3 as future option only).
## Beschreibung und Kontext
@ -150,7 +150,7 @@ The first end-to-end AC4 test on a 500-page PDF revealed **three** independent b
2. **Pre-upserts must preserve `_ingestion` metadata and the `indexed` status.** `routeDataFiles._autoIndexFile` persisted a fresh `FileContentIndex` from the pre-scan **before** calling `requestIngestion`, overwriting `structure._ingestion.hash` and `status="indexed"` from any prior successful run. The duplicate check saw a row with empty metadata and re-ran the whole embedding stage. **Rule:** any upsert on the idempotency row taken outside `requestIngestion` MUST read the existing row first and merge forward both `_ingestion` and (where applicable) the terminal `indexed` status.
3. **Extraction-pipeline defaults must preserve granularity for RAG.** `ExtractionOptions.mergeStrategy` defaulted to concatenating every text `ContentPart` into one blob, collapsing a 500-page PDF into a single chunk whose embedding is a blurred average of the whole document — unusable for targeted retrieval. **Rule:** every ingestion lane passes `mergeStrategy=None` explicitly until the default itself can be safely flipped after auditing non-RAG callers. (Tests: `tests/unit/services/test_extraction_merge_strategy.py`.)
**Deferred to P1** (uncovered during P0, not blocking AC1AC5):
**Deferred (ingestion idempotency hardening)** (uncovered during P0, not blocking AC1AC5; naming here is **not** the same milestone as **P1 user-connection hooks** below):
- **In-flight duplicate detection.** The current duplicate check only matches when `status == "indexed"`, so two nearly-simultaneous calls for the same `sourceId` both run full embedding. Fix candidates: accept `status ∈ {"extracted", "embedding", "indexed"}` with matching hash as "already in progress", or a per-`sourceId` `asyncio.Lock` in `KnowledgeService`.
- **Pre-extraction byte-hash shortcut.** `requestIngestion`'s duplicate check runs **after** extraction, so re-indexing a 1.6 MB PDF still spends ~15 s in `runExtraction` before the content hash is computed. The file-bytes SHA already exists in `interfaceDbManagement` for upload-dedup — a short-circuit in `_autoIndexFile` (and symmetric paths) could skip extraction entirely for an unchanged file.
@ -231,12 +231,14 @@ The first end-to-end AC4 test on a 500-page PDF revealed **three** independent b
**Email and messaging (Outlook + Gmail via Microsoft / Google user connections) — shared cautions**
- Default tiers: **metadata only****snippet****full body****attachments** (most expensive / sensitive).
- Default tiers: **metadata only****snippet****full body****attachments** (most expensive / sensitive). **Product default** vs **user override** is defined in **§2.6** (per-connection mail depth + attachments).
- Apply **quoted-thread stripping**, **signature removal**, and **max body length** before embed.
- **Legal hold / retention:** ingestion must respect mandate **delete** and **export** rules; **disconnecting** or **revoking** the mail **connection** must **purge** mail-sourced chunks.
### 2.3 “Account and stuff” — what to index vs. what never to index
**Roadmap note:** Standalone **profile/mandate snapshot** ingestion (formerly roadmap **P2**) is **out of current scope**; the table below remains the **target model** when that work is picked up again.
**Goal:** Give agents **useful, permission-safe** context (“who is this user in this mandate”, “which features are on”, “preferred language”) without creating a **second copy of sensitive credentials** in the vector store.
| Data | Typical treatment |
@ -256,6 +258,60 @@ Snapshots should be stored with the same **scope model** as file chunks (`person
**Storage (already implemented — not redesigned here):** The platform already uses **one** knowledge persistence stack: **`FileContentIndex`** (incl. `mandateId`, `scope`, status) and **`ContentChunk`** (pgvector embeddings, `fileId`, `userId`, `featureInstanceId`, `contextRef`, optional **`chunkMetadata`**), accessed via **`interfaceDbKnowledge`**. Chunks are **file-anchored** today; **connection- / source-specific** provenance (e.g. `connectionId`, external ids) can ride in **`contextRef` / `chunkMetadata`** until optional schema extensions are justified. **This document targets ingestion triggers and lifecycles**, not a second corpus or a duplicate storage model.
### 2.5 Lifecycle gap and daily refresh (roadmap **P1c**, v1)
**Gap:** After a successful connect, **bootstrap** runs once (initial fill). **New** mail, files, or tasks that arrive **after** that run are **not** indexed automatically until a **delta** path exists (webhook, `historyId` / `changes` cursors, etc. — see Teil **2.1** row *“Sync for an existing connection”*).
**Pragmatic mitigation (deliberately simple):** A **daily scheduler** (e.g. once per night, staggered by tenant/load) re-invokes the same **bootstrap walkers** for every **active** `UserConnection` that has **knowledge ingestion enabled** (see **§2.6**). Idempotency + fast-path skips unchanged items; **new** and **changed** items are picked up.
- **Pros:** No new external dependencies (Pub/Sub, watch renewal) in v1; fits existing BackgroundJob + cron/feature-flag patterns.
- **Con:** Data can lag up to **~24 h** before it appears in RAG — acceptable for v1 product choice.
- **Later (without replacing P1c):** Add per-authority **delta APIs** (Gmail `users.history.list`, Drive `changes.list`, ClickUp tighter polling) to reduce latency and API cost.
### 2.6 User consent, frontend flow, and per-connection preferences (incl. neutralization)
**Goal:** The user **explicitly** chooses whether this connection may feed the **shared knowledge store** used for AI/RAG — and **how much**. Without consent, **no** knowledge bootstrap is started for that connection (OAuth may still unlock other product features; that split must be obvious in the UI).
**Frontend (`frontend_nyla`):** extend the **add connection** flow (and later **connection settings**) with the dialog and controls below; persist choices via Gateway API **before** or **when** triggering knowledge ingestion.
#### UX when adding a connection
1. User starts OAuth as today.
2. **Before** or **immediately after** successful authorization: a **dialog** that clearly separates “establish connection” from “add to knowledge base”.
3. **No:** Connection remains usable for other features; either skip `KnowledgeIngestionConsumer.onConnectionEstablished` for the knowledge lane or persist `knowledgeIngestionEnabled=false` and never schedule walkers.
4. **Yes:** Show **advanced settings** (second step or accordion) per **settings catalog** below; persist **per `connectionId`** (or a dedicated preferences row); only then enqueue **bootstrap** (and later **P1c** refresh) with allowed surfaces and tiers.
**Suggested copy (DE — pick one tone / A-B test):**
- **Formal:** „Möchten Sie Inhalte aus dieser Verbindung in Ihre **Wissensdatenbank** übernehmen? KI-Funktionen können dann passender auf **Ihre** Dokumente und Nachrichten Bezug nehmen — **nur** mit Ihrer ausdrücklichen Zustimmung und in dem Umfang, den Sie festlegen.“
- **Approachable:** „Sollen wir aus dieser Verbindung ausgewählte Inhalte sicher in Ihre **persönliche Wissensdatenbank** legen, damit die KI für Sie **besser helfen** kann? Sie entscheiden **was** und **wie stark anonymisiert** — und können das jederzeit in den Einstellungen ändern oder die Daten entfernen.“
Mirror in EN if the UI is bilingual.
#### Minimum settings catalog (all **per connection** where technically applicable)
| Layer | Setting | Meaning |
|--------|-----------|---------|
| **Master** | **Knowledge ingestion for this connection** | `off` / `on`: gates bootstrap + **§2.5** (P1c) refresh for the knowledge store. |
| **Protection** | **Neutralize / anonymize before embedding** | When `on`: apply the same (or stricter) **neutralization** pipeline as for uploads (`FileItem.neutralize` / platform rules) to connector-sourced text **before** chunking — names, e-mail addresses, phone-like patterns, IBAN-like patterns, per policy. User-facing label **„anonymisiert“** maps to this pipeline (not a cryptographic guarantee). |
| **Mail** (Outlook / Gmail) | **Content depth** | At least: **metadata only** (subject, participants, dates — no body) / **snippet** / **full cleaned body** (after `cleanEmailBody` and caps). |
| **Mail** | **Index attachments** | `off` / `on` (with size/type caps). |
| **Files** (Drive / SharePoint / OneDrive) | **Index binary files** | `off` / `on`; optional **MIME allowlist** (Office/PDF/text only) as a simplified UX preset. |
| **ClickUp** | **Scope** | `titles only` / `title + description` / `+ comments` / optional `attachments`. |
| **Microsoft** | **Parity** | Same dimensions where Graph surfaces mirror Google (mailbox / drive-like). |
| **General** | **Time window** | “Only index items from the last **N** days” (aligns with existing walker caps; slider with a sensible max). |
| **General** | **Help: what RAG is not** | Short explainer: not real-time mail; delay until next scheduled run (**§2.5**). |
**Optional power-user toggles (same screen, collapsed):** per authority **which surfaces** ingest (e.g. **Google:** Gmail on/off, Drive on/off; **Microsoft:** SharePoint on/off, Outlook on/off — when product exposes both). Reduces accidental over-breadth without extra wizard steps.
**Backend consequence:** Walkers read persisted preferences for `connectionId` each run and **filter** surfaces and payload tiers **before** `indexFile`. On preference change, product decision: trigger **re-sync**, or apply only to **new** items — document the chosen rule.
#### Neutralization when the user opts in
- **Ingestion on** + **neutralization on:** After content is obtained (virtual text or extraction output), apply the **neutralization stage** **before** chunking/embedding; **that** text is what gets embedded.
- **Neutralization off:** Still apply baseline **hygiene** where already defined (e.g. `cleanEmailBody` for quotes/signatures) — hygiene **≠** full PII removal.
- **Compliance copy:** If the user chooses **full body**, state clearly that **perfect** anonymization is not guaranteed without neutralization.
---
## Teil 3 — Feature injection: retrieval vs corpus, agent loop, and real gaps
@ -338,11 +394,11 @@ Then add **`requestIngestion` / `indexFile`** at the **feature commit point** (o
3. **Unified façade** — one ingestion API; avoid a second embedding pipeline.
4. **Purge** — tie to **`fileId`**, business key, or future connector purge keys on revoke/delete.
### 3.7 Phasing
### 3.7 Phasing (feature matrix — **not** the same numbering as roadmap **P1c/P1d/P3** above)
- **P0:** For **each** row in §3.3, confirm **retrieval** vs **corpus** paths; document “satisfied by agent+upload+tools” vs “needs feature hook.”
- **P1:** Implement **feature-native corpus** for one domain with a clear §3.5 gap (e.g. **trustee** entity text, **teamsbot** persisted transcript).
- **P2:** **Chatbot** architecture decision: integrate **`serviceKnowledge`** or keep parallel retrieval; if integrate, add explicit **corpus** rules for config/FAQ.
- **FM0:** For **each** row in §3.3, confirm **retrieval** vs **corpus** paths; document “satisfied by agent+upload+tools” vs “needs feature hook.”
- **FM1:** Implement **feature-native corpus** for one domain with a clear §3.5 gap (e.g. **trustee** entity text, **teamsbot** persisted transcript).
- **FM2:** **Chatbot** architecture decision: integrate **`serviceKnowledge`** or keep parallel retrieval; if integrate, add explicit **corpus** rules for config/FAQ.
---
@ -350,12 +406,41 @@ Then add **`requestIngestion` / `indexFile`** at the **feature commit point** (o
Phases align with **Teil 1** (façade), **Teil 2** (connector + trigger catalog), and **Teil 3.7** (feature matrix and feature-native corpus pilots). **P0** overlaps **Teil 3.7 P0** (complete the per-feature matrix before large builds).
**Authority rollout (2026-04-24):** The **user-connection ingestion lane** (bootstrap + purge tied to **`UserConnection`**) is delivered **per OAuth authority**: **`msft` (P1a)**, **`google`** + **`clickup` (P1b)** — same consumer, dispatcher fan-out, purge-by-`connectionId`, and unit tests for walkers + consumer. **Next product slices:** **P1c** (daily refresh, **§2.5**), **consent + per-connection preferences + frontend** (**§2.6**), then **P3** (event bus at scale).
| Phase | Outcome |
|-------|---------|
| **P0 — Façade + idempotency** *(done, 2026-04-21)* | Single `requestIngestion` / `getIngestionStatus` entry point on `KnowledgeService` with content-hash idempotency, provenance in `structure._ingestion`, and structured logging (`ingestion.queued` / `ingestion.indexed` / `ingestion.skipped.duplicate` / `ingestion.failed`). All prior `indexFile` call sites now route through the façade: `routeDataFiles._autoIndexFile`, `commcoach/serviceCommcoachIndexer.indexSessionData`, `serviceAgent/coreTools/_workspaceTools.readFile`, `serviceAgent/coreTools/_documentTools.describeImage`. Agent tools no longer carry on-demand extraction + ingestion fallbacks — they are pure consumers of the knowledge store. **Teil 3.3** matrix audited. Three implementation bugs fixed during verification: stable content hash, pre-upsert `_ingestion` preservation, `mergeStrategy=None` for per-page granularity (see **§1.4 Implementation pitfalls**). |
| **P1 — User-connection hooks** *(done, 2026-04-21)* | `connection.established` / `connection.revoked` callbacks emitted from every OAuth callback (`routeSecurityMsft`, `routeSecurityGoogle`, `routeSecurityClickup`) and from `routeDataConnections.disconnect_service` / `delete_connection`; the `ConnectionStatus.INACTIVE` enum bug (the value did not exist) was fixed by switching the disconnect path to `ConnectionStatus.REVOKED`. A new central `KnowledgeIngestionConsumer` (`subConnectorIngestConsumer.py`, registered in `app.py` lifespan) maps `established` to a `connection.bootstrap` BackgroundJob and `revoked` to a synchronous purge through `KnowledgeService.purgeConnection``interfaceDbKnowledge.deleteFileContentIndexByConnectionId`. `FileContentIndex` gained `connectionId` and `sourceKind` columns (auto-applied by `connectorDbPostgre`); `IngestionJob` carries both end-to-end so every chunk is purgeable by connection. **All three OAuth authorities are wired up** with one bootstrap module per service: `subConnectorSyncSharepoint.py` (`sourceKind="sharepoint_item"`, `eTag` as `contentVersion`, walks sites with the `@odata.nextLink` paginated `SharepointAdapter.browse`), `subConnectorSyncOutlook.py` (virtual `outlook_message` documents — header / snippet / cleaned body via the shared `cleanEmailBody` utility — with `changeKey` revisions and optional `outlook_attachment` child jobs), `subConnectorSyncGdrive.py` (`gdrive_item`, `modifiedTime` revisions, recursive walk from My Drive root with depth/age caps and Google-Doc export support inherited from `DriveAdapter.download`), `subConnectorSyncGmail.py` (virtual `gmail_message` documents with `historyId` revisions, walks `INBOX + SENT` by default, MIME-tree body extraction prefers `text/plain` and falls back to `text/html`, optional `gmail_attachment` child jobs), `subConnectorSyncClickup.py` (virtual `clickup_task` documents with `date_updated` revisions, walks teams → spaces → folder/folderless lists → tasks with workspace and per-workspace list caps, header carries name/status/list/space/assignees/tags/url so search prompts retrieve task context without a live API call). The dispatcher `_bootstrapJobHandler` fans out per authority (msft → sharepoint+outlook in parallel, google → drive+gmail in parallel, clickup → tasks); unsupported authorities log `ingestion.connection.bootstrap.skipped reason=unsupported_authority`. Structured-log schema (started / progress / done / purged) defined in **§ Structured ingestion logs** below. Eight new unit tests (purge, consumer dispatch + per-authority routing, `cleanEmailBody`, bootstrapSharepoint, bootstrapOutlook, bootstrapGmail, bootstrapGdrive, bootstrapClickup) lock the contract. **Retrieval threshold calibration (2026-04-21):** during UI verification `buildAgentContext` returned `instanceChunks=0` despite 640 correctly-indexed rows — root cause was overly aggressive `minScore` thresholds (Layer 1 `0.65`, Layer 1.5 `0.55`, Layer 3 `0.70`) versus realistic `text-embedding-3-small` cosine similarities in the `0.30``0.55` range. All three thresholds lowered to `0.35`; agent then correctly synthesized answers from indexed Outlook/SharePoint content without resorting to live tools. |
| **P2 — Profile & mandate snapshots** | Allowlisted fields only (**Teil 2.3**); regenerate on events; explicit admin toggle per mandate if needed. |
| **P3 — Event bus** | Move direct calls to async consumer where load requires it (**Teil 2.4** scalable target). |
| **P1a — User-connection hooks (Microsoft `msft`)** *(done, 2026-04-21)* | **`connection.established`** / **`connection.revoked`** emitted from **Microsoft** data-OAuth success paths and from **disconnect/delete** when the row is **`msft`** (incl. **`ConnectionStatus.REVOKED`** fix where **`INACTIVE`** was invalid). Central **`KnowledgeIngestionConsumer`** (`subConnectorIngestConsumer.py`, **`app.py`** lifespan) maps **`established`** → **`connection.bootstrap`** BackgroundJob and **`revoked`** → synchronous **`KnowledgeService.purgeConnection`** → **`interfaceDbKnowledge.deleteFileContentIndexByConnectionId`**. **`FileContentIndex.connectionId`** + **`sourceKind`** (and **`IngestionJob`** carrying both) make connector-sourced rows purgeable. **Bootstrap modules live for Microsoft:** **`subConnectorSyncSharepoint.py`** (`sourceKind="sharepoint_item"`, **`eTag`** as `contentVersion`, **`SharepointAdapter.browse`** with **`@odata.nextLink`** pagination) and **`subConnectorSyncOutlook.py`** (virtual **`outlook_message`** docs — header / snippet / cleaned body via **`cleanEmailBody`**, **`changeKey`** revisions, optional **`outlook_attachment`** child jobs). Dispatcher **`_bootstrapJobHandler`** runs **SharePoint + Outlook in parallel** for **`msft`**. Structured logs: **§ Structured ingestion logs**. **Retrieval threshold calibration (2026-04-21):** **`buildAgentContext`** **`minScore`** layers lowered to **`0.35`** so **`text-embedding-3-small`** matches real cosine scores; validated on **Outlook/SharePointindexed** content. **Tests (P1a):** purge, consumer **msft** dispatch, **`cleanEmailBody`**, **`bootstrapSharepoint`**, **`bootstrapOutlook`**. |
| **P1b — User-connection hooks (Google + ClickUp)** *(done, 2026-04)* | Parity with **`msft`**: **`routeSecurityGoogle`** / **`routeSecurityClickup`** call **`KnowledgeIngestionConsumer.onConnectionEstablished`** after token save; **`routeDataConnections`** disconnect/delete call **`onConnectionRevoked`** for **all** authorities. **`_bootstrapJobHandler`** fans out **google → `bootstrapGdrive` + `bootstrapGmail`** in parallel and **clickup → `bootstrapClickup`**. Walkers: `subConnectorSyncGdrive.py`, `subConnectorSyncGmail.py`, `subConnectorSyncClickup.py` + `subTextClean.py`. Unit tests: `test_bootstrap_gdrive.py`, `test_bootstrap_gmail.py`, `test_bootstrap_clickup.py`, extended `test_knowledge_ingest_consumer.py`. |
| **P1c — Connection refresh (lifecycle v1)** *(next)* | **Daily** (or nightly) **scheduled** re-run of the same bootstrap walkers for connections with **knowledge ingestion enabled** (**§2.6**). Reuses idempotency + fast-path; closes the **post-connect delta gap** without webhooks in v1. Observability: same log family as bootstrap; optional `event` suffix or `reason=scheduled_refresh` for shippers. |
| **P1d — Consent + preferences + UI** *(next)* | Persist **§2.6** settings **per `connectionId`**; Gate **`onConnectionEstablished`** / P1c jobs on user choice; **`frontend_nyla`** connection wizard + settings screen; walkers honor mail/file/ClickUp depth and **neutralization** flag. |
| **~~P2 — Profile & mandate snapshots~~** | **Removed from active roadmap** (focus: connections + feature corpus + scale). Target content remains documented in **§2.3** for a future re-entry when needed. |
| **P3 — Event bus** | Move direct calls to async consumer where load requires it (**Teil 2.4** scalable target). Remains in scope. |
### P1b checklist *(completed — kept for audit trail)*
1. **`routeSecurityGoogle`:** after successful **data** OAuth, enqueue **same** ingestion consumer path as Microsoft (pass **`connectionId`**, **`AuthAuthority.google`**, mandate/user scope).
2. **`routeSecurityClickup`:** after successful OAuth / token persistence, same.
3. **`routeDataConnections`:** verify **disconnect_service** / **delete_connection** emit **revoke** (or call **`purgeConnection`**) for **google** and **clickup** rows, not only **msft**.
4. **`_bootstrapJobHandler`:** remove any **“unsupported_authority”** skip for **`google`** / **`clickup`** once walkers are registered; keep skip only for **future** authorities.
5. **Quality bar:** T10/T12T15 in the testplan — extend from **Microsoft-only** assumptions to **all three** **`routeDataConnections`** OAuth authorities.
### P1c / P1d checklist *(next engineering slices)*
1. **P1c:** BackgroundJob or cron entry; feature flag; per-tenant stagger; only connections with **knowledge ingestion = on**; metrics on `indexed` vs `skippedDup` per run.
2. **P1d ✅ — implemented:**
- [x] **`UserConnection`** extended with `knowledgeIngestionEnabled: bool` (default `False` = strict opt-in) and `knowledgePreferences: Optional[Dict]` (`schemaVersion=1`); DB auto-migration adds columns on startup.
- [x] **`routeDataConnections` `create_connection`** accepts `knowledgeIngestionEnabled` + `knowledgePreferences` in request body and persists them before returning.
- [x] **OAuth callbacks** (`routeSecurityGoogle`, `routeSecurityMsft`, `routeSecurityClickup`) gate `callbackRegistry.trigger("connection.established", …)` on `connection.knowledgeIngestionEnabled`; emit structured log `ingestion.connection.bootstrap.skipped reason=consent_disabled` when disabled.
- [x] **`_bootstrapJobHandler`** defensive re-check: loads connection via `getUserConnectionById` and no-ops if flag was disabled after OAuth (race protection).
- [x] **`IngestionJob.neutralize: bool`** added; `requestIngestion` + `_indexFileInternal` thread it through; for `sourceKind != "file"` the flag drives `_shouldNeutralize` directly; for `sourceKind == "file"` the `FileItem.neutralize` column remains authoritative.
- [x] **`subConnectorPrefs.py`** — `loadConnectionPrefs(connectionId)` helper + `ConnectionIngestionPrefs` dataclass with safe defaults for all §2.6 keys.
- [x] **All five walkers** (Gmail, GDrive, ClickUp, Outlook, SharePoint) load prefs at bootstrap start; limits structs gain `mailContentDepth` + `neutralize` (mail walkers), `filesIndexBinaries` (Drive), `clickupScope` (ClickUp), and `neutralize` (all).
- [x] **Unit tests** (`test_p1d_consent_prefs.py` — 10 tests): consent gate no-op, prefs defaults + full mapping, Gmail depth modes (metadata/snippet/full), ClickUp scope (titles vs description).
- [x] **Frontend** (`frontend_nyla`): `AddConnectionWizard` 4-step modal (connector → consent → preferences → summary + OAuth); old three-button row replaced with single „Verbindung hinzufügen“ button; `createConnectionAndAuth` hook method; `KnowledgePreferences` type in `connectionApi.ts`.
**Default policy (document for deploy):** `knowledgeIngestionEnabled` defaults to `False` for all new connections. Existing connections (before P1d deploy) have the column `NULL`/`False` — **no bootstrap is triggered retroactively**. Users must explicitly opt in via the wizard or connection settings. If the team decides to migrate existing connections to `True`, a one-time migration script must be run and communicated via release note.
---
@ -366,7 +451,8 @@ Phases align with **Teil 1** (façade), **Teil 2** (connector + trigger catalog)
- One **ingestion contract** for all features and connector lifecycles.
- Indexing **decoupled** from the agent loop (agents may still *invoke* tools that ultimately call ingestion, but ingestion must not *depend* on an agent run).
- **Explicit** handling of connection establishment, sync, and revocation.
- **Bounded** indexing of user/mandate context with a clear PII policy.
- **Bounded** indexing of user/mandate context with a clear PII policy.
- **Explicit user consent** and **per-connection** ingestion preferences (incl. optional **neutralization**) before connector content enters the knowledge store (**§2.6**).
**Explizit NICHT:**
@ -379,7 +465,8 @@ Phases align with **Teil 1** (façade), **Teil 2** (connector + trigger catalog)
## Betroffene Module (erwartet)
- **Gateway:** `serviceKnowledge`, file upload routes, connector OAuth handlers, sync workers, possibly new `serviceKnowledgeIngest` or package under `modules/serviceCenter/services/`.
- **Interfaces:** `interfaceDbKnowledge` extensions for source metadata if needed.
- **Interfaces:** `interfaceDbKnowledge` extensions for source metadata if needed; **`interfaceDbApp`** (or adjacent) for **per-`connectionId`** ingestion preferences from **§2.6**.
- **Frontend:** `frontend_nyla` — connection wizard + connection detail settings (consent, depth toggles, neutralization, time window).
- **Wiki / Reference:** `b-reference/gateway/ai-agent.md` (ingestion vs. retrieval) after implementation.
---
@ -388,19 +475,19 @@ Phases align with **Teil 1** (façade), **Teil 2** (connector + trigger catalog)
| Thema | Optionen |
|-------|----------|
| **Email bodies** | Full text vs. summary-only vs. attachment-only |
| **Email bodies** | Default product stance is **user-configurable per connection** (**§2.6** table: metadata / snippet / full cleaned body); mandate policy may still cap max tier. |
| **Multi-tenant isolation audits** | Periodic job to verify chunk `mandateId` matches connection |
| **Cost caps** | Per-mandate embedding budget; defer large backfills |
| **Neutralization** | Mandatory for certain `sourceKind`s even when not file-upload |
| **Neutralization** | **User opt-in** per connection (**§2.6**); optional **mandate floor** (“never below snippet+neutralize for mail”) remains a separate governance decision. |
| **Provenance shape** | First-class DB columns vs **documented `chunkMetadata` keys** for `connectionId`, external id, revision (must support **Teil 2** purge rules). |
| **In-flight duplicate handling** | Accept `status ∈ {"extracted","embedding","indexed"}` with matching hash as in-progress (cheap, lossy under failure) **vs** per-`sourceId` `asyncio.Lock` in `KnowledgeService` (strict, requires singleton) — see **§1.4 Deferred to P1**. |
| **Pre-extraction dedup shortcut** | Short-circuit `_autoIndexFile` via the file-bytes SHA in `interfaceDbManagement` before running `runExtraction` (~15 s saved per re-index of a large PDF) — see **§1.4 Deferred to P1**. |
| **In-flight duplicate handling** | Accept `status ∈ {"extracted","embedding","indexed"}` with matching hash as in-progress (cheap, lossy under failure) **vs** per-`sourceId` `asyncio.Lock` in `KnowledgeService` (strict, requires singleton) — see **§1.4 Deferred (ingestion idempotency hardening)**. |
| **Pre-extraction dedup shortcut** | Short-circuit `_autoIndexFile` via the file-bytes SHA in `interfaceDbManagement` before running `runExtraction` (~15 s saved per re-index of a large PDF) — see **§1.4 Deferred (ingestion idempotency hardening)**. |
---
## Structured ingestion logs (P1 schema)
The connection-lifecycle lane emits the following structured log events. Each event is a single `logger.info` / `.warning` / `.error` call with a stable `extra={"event": ...}` field so downstream log shippers can route on `event` without parsing the message string.
The connection-lifecycle lane emits the following structured log events. **`part`** values **`sharepoint`**, **`outlook`**, **`gdrive`**, **`gmail`**, and **`clickup`** are all **implemented** for bootstrap; **P1c** may add the same events with a distinguishable `reason` / `jobType` for **scheduled refresh** (exact field TBD in implementation). Each event is a single `logger.info` / `.warning` / `.error` call with a stable `extra={"event": ...}` field so downstream log shippers can route on `event` without parsing the message string.
| `event` | Severity | Emitter | Required `extra` keys | Meaning |
|---------|----------|---------|------------------------|---------|
@ -409,7 +496,7 @@ The connection-lifecycle lane emits the following structured log events. Each ev
| `ingestion.connection.bootstrap.progress` | info | bootstrap walkers | `connectionId`, `part`, `processed`, `skippedDup`, `failed` | Heart-beat every ~50 items so long-running runs are observable. |
| `ingestion.connection.bootstrap.done` | info | bootstrap walkers + façade-level totals | `connectionId`, `part`, `indexed`, `skippedDup`, `skippedPolicy`, `failed`, `durationMs` (Outlook/Gmail add `attachmentsIndexed`; SharePoint/Drive add `bytes`; ClickUp adds `workspaces` + `lists`) | Walker finished cleanly. |
| `ingestion.connection.bootstrap.failed` | error | `_bootstrapJobHandler` | `part`, `connectionId`, `error` | One bootstrap part raised — recorded but the other parts still complete. |
| `ingestion.connection.bootstrap.skipped` | info | `_bootstrapJobHandler` | `connectionId`, `authority`, `reason` (`unsupported_authority`) | Authority has no bootstrap module registered (e.g. a future provider). |
| `ingestion.connection.bootstrap.skipped` | info | `_bootstrapJobHandler` + OAuth callbacks + defensive check in `_bootstrapJobHandler` | `connectionId`, `authority`, `reason` (`unsupported_authority``consent_disabled`) | Authority has no bootstrap module registered (e.g. a future provider)**or** user has not consented (`knowledgeIngestionEnabled=False`). |
| `ingestion.connection.purged` | info | `_onConnectionRevoked` | `connectionId`, `authority`, `reason`, `indexRows`, `chunks` | Knowledge purge for a revoked connection completed; numbers reflect the deleted rows. |
| `ingestion.connection.purged.failed` | error | `_onConnectionRevoked` | `connectionId`, `error` | Purge raised; the revoke event was still acknowledged upstream. |
@ -421,16 +508,17 @@ All events should keep field naming consistent with the existing `ingestion.queu
- **Gateway reference (retrieval + knowledge):** `wiki/b-reference/gateway/architecture.md`, `wiki/b-reference/gateway/ai-agent.md`
- **Implementation touchpoints (indicative):** `gateway/modules/serviceCenter/services/serviceKnowledge/mainServiceKnowledge.py`, `gateway/modules/routes/routeDataFiles.py`, `gateway/modules/features/commcoach/serviceCommcoachIndexer.py`, agent `coreTools` `_documentTools` / `_workspaceTools`, `gateway/modules/datamodels/datamodelExtraction.py` (`ExtractionOptions.mergeStrategy: Optional[MergeStrategy]`).
- **Unit tests (P0 guardrails):** `gateway/tests/unit/services/test_ingestion_hash_stability.py`, `gateway/tests/unit/services/test_extraction_merge_strategy.py`.
- **Unit tests (P1 guardrails):** `gateway/tests/unit/services/test_connection_purge.py`, `gateway/tests/unit/services/test_knowledge_ingest_consumer.py`, `gateway/tests/unit/services/test_clean_email_body.py`, `gateway/tests/unit/services/test_bootstrap_sharepoint.py`, `gateway/tests/unit/services/test_bootstrap_outlook.py`, `gateway/tests/unit/services/test_bootstrap_gmail.py`, `gateway/tests/unit/services/test_bootstrap_gdrive.py`, `gateway/tests/unit/services/test_bootstrap_clickup.py`.
- **P1 implementation touchpoints:** `gateway/modules/serviceCenter/services/serviceKnowledge/subConnectorIngestConsumer.py`, `gateway/modules/serviceCenter/services/serviceKnowledge/subConnectorSyncSharepoint.py`, `gateway/modules/serviceCenter/services/serviceKnowledge/subConnectorSyncOutlook.py`, `gateway/modules/serviceCenter/services/serviceKnowledge/subConnectorSyncGdrive.py`, `gateway/modules/serviceCenter/services/serviceKnowledge/subConnectorSyncGmail.py`, `gateway/modules/serviceCenter/services/serviceKnowledge/subConnectorSyncClickup.py`, `gateway/modules/serviceCenter/services/serviceKnowledge/subTextClean.py`, `gateway/modules/interfaces/interfaceDbKnowledge.py` (`deleteFileContentIndexByConnectionId`), `gateway/modules/datamodels/datamodelKnowledge.py` (`FileContentIndex.connectionId` + `sourceKind`), `gateway/modules/connectors/providerMsft/connectorMsft.py` (`@odata.nextLink`-loop in `SharepointAdapter.browse`, `eTag` in `_graphItemToExternalEntry`), `gateway/modules/routes/routeSecurityMsft.py` / `routeSecurityGoogle.py` / `routeSecurityClickup.py` / `routeDataConnections.py` (callback emission + `ConnectionStatus.REVOKED` fix), `gateway/app.py` (consumer registration in lifespan).
- **Unit tests (P1a — Microsoft, done):** `gateway/tests/unit/services/test_connection_purge.py`, `gateway/tests/unit/services/test_knowledge_ingest_consumer.py` (incl. **msft** fan-out), `gateway/tests/unit/services/test_clean_email_body.py`, `gateway/tests/unit/services/test_bootstrap_sharepoint.py`, `gateway/tests/unit/services/test_bootstrap_outlook.py`.
- **Unit tests (P1b — Google + ClickUp, done):** **`test_knowledge_ingest_consumer`** (google / clickup fan-out), **`test_bootstrap_gmail.py`**, **`test_bootstrap_gdrive.py`**, **`test_bootstrap_clickup.py`**. **P1d (done):** **`test_p1d_consent_prefs.py`** (10 tests: consent gate, prefs parsing, Gmail depth modes, ClickUp scope). **P1c:** add scheduler tests when implemented.
- **P1 implementation touchpoints:** `gateway/modules/serviceCenter/services/serviceKnowledge/subConnectorIngestConsumer.py`, `gateway/modules/serviceCenter/services/serviceKnowledge/subConnectorSyncSharepoint.py`, `gateway/modules/serviceCenter/services/serviceKnowledge/subConnectorSyncOutlook.py`, `gateway/modules/serviceCenter/services/serviceKnowledge/subConnectorSyncGdrive.py`, `gateway/modules/serviceCenter/services/serviceKnowledge/subConnectorSyncGmail.py`, `gateway/modules/serviceCenter/services/serviceKnowledge/subConnectorSyncClickup.py`, `gateway/modules/serviceCenter/services/serviceKnowledge/subTextClean.py`, `gateway/modules/interfaces/interfaceDbKnowledge.py` (`deleteFileContentIndexByConnectionId`), `gateway/modules/datamodels/datamodelKnowledge.py` (`FileContentIndex.connectionId` + `sourceKind`), `gateway/modules/connectors/providerMsft/connectorMsft.py` (`@odata.nextLink`-loop in `SharepointAdapter.browse`, `eTag` in `_graphItemToExternalEntry`), `gateway/modules/connectors/providerGoogle/connectorGoogle.py` (P1b: Drive + Gmail revision keys and download/export paths), `gateway/modules/routes/routeSecurityMsft.py` (P1a callbacks), `gateway/modules/routes/routeSecurityGoogle.py` and `gateway/modules/routes/routeSecurityClickup.py` (P1b: parity callbacks), `gateway/modules/routes/routeDataConnections.py` (revoke for **all** authorities), `gateway/app.py` (consumer registration in lifespan).
## Akzeptanzkriterien (Plan-Ebene)
| # | Kriterium | Prio |
|---|-----------|------|
| 1 | Every new **file** that should be searchable triggers ingestion **without** requiring an agent session. | must |
| 2 | **User connection** connect / disconnect has defined ingestion or purge behavior documented and implementable. | must |
| 3 | **Profile/mandate** snapshots use an explicit allowlist; secrets never enter the embedding pipeline. | must |
| 2 | **User connection** connect / disconnect has defined ingestion or purge behavior **for each** OAuth authority **`routeDataConnections`** supports (**P1a** **`msft`**, **P1b** **`google`** / **`clickup`**); **plus** user-controlled **opt-in** and **preference bundle** before ingestion (**P1d**, **§2.6**). | must |
| 3 | **Profile/mandate** snapshot ingestion (**former roadmap P2**) is **deferred**; when re-opened, snapshots must use an explicit allowlist and never embed secrets. Until then, **§2.6** consent + neutralization covers connector-sourced PII risk. | should (reactivated when P2 returns) |
| 4 | Ingestion is **idempotent** for unchanged content (no duplicate embedding work). Verified 2026-04-21 on a 500-page PDF: second re-index trigger logs `ingestion.skipped.duplicate` with a stable hash, zero embedding API calls. See **§1.4 pitfalls** for the three bug classes that had to be fixed first. | must |
| 5 | **Teil 3.3** matrix completed: every `modules/features/*` product row has **retrieval** (agent vs none), **corpus** (upload / tools / feature indexer), and **gap** explicitly stated—not “non-injecting” if **`AgentService`** already provides retrieval injection. | should |
@ -449,9 +537,10 @@ All events should keep field naming consistent with the existing `ingestion.queu
| T7 | Bleiben bei Multi-Page-PDFs die Per-Page-Chunks erhalten (keine `MergeStrategy`-Konkatenation)? | Unit: `tests/unit/services/test_extraction_merge_strategy.py`. Live: 500-Seiten-PDF → 563 ContentObjects, 567 Embedding-Chunks in 24 Batches (verifiziert 2026-04-21). |
| T8 | Überleben `_ingestion.hash` und `status="indexed"` einen Pre-Scan-Re-Upsert in `_autoIndexFile`? | Review `routeDataFiles._autoIndexFile` Zeile ~127: existing row wird vor upsert gelesen und `_ingestion` + `indexed` in frischen `contentIndex` gemerged. Live: zweiter Trigger → `ingestion.skipped.duplicate` statt Re-Embedding. |
| T9 | Räumt ein `connection.revoked` Event **alle** `FileContentIndex`-Rows + `ContentChunk`s einer Connection und **nichts anderes** auf (Uploads ohne `connectionId`, andere Connections bleiben intakt)? | Unit: `tests/unit/services/test_connection_purge.py` (3 Cases: positive purge, leerer connectionId-Noop, unbekannter connectionId). |
| T10 | Dispatcht der `KnowledgeIngestionConsumer` `connection.established` korrekt als asynchroner `connection.bootstrap` Job (msft → SharePoint + Outlook parallel; google → Drive + Gmail parallel; clickup → Tasks; unbekannte Authorities `skipped.reason="unsupported_authority"`) und `connection.revoked` synchron als Purge? | Unit: `tests/unit/services/test_knowledge_ingest_consumer.py` (8 Cases: established enqueue, missing-id ignore, revoked purge, missing-id ignore, skip-unsupported, msft fan-out, google fan-out, clickup dispatch). |
| T10 | Dispatcht der `KnowledgeIngestionConsumer` `connection.established` korrekt als asynchroner `connection.bootstrap` Job (**P1a:** **msft** → SharePoint + Outlook parallel; **P1b:** **google** → Drive + Gmail parallel; **clickup** → Tasks) und `connection.revoked` synchron als Purge — **für jede** der drei **`routeDataConnections`**-Authorities? | **P1a + P1b (done):** `test_knowledge_ingest_consumer.py` — alle drei Authorities + revoke; unbekannte Authorities `skipped.reason="unsupported_authority"`. **P1d:** zusätzlich nur bei **Consent = ja** dispatch. |
| T11 | Reduziert `cleanEmailBody` ein realistisches Outlook-HTML auf den eigenen Body-Anteil (HTML strip, Quote-Strip EN+DE, Signature-Strip, Whitespace-Collapse, `maxChars`-Truncate)? | Unit: `tests/unit/services/test_clean_email_body.py` (8 Cases). Konsequenz: `bootstrapOutlook` schickt nie HTML/Quoted-Replies/Signaturen in den Embedding-Pipeline-Schritt. |
| T12 | Sind die Bootstrap-Walker für SharePoint und Outlook idempotent gegen ein zweites Run mit unveränderten `eTag` / `changeKey`? | Unit: `tests/unit/services/test_bootstrap_sharepoint.py` + `tests/unit/services/test_bootstrap_outlook.py`. Mock-Adapter liefern stable revisions; KnowledgeService-Fake meldet `duplicate` und das Result-Objekt bilanziert `skippedDuplicate`. |
| T13 | Walked `bootstrapGmail` `INBOX + SENT`, parsed MIME-Bodies (preferring `text/plain`, falling back to `text/html`), folgt `nextPageToken`-Pagination und ist idempotent gegen identische `historyId` Revisions? | Unit: `tests/unit/services/test_bootstrap_gmail.py` (6 Cases: header/snippet/body content-objects, MIME plain-vs-html preference, HTML fallback, multi-label fan-out, `nextPageToken` pagination, duplicate accounting). |
| T14 | Walked `bootstrapGdrive` My Drive rekursiv (Folder-MIME-Erkennung, `maxDepth`), respektiert den `maxAgeDays`-Recency-Filter und ist idempotent gegen identische `modifiedTime` Revisions? | Unit: `tests/unit/services/test_bootstrap_gdrive.py` (4 Cases: site/subfolder walk, duplicate accounting, recency-skip via `skippedPolicy`, provenance carries `authority="google"` + `service="drive"`). |
| T15 | Walked `bootstrapClickup` Workspaces → Spaces → Folder/Folderless Lists → Tasks unter `maxWorkspaces` / `maxListsPerWorkspace` / `maxTasks` Caps, respektiert den `maxAgeDays`-Recency-Filter und ist idempotent gegen identische `date_updated` Revisions? | Unit: `tests/unit/services/test_bootstrap_clickup.py` (4 Cases: hierarchy walk indexes 4 tasks across 2 lists, duplicate accounting, recency-skip via `skippedPolicy`, `maxTasks` cap). |
| T13 | Walked `bootstrapGmail` `INBOX + SENT`, parsed MIME-Bodies (preferring `text/plain`, falling back to `text/html`), folgt `nextPageToken`-Pagination und ist idempotent gegen identische `historyId` Revisions? | **P1b (done):** Unit `test_bootstrap_gmail.py`. **P1d:** Walker respektiert **Content depth** aus **§2.6** (Metadaten/Snippet/Body). |
| T14 | Walked `bootstrapGdrive` My Drive rekursiv (Folder-MIME-Erkennung, `maxDepth`), respektiert den `maxAgeDays`-Recency-Filter und ist idempotent gegen identische `modifiedTime` Revisions? | **P1b (done):** Unit `test_bootstrap_gdrive.py`. **P1d:** „Binärdateien“ / MIME-Allowlist aus **§2.6**. |
| T15 | Walked `bootstrapClickup` Workspaces → Spaces → Folder/Folderless Lists → Tasks unter `maxWorkspaces` / `maxListsPerWorkspace` / `maxTasks` Caps, respektiert den `maxAgeDays`-Recency-Filter und ist idempotent gegen identische `date_updated` Revisions? | **P1b (done):** Unit `test_bootstrap_clickup.py`. **P1d:** ClickUp-**Scope** (Titel/Beschreibung/Kommentare) aus **§2.6**. |
| T16 | Führt der **P1c**-Tagesjob nur Verbindungen mit **Wissens-Injektion = ein** aus und bleiben Kosten/API-Limits durch Idempotenz + Fast-Path beherrschbar? | Integration oder Unit mit Fake-Clock: zweiter Lauf → überwiegend `skippedDup`; Logs `ingestion.connection.bootstrap.*` mit erkennbarem Scheduled-`reason` (falls implementiert). |

Binary file not shown.