wiki/z-archive/unified-document-model.md

# Unified Document Model (UDM)

## Konzept & Zielsetzung

Das Unified Document Model definiert eine **generische, formatunabhängige Baumstruktur**, in die jeder Dokumenttyp (PDF, DOCX, PPTX, XLSX, HTML, ZIP) durch einen Extractor überführt wird. Dadurch können AI-Workflows, Nodes und Tools mit einem einzigen Objektmodell arbeiten – unabhängig vom Quellformat.

### Designprinzipien

- **3-Ebenen-Garantie**: Jedes Dokument hat exakt drei Verschachtelungsebenen (ausgenommen ZIP als Meta-Container).
- **Einheitliche Blattknoten**: Alle atomaren Inhalte sind `ContentBlock`-Objekte mit identischer Attributstruktur.
- **Generische Traversierung**: Workflow-Nodes (Loop, Filter, Transform, Map) arbeiten formatunabhängig über dieselbe Baumstruktur.
- **Keine formatspezifischen Zwischenschichten**: Konzepte wie "Paragraph", "Row" oder "Cell" werden in den `ContentBlock` absorbiert, nicht als eigene Ebenen modelliert.

---

## Architekturübersicht

```
┌─────────────────────────────────────────────────────┐
│  Level 1 — Document                                 │
│  ┌───────────────────────────────────────────────┐   │
│  │  Level 2 — StructuralNode                     │   │
│  │  ┌─────────────────────────────────────────┐   │   │
│  │  │  Level 3 — ContentBlock (Blattknoten)   │   │   │
│  │  └─────────────────────────────────────────┘   │   │
│  └───────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────┘
```

### Ebenen im Detail

| Ebene | Typ | Beschreibung |
|-------|-----|-------------|
| **Level 1** | `Document` | Wurzelknoten. Repräsentiert ein einzelnes Quelldokument. |
| **Level 2** | `StructuralNode` | Strukturelle Gliederungseinheit: Seite, Abschnitt, Slide oder Sheet. |
| **Level 3** | `ContentBlock` | Atomarer Inhalt: Text, Bild, Tabelle, Code, Medien, Link oder Formel. |

---

## Datenmodell

### Document (Level 1)

Der Wurzelknoten pro Quelldatei.

```typescript
interface Document {
  id:           string;
  role:         "document";
  source_type:  "pdf" | "docx" | "pptx" | "xlsx" | "html";
  source_path:  string;
  metadata:     Metadata;
  children:     StructuralNode[];
}
```

| Feld | Typ | Beschreibung |
|------|-----|-------------|
| `id` | `string` | Eindeutige ID (UUID) |
| `role` | `"document"` | Immer `"document"` auf Level 1 |
| `source_type` | `string` | Originalformat der Quelldatei |
| `source_path` | `string` | Pfad der Originaldatei (relativ zum Workspace oder Archiv) |
| `metadata` | `Metadata` | Dokument-Metadaten |
| `children` | `StructuralNode[]` | Liste der strukturellen Einheiten |

---

### StructuralNode (Level 2)

Die Gliederungseinheit innerhalb eines Dokuments.

```typescript
interface StructuralNode {
  id:        string;
  role:      "page" | "section" | "slide" | "sheet";
  index:     number;
  label:     string | null;
  metadata:  Metadata;
  children:  ContentBlock[];
}
```

| Feld | Typ | Beschreibung |
|------|-----|-------------|
| `id` | `string` | Eindeutige ID |
| `role` | `string` | Art der Struktureinheit (formatabhängig, aber aus fester Menge) |
| `index` | `number` | 0-basierte Position innerhalb des Dokuments |
| `label` | `string?` | Optionaler Name (Sheet-Name, Abschnittsüberschrift, Slide-Titel) |
| `metadata` | `Metadata` | Zusätzliche Informationen zur Struktureinheit |
| `children` | `ContentBlock[]` | Liste der atomaren Inhalte |

#### Rollen-Zuordnung pro Format

| Quellformat | `role` | Entspricht im Original |
|-------------|--------|----------------------|
| PDF | `page` | Seite |
| DOCX | `section` | Abschnitt (Heading-basierte Gliederung) |
| PPTX | `slide` | Folie |
| XLSX | `sheet` | Tabellenblatt |
| HTML | `section` | Semantischer Bereich (`<header>`, `<main>`, `<nav>`, `<footer>`, `<aside>`) |

---

### ContentBlock (Level 3)

Der atomare Inhaltsknoten. **Alle Formate erzeugen identisch strukturierte ContentBlocks.**

```typescript
interface ContentBlock {
  id:            string;
  content_type:  "text" | "image" | "table" | "code" | "media" | "link" | "formula";
  raw:           string;
  mime_type:     string | null;
  language:      string | null;
  attributes:    Record<string, any>;
  position:      Position;
  metadata:      Metadata;
}
```

| Feld | Typ | Beschreibung |
|------|-----|-------------|
| `id` | `string` | Eindeutige ID |
| `content_type` | `string` | Art des Inhalts (aus fester Menge) |
| `raw` | `string` | Rohinhalt: Plaintext, Base64-kodierte Binärdaten oder JSON-serialisierte Struktur |
| `mime_type` | `string?` | MIME-Type (`text/plain`, `image/png`, `text/html`, `application/json`, …) |
| `language` | `string?` | Programmiersprache bei `code`-Blocks (`python`, `sql`, `javascript`, …) |
| `attributes` | `Record` | Zusätzliche Eigenschaften (Styling, Alt-Text, Grösse, …) |
| `position` | `Position` | Lokalisierung innerhalb der Struktureinheit |
| `metadata` | `Metadata` | Block-spezifische Metadaten |

#### Content-Typen im Detail

| `content_type` | `raw` enthält | `mime_type` Beispiel | Typische `attributes` |
|-----------------|--------------|---------------------|----------------------|
| `text` | Plaintext-Inhalt | `text/plain` | `{ style, heading_level, list_type, bold, italic }` |
| `image` | Base64-kodierte Bilddaten | `image/png`, `image/jpeg` | `{ width, height, alt_text, caption }` |
| `table` | JSON-Matrix `{ headers: [...], rows: [[...], ...] }` | `application/json` | `{ row_count, col_count, has_header, name }` |
| `code` | Quellcode als Text | `text/plain` | `{ language, line_count, executable }` |
| `media` | Base64-kodierte Daten oder URI | `audio/mp3`, `video/mp4` | `{ duration, format, embedded }` |
| `link` | URL als Text | `text/uri-list` | `{ display_text, target, rel }` |
| `formula` | Formelausdruck (LaTeX, Excel-Syntax) | `text/plain` | `{ notation, result_value, result_type }` |

---

### Position

Lokalisiert einen ContentBlock innerhalb seiner Struktureinheit.

```typescript
interface Position {
  index:   number;
  page:    number | null;
  row:     number | null;
  col:     number | null;
  bbox:    BoundingBox | null;
}

interface BoundingBox {
  x:      number;
  y:      number;
  width:  number;
  height: number;
  unit:   "px" | "pt" | "mm";
}
```

| Feld | Typ | Beschreibung |
|------|-----|-------------|
| `index` | `number` | 0-basierte Reihenfolge innerhalb der Struktureinheit |
| `page` | `number?` | Seitennummer (relevant für PDF) |
| `row` | `number?` | Zeilenposition (relevant für tabellarische Daten) |
| `col` | `number?` | Spaltenposition (relevant für tabellarische Daten) |
| `bbox` | `BoundingBox?` | Bounding Box für visuell positionierte Inhalte (PDF, PPTX) |

---

### Metadata

Einheitliches Metadaten-Objekt, verwendbar auf allen drei Ebenen.

```typescript
interface Metadata {
  title:        string | null;
  author:       string | null;
  created_at:   string | null;   // ISO 8601
  modified_at:  string | null;   // ISO 8601
  source_path:  string;
  tags:         string[];
  custom:       Record<string, any>;
}
```

---

### Archive (Sonderfall ZIP)

ZIP-Dateien fungieren als Meta-Container und erzeugen eine zusätzliche Wrapper-Ebene. Jede enthaltene Datei wird als eigenständiges `Document` extrahiert.

```typescript
interface Archive {
  id:           string;
  role:         "archive";
  source_type:  "zip" | "tar" | "gz";
  source_path:  string;
  metadata:     Metadata;
  children:     (Archive | Document)[];
}
```

Die 3-Ebenen-Garantie gilt **pro Dokument innerhalb** des Archivs. Das Archiv selbst ist eine Hülle.

```
Archive (ZIP)
├── Document (PDF)
│   ├── StructuralNode (page 0)
│   │   ├── ContentBlock (text)
│   │   └── ContentBlock (image)
│   └── StructuralNode (page 1)
│       └── ContentBlock (text)
├── Document (DOCX)
│   └── StructuralNode (section 0)
│       ├── ContentBlock (text)
│       └── ContentBlock (table)
└── Archive (nested ZIP)
    └── Document (…)
```

---

## Format-Mapping-Referenz

Übersicht, wie jedes Quellformat in das 3-Ebenen-Modell abgebildet wird.

### PDF → UDM

```
PDF-Datei
├── Document (source_type: "pdf")
│   ├── StructuralNode (role: "page", index: 0)
│   │   ├── ContentBlock (text)      ← Textblöcke der Seite
│   │   ├── ContentBlock (image)     ← Eingebettete Bilder
│   │   └── ContentBlock (table)     ← Erkannte Tabellen
│   ├── StructuralNode (role: "page", index: 1)
│   │   └── ...
```

- Jede PDF-Seite wird ein `StructuralNode(page)`.
- Text wird als zusammenhängende Blöcke extrahiert, nicht zeilenweise.
- Bilder werden als Base64 in `raw` gespeichert.
- Tabellen werden als JSON-Matrix serialisiert.
- `bbox` in `Position` enthält die visuelle Position auf der Seite.

### DOCX → UDM

```
DOCX-Datei
├── Document (source_type: "docx")
│   ├── StructuralNode (role: "section", label: "Einleitung")
│   │   ├── ContentBlock (text)      ← Absätze
│   │   ├── ContentBlock (image)     ← Inline-Bilder
│   │   └── ContentBlock (table)     ← Word-Tabellen
│   ├── StructuralNode (role: "section", label: "Methodik")
│   │   └── ...
```

- Sections werden anhand von Heading-Ebenen gegliedert (H1 → neue Section).
- Absätze (Paragraphs) werden direkt zu `ContentBlock(text)` – keine Zwischenebene.
- Styling-Informationen (bold, italic, heading_level) landen in `attributes`.
- Tabellen werden als JSON-Matrix in `raw` serialisiert.

### PPTX → UDM

```
PPTX-Datei
├── Document (source_type: "pptx")
│   ├── StructuralNode (role: "slide", index: 0, label: "Titelfolie")
│   │   ├── ContentBlock (text)      ← Textboxen
│   │   ├── ContentBlock (image)     ← Bilder/Grafiken
│   │   └── ContentBlock (table)     ← Slide-Tabellen
│   ├── StructuralNode (role: "slide", index: 1)
│   │   └── ...
```

- Jede Folie wird ein `StructuralNode(slide)`.
- Textboxen werden zu `ContentBlock(text)` mit `bbox` für die Position.
- Speaker Notes landen als `ContentBlock(text)` mit `attributes.note: true`.

### XLSX → UDM

```
XLSX-Datei
├── Document (source_type: "xlsx")
│   ├── StructuralNode (role: "sheet", index: 0, label: "Umsatz 2024")
│   │   ├── ContentBlock (table)     ← Tabellenbereich A
│   │   └── ContentBlock (table)     ← Tabellenbereich B (falls disjunkt)
│   ├── StructuralNode (role: "sheet", index: 1, label: "Kosten")
│   │   └── ContentBlock (table)
```

- Jedes Sheet wird ein `StructuralNode(sheet)`.
- Das gesamte Daten-Grid eines Sheets wird als ein `ContentBlock(table)` serialisiert.
- Bei mehreren disjunkten Tabellenbereichen im selben Sheet → mehrere ContentBlocks.
- Formeln werden als `ContentBlock(formula)` extrahiert, wenn gewünscht.

### HTML → UDM

```
HTML-Datei
├── Document (source_type: "html")
│   ├── StructuralNode (role: "section", label: "header")
│   │   └── ContentBlock (text)
│   ├── StructuralNode (role: "section", label: "nav")
│   │   └── ContentBlock (link)
│   ├── StructuralNode (role: "section", label: "main")
│   │   ├── ContentBlock (text)
│   │   ├── ContentBlock (image)
│   │   └── ContentBlock (table)
│   ├── StructuralNode (role: "section", label: "footer")
│   │   └── ContentBlock (text)
```

- Semantische HTML5-Elemente (`<header>`, `<main>`, `<nav>`, `<footer>`, `<aside>`) werden zu Sections.
- Falls keine semantischen Elemente vorhanden: gesamter `<body>` als eine Section.
- HTML-Inhalte werden in Plaintext konvertiert, nicht als HTML-Markup gespeichert.

---

## Workflow-Integration

### Generische Traversierung

Da alle Formate dieselbe Struktur haben, funktioniert ein einzelner rekursiver Walker für alle Dokumenttypen:

```python
def walk_content_blocks(document: Document) -> Iterator[ContentBlock]:
    """Iteriert über alle ContentBlocks eines Dokuments, formatunabhängig."""
    for structural_node in document.children:
        for block in structural_node.children:
            yield block
```

### Filter-Node

```python
def filter_by_type(document: Document, content_type: str) -> list[ContentBlock]:
    """Filtert alle ContentBlocks nach Typ (z.B. 'image', 'table')."""
    return [
        block for block in walk_content_blocks(document)
        if block.content_type == content_type
    ]
```

### Loop-Node

```python
def process_all_documents(archive: Archive):
    """Verarbeitet alle Dokumente in einem Archiv mit identischer Logik."""
    for document in archive.children:
        if isinstance(document, Archive):
            process_all_documents(document)   # rekursiv für verschachtelte ZIPs
        else:
            for block in walk_content_blocks(document):
                # Identische Verarbeitung, egal ob PDF, DOCX, PPTX, ...
                transform(block)
```

### Map-Node

```python
def map_blocks(document: Document, fn: Callable[[ContentBlock], T]) -> list[T]:
    """Wendet eine Funktion auf jeden ContentBlock an."""
    return [fn(block) for block in walk_content_blocks(document)]
```

### Beispiel: Alle Bilder aus beliebigem Dokument extrahieren

```python
images = filter_by_type(document, "image")
for img in images:
    save_image(
        data=base64_decode(img.raw),
        filename=f"{img.id}.{img.mime_type.split('/')[1]}",
        alt_text=img.attributes.get("alt_text", "")
    )
```

### Beispiel: Alle Tabellen als CSV exportieren

```python
tables = filter_by_type(document, "table")
for table in tables:
    data = json.loads(table.raw)
    write_csv(
        headers=data["headers"],
        rows=data["rows"],
        filename=f"{table.id}.csv"
    )
```

---

## JSON-Beispiel

Vollständiges Beispiel eines extrahierten PDF-Dokuments:

```json
{
  "id": "doc-a1b2c3",
  "role": "document",
  "source_type": "pdf",
  "source_path": "reports/quarterly-report-q3.pdf",
  "metadata": {
    "title": "Quarterly Report Q3 2025",
    "author": "Finance Team",
    "created_at": "2025-10-01T08:00:00Z",
    "modified_at": "2025-10-15T14:30:00Z",
    "source_path": "reports/quarterly-report-q3.pdf",
    "tags": ["finance", "quarterly"],
    "custom": {}
  },
  "children": [
    {
      "id": "sn-page-0",
      "role": "page",
      "index": 0,
      "label": null,
      "metadata": {
        "title": null,
        "author": null,
        "created_at": null,
        "modified_at": null,
        "source_path": "reports/quarterly-report-q3.pdf#page=1",
        "tags": [],
        "custom": {}
      },
      "children": [
        {
          "id": "cb-001",
          "content_type": "text",
          "raw": "Quarterly Report Q3 2025\n\nThis report summarizes the financial performance...",
          "mime_type": "text/plain",
          "language": null,
          "attributes": {
            "heading_level": 1,
            "style": "title"
          },
          "position": {
            "index": 0,
            "page": 1,
            "row": null,
            "col": null,
            "bbox": { "x": 50, "y": 30, "width": 500, "height": 40, "unit": "pt" }
          },
          "metadata": {
            "title": null,
            "author": null,
            "created_at": null,
            "modified_at": null,
            "source_path": "reports/quarterly-report-q3.pdf#page=1",
            "tags": [],
            "custom": {}
          }
        },
        {
          "id": "cb-002",
          "content_type": "table",
          "raw": "{\"headers\":[\"Metric\",\"Q2\",\"Q3\",\"Delta\"],\"rows\":[[\"Revenue\",\"1.2M\",\"1.5M\",\"+25%\"],[\"Costs\",\"800K\",\"850K\",\"+6%\"]]}",
          "mime_type": "application/json",
          "language": null,
          "attributes": {
            "row_count": 2,
            "col_count": 4,
            "has_header": true,
            "name": "Financial Overview"
          },
          "position": {
            "index": 1,
            "page": 1,
            "row": null,
            "col": null,
            "bbox": { "x": 50, "y": 200, "width": 500, "height": 120, "unit": "pt" }
          },
          "metadata": {
            "title": null,
            "author": null,
            "created_at": null,
            "modified_at": null,
            "source_path": "reports/quarterly-report-q3.pdf#page=1",
            "tags": [],
            "custom": {}
          }
        }
      ]
    }
  ]
}
```

---

## Zusammenfassung

| Eigenschaft | Wert |
|-------------|------|
| **Ebenen** | Exakt 3 pro Dokument (Archive als optionaler Wrapper) |
| **Struktureinheiten** | `page`, `section`, `slide`, `sheet` |
| **Content-Typen** | `text`, `image`, `table`, `code`, `media`, `link`, `formula` |
| **Formate** | PDF, DOCX, PPTX, XLSX, HTML (erweiterbar) |
| **Traversierung** | Ein generischer Walker für alle Formate |
| **Serialisierung** | JSON-kompatibel, sofort einsetzbar in Workflow-Engines |