ValueOn AG c9454a618f feat db-clean-ui and unified content udm

2026-04-16 23:12:56 +02:00

17 KiB

Raw Blame History

Unified Document Model (UDM)

Konzept & Zielsetzung

Das Unified Document Model definiert eine generische, formatunabhängige Baumstruktur, in die jeder Dokumenttyp (PDF, DOCX, PPTX, XLSX, HTML, ZIP) durch einen Extractor überführt wird. Dadurch können AI-Workflows, Nodes und Tools mit einem einzigen Objektmodell arbeiten – unabhängig vom Quellformat.

Designprinzipien

3-Ebenen-Garantie: Jedes Dokument hat exakt drei Verschachtelungsebenen (ausgenommen ZIP als Meta-Container).
Einheitliche Blattknoten: Alle atomaren Inhalte sind ContentBlock-Objekte mit identischer Attributstruktur.
Generische Traversierung: Workflow-Nodes (Loop, Filter, Transform, Map) arbeiten formatunabhängig über dieselbe Baumstruktur.
Keine formatspezifischen Zwischenschichten: Konzepte wie "Paragraph", "Row" oder "Cell" werden in den ContentBlock absorbiert, nicht als eigene Ebenen modelliert.

Architekturübersicht

┌─────────────────────────────────────────────────────┐
│  Level 1 — Document                                 │
│  ┌───────────────────────────────────────────────┐   │
│  │  Level 2 — StructuralNode                     │   │
│  │  ┌─────────────────────────────────────────┐   │   │
│  │  │  Level 3 — ContentBlock (Blattknoten)   │   │   │
│  │  └─────────────────────────────────────────┘   │   │
│  └───────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────┘

Ebenen im Detail

Ebene	Typ	Beschreibung
Level 1	`Document`	Wurzelknoten. Repräsentiert ein einzelnes Quelldokument.
Level 2	`StructuralNode`	Strukturelle Gliederungseinheit: Seite, Abschnitt, Slide oder Sheet.
Level 3	`ContentBlock`	Atomarer Inhalt: Text, Bild, Tabelle, Code, Medien, Link oder Formel.

Datenmodell

Document (Level 1)

Der Wurzelknoten pro Quelldatei.

interface Document {
  id:           string;
  role:         "document";
  source_type:  "pdf" | "docx" | "pptx" | "xlsx" | "html";
  source_path:  string;
  metadata:     Metadata;
  children:     StructuralNode[];
}

Feld	Typ	Beschreibung
`id`	`string`	Eindeutige ID (UUID)
`role`	`"document"`	Immer `"document"` auf Level 1
`source_type`	`string`	Originalformat der Quelldatei
`source_path`	`string`	Pfad der Originaldatei (relativ zum Workspace oder Archiv)
`metadata`	`Metadata`	Dokument-Metadaten
`children`	`StructuralNode[]`	Liste der strukturellen Einheiten

StructuralNode (Level 2)

Die Gliederungseinheit innerhalb eines Dokuments.

interface StructuralNode {
  id:        string;
  role:      "page" | "section" | "slide" | "sheet";
  index:     number;
  label:     string | null;
  metadata:  Metadata;
  children:  ContentBlock[];
}

Feld	Typ	Beschreibung
`id`	`string`	Eindeutige ID
`role`	`string`	Art der Struktureinheit (formatabhängig, aber aus fester Menge)
`index`	`number`	0-basierte Position innerhalb des Dokuments
`label`	`string?`	Optionaler Name (Sheet-Name, Abschnittsüberschrift, Slide-Titel)
`metadata`	`Metadata`	Zusätzliche Informationen zur Struktureinheit
`children`	`ContentBlock[]`	Liste der atomaren Inhalte

Rollen-Zuordnung pro Format

Quellformat	`role`	Entspricht im Original
PDF	`page`	Seite
DOCX	`section`	Abschnitt (Heading-basierte Gliederung)
PPTX	`slide`	Folie
XLSX	`sheet`	Tabellenblatt
HTML	`section`	Semantischer Bereich (`<header>`, `<main>`, `<nav>`, `<footer>`, `<aside>`)

ContentBlock (Level 3)

Der atomare Inhaltsknoten. Alle Formate erzeugen identisch strukturierte ContentBlocks.

interface ContentBlock {
  id:            string;
  content_type:  "text" | "image" | "table" | "code" | "media" | "link" | "formula";
  raw:           string;
  mime_type:     string | null;
  language:      string | null;
  attributes:    Record<string, any>;
  position:      Position;
  metadata:      Metadata;
}

Feld	Typ	Beschreibung
`id`	`string`	Eindeutige ID
`content_type`	`string`	Art des Inhalts (aus fester Menge)
`raw`	`string`	Rohinhalt: Plaintext, Base64-kodierte Binärdaten oder JSON-serialisierte Struktur
`mime_type`	`string?`	MIME-Type (`text/plain`, `image/png`, `text/html`, `application/json`, …)
`language`	`string?`	Programmiersprache bei `code`-Blocks (`python`, `sql`, `javascript`, …)
`attributes`	`Record`	Zusätzliche Eigenschaften (Styling, Alt-Text, Grösse, …)
`position`	`Position`	Lokalisierung innerhalb der Struktureinheit
`metadata`	`Metadata`	Block-spezifische Metadaten

Content-Typen im Detail

`content_type`	`raw` enthält	`mime_type` Beispiel	Typische `attributes`
`text`	Plaintext-Inhalt	`text/plain`	`{ style, heading_level, list_type, bold, italic }`
`image`	Base64-kodierte Bilddaten	`image/png`, `image/jpeg`	`{ width, height, alt_text, caption }`
`table`	JSON-Matrix `{ headers: [...], rows: [[...], ...] }`	`application/json`	`{ row_count, col_count, has_header, name }`
`code`	Quellcode als Text	`text/plain`	`{ language, line_count, executable }`
`media`	Base64-kodierte Daten oder URI	`audio/mp3`, `video/mp4`	`{ duration, format, embedded }`
`link`	URL als Text	`text/uri-list`	`{ display_text, target, rel }`
`formula`	Formelausdruck (LaTeX, Excel-Syntax)	`text/plain`	`{ notation, result_value, result_type }`

Position

Lokalisiert einen ContentBlock innerhalb seiner Struktureinheit.

interface Position {
  index:   number;
  page:    number | null;
  row:     number | null;
  col:     number | null;
  bbox:    BoundingBox | null;
}

interface BoundingBox {
  x:      number;
  y:      number;
  width:  number;
  height: number;
  unit:   "px" | "pt" | "mm";
}

Feld	Typ	Beschreibung
`index`	`number`	0-basierte Reihenfolge innerhalb der Struktureinheit
`page`	`number?`	Seitennummer (relevant für PDF)
`row`	`number?`	Zeilenposition (relevant für tabellarische Daten)
`col`	`number?`	Spaltenposition (relevant für tabellarische Daten)
`bbox`	`BoundingBox?`	Bounding Box für visuell positionierte Inhalte (PDF, PPTX)

Metadata

Einheitliches Metadaten-Objekt, verwendbar auf allen drei Ebenen.

interface Metadata {
  title:        string | null;
  author:       string | null;
  created_at:   string | null;   // ISO 8601
  modified_at:  string | null;   // ISO 8601
  source_path:  string;
  tags:         string[];
  custom:       Record<string, any>;
}

Archive (Sonderfall ZIP)

ZIP-Dateien fungieren als Meta-Container und erzeugen eine zusätzliche Wrapper-Ebene. Jede enthaltene Datei wird als eigenständiges Document extrahiert.

interface Archive {
  id:           string;
  role:         "archive";
  source_type:  "zip" | "tar" | "gz";
  source_path:  string;
  metadata:     Metadata;
  children:     (Archive | Document)[];
}

Die 3-Ebenen-Garantie gilt pro Dokument innerhalb des Archivs. Das Archiv selbst ist eine Hülle.

Archive (ZIP)
├── Document (PDF)
│   ├── StructuralNode (page 0)
│   │   ├── ContentBlock (text)
│   │   └── ContentBlock (image)
│   └── StructuralNode (page 1)
│       └── ContentBlock (text)
├── Document (DOCX)
│   └── StructuralNode (section 0)
│       ├── ContentBlock (text)
│       └── ContentBlock (table)
└── Archive (nested ZIP)
    └── Document (…)

Format-Mapping-Referenz

Übersicht, wie jedes Quellformat in das 3-Ebenen-Modell abgebildet wird.

PDF → UDM

PDF-Datei
├── Document (source_type: "pdf")
│   ├── StructuralNode (role: "page", index: 0)
│   │   ├── ContentBlock (text)      ← Textblöcke der Seite
│   │   ├── ContentBlock (image)     ← Eingebettete Bilder
│   │   └── ContentBlock (table)     ← Erkannte Tabellen
│   ├── StructuralNode (role: "page", index: 1)
│   │   └── ...

Jede PDF-Seite wird ein StructuralNode(page).
Text wird als zusammenhängende Blöcke extrahiert, nicht zeilenweise.
Bilder werden als Base64 in raw gespeichert.
Tabellen werden als JSON-Matrix serialisiert.
bbox in Position enthält die visuelle Position auf der Seite.

DOCX → UDM

DOCX-Datei
├── Document (source_type: "docx")
│   ├── StructuralNode (role: "section", label: "Einleitung")
│   │   ├── ContentBlock (text)      ← Absätze
│   │   ├── ContentBlock (image)     ← Inline-Bilder
│   │   └── ContentBlock (table)     ← Word-Tabellen
│   ├── StructuralNode (role: "section", label: "Methodik")
│   │   └── ...

Sections werden anhand von Heading-Ebenen gegliedert (H1 → neue Section).
Absätze (Paragraphs) werden direkt zu ContentBlock(text) – keine Zwischenebene.
Styling-Informationen (bold, italic, heading_level) landen in attributes.
Tabellen werden als JSON-Matrix in raw serialisiert.

PPTX → UDM

PPTX-Datei
├── Document (source_type: "pptx")
│   ├── StructuralNode (role: "slide", index: 0, label: "Titelfolie")
│   │   ├── ContentBlock (text)      ← Textboxen
│   │   ├── ContentBlock (image)     ← Bilder/Grafiken
│   │   └── ContentBlock (table)     ← Slide-Tabellen
│   ├── StructuralNode (role: "slide", index: 1)
│   │   └── ...

Jede Folie wird ein StructuralNode(slide).
Textboxen werden zu ContentBlock(text) mit bbox für die Position.
Speaker Notes landen als ContentBlock(text) mit attributes.note: true.

XLSX → UDM

XLSX-Datei
├── Document (source_type: "xlsx")
│   ├── StructuralNode (role: "sheet", index: 0, label: "Umsatz 2024")
│   │   ├── ContentBlock (table)     ← Tabellenbereich A
│   │   └── ContentBlock (table)     ← Tabellenbereich B (falls disjunkt)
│   ├── StructuralNode (role: "sheet", index: 1, label: "Kosten")
│   │   └── ContentBlock (table)

Jedes Sheet wird ein StructuralNode(sheet).
Das gesamte Daten-Grid eines Sheets wird als ein ContentBlock(table) serialisiert.
Bei mehreren disjunkten Tabellenbereichen im selben Sheet → mehrere ContentBlocks.
Formeln werden als ContentBlock(formula) extrahiert, wenn gewünscht.

HTML → UDM

HTML-Datei
├── Document (source_type: "html")
│   ├── StructuralNode (role: "section", label: "header")
│   │   └── ContentBlock (text)
│   ├── StructuralNode (role: "section", label: "nav")
│   │   └── ContentBlock (link)
│   ├── StructuralNode (role: "section", label: "main")
│   │   ├── ContentBlock (text)
│   │   ├── ContentBlock (image)
│   │   └── ContentBlock (table)
│   ├── StructuralNode (role: "section", label: "footer")
│   │   └── ContentBlock (text)

Semantische HTML5-Elemente (<header>, <main>, <nav>, <footer>, <aside>) werden zu Sections.
Falls keine semantischen Elemente vorhanden: gesamter <body> als eine Section.
HTML-Inhalte werden in Plaintext konvertiert, nicht als HTML-Markup gespeichert.

Workflow-Integration

Generische Traversierung

Da alle Formate dieselbe Struktur haben, funktioniert ein einzelner rekursiver Walker für alle Dokumenttypen:

def walk_content_blocks(document: Document) -> Iterator[ContentBlock]:
    """Iteriert über alle ContentBlocks eines Dokuments, formatunabhängig."""
    for structural_node in document.children:
        for block in structural_node.children:
            yield block

Filter-Node

def filter_by_type(document: Document, content_type: str) -> list[ContentBlock]:
    """Filtert alle ContentBlocks nach Typ (z.B. 'image', 'table')."""
    return [
        block for block in walk_content_blocks(document)
        if block.content_type == content_type
    ]

Loop-Node

def process_all_documents(archive: Archive):
    """Verarbeitet alle Dokumente in einem Archiv mit identischer Logik."""
    for document in archive.children:
        if isinstance(document, Archive):
            process_all_documents(document)   # rekursiv für verschachtelte ZIPs
        else:
            for block in walk_content_blocks(document):
                # Identische Verarbeitung, egal ob PDF, DOCX, PPTX, ...
                transform(block)

Map-Node

def map_blocks(document: Document, fn: Callable[[ContentBlock], T]) -> list[T]:
    """Wendet eine Funktion auf jeden ContentBlock an."""
    return [fn(block) for block in walk_content_blocks(document)]

Beispiel: Alle Bilder aus beliebigem Dokument extrahieren

images = filter_by_type(document, "image")
for img in images:
    save_image(
        data=base64_decode(img.raw),
        filename=f"{img.id}.{img.mime_type.split('/')[1]}",
        alt_text=img.attributes.get("alt_text", "")
    )

Beispiel: Alle Tabellen als CSV exportieren

tables = filter_by_type(document, "table")
for table in tables:
    data = json.loads(table.raw)
    write_csv(
        headers=data["headers"],
        rows=data["rows"],
        filename=f"{table.id}.csv"
    )

JSON-Beispiel

Vollständiges Beispiel eines extrahierten PDF-Dokuments:

{
  "id": "doc-a1b2c3",
  "role": "document",
  "source_type": "pdf",
  "source_path": "reports/quarterly-report-q3.pdf",
  "metadata": {
    "title": "Quarterly Report Q3 2025",
    "author": "Finance Team",
    "created_at": "2025-10-01T08:00:00Z",
    "modified_at": "2025-10-15T14:30:00Z",
    "source_path": "reports/quarterly-report-q3.pdf",
    "tags": ["finance", "quarterly"],
    "custom": {}
  },
  "children": [
    {
      "id": "sn-page-0",
      "role": "page",
      "index": 0,
      "label": null,
      "metadata": {
        "title": null,
        "author": null,
        "created_at": null,
        "modified_at": null,
        "source_path": "reports/quarterly-report-q3.pdf#page=1",
        "tags": [],
        "custom": {}
      },
      "children": [
        {
          "id": "cb-001",
          "content_type": "text",
          "raw": "Quarterly Report Q3 2025\n\nThis report summarizes the financial performance...",
          "mime_type": "text/plain",
          "language": null,
          "attributes": {
            "heading_level": 1,
            "style": "title"
          },
          "position": {
            "index": 0,
            "page": 1,
            "row": null,
            "col": null,
            "bbox": { "x": 50, "y": 30, "width": 500, "height": 40, "unit": "pt" }
          },
          "metadata": {
            "title": null,
            "author": null,
            "created_at": null,
            "modified_at": null,
            "source_path": "reports/quarterly-report-q3.pdf#page=1",
            "tags": [],
            "custom": {}
          }
        },
        {
          "id": "cb-002",
          "content_type": "table",
          "raw": "{\"headers\":[\"Metric\",\"Q2\",\"Q3\",\"Delta\"],\"rows\":[[\"Revenue\",\"1.2M\",\"1.5M\",\"+25%\"],[\"Costs\",\"800K\",\"850K\",\"+6%\"]]}",
          "mime_type": "application/json",
          "language": null,
          "attributes": {
            "row_count": 2,
            "col_count": 4,
            "has_header": true,
            "name": "Financial Overview"
          },
          "position": {
            "index": 1,
            "page": 1,
            "row": null,
            "col": null,
            "bbox": { "x": 50, "y": 200, "width": 500, "height": 120, "unit": "pt" }
          },
          "metadata": {
            "title": null,
            "author": null,
            "created_at": null,
            "modified_at": null,
            "source_path": "reports/quarterly-report-q3.pdf#page=1",
            "tags": [],
            "custom": {}
          }
        }
      ]
    }
  ]
}

Zusammenfassung

Eigenschaft	Wert
Ebenen	Exakt 3 pro Dokument (Archive als optionaler Wrapper)
Struktureinheiten	`page`, `section`, `slide`, `sheet`
Content-Typen	`text`, `image`, `table`, `code`, `media`, `link`, `formula`
Formate	PDF, DOCX, PPTX, XLSX, HTML (erweiterbar)
Traversierung	Ein generischer Walker für alle Formate
Serialisierung	JSON-kompatibel, sofort einsetzbar in Workflow-Engines

17 KiB Raw Blame History Unescape Escape