gateway/modules/workflows/processing/shared/CONCEPT_HIERARCHICAL_DOCUMENT_GENERATION.md
2025-12-23 00:34:15 +01:00

19 KiB

Concept: Hierarchical Document Generation with Image Integration

Executive Summary

This concept proposes a three-phase hierarchical approach to document generation that enables proper image integration and handles complex documents efficiently.

Key Decisions:

  • Performance: Parallel processing with ChatLog progress messages
  • Error Handling: Skip failed sections, show error messages
  • Image Storage: Store as base64 in JSON (renderers need direct access)
  • Backward Compatibility: Not needed - implement as new default

Renderer Status:

  • Ready: Text, Markdown, DOCX renderers
  • ⚠️ Needs Update: HTML (create separate image files), PDF (embed images)
  • ⚠️ Needs Implementation: XLSX, PPTX (add image support)

Problem Statement

Currently, the document generation system has the following limitations:

  1. No Image Integration: Images are generated separately but cannot be embedded into document structures
  2. Single-Pass Generation: Documents are generated in one AI call, making it difficult to handle complex sections (long text, images, chapters)
  3. Repeated Extraction: Content extraction may happen multiple times unnecessarily
  4. No Structured Approach: No mechanism to first define document structure, then populate sections

Current Architecture Analysis

Current Flow:

User Request → ai.generateDocument → ai.process → AI JSON Generation → Renderer → Final Document

Issues:

  • AI generates complete JSON structure in one pass
  • Images are generated separately via ai.generate action
  • No mechanism to integrate generated images into document structure
  • JSON schema supports image content_type, but AI rarely generates it
  • Content extraction happens per action, not cached/reused

Current Image Handling:

  • Images can be rendered IF they exist in JSON structure (content_type: "image")
  • Image data expected as base64Data in elements
  • Renderers support image rendering (Docx, PDF, HTML, etc.)
  • But images are never generated WITHIN document generation

Proposed Solution: Hierarchical Document Generation

Core Concept

Three-Phase Approach:

  1. Structure Generation Phase: Generate document skeleton with section placeholders
  2. Content Generation Phase: Generate content for each section (text or image) via sub-prompts
  3. Integration Phase: Merge all generated content into final document structure

Architecture Overview

┌─────────────────────────────────────────────────────────────┐
│ Phase 1: Structure Generation                                │
│ - Generate document skeleton                                 │
│ - Identify sections (text, image, complex)                   │
│ - Create section placeholders with metadata                  │
└─────────────────────────────────────────────────────────────┘
                        ↓
┌─────────────────────────────────────────────────────────────┐
│ Phase 2: Content Generation (Tree-like)                     │
│                                                              │
│  ┌──────────────────────────────────────────────┐          │
│  │ Section 1: Heading (simple)                 │          │
│  │ → Generate directly                          │          │
│  └──────────────────────────────────────────────┘          │
│                                                              │
│  ┌──────────────────────────────────────────────┐          │
│  │ Section 2: Paragraph (simple)                │          │
│  │ → Generate directly                          │          │
│  └──────────────────────────────────────────────┘          │
│                                                              │
│  ┌──────────────────────────────────────────────┐          │
│  │ Section 3: Image (complex)                  │          │
│  │ → Sub-prompt: Generate image                │          │
│  │ → Store image data                          │          │
│  │ → Create image section with base64Data      │          │
│  └──────────────────────────────────────────────┘          │
│                                                              │
│  ┌──────────────────────────────────────────────┐          │
│  │ Section 4: Long Chapter (complex)            │          │
│  │ → Sub-prompt: Generate chapter content      │          │
│  │ → Split into subsections if needed          │          │
│  └──────────────────────────────────────────────┘          │
└─────────────────────────────────────────────────────────────┘
                        ↓
┌─────────────────────────────────────────────────────────────┐
│ Phase 3: Integration                                         │
│ - Merge all generated content                                │
│ - Replace placeholders with actual data                      │
│ - Validate structure completeness                            │
│ - Render to final format                                     │
└─────────────────────────────────────────────────────────────┘

Detailed Design

Phase 1: Structure Generation

Purpose: Create document skeleton with section metadata

Process:

  1. AI generates document structure with sections
  2. Each section includes:
    • id: Unique identifier
    • content_type: Type (heading, paragraph, image, table, etc.)
    • complexity: "simple" or "complex"
    • generation_hint: Instructions for content generation
    • order: Section order
    • elements: Empty or placeholder

Example Structure:

{
  "metadata": {
    "title": "Children's Bedtime Story",
    "split_strategy": "single_document"
  },
  "documents": [{
    "id": "doc_1",
    "sections": [
      {
        "id": "section_title",
        "content_type": "heading",
        "complexity": "simple",
        "generation_hint": "Story title",
        "order": 1,
        "elements": []
      },
      {
        "id": "section_intro",
        "content_type": "paragraph",
        "complexity": "simple",
        "generation_hint": "Introduction paragraph",
        "order": 2,
        "elements": []
      },
      {
        "id": "section_image_1",
        "content_type": "image",
        "complexity": "complex",
        "generation_hint": "Illustration: Rabbit meeting owl in moonlit forest",
        "image_prompt": "A small brown rabbit sitting in a peaceful forest clearing under moonlight with stars, meeting a wise owl perched on a branch",
        "order": 3,
        "elements": []
      },
      {
        "id": "section_chapter_1",
        "content_type": "paragraph",
        "complexity": "complex",
        "generation_hint": "First chapter: Rabbit's adventure begins",
        "order": 4,
        "elements": []
      }
    ]
  }]
}

Phase 2: Content Generation

Purpose: Generate actual content for each section

Process:

  1. Iterate through sections in order
  2. For each section:
    • Simple sections (heading, short paragraph):
      • Generate content directly via AI
      • Populate elements array
    • Complex sections (image, long chapter):
      • Create sub-prompt based on generation_hint and image_prompt
      • Generate content via specialized action:
        • Images: ai.generate with image generation
        • Long text: ai.process with focused prompt
      • Store generated content
      • Populate elements array

Content Caching:

  • Extract content from source documents ONCE at the start
  • Cache extracted content for reuse across all sections
  • Pass cached content to sub-prompts to avoid re-extraction

Image Generation:

  • For content_type: "image" sections:
    • Use image_prompt from structure
    • Call ai.generate action with image generation
    • Receive base64 image data
    • Create image element:
      {
        "url": "data:image/png;base64,<base64_data>",
        "base64Data": "<base64_data>",
        "altText": "<alt_text>",
        "caption": "<caption>"
      }
      

Phase 3: Integration

Purpose: Merge all content into final document structure

Process:

  1. Validate all sections have content
  2. Merge generated content into structure
  3. Replace placeholders with actual data
  4. Finalize JSON structure
  5. Render to target format (docx, pdf, html, etc.)

Implementation Strategy

New Components Needed

  1. Structure Generator (structureGenerator.py)

    • Generates document skeleton
    • Identifies section complexity
    • Creates generation hints
  2. Content Generator (contentGenerator.py)

    • Generates content for each section
    • Handles simple vs complex sections
    • Manages sub-prompts and image generation
    • Caches extracted content
  3. Content Integrator (contentIntegrator.py)

    • Merges generated content
    • Validates completeness
    • Finalizes document structure

Modified Components

  1. generateDocument action

    • Implement hierarchical generation as default
    • Orchestrate three phases
    • Add progress logging for each phase
  2. process action

    • Support content caching (extract once, reuse)
    • Support sub-prompt generation for sections
  3. Prompt Builder (subPromptBuilderGeneration.py)

    • Add structure generation prompt
    • Add section-specific content prompts
    • Add image generation prompt templates
  4. Renderers (Update required):

    • HTML Renderer: Create separate image files and link them
    • PDF Renderer: Embed images using reportlab
    • XLSX Renderer: Add image embedding support
    • PPTX Renderer: Add image embedding support

New Action Parameters

For generateDocument:

  • enableImageIntegration: boolean (default: true)
  • maxSectionLength: int (threshold for "complex" sections, default: 500 words)
  • parallelGeneration: boolean (default: true) - enable parallel section generation
  • progressLogging: boolean (default: true) - send ChatLog progress updates

For sub-prompts:

  • sectionContext: Previous sections for context
  • cachedContent: Extracted content cache (to avoid re-extraction)
  • targetSection: Section metadata
  • previousSections: Array of already-generated sections for continuity

Benefits

  1. Image Integration: Images can be generated and embedded into documents
  2. Structured Approach: Clear separation of structure and content
  3. Efficiency: Content extracted once, reused across sections
  4. Scalability: Can handle very long documents by splitting into sections
  5. Quality: Better control over complex sections (images, long chapters)
  6. Flexibility: Can generate different content types per section

Migration Strategy

Note: No backwards compatibility needed - can implement directly as new default.

  1. Phase 1: Implement hierarchical generation as new default
  2. Phase 2: Update renderers (HTML, PDF, XLSX, PPTX) for image support
  3. Phase 3: Testing and refinement
  4. Phase 4: Remove old single-pass mode (or keep as internal fallback only)

Example Workflow

User Request: "Create a children's bedtime story with 5 illustrations"

Phase 1 Output:

{
  "metadata": {"title": "Flöckchen's Adventure"},
  "documents": [{
    "sections": [
      {"id": "title", "content_type": "heading", "complexity": "simple", ...},
      {"id": "intro", "content_type": "paragraph", "complexity": "simple", ...},
      {"id": "img1", "content_type": "image", "complexity": "complex", 
       "image_prompt": "Rabbit meeting owl", ...},
      {"id": "chapter1", "content_type": "paragraph", "complexity": "complex", ...},
      {"id": "img2", "content_type": "image", "complexity": "complex", ...},
      ...
    ]
  }]
}

Phase 2 Process:

  • Generate title → populate elements
  • Generate intro → populate elements
  • Generate image 1 → call ai.generate, store base64 → populate elements
  • Generate chapter 1 → sub-prompt → populate elements
  • Generate image 2 → call ai.generate, store base64 → populate elements
  • ...

Phase 3 Output: Complete document with all sections populated, ready for rendering

Renderer Readiness Assessment

Current Renderer Status for Image Handling:

  1. Text Renderer (rendererText.py): READY

    • Skips images, shows placeholder: [Image: altText]
    • No changes needed
  2. Markdown Renderer (rendererMarkdown.py): READY

    • Shows placeholder with truncated base64: ![altText](data:image/png;base64,...)
    • No changes needed (markdown limitation)
  3. HTML Renderer (rendererHtml.py): ⚠️ NEEDS UPDATE

    • Currently: Embeds base64 directly in <img> tag as data URI
    • Required Change: Create separate image files and link to them
    • Implementation: Generate image files (e.g., image_1.png, image_2.png) alongside HTML
    • Update <img> tags to use relative paths: <img src="image_1.png" alt="...">
    • Return multiple files: HTML file + image files
  4. PDF Renderer (rendererPdf.py): ⚠️ NEEDS UPDATE

    • Currently: Shows placeholder [Image: altText]
    • Required Change: Embed images directly in PDF using reportlab
    • Implementation: Use reportlab.platypus.Image() with base64 decoded bytes
  5. DOCX Renderer (rendererDocx.py): READY

    • Embeds images directly using doc.add_picture()
    • Adds captions below images
    • No changes needed
  6. XLSX Renderer (rendererXlsx.py): ⚠️ NEEDS IMPLEMENTATION

    • Currently: No image handling found
    • Required Change: Add image support using openpyxl
    • Implementation: Use openpyxl.drawing.image.Image() to embed images in cells
    • Store images in worksheet cells or as floating images
  7. PPTX Renderer (rendererPptx.py): ⚠️ NEEDS IMPLEMENTATION

    • Currently: No image handling found
    • Required Change: Add image support using python-pptx
    • Implementation: Use slide.shapes.add_picture() to add images to slides

Renderer Update Requirements:

Priority 1 (Critical for HTML output):

  • HTML Renderer: Create separate image files and link them

Priority 2 (Important for document formats):

  • PDF Renderer: Embed images using reportlab
  • XLSX Renderer: Add image embedding support
  • PPTX Renderer: Add image embedding support

Answers to Open Questions

1. Performance: How to handle very large documents (100+ sections)?

Answer: Use parallel processing where possible, with progress ChatLog messages.

Implementation Strategy:

  • Parallel Section Generation: Generate independent sections in parallel using asyncio
  • Batch Processing: Process sections in batches (e.g., 10 sections at a time)
  • Progress Tracking: Send ChatLog progress updates:
    • "Generating structure..." (Phase 1)
    • "Generating content for section X/Y..." (Phase 2)
    • "Generating image for section X..." (Phase 2 - images)
    • "Merging content..." (Phase 3)
    • "Rendering final document..." (Phase 3)
  • Streaming: For very large documents, consider streaming partial results

Example Progress Messages:

Phase 1: Structure Generation (0% → 33%)
Phase 2: Content Generation (33% → 90%)
  - Section 1/10: Heading (34%)
  - Section 2/10: Paragraph (40%)
  - Section 3/10: Image generation (50%)
  - Section 4/10: Chapter (60%)
  ...
Phase 3: Integration & Rendering (90% → 100%)

2. Error Handling: What if one section fails?

Answer: Skip failed sections, keep section title and type, show error message in the section.

Implementation Strategy:

  • Graceful Degradation: Continue processing remaining sections
  • Error Section: Create error placeholder section:
    {
      "id": "section_failed_3",
      "content_type": "paragraph",
      "elements": [{
        "text": "[ERROR: Failed to generate content for this section. Error: <error_message>]"
      }],
      "order": 3,
      "error": true,
      "errorMessage": "<detailed_error>"
    }
    
  • Logging: Log errors for debugging but don't fail entire document
  • User Notification: Include error count in final progress message

3. Image Storage: Where to store generated images?

Answer: Store images in JSON as base64, as renderers need them afterwards.

Implementation Strategy:

  • In-Memory Storage: Keep base64 strings in JSON structure during generation
  • JSON Structure: Store in section elements:
    {
      "url": "data:image/png;base64,<base64_data>",
      "base64Data": "<full_base64_string>",
      "altText": "Image description",
      "caption": "Optional caption"
    }
    
  • Memory Management: For very large images, consider compression or chunking
  • Renderer Access: All renderers can access base64Data directly from JSON
  • HTML Special Case: HTML renderer will extract base64, decode, and save as separate files during rendering

4. Backward Compatibility: How to ensure existing workflows still work?

Answer: No backwards compatibility needed.

Implementation Strategy:

  • New Default: Hierarchical generation becomes the default mode
  • Clean Migration: All document generation uses hierarchical approach
  • No Fallback: Remove single-pass mode (or keep as internal fallback only)
  • Breaking Change: Acceptable since this is a new feature/enhancement

Next Steps

  1. Review and Approval: Get feedback on concept
  2. Detailed Design: Design API and data structures
  3. Prototype: Implement Phase 1 (structure generation)
  4. Testing: Test with real use cases
  5. Full Implementation: Implement all phases
  6. Migration: Migrate existing workflows