19 KiB
Concept: Hierarchical Document Generation with Image Integration
Executive Summary
This concept proposes a three-phase hierarchical approach to document generation that enables proper image integration and handles complex documents efficiently.
Key Decisions:
- ✅ Performance: Parallel processing with ChatLog progress messages
- ✅ Error Handling: Skip failed sections, show error messages
- ✅ Image Storage: Store as base64 in JSON (renderers need direct access)
- ✅ Backward Compatibility: Not needed - implement as new default
Renderer Status:
- ✅ Ready: Text, Markdown, DOCX renderers
- ⚠️ Needs Update: HTML (create separate image files), PDF (embed images)
- ⚠️ Needs Implementation: XLSX, PPTX (add image support)
Problem Statement
Currently, the document generation system has the following limitations:
- No Image Integration: Images are generated separately but cannot be embedded into document structures
- Single-Pass Generation: Documents are generated in one AI call, making it difficult to handle complex sections (long text, images, chapters)
- Repeated Extraction: Content extraction may happen multiple times unnecessarily
- No Structured Approach: No mechanism to first define document structure, then populate sections
Current Architecture Analysis
Current Flow:
User Request → ai.generateDocument → ai.process → AI JSON Generation → Renderer → Final Document
Issues:
- AI generates complete JSON structure in one pass
- Images are generated separately via
ai.generateaction - No mechanism to integrate generated images into document structure
- JSON schema supports
imagecontent_type, but AI rarely generates it - Content extraction happens per action, not cached/reused
Current Image Handling:
- Images can be rendered IF they exist in JSON structure (
content_type: "image") - Image data expected as
base64Datain elements - Renderers support image rendering (Docx, PDF, HTML, etc.)
- But images are never generated WITHIN document generation
Proposed Solution: Hierarchical Document Generation
Core Concept
Three-Phase Approach:
- Structure Generation Phase: Generate document skeleton with section placeholders
- Content Generation Phase: Generate content for each section (text or image) via sub-prompts
- Integration Phase: Merge all generated content into final document structure
Architecture Overview
┌─────────────────────────────────────────────────────────────┐
│ Phase 1: Structure Generation │
│ - Generate document skeleton │
│ - Identify sections (text, image, complex) │
│ - Create section placeholders with metadata │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ Phase 2: Content Generation (Tree-like) │
│ │
│ ┌──────────────────────────────────────────────┐ │
│ │ Section 1: Heading (simple) │ │
│ │ → Generate directly │ │
│ └──────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────┐ │
│ │ Section 2: Paragraph (simple) │ │
│ │ → Generate directly │ │
│ └──────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────┐ │
│ │ Section 3: Image (complex) │ │
│ │ → Sub-prompt: Generate image │ │
│ │ → Store image data │ │
│ │ → Create image section with base64Data │ │
│ └──────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────┐ │
│ │ Section 4: Long Chapter (complex) │ │
│ │ → Sub-prompt: Generate chapter content │ │
│ │ → Split into subsections if needed │ │
│ └──────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ Phase 3: Integration │
│ - Merge all generated content │
│ - Replace placeholders with actual data │
│ - Validate structure completeness │
│ - Render to final format │
└─────────────────────────────────────────────────────────────┘
Detailed Design
Phase 1: Structure Generation
Purpose: Create document skeleton with section metadata
Process:
- AI generates document structure with sections
- Each section includes:
id: Unique identifiercontent_type: Type (heading, paragraph, image, table, etc.)complexity: "simple" or "complex"generation_hint: Instructions for content generationorder: Section orderelements: Empty or placeholder
Example Structure:
{
"metadata": {
"title": "Children's Bedtime Story",
"split_strategy": "single_document"
},
"documents": [{
"id": "doc_1",
"sections": [
{
"id": "section_title",
"content_type": "heading",
"complexity": "simple",
"generation_hint": "Story title",
"order": 1,
"elements": []
},
{
"id": "section_intro",
"content_type": "paragraph",
"complexity": "simple",
"generation_hint": "Introduction paragraph",
"order": 2,
"elements": []
},
{
"id": "section_image_1",
"content_type": "image",
"complexity": "complex",
"generation_hint": "Illustration: Rabbit meeting owl in moonlit forest",
"image_prompt": "A small brown rabbit sitting in a peaceful forest clearing under moonlight with stars, meeting a wise owl perched on a branch",
"order": 3,
"elements": []
},
{
"id": "section_chapter_1",
"content_type": "paragraph",
"complexity": "complex",
"generation_hint": "First chapter: Rabbit's adventure begins",
"order": 4,
"elements": []
}
]
}]
}
Phase 2: Content Generation
Purpose: Generate actual content for each section
Process:
- Iterate through sections in order
- For each section:
- Simple sections (heading, short paragraph):
- Generate content directly via AI
- Populate
elementsarray
- Complex sections (image, long chapter):
- Create sub-prompt based on
generation_hintandimage_prompt - Generate content via specialized action:
- Images:
ai.generatewith image generation - Long text:
ai.processwith focused prompt
- Images:
- Store generated content
- Populate
elementsarray
- Create sub-prompt based on
- Simple sections (heading, short paragraph):
Content Caching:
- Extract content from source documents ONCE at the start
- Cache extracted content for reuse across all sections
- Pass cached content to sub-prompts to avoid re-extraction
Image Generation:
- For
content_type: "image"sections:- Use
image_promptfrom structure - Call
ai.generateaction with image generation - Receive base64 image data
- Create image element:
{ "url": "data:image/png;base64,<base64_data>", "base64Data": "<base64_data>", "altText": "<alt_text>", "caption": "<caption>" }
- Use
Phase 3: Integration
Purpose: Merge all content into final document structure
Process:
- Validate all sections have content
- Merge generated content into structure
- Replace placeholders with actual data
- Finalize JSON structure
- Render to target format (docx, pdf, html, etc.)
Implementation Strategy
New Components Needed
-
Structure Generator (
structureGenerator.py)- Generates document skeleton
- Identifies section complexity
- Creates generation hints
-
Content Generator (
contentGenerator.py)- Generates content for each section
- Handles simple vs complex sections
- Manages sub-prompts and image generation
- Caches extracted content
-
Content Integrator (
contentIntegrator.py)- Merges generated content
- Validates completeness
- Finalizes document structure
Modified Components
-
generateDocumentaction- Implement hierarchical generation as default
- Orchestrate three phases
- Add progress logging for each phase
-
processaction- Support content caching (extract once, reuse)
- Support sub-prompt generation for sections
-
Prompt Builder (
subPromptBuilderGeneration.py)- Add structure generation prompt
- Add section-specific content prompts
- Add image generation prompt templates
-
Renderers (Update required):
- HTML Renderer: Create separate image files and link them
- PDF Renderer: Embed images using reportlab
- XLSX Renderer: Add image embedding support
- PPTX Renderer: Add image embedding support
New Action Parameters
For generateDocument:
enableImageIntegration: boolean (default: true)maxSectionLength: int (threshold for "complex" sections, default: 500 words)parallelGeneration: boolean (default: true) - enable parallel section generationprogressLogging: boolean (default: true) - send ChatLog progress updates
For sub-prompts:
sectionContext: Previous sections for contextcachedContent: Extracted content cache (to avoid re-extraction)targetSection: Section metadatapreviousSections: Array of already-generated sections for continuity
Benefits
- Image Integration: Images can be generated and embedded into documents
- Structured Approach: Clear separation of structure and content
- Efficiency: Content extracted once, reused across sections
- Scalability: Can handle very long documents by splitting into sections
- Quality: Better control over complex sections (images, long chapters)
- Flexibility: Can generate different content types per section
Migration Strategy
Note: No backwards compatibility needed - can implement directly as new default.
- Phase 1: Implement hierarchical generation as new default
- Phase 2: Update renderers (HTML, PDF, XLSX, PPTX) for image support
- Phase 3: Testing and refinement
- Phase 4: Remove old single-pass mode (or keep as internal fallback only)
Example Workflow
User Request: "Create a children's bedtime story with 5 illustrations"
Phase 1 Output:
{
"metadata": {"title": "Flöckchen's Adventure"},
"documents": [{
"sections": [
{"id": "title", "content_type": "heading", "complexity": "simple", ...},
{"id": "intro", "content_type": "paragraph", "complexity": "simple", ...},
{"id": "img1", "content_type": "image", "complexity": "complex",
"image_prompt": "Rabbit meeting owl", ...},
{"id": "chapter1", "content_type": "paragraph", "complexity": "complex", ...},
{"id": "img2", "content_type": "image", "complexity": "complex", ...},
...
]
}]
}
Phase 2 Process:
- Generate title → populate elements
- Generate intro → populate elements
- Generate image 1 → call
ai.generate, store base64 → populate elements - Generate chapter 1 → sub-prompt → populate elements
- Generate image 2 → call
ai.generate, store base64 → populate elements - ...
Phase 3 Output: Complete document with all sections populated, ready for rendering
Renderer Readiness Assessment
Current Renderer Status for Image Handling:
-
Text Renderer (
rendererText.py): ✅ READY- Skips images, shows placeholder:
[Image: altText] - No changes needed
- Skips images, shows placeholder:
-
Markdown Renderer (
rendererMarkdown.py): ✅ READY- Shows placeholder with truncated base64:
 - No changes needed (markdown limitation)
- Shows placeholder with truncated base64:
-
HTML Renderer (
rendererHtml.py): ⚠️ NEEDS UPDATE- Currently: Embeds base64 directly in
<img>tag as data URI - Required Change: Create separate image files and link to them
- Implementation: Generate image files (e.g.,
image_1.png,image_2.png) alongside HTML - Update
<img>tags to use relative paths:<img src="image_1.png" alt="..."> - Return multiple files: HTML file + image files
- Currently: Embeds base64 directly in
-
PDF Renderer (
rendererPdf.py): ⚠️ NEEDS UPDATE- Currently: Shows placeholder
[Image: altText] - Required Change: Embed images directly in PDF using reportlab
- Implementation: Use
reportlab.platypus.Image()with base64 decoded bytes
- Currently: Shows placeholder
-
DOCX Renderer (
rendererDocx.py): ✅ READY- Embeds images directly using
doc.add_picture() - Adds captions below images
- No changes needed
- Embeds images directly using
-
XLSX Renderer (
rendererXlsx.py): ⚠️ NEEDS IMPLEMENTATION- Currently: No image handling found
- Required Change: Add image support using openpyxl
- Implementation: Use
openpyxl.drawing.image.Image()to embed images in cells - Store images in worksheet cells or as floating images
-
PPTX Renderer (
rendererPptx.py): ⚠️ NEEDS IMPLEMENTATION- Currently: No image handling found
- Required Change: Add image support using python-pptx
- Implementation: Use
slide.shapes.add_picture()to add images to slides
Renderer Update Requirements:
Priority 1 (Critical for HTML output):
- HTML Renderer: Create separate image files and link them
Priority 2 (Important for document formats):
- PDF Renderer: Embed images using reportlab
- XLSX Renderer: Add image embedding support
- PPTX Renderer: Add image embedding support
Answers to Open Questions
1. Performance: How to handle very large documents (100+ sections)?
Answer: Use parallel processing where possible, with progress ChatLog messages.
Implementation Strategy:
- Parallel Section Generation: Generate independent sections in parallel using asyncio
- Batch Processing: Process sections in batches (e.g., 10 sections at a time)
- Progress Tracking: Send ChatLog progress updates:
- "Generating structure..." (Phase 1)
- "Generating content for section X/Y..." (Phase 2)
- "Generating image for section X..." (Phase 2 - images)
- "Merging content..." (Phase 3)
- "Rendering final document..." (Phase 3)
- Streaming: For very large documents, consider streaming partial results
Example Progress Messages:
Phase 1: Structure Generation (0% → 33%)
Phase 2: Content Generation (33% → 90%)
- Section 1/10: Heading (34%)
- Section 2/10: Paragraph (40%)
- Section 3/10: Image generation (50%)
- Section 4/10: Chapter (60%)
...
Phase 3: Integration & Rendering (90% → 100%)
2. Error Handling: What if one section fails?
Answer: Skip failed sections, keep section title and type, show error message in the section.
Implementation Strategy:
- Graceful Degradation: Continue processing remaining sections
- Error Section: Create error placeholder section:
{ "id": "section_failed_3", "content_type": "paragraph", "elements": [{ "text": "[ERROR: Failed to generate content for this section. Error: <error_message>]" }], "order": 3, "error": true, "errorMessage": "<detailed_error>" } - Logging: Log errors for debugging but don't fail entire document
- User Notification: Include error count in final progress message
3. Image Storage: Where to store generated images?
Answer: Store images in JSON as base64, as renderers need them afterwards.
Implementation Strategy:
- In-Memory Storage: Keep base64 strings in JSON structure during generation
- JSON Structure: Store in section elements:
{ "url": "data:image/png;base64,<base64_data>", "base64Data": "<full_base64_string>", "altText": "Image description", "caption": "Optional caption" } - Memory Management: For very large images, consider compression or chunking
- Renderer Access: All renderers can access
base64Datadirectly from JSON - HTML Special Case: HTML renderer will extract base64, decode, and save as separate files during rendering
4. Backward Compatibility: How to ensure existing workflows still work?
Answer: No backwards compatibility needed.
Implementation Strategy:
- New Default: Hierarchical generation becomes the default mode
- Clean Migration: All document generation uses hierarchical approach
- No Fallback: Remove single-pass mode (or keep as internal fallback only)
- Breaking Change: Acceptable since this is a new feature/enhancement
Next Steps
- Review and Approval: Get feedback on concept
- Detailed Design: Design API and data structures
- Prototype: Implement Phase 1 (structure generation)
- Testing: Test with real use cases
- Full Implementation: Implement all phases
- Migration: Migrate existing workflows