first version of service center

implemented on chatbot
This commit is contained in:
Ida Dittrich 2026-03-05 16:23:01 +01:00
parent 6dc2afafb9
commit 53d2d9d873
111 changed files with 37504 additions and 49 deletions

8
app.py
View file

@ -324,6 +324,14 @@ async def lifespan(app: FastAPI):
except Exception as e:
logger.error(f"Feature catalog registration failed: {e}")
# Pre-warm service center modules (avoids first-request import latency)
try:
from modules.serviceCenter import preWarm
preWarm()
logger.info("Service center pre-warm completed")
except Exception as e:
logger.warning(f"Service center pre-warm failed (non-critical): {e}")
# Get event user for feature lifecycle (system-level user for background operations)
rootInterface = getRootInterface()
eventUser = rootInterface.getUserByUsername("event")

View file

@ -0,0 +1,318 @@
# Gateway Service Architecture Documentation
This document describes the structure, design patterns, and key components of the two service architectures in the gateway:
1. **`modules/serviceCenter`** — the new service center (context-based DI, RBAC-aware)
2. **`modules/services`** — the legacy services hub (monolithic hub, eager loading)
---
## 1. `modules/serviceCenter` — New Service Center
### 1.1 File/Folder Structure
```
modules/serviceCenter/
├── __init__.py # Public API: getService, preWarm, registerServiceObjects, can_access_service
├── context.py # ServiceCenterContext dataclass
├── registry.py # Service definitions (CORE_SERVICES, IMPORTABLE_SERVICES, RBAC)
├── resolver.py # Resolution logic, dependency injection, legacy fallback
├── core/ # Core services (internal building blocks, no RBAC)
│ ├── __init__.py
│ ├── serviceUtils/
│ │ └── mainServiceUtils.py
│ ├── serviceSecurity/
│ │ └── mainServiceSecurity.py
│ └── serviceStreaming/
│ └── mainServiceStreaming.py
└── services/ # Feature-facing importable services (RBAC-protected)
├── __init__.py
├── serviceAi/
├── serviceBilling/
├── serviceChat/
├── serviceExtraction/
├── serviceGeneration/
├── serviceMessaging/
├── serviceNeutralization/
├── serviceSharepoint/
├── serviceTicket/
└── serviceWeb/
```
**Design distinction:**
- **Core services** — Internal utilities (utils, security, streaming). Never directly requested by features. No RBAC. Security and streaming are fully migrated; legacy hub delegates to service center.
- **Importable services** — Feature-facing. Have `objectKey` for RBAC (e.g. `service.web`, `service.extraction`).
---
### 1.2 Service Registration and Discovery
**Registration:** Services are declared statically in `registry.py`.
- **CORE_SERVICES**: Internal services with `module`, `class`, `dependencies`.
- **IMPORTABLE_SERVICES**: Feature-facing services with `module`, `class`, `dependencies`, `objectKey`, `label`.
- **SERVICE_RBAC_OBJECTS**: Derived from IMPORTABLE_SERVICES for catalog registration.
**Discovery:** No dynamic discovery. All services are explicitly listed in the registry. Adding a service requires editing `registry.py`.
```python
# Example from registry.py
IMPORTABLE_SERVICES = {
"ai": {
"module": "modules.serviceCenter.services.serviceAi.mainServiceAi",
"class": "AiService",
"dependencies": ["chat", "utils", "extraction", "billing"],
"objectKey": "service.ai",
"label": {"en": "AI", "de": "KI", "fr": "IA"},
},
# ...
}
```
---
### 1.3 Dependency Injection and Factory Patterns
**Constructor pattern:** Services receive two arguments from the resolver:
1. `context: ServiceCenterContext` — user, mandate_id, feature_instance_id, workflow
2. `get_service: Callable[[str], Any]` — function to resolve other services by key
```python
# Service Center service constructor
def __init__(self, context, get_service: Callable[[str], Any]):
self._context = context
self._get_service = get_service
```
**Dependency resolution:**
- The resolver (`resolver.py`) builds a `get_service` callable that recursively resolves dependencies.
- Dependencies are declared in the registry (e.g. `"dependencies": ["chat", "utils", "extraction", "billing"]`).
- Circular dependencies are detected and raise `RuntimeError`.
- Resolution is cached per `(user, mandate_id, feature_instance_id)` + `key`.
**Legacy fallback:** If a service fails to load from the service center, the resolver falls back to the legacy `Services` hub when `legacy_hub` is provided.
---
### 1.4 Main Entry Points and Usage Patterns
| Entry Point | Purpose |
|-------------|---------|
| `getService(key, context, legacy_hub=None)` | Resolve a service by key for the given context |
| `preWarm(service_keys=None)` | Pre-load service modules at startup (called in `app.py` lifespan) |
| `registerServiceObjects(catalogService)` | Register service RBAC objects (called via `registerAllFeaturesInCatalog`) |
| `can_access_service(user, rbac, service_key, ...)` | RBAC check for service access |
| `ServiceCenterContext(user, mandate_id, feature_instance_id, workflow)` | Context dataclass |
**Typical usage (chatbot feature):**
```python
from modules.serviceCenter import getService
from modules.serviceCenter.context import ServiceCenterContext
ctx = ServiceCenterContext(user=user, mandate_id=mandateId, feature_instance_id=featureInstanceId, workflow=workflow)
ai_service = getService("ai", ctx, legacy_hub=None)
chat_service = getService("chat", ctx, legacy_hub=None)
```
**Feature-level hub (e.g. chatbot):**
- `getChatbotServices()` builds a lightweight hub with only required services.
- Uses `REQUIRED_SERVICES` to resolve only `chat`, `ai`, `billing`, `streaming`.
- Returns a `_ChatbotServiceHub` object with `chat`, `ai`, `billing`, `streaming`, `interfaceDbComponent`, etc.
---
### 1.5 Initialization and Bootstrapping
1. **`app.py` lifespan:**
- `registerAllFeaturesInCatalog(catalogService)` → calls `registerServiceObjects(catalogService)` for service RBAC objects
- `preWarm()` — imports all service modules to avoid first-request latency
2. **`registerAllFeaturesInCatalog` (modules/system/registry.py):**
- Registers system RBAC objects
- Registers service center RBAC objects via `registerServiceObjects`
- Registers feature RBAC objects
3. **First request:**
- `getService(key, ctx)``resolve()` loads module, instantiates class with `(context, get_service)`, caches instance
---
## 2. `modules/services` — Legacy Services Hub
### 2.1 File/Folder Structure
```
modules/services/
├── __init__.py # Services class, getInterface(), PublicService, _loadFeatureInterfaces, _loadFeatureServices
├── serviceAi/
│ └── mainServiceAi.py
├── serviceBilling/
│ └── mainServiceBilling.py
├── serviceChat/
│ └── mainServiceChat.py
├── serviceExtraction/
│ ├── extractors/
│ ├── chunking/
│ ├── merging/
│ ├── subRegistry.py
│ ├── subPipeline.py
│ └── mainServiceExtraction.py
├── serviceGeneration/
│ ├── paths/
│ ├── renderers/
│ └── mainServiceGeneration.py
├── serviceMessaging/
│ └── mainServiceMessaging.py
├── serviceNormalization/
│ └── mainServiceNormalization.py
├── serviceSharepoint/
│ └── mainServiceSharepoint.py
├── serviceStreaming/
│ ├── eventManager.py
│ ├── helpers.py
│ └── mainServiceStreaming.py
├── serviceTicket/
│ └── mainServiceTicket.py
├── serviceUtils/
│ └── mainServiceUtils.py
├── serviceWeb/
│ └── mainServiceWeb.py
└── serviceSecurity/
└── mainServiceSecurity.py
```
**No core vs. services split.** All services live in flat `serviceX/` directories.
---
### 2.2 Service Registration and Discovery
**Registration:** Services are **eagerly loaded** in `Services.__init__()` via hardcoded imports. No registry file.
**Discovery:**
- **Shared services:** Loaded explicitly in `__init__` from `modules/services/serviceX/mainServiceX.py`.
- **Feature services:** Discovered dynamically via `_loadFeatureServices()` — scans `modules/features/*/service*/mainService*.py` and instantiates classes ending with `"Service"`.
```python
# Shared services — hardcoded in Services.__init__
from .serviceSharepoint.mainServiceSharepoint import SharepointService
self.sharepoint = PublicService(SharepointService(self))
from .serviceChat.mainServiceChat import ChatService
self.chat = PublicService(ChatService(self))
# ... etc.
```
---
### 2.3 Dependency Injection / Factory Patterns
**Constructor pattern:** Services receive the entire `Services` hub as their single dependency.
```python
# Legacy service constructor
def __init__(self, services):
self.services = services
```
**No explicit dependency graph.** Services access other services via `self.services.<attr>` (e.g. `self.services.interfaceDbComponent`, `self.services.extraction`). All services are loaded at construction time.
**PublicService proxy:** Services are wrapped in `PublicService(target, functionsOnly=True)` to expose only callable, non-private attributes. Reduces accidental access to internal state.
**BillingService:** Uses a separate factory `getService(currentUser, mandateId, featureInstanceId, featureCode)` and a module-level cache. Not integrated with the hubs constructor pattern.
---
### 2.4 Main Entry Points and Usage Patterns
| Entry Point | Purpose |
|-------------|---------|
| `getInterface(user, workflow, mandateId, featureInstanceId)` | Returns a `Services` instance |
| `Services` | Central hub with all services and interfaces |
**Typical usage:**
```python
from modules.services import getInterface as getServices
services = getServices(user, workflow, mandateId=mandateId, featureInstanceId=featureInstanceId)
ai = services.ai
extraction = services.extraction
```
**Interfaces loaded at construction:**
- `interfaceDbApp`, `interfaceDbComponent`, `interfaceDbChat`, `rbac`
- Plus dynamically loaded `interfaceFeature*` from feature containers
---
### 2.5 Initialization and Bootstrapping
1. **No startup bootstrap** — services load on first `getInterface()` call.
2. **Construction flow:**
- `getInterface(user, ...)``Services(user, ...)`
- `Services.__init__`:
- Loads DB interfaces (`interfaceDbApp`, `interfaceDbComponent`, `interfaceDbChat`)
- Instantiates all shared services (sharepoint, ticket, chat, utils, security, streaming, ai, extraction, generation, web)
- Calls `_loadFeatureInterfaces()` — discovers `interfaceFeature*.py` in features
- Calls `_loadFeatureServices()` — discovers `service*/mainService*.py` in features, overrides hub attributes
3. **Feature services:** If a feature defines `serviceAi/mainServiceAi.py`, it overrides `services.ai`. Shared `serviceAi` is only used when no feature override exists.
---
## 3. Side-by-Side Comparison
| Aspect | Service Center | Legacy Services |
|--------|----------------|-----------------|
| **Location** | `modules/serviceCenter/` | `modules/services/` |
| **Entry point** | `getService(key, context, legacy_hub)` | `getInterface(user, ...)``Services` |
| **Constructor** | `(context, get_service)` | `(services)` — full hub |
| **Context** | `ServiceCenterContext` (user, mandate_id, feature_instance_id, workflow) | Full `Services` with interfaces |
| **Dependencies** | Declared in registry, resolved lazily via `get_service("key")` | Via `self.services.<attr>` |
| **Loading** | Lazy (on first `getService`), cached per context | Eager (all at construction) |
| **RBAC** | Per-service `objectKey` in registry, `can_access_service()` | Shared via hub `.rbac` |
| **Feature services** | Not applicable — features use `getService(key, ctx)` | Discovered via `_loadFeatureServices()` |
| **Pre-warm** | `preWarm()` in app lifespan | None |
| **Bootstrap** | `registerServiceObjects()` via `registerAllFeaturesInCatalog` | None |
---
## 4. Coexistence and Migration
- **Service center** can fall back to **legacy hub** when `legacy_hub` is passed to `getService`.
- **Chatbot** uses service center via `getChatbotServices()` and does not use the legacy hub.
- **Billing, routes, teamsbot, commcoach, etc.** still use `modules.services` (e.g. `getInterface`, `getService` from `serviceBilling`).
- **`ServiceCenterContext`** is used when calling `getService`. Features that pass `workflow=None` use a placeholder workflow for billing/featureCode.
- Migration plan: `docs/SERVICE_CENTER_MIGRATION_PLAN.md`.
---
## 5. Service Center Resolver Flow
```
getService("ai", ctx, legacy_hub)
→ resolve("ai", ctx, cache, resolving, legacy_hub)
→ Check cache (cache_key = user_mandate_feature_ai)
→ If cache hit: return cached instance
→ If miss:
→ _load_service_class("modules.serviceCenter.services.serviceAi.mainServiceAi", "AiService")
→ Resolve dependencies: chat, utils, extraction, billing (recursive resolve)
→ instance = AiService(ctx, get_service)
→ cache[cache_key] = instance
→ return instance
→ On ImportError/ModuleNotFoundError: _get_from_legacy(legacy_hub, "ai") if legacy_hub
```
---
## 6. Key Files Reference
| File | Purpose |
|------|---------|
| `serviceCenter/registry.py` | Service definitions, dependency graph, RBAC keys |
| `serviceCenter/resolver.py` | Resolution logic, caching, legacy fallback |
| `serviceCenter/context.py` | `ServiceCenterContext` dataclass |
| `serviceCenter/__init__.py` | `getService`, `preWarm`, `registerServiceObjects`, `can_access_service` |
| `services/__init__.py` | `Services` class, `getInterface`, `PublicService`, feature discovery |
| `system/registry.py` | `registerAllFeaturesInCatalog` (calls `registerServiceObjects`) |
| `app.py` | Lifespan: `preWarm()`, `registerAllFeaturesInCatalog()` |
| `features/chatbot/mainChatbot.py` | Example: `getChatbotServices()` using service center |

View file

@ -0,0 +1,217 @@
# Service Center Migration Plan
## Overview
This document describes a **step-by-step plan** to migrate from the old `modules/services` (Services hub) to the new `modules/serviceCenter`. The migration is **incremental**—one feature at a time—with UI-driven testing after each step.
**Recommended first feature: Chatbot** — it has a clear UI, limited service dependencies, and is already partially using the service center (AI, generation, billing).
---
## Architecture Summary
### Current State
| Component | Location | Notes |
|-----------|----------|-------|
| **Service Center** | `modules/serviceCenter/` | New: registry, resolver, context-based DI |
| **Services Hub** | `modules/services/` | Legacy: `getInterface()``Services` instance |
| **Chatbot** | `modules/features/chatbot/` | Uses `getServices()``.chat`, `.ai` |
### Service Center vs Legacy Services
| Aspect | Service Center | Legacy Services |
|--------|----------------|-----------------|
| **Constructor** | `(context: ServiceCenterContext, get_service)` | `(services: Services)` — receives hub |
| **Context** | Minimal: user, mandate_id, feature_instance_id, workflow | Full hub with all interfaces |
| **Dependencies** | Injected via `get_service("key")` | Via `self.services.<attr>` |
| **RBAC** | Per-service `objectKey` in registry | Shared via hub |
| **Pre-warm** | `preWarm()` at app startup | Loaded on first use |
### Services Already Using Service Center (in Services class)
The `Services` class in `modules/services/__init__.py` already uses `getService()` for:
- `messaging`
- `ai`
- `generation`
- `billing`
### Services Still Using Legacy Direct Imports
- `chat` ← **Target for Phase 1**
- `sharepoint`
- `ticket`
- `utils`
- `security`
- `streaming`
- `extraction`
- `web`
---
## Phase 1: Migrate Chatbot to Use Service Center for Chat
**Goal:** Switch the Chatbot feature to get the Chat service from Service Center instead of the legacy hub. This validates the full flow with minimal risk.
### Step 1.1: Switch Services Class to Use Service Center for Chat
**File:** `modules/services/__init__.py`
**Change:** Replace the direct ChatService import with `getService("chat", ...)`.
```python
# BEFORE (line ~126-127):
from .serviceChat.mainServiceChat import ChatService
self.chat = PublicService(ChatService(self))
# AFTER:
self.chat = PublicService(getService("chat", _ctx, legacy_hub=self))
```
The `_ctx` (ServiceCenterContext) is already created for messaging/ai/generation. Add `workflow=self.workflow` to the context if not already present (it should be—check the existing `_ctx` creation around line 109116).
**Verification:**
1. Ensure `ServiceCenterContext` includes `workflow` when Services has one (chatbot often passes `workflow=None` initially).
2. The service center ChatService gets `interfaceDbComponent` from `getInterface(context.user, mandateId=context.mandate_id)` — same as legacy. The chatbot calls `getFileInfo(fileId)` which only needs `interfaceDbComponent`, not workflow.
### Step 1.2: Ensure Service Center Context Has Workflow
**File:** `modules/services/__init__.py`
Verify the existing context creation:
```python
_ctx = ServiceCenterContext(
user=self.user,
mandate_id=self.mandateId,
feature_instance_id=self.featureInstanceId,
workflow=self.workflow,
)
```
If `workflow` is missing, add it. The ChatService uses `_context.workflow` for methods like `getChatDocumentsFromDocumentList`; for `getFileInfo` it is not needed.
### Step 1.3: Run Unit Tests
```powershell
cd c:\Users\IdaDittrich\Documents\01_Code\gateway
pytest tests/unit/serviceCenter/test_service_center_imports.py -v
python tests/scripts/smoke_test_service_center.py
```
### Step 1.4: Manual UI Test — Chatbot with File Upload
1. **Start the gateway:**
```powershell
cd c:\Users\IdaDittrich\Documents\01_Code\gateway
uvicorn app:app --reload --host 0.0.0.0 --port 8000
```
2. **Start the frontend** (if using frontend_nyla):
```powershell
cd c:\Users\IdaDittrich\Documents\01_Code\frontend_nyla
npm run dev
```
3. **Log in** as a user with access to the Chatbot feature.
4. **Open a Chatbot instance** (navigate to the chatbot feature, select or create an instance).
5. **Create a new conversation** — click "New conversation" or equivalent.
6. **Attach a file** — upload a PDF or document before sending.
7. **Send a message** — e.g. "Summarize this document."
8. **Verify:**
- No 500 errors in gateway logs
- File is processed (chat services `getFileInfo` is used when creating `ChatbotDocument`s)
- AI response streams back correctly (AI service already from service center)
### Step 1.5: Rollback if Needed
If something breaks, revert the change in `modules/services/__init__.py`:
```python
from .serviceChat.mainServiceChat import ChatService
self.chat = PublicService(ChatService(self))
```
---
## Phase 2 (Future): Migrate Extraction for Chatbot
The chatbot may use extraction when processing documents. After Phase 1 is stable:
1. Switch `Services` to use `getService("extraction", _ctx, legacy_hub=self)` instead of direct import.
2. Ensure `ExtractionService` in service center has the same interface as the legacy one.
3. Re-test chatbot with document-heavy prompts.
---
## Phase 3 (Future): Migrate Remaining Services
| Service | Used By | Priority |
|---------|---------|----------|
| utils | Chat, Extraction, AI, Web, Generation | High (core) |
| security | Sharepoint | Medium |
| streaming | Workflows, Chatbot SSE | Medium |
| sharepoint | Sharepoint workflows | Medium |
| ticket | Ticket system | Low |
| web | Web research workflows | Medium |
---
## Service Center Bootstrap (Already Done)
The app already:
- Calls `preWarm()` at startup (`app.py` lifespan)
- Has `registerServiceObjects()` available for RBAC catalog (call from bootstrap if needed)
### Optional: Register Service RBAC Objects
If you want service-level RBAC (e.g. `can_access_service()`), call during bootstrap:
```python
# In app.py lifespan or interfaceBootstrap
from modules.serviceCenter import registerServiceObjects
from modules.security.rbacCatalog import getCatalogService
catalogService = getCatalogService()
registerServiceObjects(catalogService)
```
---
## Testing Checklist (Chatbot Phase 1)
- [ ] Unit tests pass: `pytest tests/unit/serviceCenter/ -v`
- [ ] Smoke test passes: `python tests/scripts/smoke_test_service_center.py`
- [ ] Gateway starts without import errors
- [ ] Chatbot UI loads
- [ ] New conversation creates successfully
- [ ] Message without file sends and gets AI response
- [ ] Message with file attachment sends and gets AI response
- [ ] No errors in gateway logs during the above flows
---
## File Summary for Phase 1
| File | Action |
|------|--------|
| `modules/services/__init__.py` | Replace `ChatService` import with `getService("chat", _ctx, legacy_hub=self)` |
| (No other changes) | Service center ChatService and resolver already support legacy fallback |
---
## FAQ
**Q: Why start with Chat instead of Utils?**
A: Chat has a clear UI path (chatbot) and only a few call sites. Utils is used everywhere; migrating it later reduces risk.
**Q: What if `getService("chat", ctx)` fails?**
A: The resolver passes `legacy_hub=self`, so it falls back to the legacy `Services.chat` if the service center module fails to load. You get graceful degradation.
**Q: Can I test without the frontend?**
A: Yes. Use the API directly, e.g. `POST /api/chatbot/{instanceId}/start/stream` with a valid `UserInputRequest` (with `listFileId` for file upload).

View file

@ -0,0 +1,92 @@
# Service Center vs Legacy Services Hub — Comparison & Assessment
## Executive Summary
The **Service Center** (`modules/serviceCenter`) is a superior architecture compared to the legacy **Services Hub** (`modules/services`). It was worthwhile to create it. The main benefits are: **explicit dependency graph**, **lazy loading**, **per-service RBAC**, and **context-scoped resolution** without carrying the entire hub. The legacy hub remains valid for incremental migration and backward compatibility.
---
## 1. Architecture Comparison
| Aspect | Service Center | Legacy Services Hub |
|--------|----------------|---------------------|
| **Location** | `modules/serviceCenter/` | `modules/services/` |
| **Entry point** | `getService(key, context, legacy_hub)` | `getInterface(user, ...)``Services` |
| **Constructor** | `(context, get_service)` | `(services)` — full hub |
| **Dependencies** | Declared in registry, resolved lazily via `get_service("key")` | Via `self.services.<attr>` — all services always present |
| **Loading** | **Lazy** — only requested services + deps | **Eager** — everything at construction |
| **RBAC** | Per-service `objectKey`, `can_access_service()` | Shared via hub `.rbac` |
| **Caching** | Per-context cache (user + mandate + featureInstance) | No instance cache — new `Services` each call |
| **Feature override** | N/A — features use `getService` directly | Feature services override hub attributes |
| **Pre-warm** | `preWarm()` at app startup | None |
| **Structure** | Core vs importable split; explicit registry | Flat `serviceX/` dirs; discovery via glob |
---
## 2. Which Setup is Better?
**Service Center is better** for these reasons:
### 2.1 Explicit Dependency Graph
- Dependencies are declared in `registry.py` (e.g. `"ai": {"dependencies": ["chat", "utils", "extraction", "billing"]}`).
- Circular dependencies are detected and raise `RuntimeError`.
- Easier to reason about and refactor.
### 2.2 Lazy Loading & Resource Efficiency
- Only requested services (and their transitive deps) are loaded.
- A feature like chatbot needs `chat`, `ai`, `billing`, `streaming` — not `sharepoint`, `ticket`, `neutralization`, etc.
- Legacy hub loads **everything** on first `getInterface()`.
### 2.3 Context-Scoped Resolution
- Each request gets a `ServiceCenterContext` (user, mandate_id, feature_instance_id, workflow).
- Resolution is cached per context. Same user+mandate+feature → same instances.
- No need to pass or construct a full hub.
### 2.4 Per-Service RBAC
- Services have `objectKey` (e.g. `service.ai`, `service.extraction`).
- `can_access_service(user, rbac, service_key)` checks before resolving.
- Finer-grained control than a single hub-level RBAC.
### 2.5 Separation of Concerns
- **Core services** (utils, security, streaming): internal, no RBAC.
- **Importable services** (ai, billing, extraction, etc.): feature-facing, RBAC-protected.
- Clear distinction vs. flat structure in legacy.
### 2.6 Pre-warm for Cold Start
- `preWarm()` imports all service modules at startup.
- First request avoids import latency.
- Legacy has no equivalent.
---
## 3. When Legacy Still Makes Sense
- **Migration**: Features that havent moved yet still use `getInterface()`.
- **Feature overrides**: Feature-specific services (e.g. `serviceAi/mainServiceAi.py` in a feature) that override hub attributes.
- **Backward compatibility**: `legacy_hub` fallback in Service Center allows gradual migration.
---
## 4. Did It Make Sense to Create the Service Center?
**Yes.** The legacy hub has inherent limitations:
1. **Monolithic hub** — every `getInterface()` constructs a full `Services` object with all services, interfaces, and feature discovery.
2. **Implicit dependencies** — services grab what they need via `self.services.<attr>`, leading to hidden coupling.
3. **No explicit RBAC per service** — access control is at the hub level.
4. **Eager loading** — every request pays for all services even when only a few are used.
Service Center addresses these while keeping a migration path via `legacy_hub` fallback. The Chatbot feature already uses it successfully.
---
## 5. Benchmark Script
Run the comparison script to measure runtime and memory:
```bash
# From gateway root
python tests/benchmarks/benchmark_service_center_vs_legacy.py
```
See `tests/benchmarks/benchmark_service_center_vs_legacy.py` for details on metrics and methodology.

View file

@ -34,7 +34,6 @@ from modules.features.chatbot.bridges.tools import (
create_tavily_search_tool,
create_send_streaming_message_tool,
)
from modules.services.serviceStreaming import ChatStreamingHelper
from modules.datamodels.datamodelUam import User
if TYPE_CHECKING:
@ -585,6 +584,7 @@ class Chatbot:
workflow_id: str = "default"
config: Optional["ChatbotConfig"] = None
_event_manager: Any = None
_chat_streaming_helper: Any = None # From service center streaming service
@classmethod
async def create(
@ -596,6 +596,7 @@ class Chatbot:
config: Optional["ChatbotConfig"] = None,
event_manager=None,
planner_model: Optional[AICenterChatModel] = None,
chat_streaming_helper=None,
) -> "Chatbot":
"""Factory method to create and configure a Chatbot instance.
@ -607,6 +608,7 @@ class Chatbot:
config: Optional chatbot configuration for dynamic tool enablement.
event_manager: Optional event manager for streaming (passed from route).
planner_model: Optional fast model for planner/routing (default: same as model).
chat_streaming_helper: ChatStreamingHelper from service center streaming service.
Returns:
A configured Chatbot instance.
@ -619,6 +621,7 @@ class Chatbot:
config=config,
_event_manager=event_manager,
planner_model=planner_model,
_chat_streaming_helper=chat_streaming_helper,
)
configured_tools = await instance._configure_tools()
instance._tools = configured_tools
@ -1244,10 +1247,11 @@ class Chatbot:
if etype == "on_chain_end" and _is_root(event):
output_obj = edata.get("output")
# Extract message list from the graph's final output
final_msgs = ChatStreamingHelper.extract_messages_from_output(
output_obj=output_obj
)
# Extract message list from the graph's final output (ChatStreamingHelper from service center)
helper = self._chat_streaming_helper
if not helper:
raise RuntimeError("ChatStreamingHelper required; pass chat_streaming_helper to Chatbot.create()")
final_msgs = helper.extract_messages_from_output(output_obj=output_obj)
# Normalize for the frontend (only user/assistant with text content)
# Exclude planner-only and SQL-path intermediate messages from chat display
@ -1255,9 +1259,9 @@ class Chatbot:
chat_history_payload: List[dict] = []
for m in final_msgs:
if isinstance(m, BaseMessage):
d = ChatStreamingHelper.message_to_dict(msg=m)
d = helper.message_to_dict(msg=m)
elif isinstance(m, dict):
d = ChatStreamingHelper.dict_message_to_dict(obj=m)
d = helper.dict_message_to_dict(obj=m)
else:
continue
if d.get("role") not in ("user", "assistant") or not d.get("content"):

View file

@ -48,12 +48,25 @@ RESOURCE_OBJECTS = [
},
]
# Service requirements for chatbot — resolved via service center
# Service requirements - services this feature needs from the service center
# Format: [{serviceKey, meta}]. Used by getChatbotServices() to resolve only needed services.
REQUIRED_SERVICES = [
{"serviceKey": "chat", "meta": {"usage": "File info, document handling"}},
{"serviceKey": "ai", "meta": {"usage": "AI calls, conversation name generation"}},
{"serviceKey": "billing", "meta": {"usage": "Usage tracking, balance checks"}},
{"serviceKey": "streaming", "meta": {"usage": "Event manager, ChatStreamingHelper"}},
{
"serviceKey": "chat",
"meta": {"usage": "File info, document handling"}
},
{
"serviceKey": "ai",
"meta": {"usage": "AI calls, conversation name generation"}
},
{
"serviceKey": "billing",
"meta": {"usage": "Usage tracking, balance checks"}
},
{
"serviceKey": "streaming",
"meta": {"usage": "Event manager, ChatStreamingHelper"}
},
]
# Template roles for this feature
@ -123,6 +136,108 @@ def getFeatureDefinition() -> Dict[str, Any]:
}
def getRequiredServiceKeys() -> List[str]:
"""Return list of service keys this feature requires."""
return [s["serviceKey"] for s in REQUIRED_SERVICES]
def getChatbotServices(
user,
mandateId: Optional[str] = None,
featureInstanceId: Optional[str] = None,
workflow=None,
) -> Any:
"""
Get a service hub for the chatbot feature using the service center.
Resolves only the services declared in REQUIRED_SERVICES.
Returns a hub-like object with: chat, ai, billing, streaming,
plus interfaceDbComponent, user, mandateId, featureInstanceId.
"""
from modules.serviceCenter import getService
from modules.serviceCenter.context import ServiceCenterContext
from modules.interfaces.interfaceDbManagement import getInterface as getComponentInterface
# Provide workflow or placeholder so billing/etc get featureCode
_workflow = workflow
if _workflow is None:
_workflow = type("_Placeholder", (), {"featureCode": FEATURE_CODE})()
ctx = ServiceCenterContext(
user=user,
mandate_id=mandateId,
feature_instance_id=featureInstanceId,
workflow=_workflow,
)
hub = _ChatbotServiceHub()
hub.user = user
hub.mandateId = mandateId
hub.featureInstanceId = featureInstanceId
hub.workflow = workflow
hub.interfaceDbComponent = getComponentInterface(user, mandateId=mandateId, featureInstanceId=featureInstanceId)
for spec in REQUIRED_SERVICES:
key = spec["serviceKey"]
try:
svc = getService(key, ctx, legacy_hub=None)
setattr(hub, key, svc)
except Exception as e:
logger.warning(f"Could not resolve service '{key}' for chatbot: {e}")
setattr(hub, key, None)
return hub
def getChatStreamingHelper():
"""
Get ChatStreamingHelper utility class (used by chatbot for message normalization).
Resolves via service center streaming service.
"""
from modules.serviceCenter import getService
from modules.serviceCenter.context import ServiceCenterContext
# Minimal context - streaming service only needs it for resolver
ctx = ServiceCenterContext(user=__get_placeholder_user(), mandate_id=None, feature_instance_id=None)
streaming = getService("streaming", ctx, legacy_hub=None)
return streaming.getChatStreamingHelper() if streaming else None
def __get_placeholder_user():
"""Placeholder user for contexts that only need service resolution (e.g. ChatStreamingHelper)."""
from modules.datamodels.datamodelUam import User
return User(id="system", email="system@placeholder", firstName="System", lastName="Placeholder")
def getEventManager(user, mandateId: Optional[str] = None, featureInstanceId: Optional[str] = None):
"""
Get the global event manager for SSE streaming (used by chatbot routes).
"""
from modules.serviceCenter import getService
from modules.serviceCenter.context import ServiceCenterContext
ctx = ServiceCenterContext(
user=user,
mandate_id=mandateId,
feature_instance_id=featureInstanceId,
)
streaming = getService("streaming", ctx, legacy_hub=None)
return streaming.getEventManager()
class _ChatbotServiceHub:
"""Lightweight hub exposing only services required by the chatbot feature."""
user = None
mandateId = None
featureInstanceId = None
workflow = None
interfaceDbComponent = None
chat = None
ai = None
billing = None
streaming = None
featureCode = "chatbot"
allowedProviders = None
def getUiObjects() -> List[Dict[str, Any]]:
"""Return UI objects for RBAC catalog registration."""
return UI_OBJECTS

View file

@ -31,7 +31,7 @@ from modules.features.chatbot.interfaceFeatureChatbot import ChatbotConversation
# Import chatbot feature
from modules.features.chatbot import chatProcess
from modules.services.serviceStreaming import get_event_manager
from modules.features.chatbot.mainChatbot import getEventManager
# Pre-warm AI connectors when this router loads (before first request).
# Ensures connectors are ready; avoids 48 s delay on first chatbot message.
@ -250,7 +250,7 @@ async def stream_chatbot_start(
# Validate instance access
mandateId = _validateInstanceAccess(instanceId, context)
event_manager = get_event_manager()
event_manager = getEventManager(context.user, mandateId=mandateId, featureInstanceId=instanceId)
try:
# Use workflowId from query parameter if provided, otherwise from request body
@ -462,7 +462,7 @@ async def stop_chatbot(
) -> ChatbotConversation:
"""Stops a running chatbot workflow."""
# Validate instance access
_validateInstanceAccess(instanceId, context)
mandateId = _validateInstanceAccess(instanceId, context)
try:
# Get chatbot interface with instance context
@ -489,7 +489,7 @@ async def stop_chatbot(
"lastActivity": getUtcTimestamp()
})
event_manager = get_event_manager()
event_manager = getEventManager(context.user, mandateId=mandateId, featureInstanceId=instanceId)
# Store log entry (createLog emits when event_manager is provided)
interfaceDbChat.createLog({
"id": f"log_{uuid.uuid4()}",

View file

@ -91,15 +91,17 @@ async def chatProcess(
ChatbotConversation instance
"""
try:
# Get services from service center (only chat, ai, billing, streaming — avoids ~90ms legacy hub)
# Get services from service center (only services declared in mainChatbot.REQUIRED_SERVICES)
services = getChatbotServices(currentUser, mandateId=mandateId, featureInstanceId=featureInstanceId)
services.featureCode = 'chatbot'
# Config and model warm run in background task — return stream ~23 s faster for normal feel
chatbot_config = None
# Load instance config and apply allowedProviders for AI calls (conversation name + main chat)
chatbot_config = await _load_chatbot_config(featureInstanceId)
if chatbot_config.model.allowedProviders:
services.allowedProviders = chatbot_config.model.allowedProviders
logger.info(f"Chatbot instance {featureInstanceId}: restricting to providers {chatbot_config.model.allowedProviders}")
# Reuse hub's interfaceDbChat (ChatObjects) - avoids duplicate DB init
interfaceDbChat = services.interfaceDbChat
from modules.features.chatbot.interfaceFeatureChatbot import getInterface as getChatbotInterface
interfaceDbChat = getChatbotInterface(currentUser, mandateId=mandateId, featureInstanceId=featureInstanceId)
# Create or load workflow (event_manager passed from route)
if workflowId:
@ -162,6 +164,10 @@ async def chatProcess(
# Create event queue for new workflow (for streaming)
event_manager.create_queue(workflow.id)
# Reload workflow to get current message count
workflow = interfaceDbChat.getWorkflow(workflow.id)
services.workflow = workflow # Required for chat service document resolution
# Process uploaded files and create ChatbotDocuments
user_documents = []
if userInput.listFileId and len(userInput.listFileId) > 0:
@ -1204,49 +1210,45 @@ def _preflight_billing_check(services, mandateId: str, featureInstanceId: Option
"""
Pre-flight billing check before starting chatbot AI processing.
Raises if mandate has insufficient balance or no providers allowed.
Uses services.billing from service center (REQUIRED_SERVICES).
Exception types from BillingService class (service center billing API).
"""
from modules.services.serviceBilling.mainServiceBilling import (
getService as getBillingService,
InsufficientBalanceException,
ProviderNotAllowedException,
BillingContextError,
)
user = services.user
featureCode = "chatbot"
from modules.serviceCenter.services.serviceBilling import BillingService
billingService = services.billing
if not billingService:
raise BillingService.BillingContextError("Billing service not available for chatbot")
try:
billingService = getBillingService(user, mandateId, featureInstanceId, featureCode)
balanceCheck = billingService.checkBalance(0.01)
if not balanceCheck.allowed:
raise InsufficientBalanceException(
raise BillingService.InsufficientBalanceException(
currentBalance=balanceCheck.currentBalance or 0.0,
requiredAmount=0.01,
message=f"Ungenuegendes Guthaben. Aktuell: CHF {balanceCheck.currentBalance:.2f}"
)
rbacAllowedProviders = billingService.getallowedProviders()
if not rbacAllowedProviders:
raise ProviderNotAllowedException(
raise BillingService.ProviderNotAllowedException(
provider="any",
message="Keine AI-Provider fuer Ihre Rolle freigegeben. Kontaktieren Sie Ihren Administrator."
)
except (InsufficientBalanceException, ProviderNotAllowedException):
except (BillingService.InsufficientBalanceException, BillingService.ProviderNotAllowedException):
raise
except Exception as e:
logger.error(f"Billing pre-flight failed: {e}")
raise BillingContextError(f"Billing check failed: {e}")
raise BillingService.BillingContextError(f"Billing check failed: {e}")
def _create_chatbot_billing_callback(services, workflow_id: str):
"""
Create billing callback for AICenterChatModel. Records each AI call to poweron_billing.
Uses services.billing from service center (REQUIRED_SERVICES).
"""
from modules.services.serviceBilling.mainServiceBilling import getService as getBillingService
from modules.datamodels.datamodelAi import AiCallResponse
user = services.user
mandateId = services.mandateId
featureInstanceId = getattr(services, "featureInstanceId", None)
featureCode = "chatbot"
billingService = getBillingService(user, mandateId, featureInstanceId, featureCode)
billingService = services.billing
if not billingService:
return lambda _: None # No-op callback if billing unavailable
def _billing_callback(response: AiCallResponse) -> None:
if not response or getattr(response, "errorCount", 0) > 0:
@ -1389,6 +1391,11 @@ async def _processChatbotMessageLangGraph(
)
# Create chatbot instance with config for dynamic tool configuration
chat_streaming_helper = None
if services.streaming:
chat_streaming_helper = services.streaming.getChatStreamingHelper()
if not chat_streaming_helper:
logger.warning("ChatStreamingHelper not available from streaming service; message normalization may fail")
chatbot = await Chatbot.create(
model=model,
memory=memory,
@ -1397,6 +1404,7 @@ async def _processChatbotMessageLangGraph(
config=config,
event_manager=event_manager,
planner_model=planner_model,
chat_streaming_helper=chat_streaming_helper,
)
# Emit synthetic status for real-time UI feedback

View file

@ -0,0 +1,136 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
"""
Service Center.
Central registry for core and importable services with per-feature resolution.
"""
import logging
from typing import Any, List, Optional
from modules.serviceCenter.context import ServiceCenterContext
from modules.serviceCenter.registry import (
CORE_SERVICES,
IMPORTABLE_SERVICES,
SERVICE_RBAC_OBJECTS,
)
from modules.serviceCenter.resolver import (
resolve,
get_resolution_cache,
clear_cache,
)
logger = logging.getLogger(__name__)
def getService(
key: str,
context: ServiceCenterContext,
legacy_hub: Optional[Any] = None,
) -> Any:
"""
Get a service instance by key for the given context.
Args:
key: Service key (e.g., "web", "extraction", "utils")
context: ServiceCenterContext with user, mandate_id, feature_instance_id, workflow
legacy_hub: Optional legacy Services instance for fallback when service not yet migrated
Returns:
Service instance
"""
cache = get_resolution_cache()
resolving = set()
return resolve(key, context, cache, resolving, legacy_hub=legacy_hub)
def preWarm(service_keys: Optional[List[str]] = None) -> None:
"""
Pre-load service modules at startup to avoid first-request latency.
Args:
service_keys: Optional list of keys to preload. If None, preloads all registered services.
"""
import importlib
keys = service_keys or list(CORE_SERVICES.keys()) + list(IMPORTABLE_SERVICES.keys())
for key in keys:
spec = CORE_SERVICES.get(key) or IMPORTABLE_SERVICES.get(key)
if not spec:
continue
try:
importlib.import_module(spec["module"])
logger.debug(f"Pre-warmed service module: {key}")
except (ImportError, ModuleNotFoundError) as e:
logger.debug(f"Could not pre-warm {key}: {e}")
def registerServiceObjects(catalogService) -> bool:
"""Register service RBAC objects in the catalog. Call at startup."""
try:
for obj in SERVICE_RBAC_OBJECTS:
catalogService.registerResourceObject(
featureCode="system",
objectKey=obj["objectKey"],
label=obj["label"],
meta=obj.get("meta"),
)
logger.info(f"Registered {len(SERVICE_RBAC_OBJECTS)} service RBAC objects")
return True
except Exception as e:
logger.error(f"Failed to register service RBAC objects: {e}")
return False
def can_access_service(
user,
rbac,
service_key: str,
mandate_id: Optional[str] = None,
feature_instance_id: Optional[str] = None,
allow_when_no_rbac: bool = True,
) -> bool:
"""
Check if user has permission to access the given service.
Args:
user: User object
rbac: RbacClass instance (e.g. from interfaceDbApp.rbac)
service_key: Service key (e.g., "web", "extraction")
mandate_id: Optional mandate context
feature_instance_id: Optional feature instance context
allow_when_no_rbac: If True, allow when rbac is None (migration/default)
Returns:
True if user has view permission on the service
"""
if not rbac:
return allow_when_no_rbac
if service_key not in IMPORTABLE_SERVICES:
return False
obj = IMPORTABLE_SERVICES[service_key]
object_key = obj.get("objectKey")
if not object_key:
return False
from modules.datamodels.datamodelRbac import AccessRuleContext
permissions = rbac.getUserPermissions(
user,
AccessRuleContext.RESOURCE,
object_key,
mandateId=mandate_id,
featureInstanceId=feature_instance_id,
)
return permissions.view if permissions else False
__all__ = [
"ServiceCenterContext",
"getService",
"preWarm",
"clear_cache",
"registerServiceObjects",
"can_access_service",
"SERVICE_RBAC_OBJECTS",
"CORE_SERVICES",
"IMPORTABLE_SERVICES",
]

View file

@ -0,0 +1,32 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
"""
Service Center Context.
Minimal context passed to services: user, mandate, feature instance, workflow.
"""
from dataclasses import dataclass
from typing import Any, Optional
from modules.datamodels.datamodelUam import User
@dataclass
class ServiceCenterContext:
"""Context for service resolution: user, mandate, feature instance, optional workflow."""
user: User
mandate_id: Optional[str] = None
feature_instance_id: Optional[str] = None
workflow_id: Optional[str] = None
workflow: Any = None
@property
def mandateId(self) -> Optional[str]:
"""Alias for mandate_id (backward compatibility)."""
return self.mandate_id
@property
def featureInstanceId(self) -> Optional[str]:
"""Alias for feature_instance_id (backward compatibility)."""
return self.feature_instance_id

View file

@ -0,0 +1,3 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
"""Core services - internal building blocks, not requested by features."""

View file

@ -0,0 +1,7 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
"""Security core service."""
from .mainServiceSecurity import SecurityService
__all__ = ["SecurityService"]

View file

@ -0,0 +1,81 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
"""
Security service for token management operations.
Core service - not requested by features directly.
"""
import logging
from typing import Optional, Callable, Any
from modules.datamodels.datamodelSecurity import Token
from modules.auth import TokenManager
logger = logging.getLogger(__name__)
class SecurityService:
"""Security service providing token management operations."""
def __init__(self, context: Any, get_service: Callable[[str], Any]):
"""Initialize with service center context and resolver."""
self._context = context
self._get_service = get_service
self._tokenManager = TokenManager()
from modules.interfaces.interfaceDbApp import getInterface as getAppInterface
self._interfaceDbApp = getAppInterface(
context.user,
mandateId=context.mandate_id,
)
def getFreshToken(self, connectionId: str, secondsBeforeExpiry: int = 30 * 60) -> Optional[Token]:
"""Get a fresh token for a connection, refreshing when expiring soon."""
try:
token = self._interfaceDbApp.getConnectionToken(connectionId)
if not token:
return None
return self._tokenManager.ensureFreshToken(
token,
secondsBeforeExpiry=secondsBeforeExpiry,
saveCallback=lambda t: self._interfaceDbApp.saveConnectionToken(t)
)
except Exception as e:
logger.error(f"getFreshToken: Error fetching or refreshing token for connection {connectionId}: {e}")
return None
def refreshToken(self, oldToken: Token) -> Optional[Token]:
"""Refresh an expired token using the appropriate OAuth service."""
try:
return self._tokenManager.refreshToken(oldToken)
except Exception as e:
logger.error(f"refreshToken: Error refreshing token: {e}")
return None
def ensureFreshToken(self, token: Token, *, secondsBeforeExpiry: int = 30 * 60,
saveCallback: Optional[Callable[[Token], None]] = None) -> Optional[Token]:
"""Ensure a token is fresh; refresh if expiring within threshold."""
try:
return self._tokenManager.ensureFreshToken(
token,
secondsBeforeExpiry=secondsBeforeExpiry,
saveCallback=saveCallback
)
except Exception as e:
logger.error(f"ensureFreshToken: Error ensuring fresh token: {e}")
return None
def refreshMicrosoftToken(self, refreshToken: str, userId: str, oldToken: Token) -> Optional[Token]:
"""Refresh Microsoft OAuth token using refresh token."""
try:
return self._tokenManager.refreshMicrosoftToken(refreshToken, userId, oldToken)
except Exception as e:
logger.error(f"refreshMicrosoftToken: Error refreshing Microsoft token: {e}")
return None
def refreshGoogleToken(self, refreshToken: str, userId: str, oldToken: Token) -> Optional[Token]:
"""Refresh Google OAuth token using refresh token."""
try:
return self._tokenManager.refreshGoogleToken(refreshToken, userId, oldToken)
except Exception as e:
logger.error(f"refreshGoogleToken: Error refreshing Google token: {e}")
return None

View file

@ -0,0 +1,9 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
"""Streaming core service for SSE event management."""
from .eventManager import EventManager, get_event_manager
from .helpers import ChatStreamingHelper
from .mainServiceStreaming import StreamingService
__all__ = ["EventManager", "get_event_manager", "ChatStreamingHelper", "StreamingService"]

View file

@ -0,0 +1,158 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
"""
Event manager for SSE streaming.
Manages event queues for Server-Sent Events (SSE) streaming across features.
"""
import logging
import asyncio
from typing import Dict, Optional, Any
logger = logging.getLogger(__name__)
class EventManager:
"""
Manages event queues for SSE streaming.
Each workflow has its own async queue for events.
"""
def __init__(self):
"""Initialize the event manager."""
self._queues: Dict[str, asyncio.Queue] = {}
self._cleanup_tasks: Dict[str, asyncio.Task] = {}
def create_queue(self, workflow_id: str) -> asyncio.Queue:
"""
Create an event queue for a workflow.
Args:
workflow_id: Workflow ID
Returns:
Async queue for events
"""
if workflow_id not in self._queues:
self._queues[workflow_id] = asyncio.Queue()
logger.debug(f"Created event queue for workflow {workflow_id}")
return self._queues[workflow_id]
def get_queue(self, workflow_id: str) -> Optional[asyncio.Queue]:
"""
Get the event queue for a workflow.
Args:
workflow_id: Workflow ID
Returns:
Async queue if exists, None otherwise
"""
return self._queues.get(workflow_id)
def has_queue(self, workflow_id: str) -> bool:
"""
Check if a queue exists for a workflow.
Args:
workflow_id: Workflow ID
Returns:
True if queue exists, False otherwise
"""
return workflow_id in self._queues
async def emit_event(
self,
context_id: str,
event_type: str,
data: Dict[str, Any],
event_category: str = "chat",
message: Optional[str] = None,
step: Optional[str] = None
) -> None:
"""
Emit an event to the queue for a workflow.
Args:
context_id: Workflow ID (context)
event_type: Type of event (e.g., "chatdata", "complete", "error")
data: Event data dictionary
event_category: Category of event (e.g., "chat", "workflow")
message: Optional message string
step: Optional step identifier
"""
queue = self._queues.get(context_id)
if not queue:
# DEBUG level: This is normal for background workflows without active SSE listener
return
event = {
"type": event_type,
"data": data,
"category": event_category,
"message": message,
"step": step
}
try:
await queue.put(event)
logger.debug(f"Emitted {event_type} event for workflow {context_id}")
except Exception as e:
logger.error(f"Error emitting event for workflow {context_id}: {e}", exc_info=True)
async def cleanup(self, workflow_id: str, delay: float = 60.0) -> None:
"""
Schedule cleanup of a queue after a delay.
Args:
workflow_id: Workflow ID
delay: Delay in seconds before cleanup
"""
# Cancel existing cleanup task if any
if workflow_id in self._cleanup_tasks:
self._cleanup_tasks[workflow_id].cancel()
async def _cleanup():
try:
await asyncio.sleep(delay)
if workflow_id in self._queues:
# Drain remaining events
queue = self._queues[workflow_id]
while not queue.empty():
try:
queue.get_nowait()
except asyncio.QueueEmpty:
break
# Remove queue
del self._queues[workflow_id]
logger.info(f"Cleaned up event queue for workflow {workflow_id}")
except asyncio.CancelledError:
logger.debug(f"Cleanup cancelled for workflow {workflow_id}")
except Exception as e:
logger.error(f"Error during cleanup for workflow {workflow_id}: {e}", exc_info=True)
finally:
if workflow_id in self._cleanup_tasks:
del self._cleanup_tasks[workflow_id]
# Schedule cleanup
task = asyncio.create_task(_cleanup())
self._cleanup_tasks[workflow_id] = task
# Global event manager instance
_event_manager: Optional[EventManager] = None
def get_event_manager() -> EventManager:
"""
Get the global event manager instance.
Returns:
EventManager instance
"""
global _event_manager
if _event_manager is None:
_event_manager = EventManager()
return _event_manager

View file

@ -0,0 +1,242 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
"""Streaming helper utilities for chat message processing and normalization."""
from __future__ import annotations
from typing import Any, Dict, List, Literal, Mapping, Optional
from langchain_core.messages import (
AIMessage,
BaseMessage,
HumanMessage,
SystemMessage,
ToolMessage,
)
Role = Literal["user", "assistant", "system", "tool"]
class ChatStreamingHelper:
"""Pure helper methods for streaming and message normalization.
This class provides static utility methods for converting between different
message formats, extracting content, and normalizing message structures
for streaming chat applications.
"""
@staticmethod
def role_from_message(*, msg: BaseMessage) -> Role:
"""Extract the role from a BaseMessage instance.
Args:
msg: The BaseMessage instance to extract the role from.
Returns:
The role as a string literal: "user", "assistant", "system", or "tool".
Defaults to "assistant" if the message type is not recognized.
Examples:
>>> from langchain_core.messages import HumanMessage
>>> msg = HumanMessage(content="Hello")
>>> ChatStreamingHelper.role_from_message(msg=msg)
'user'
"""
if isinstance(msg, HumanMessage):
return "user"
if isinstance(msg, AIMessage):
return "assistant"
if isinstance(msg, SystemMessage):
return "system"
if isinstance(msg, ToolMessage):
return "tool"
return getattr(msg, "role", "assistant")
@staticmethod
def flatten_content(*, content: Any) -> str:
"""Convert complex content structures to plain text.
This method handles various content formats including strings, lists of
content parts, and dictionaries with text fields. It's designed to
normalize content from different message sources into a consistent
plain text format.
Args:
content: The content to flatten. Can be:
- str: Returned as-is after stripping whitespace
- list: Each item processed and joined with newlines
- dict: Text extracted from "text" or "content" fields
- None: Returns empty string
- Any other type: Converted to string
Returns:
The flattened content as a plain text string with whitespace stripped.
Examples:
>>> content = [{"type": "text", "text": "Hello"}, {"type": "text", "text": "world"}]
>>> ChatStreamingHelper.flatten_content(content=content)
'Hello\nworld'
>>> content = {"text": "Simple message"}
>>> ChatStreamingHelper.flatten_content(content=content)
'Simple message'
"""
if content is None:
return ""
if isinstance(content, str):
return content.strip()
if isinstance(content, list):
parts: List[str] = []
for part in content:
if isinstance(part, dict):
if "text" in part and isinstance(part["text"], str):
parts.append(part["text"])
elif part.get("type") == "text" and isinstance(
part.get("text"), str
):
parts.append(part["text"])
elif "content" in part and isinstance(part["content"], str):
parts.append(part["content"])
else:
# Fallback for unknown dictionary structures
val = part.get("value")
if isinstance(val, str):
parts.append(val)
else:
parts.append(str(part))
return "\n".join(p.strip() for p in parts if p is not None)
if isinstance(content, dict):
if "text" in content and isinstance(content["text"], str):
return content["text"].strip()
if "content" in content and isinstance(content["content"], str):
return content["content"].strip()
return str(content).strip()
@staticmethod
def message_to_dict(*, msg: BaseMessage) -> Dict[str, Any]:
"""Convert a BaseMessage instance to a dictionary for streaming output.
This method normalizes BaseMessage instances into a consistent dictionary
format suitable for JSON serialization and streaming to clients.
Args:
msg: The BaseMessage instance to convert.
Returns:
A dictionary containing:
- "role": The message role (user, assistant, system, tool)
- "content": The flattened message content as plain text
- "tool_calls": Tool calls if present (optional)
- "name": Message name if present (optional)
Examples:
>>> from langchain_core.messages import HumanMessage
>>> msg = HumanMessage(content="Hello there")
>>> result = ChatStreamingHelper.message_to_dict(msg=msg)
>>> result["role"]
'user'
>>> result["content"]
'Hello there'
"""
payload: Dict[str, Any] = {
"role": ChatStreamingHelper.role_from_message(msg=msg),
"content": ChatStreamingHelper.flatten_content(
content=getattr(msg, "content", "")
),
}
tool_calls = getattr(msg, "tool_calls", None)
if tool_calls:
payload["tool_calls"] = tool_calls
name = getattr(msg, "name", None)
if name:
payload["name"] = name
return payload
@staticmethod
def dict_message_to_dict(*, obj: Mapping[str, Any]) -> Dict[str, Any]:
"""Convert a dictionary-shaped message to a normalized dictionary.
This method handles messages that come from serialized state and are
represented as dictionaries rather than BaseMessage instances. It
normalizes various dictionary formats into a consistent structure.
Args:
obj: The dictionary-shaped message to convert. Expected to contain
fields like "role", "type", "content", "text", etc.
Returns:
A normalized dictionary containing:
- "role": The message role (user, assistant, system, tool)
- "content": The flattened message content as plain text
- "tool_calls": Tool calls if present (optional)
- "name": Message name if present (optional)
Examples:
>>> obj = {"type": "human", "content": "Hello"}
>>> result = ChatStreamingHelper.dict_message_to_dict(obj=obj)
>>> result["role"]
'user'
>>> result["content"]
'Hello'
"""
role: Optional[str] = obj.get("role")
if not role:
# Handle alternative type field mappings
typ = obj.get("type")
if typ in ("human", "user"):
role = "user"
elif typ in ("ai", "assistant"):
role = "assistant"
elif typ in ("system",):
role = "system"
elif typ in ("tool", "function"):
role = "tool"
content = obj.get("content")
if content is None and "text" in obj:
content = obj["text"]
out: Dict[str, Any] = {
"role": role or "assistant",
"content": ChatStreamingHelper.flatten_content(content=content),
}
if "tool_calls" in obj:
out["tool_calls"] = obj["tool_calls"]
if obj.get("name"):
out["name"] = obj["name"]
return out
@staticmethod
def extract_messages_from_output(*, output_obj: Any) -> List[Any]:
"""Extract messages from LangGraph output objects.
This method handles various output formats from LangGraph execution,
extracting the messages list from different possible structures.
Args:
output_obj: The output object from LangGraph execution. Can be:
- An object with a "messages" attribute
- A dictionary with a "messages" key
- Any other object (returns empty list)
Returns:
A list of extracted messages, or an empty list if no messages
are found or if the output object is None.
Examples:
>>> output = {"messages": [{"role": "user", "content": "Hello"}]}
>>> messages = ChatStreamingHelper.extract_messages_from_output(output_obj=output)
>>> len(messages)
1
"""
if output_obj is None:
return []
# Try to parse dicts first
if isinstance(output_obj, dict):
msgs = output_obj.get("messages")
return msgs if isinstance(msgs, list) else []
# Then try to get messages attribute
msgs = getattr(output_obj, "messages", None)
return msgs if isinstance(msgs, list) else []

View file

@ -0,0 +1,31 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
"""
Streaming service for SSE event management.
Core service - not requested by features directly.
"""
import logging
from typing import Any, Callable
from modules.serviceCenter.core.serviceStreaming.eventManager import EventManager, get_event_manager
from modules.serviceCenter.core.serviceStreaming.helpers import ChatStreamingHelper
logger = logging.getLogger(__name__)
class StreamingService:
"""Streaming service providing access to SSE event infrastructure."""
def __init__(self, context: Any, get_service: Callable[[str], Any]):
"""Initialize with service center context and resolver."""
self._context = context
self._get_service = get_service
def getEventManager(self) -> EventManager:
"""Get the global event manager instance for SSE streaming."""
return get_event_manager()
def getChatStreamingHelper(self):
"""Get ChatStreamingHelper utility for message normalization (no legacy import at call site)."""
return ChatStreamingHelper

View file

@ -0,0 +1,7 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
"""Utils core service."""
from .mainServiceUtils import UtilsService
__all__ = ["UtilsService"]

View file

@ -0,0 +1,185 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
"""
Utility service for common operations across the gateway.
Provides centralized access to configuration, events, and other utilities.
Core service - not requested by features directly.
"""
import logging
from typing import Any, Optional, Dict, Callable, List
from modules.shared.configuration import APP_CONFIG
from modules.shared.eventManagement import eventManager
from modules.shared.timeUtils import getUtcTimestamp
from modules.shared import jsonUtils
logger = logging.getLogger(__name__)
class UtilsService:
"""Utility service providing common operations."""
def __init__(self, context, get_service: Callable[[str], Any]):
"""Initialize with service center context and resolver."""
self._context = context
self._get_service = get_service
# ===== Event handling =====
def eventRegisterCron(self, job_id: str, func: Callable, cron_kwargs: Dict[str, Any],
replace_existing: bool = True, coalesce: bool = True,
max_instances: int = 1, misfire_grace_time: int = 1800):
"""Register a cron job with the event manager."""
try:
eventManager.registerCron(
jobId=job_id,
func=func,
cronKwargs=cron_kwargs,
replaceExisting=replace_existing,
coalesce=coalesce,
maxInstances=max_instances,
misfireGraceTime=misfire_grace_time
)
logger.info(f"Registered cron job '{job_id}' with schedule: {cron_kwargs}")
except Exception as e:
logger.error(f"Error registering cron job '{job_id}': {str(e)}")
def eventRegisterInterval(self, job_id: str, func: Callable, seconds: Optional[int] = None,
minutes: Optional[int] = None, hours: Optional[int] = None,
replace_existing: bool = True, coalesce: bool = True,
max_instances: int = 1, misfire_grace_time: int = 1800):
"""Register an interval job with the event manager."""
try:
eventManager.registerInterval(
jobId=job_id,
func=func,
seconds=seconds,
minutes=minutes,
hours=hours,
replaceExisting=replace_existing,
coalesce=coalesce,
maxInstances=max_instances,
misfireGraceTime=misfire_grace_time
)
logger.info(f"Registered interval job '{job_id}' (h={hours}, m={minutes}, s={seconds})")
except Exception as e:
logger.error(f"Error registering interval job '{job_id}': {str(e)}")
def eventRemove(self, job_id: str):
"""Remove a scheduled job from the event manager."""
try:
eventManager.remove(job_id)
logger.info(f"Removed job '{job_id}'")
except Exception as e:
logger.error(f"Error removing job '{job_id}': {str(e)}")
def configGet(self, key: str, default: Any = None, user_id: str = "system") -> Any:
"""Get a configuration value with optional default."""
try:
return APP_CONFIG.get(key, default, user_id)
except Exception as e:
logger.error(f"Error getting config '{key}': {str(e)}")
return default
def timestampGetUtc(self) -> float:
"""Get current UTC timestamp."""
try:
return getUtcTimestamp()
except Exception as e:
logger.error(f"Error getting UTC timestamp: {str(e)}")
return 0.0
# ===== Debug Tools =====
def writeDebugFile(self, content: str, fileType: str, documents: Optional[List] = None) -> None:
"""Wrapper to write debug files via shared debugLogger."""
try:
from modules.shared.debugLogger import writeDebugFile as _writeDebugFile
_writeDebugFile(content, fileType, documents)
except Exception:
pass
def debugLogToFile(self, message: str, context: str = "DEBUG"):
"""Wrapper to log debug messages via shared debugLogger."""
try:
from modules.shared.debugLogger import debugLogToFile as _debugLogToFile
_debugLogToFile(message, context)
except Exception:
pass
def storeDebugMessageAndDocuments(self, message, currentUser, mandateId=None, featureInstanceId=None):
"""Wrapper to store debug messages and documents via interfaceDbChat."""
try:
from modules.interfaces.interfaceDbChat import storeDebugMessageAndDocuments as _storeDebugMessageAndDocuments
_storeDebugMessageAndDocuments(message, currentUser, mandateId=mandateId, featureInstanceId=featureInstanceId)
except Exception:
pass
def writeDebugArtifact(self, fileName: str, obj: Any):
"""Backward-compatible wrapper that now writes via writeDebugFile."""
try:
import json
if isinstance(obj, (dict, list)):
content = json.dumps(obj, ensure_ascii=False, indent=2)
else:
content = str(obj)
from modules.shared.debugLogger import writeDebugFile as _writeDebugFile
_writeDebugFile(content, fileName)
except Exception:
pass
# ===== Prompt sanitization =====
def sanitizePromptContent(self, content: str, contentType: str = "text") -> str:
"""Centralized prompt content sanitization."""
if not content:
return ""
try:
import re
content_str = str(content)
sanitized = re.sub(r'[\x00-\x08\x0B\x0C\x0E-\x1F\x7F]', '', content_str)
if contentType == "userinput":
sanitized = sanitized.replace('{', '{{').replace('}', '}}')
sanitized = sanitized.replace('"', '\\"').replace("'", "\\'")
return f"'{sanitized}'"
elif contentType == "json":
sanitized = sanitized.replace('\\', '\\\\').replace('"', '\\"')
sanitized = sanitized.replace('\n', '\\n').replace('\r', '\\r').replace('\t', '\\t')
elif contentType == "document":
sanitized = sanitized.replace('\\', '\\\\').replace('"', '\\"').replace("'", "\\'")
sanitized = sanitized.replace('\n', '\\n').replace('\r', '\\r').replace('\t', '\\t')
else:
sanitized = sanitized.replace('\\', '\\\\').replace('"', '\\"').replace("'", "\\'")
sanitized = sanitized.replace('\n', '\\n').replace('\r', '\\r').replace('\t', '\\t')
return sanitized
except Exception as e:
logger.error(f"Error sanitizing prompt content: {str(e)}")
return "[ERROR: Content could not be safely sanitized]"
# ===== JSON utility wrappers =====
def jsonStripCodeFences(self, text: str) -> str:
return jsonUtils.stripCodeFences(text)
def jsonExtractFirstBalanced(self, text: str) -> str:
return jsonUtils.extractFirstBalancedJson(text)
def jsonNormalizeText(self, text: str) -> str:
return jsonUtils.normalizeJsonText(text)
def jsonExtractString(self, text: str) -> str:
return jsonUtils.extractJsonString(text)
def jsonTryParse(self, text) -> tuple:
return jsonUtils.tryParseJson(text)
# ===== Enum utility functions =====
def mapToEnum(self, enum_class, value_str, default_value):
"""Map string value to enum."""
if not value_str:
return default_value
for enum_item in enum_class:
if enum_item.value.lower() == value_str.lower():
return enum_item
return default_value

View file

@ -0,0 +1,108 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
"""
Service Center Registry.
Service definitions, dependency graph, and RBAC object keys.
"""
from typing import Dict, List, Any
# Core services: internal building blocks, no RBAC, never requested by features
CORE_SERVICES: Dict[str, Dict[str, Any]] = {
"utils": {
"module": "modules.serviceCenter.core.serviceUtils.mainServiceUtils",
"class": "UtilsService",
"dependencies": [],
},
"security": {
"module": "modules.serviceCenter.core.serviceSecurity.mainServiceSecurity",
"class": "SecurityService",
"dependencies": [],
},
"streaming": {
"module": "modules.serviceCenter.core.serviceStreaming.mainServiceStreaming",
"class": "StreamingService",
"dependencies": [],
},
}
# Importable services: feature-facing, RBAC-protected
IMPORTABLE_SERVICES: Dict[str, Dict[str, Any]] = {
"ticket": {
"module": "modules.serviceCenter.services.serviceTicket.mainServiceTicket",
"class": "TicketService",
"dependencies": [],
"objectKey": "service.ticket",
"label": {"en": "Ticket System", "de": "Ticket-System", "fr": "Système de tickets"},
},
"messaging": {
"module": "modules.serviceCenter.services.serviceMessaging.mainServiceMessaging",
"class": "MessagingService",
"dependencies": [],
"objectKey": "service.messaging",
"label": {"en": "Messaging", "de": "Nachrichten", "fr": "Messagerie"},
},
"billing": {
"module": "modules.serviceCenter.services.serviceBilling.mainServiceBilling",
"class": "BillingService",
"dependencies": [],
"objectKey": "service.billing",
"label": {"en": "Billing", "de": "Abrechnung", "fr": "Facturation"},
},
"sharepoint": {
"module": "modules.serviceCenter.services.serviceSharepoint.mainServiceSharepoint",
"class": "SharepointService",
"dependencies": ["security"],
"objectKey": "service.sharepoint",
"label": {"en": "SharePoint", "de": "SharePoint", "fr": "SharePoint"},
},
"chat": {
"module": "modules.serviceCenter.services.serviceChat.mainServiceChat",
"class": "ChatService",
"dependencies": ["utils"],
"objectKey": "service.chat",
"label": {"en": "Chat", "de": "Chat", "fr": "Chat"},
},
"extraction": {
"module": "modules.serviceCenter.services.serviceExtraction.mainServiceExtraction",
"class": "ExtractionService",
"dependencies": ["chat", "utils"],
"objectKey": "service.extraction",
"label": {"en": "Extraction", "de": "Extraktion", "fr": "Extraction"},
},
"generation": {
"module": "modules.serviceCenter.services.serviceGeneration.mainServiceGeneration",
"class": "GenerationService",
"dependencies": ["utils", "chat"],
"objectKey": "service.generation",
"label": {"en": "Generation", "de": "Generierung", "fr": "Génération"},
},
"ai": {
"module": "modules.serviceCenter.services.serviceAi.mainServiceAi",
"class": "AiService",
"dependencies": ["chat", "utils", "extraction", "billing"],
"objectKey": "service.ai",
"label": {"en": "AI", "de": "KI", "fr": "IA"},
},
"web": {
"module": "modules.serviceCenter.services.serviceWeb.mainServiceWeb",
"class": "WebService",
"dependencies": ["ai", "chat", "utils"],
"objectKey": "service.web",
"label": {"en": "Web Research", "de": "Web-Recherche", "fr": "Recherche Web"},
},
"neutralization": {
"module": "modules.serviceCenter.services.serviceNeutralization.mainServiceNeutralization",
"class": "NeutralizationService",
"dependencies": ["extraction", "generation"],
"objectKey": "service.neutralization",
"label": {"en": "Neutralization", "de": "Neutralisierung", "fr": "Neutralisation"},
},
}
# RBAC objects for service-level access control (for catalog registration)
SERVICE_RBAC_OBJECTS: List[Dict[str, Any]] = [
{"objectKey": s["objectKey"], "label": s["label"]}
for s in IMPORTABLE_SERVICES.values()
if "objectKey" in s
]

View file

@ -0,0 +1,170 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
"""
Service Center Resolver.
Resolution logic, dependency injection, and optional legacy fallback.
"""
import importlib
import logging
from typing import Any, Callable, Dict, Optional, Set
from modules.serviceCenter.context import ServiceCenterContext
from modules.serviceCenter.registry import CORE_SERVICES, IMPORTABLE_SERVICES
logger = logging.getLogger(__name__)
# Type for get_service callable passed to services
GetServiceFunc = Callable[[str], Any]
def _make_context_id(ctx: ServiceCenterContext) -> str:
"""Create a stable cache key from context."""
return f"{id(ctx.user)}_{ctx.mandate_id or ''}_{ctx.feature_instance_id or ''}"
def _load_service_class(module_path: str, class_name: str):
"""Load service class from module."""
module = importlib.import_module(module_path)
return getattr(module, class_name)
def _create_legacy_hub(ctx: ServiceCenterContext) -> Any:
"""Create legacy Services instance for fallback when service not yet migrated."""
from modules.services import getInterface
return getInterface(
ctx.user,
workflow=ctx.workflow,
mandateId=ctx.mandate_id,
featureInstanceId=ctx.feature_instance_id,
)
def _get_from_legacy(legacy_hub: Any, key: str) -> Any:
"""Map service key to legacy hub attribute (for fallback when service center module fails)."""
key_to_attr = {
"utils": "utils",
"security": "security",
"streaming": "streaming",
"ticket": "ticket",
"messaging": "messaging",
"billing": "billing",
"sharepoint": "sharepoint",
"chat": "chat",
"extraction": "extraction",
"generation": "generation",
"ai": "ai",
"web": "web",
"neutralization": "neutralization",
}
attr = key_to_attr.get(key)
if attr and hasattr(legacy_hub, attr):
return getattr(legacy_hub, attr)
return None
def resolve(
key: str,
context: ServiceCenterContext,
cache: Dict[str, Any],
resolving: Set[str],
legacy_hub: Optional[Any] = None,
) -> Any:
"""
Resolve a service by key. Uses cache, resolves dependencies recursively.
Falls back to legacy_hub if service module cannot be loaded.
"""
cache_key = f"{_make_context_id(context)}_{key}"
if cache_key in cache:
return cache[cache_key]
if key in resolving:
raise RuntimeError(f"Circular dependency detected for service: {key}")
def get_service(dep_key: str) -> Any:
return resolve(dep_key, context, cache, resolving, legacy_hub)
# Try core first
if key in CORE_SERVICES:
spec = CORE_SERVICES[key]
try:
cls = _load_service_class(spec["module"], spec["class"])
resolving.add(key)
try:
for dep in spec.get("dependencies", []):
get_service(dep)
finally:
resolving.discard(key)
instance = cls(context, get_service)
cache[cache_key] = instance
return instance
except (ImportError, ModuleNotFoundError, AttributeError) as e:
logger.debug(f"Could not load core service '{key}' from service center: {e}")
if legacy_hub:
fallback = _get_from_legacy(legacy_hub, key)
if fallback is not None:
cache[cache_key] = fallback
return fallback
raise
# Try importable
if key in IMPORTABLE_SERVICES:
spec = IMPORTABLE_SERVICES[key]
try:
cls = _load_service_class(spec["module"], spec["class"])
resolving.add(key)
try:
for dep in spec.get("dependencies", []):
get_service(dep)
finally:
resolving.discard(key)
instance = cls(context, get_service)
cache[cache_key] = instance
return instance
except (ImportError, ModuleNotFoundError, AttributeError) as e:
logger.debug(f"Could not load importable service '{key}' from service center: {e}")
if legacy_hub:
fallback = _get_from_legacy(legacy_hub, key)
if fallback is not None:
cache[cache_key] = fallback
return fallback
raise
if legacy_hub:
fallback = _get_from_legacy(legacy_hub, key)
if fallback is not None:
cache[cache_key] = fallback
return fallback
raise KeyError(f"Unknown service: {key}")
# Module-level cache for service instances (per context)
_resolution_cache: Dict[str, Any] = {}
_cache_lock: Optional[Any] = None
try:
from threading import Lock
_cache_lock = Lock()
except ImportError:
pass
def get_resolution_cache() -> Dict[str, Any]:
"""Get the module-level resolution cache (for preWarm/clear)."""
return _resolution_cache
def clear_cache() -> None:
"""Clear the resolution cache."""
lock = _cache_lock if _cache_lock is not None else _DummyLock()
with lock:
_resolution_cache.clear()
class _DummyLock:
def __enter__(self):
return self
def __exit__(self, *args):
pass

View file

@ -0,0 +1,3 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
"""Importable services - feature-facing, RBAC-protected."""

View file

@ -0,0 +1,7 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
"""AI service."""
from .mainServiceAi import AiService
__all__ = ["AiService"]

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,665 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
"""
AI Call Looping Module
Handles AI calls with looping and repair logic, including:
- Looping with JSON repair and continuation
- KPI definition and tracking
- Progress tracking and iteration management
FLOW LOGIC
VARIABLES:
- jsonBase: str (merged JSON so far, starts empty)
- lastValidCompletePart: str (fallback for failures)
- mergeFailCount: int = 0 (max 3)
FLOW:
1. BUILD PROMPT
- First: original prompt
- Next: buildContinuationContext(lastRawResponse)
2. CALL AI response fragment
4. MERGE jsonBase + response
FAILS: repeat prompt, fails++ (if >=3 return fallback)
SUCCEEDS: try parse
SUCCEEDS: FINISHED
FAILS: step 5
5. GET CONTEXTS (merge OK, parse failed)
getContexts(mergedJson)
- If no cut point: overlapContext = ""
- Store contexts for next iteration
6. DECIDE
jsonParsingSuccess=true AND overlapContext="":
FINISHED. return completePart
jsonParsingSuccess=true AND overlapContext!="":
CONTINUE, fails=0
ELSE: repeat prompt, fails++
"""
import json
import logging
from typing import Dict, Any, List, Optional, Callable
from modules.datamodels.datamodelAi import (
AiCallRequest, AiCallOptions
)
from modules.datamodels.datamodelExtraction import ContentPart
from .subJsonResponseHandling import JsonResponseHandler
from .subLoopingUseCases import LoopingUseCaseRegistry
from modules.workflows.processing.shared.stateTools import checkWorkflowStopped
from modules.shared.jsonContinuation import getContexts
from modules.shared.jsonUtils import buildContinuationContext, extractJsonString, tryParseJson
from modules.shared.jsonUtils import tryParseJson
from modules.shared.jsonUtils import closeJsonStructures
from modules.shared.jsonUtils import stripCodeFences, normalizeJsonText
logger = logging.getLogger(__name__)
class AiCallLooper:
"""Handles AI calls with looping and repair logic."""
def __init__(self, services, aiService, responseParser):
"""Initialize AiCallLooper with service center, AI service, and response parser access."""
self.services = services
self.aiService = aiService
self.responseParser = responseParser
self.useCaseRegistry = LoopingUseCaseRegistry() # Initialize use case registry
async def callAiWithLooping(
self,
prompt: str,
options: AiCallOptions,
debugPrefix: str = "ai_call",
promptBuilder: Optional[Callable] = None,
promptArgs: Optional[Dict[str, Any]] = None,
operationId: Optional[str] = None,
userPrompt: Optional[str] = None,
contentParts: Optional[List[ContentPart]] = None, # ARCHITECTURE: Support ContentParts for large content
useCaseId: str = None # REQUIRED: Explicit use case ID - no auto-detection, no fallback
) -> str:
"""
Shared core function for AI calls with repair-based looping system.
Automatically repairs broken JSON and continues generation seamlessly.
Args:
prompt: The prompt to send to AI
options: AI call configuration options
debugPrefix: Prefix for debug file names
promptBuilder: Optional function to rebuild prompts for continuation
promptArgs: Optional arguments for prompt builder
operationId: Optional operation ID for progress tracking
userPrompt: Optional user prompt for KPI definition
contentParts: Optional content parts for first iteration
useCaseId: REQUIRED: Explicit use case ID - no auto-detection, no fallback
Returns:
Complete AI response after all iterations
"""
# REQUIRED: useCaseId must be provided - no auto-detection, no fallback
if not useCaseId:
errorMsg = (
"useCaseId is REQUIRED for callAiWithLooping. "
"No auto-detection - must explicitly specify use case ID. "
f"Available use cases: {list(self.useCaseRegistry.useCases.keys())}"
)
logger.error(errorMsg)
raise ValueError(errorMsg)
# Validate use case exists
useCase = self.useCaseRegistry.get(useCaseId)
if not useCase:
errorMsg = (
f"Use case '{useCaseId}' not found in registry. "
f"Available use cases: {list(self.useCaseRegistry.useCases.keys())}"
)
logger.error(errorMsg)
raise ValueError(errorMsg)
maxIterations = 50 # Prevent infinite loops
iteration = 0
allSections = [] # Accumulate all sections across iterations
lastRawResponse = None # Store last raw JSON response for continuation
# JSON Base Iteration System:
# - jsonBase: the merged JSON string (replaces accumulatedDirectJson array)
# - After each iteration, new response is merged with jsonBase
# - On merge success: check if complete, store contexts for next iteration
# - On merge fail: retry with same prompt, increment fails
jsonBase = None # Merged JSON string (starts None, set on first response)
# Merge fail tracking - stop after 3 consecutive merge failures
MAX_MERGE_FAILS = 3
mergeFailCount = 0 # Global counter for merge failures across entire loop
lastValidCompletePart = None # Store last successfully parsed completePart for fallback
# Get parent operation ID for iteration operations (parentId should be operationId, not log entry ID)
parentOperationId = operationId # Use the parent's operationId directly
while iteration < maxIterations:
iteration += 1
# Create separate operation for each iteration with parent reference
iterationOperationId = None
if operationId:
iterationOperationId = f"{operationId}_iter_{iteration}"
self.services.chat.progressLogStart(
iterationOperationId,
"AI Call",
f"Iteration {iteration}",
"",
parentOperationId=parentOperationId
)
# Build iteration prompt
# CRITICAL: Build continuation prompt if we have sections OR if we have a previous response (even if broken)
# This ensures continuation prompts are built even when JSON is so broken that no sections can be extracted
if (len(allSections) > 0 or lastRawResponse) and promptBuilder and promptArgs:
# Extract templateStructure and basePrompt from promptArgs (REQUIRED)
templateStructure = promptArgs.get("templateStructure")
if not templateStructure:
raise ValueError(
f"templateStructure is REQUIRED in promptArgs for use case '{useCaseId}'. "
"Prompt creation functions must return (prompt, templateStructure) tuple."
)
basePrompt = promptArgs.get("basePrompt")
if not basePrompt:
# Fallback: use prompt parameter (should be the same)
basePrompt = prompt
logger.warning(
f"basePrompt not found in promptArgs for use case '{useCaseId}', "
"using prompt parameter instead. This may indicate a bug."
)
# This is a continuation - build continuation context with raw JSON and rebuild prompt
continuationContext = buildContinuationContext(
allSections, lastRawResponse, useCaseId, templateStructure
)
if not lastRawResponse:
logger.warning(f"Iteration {iteration}: No previous response available for continuation!")
# Store valid completePart from continuation context for fallback on merge failures
# Use getContexts to check if completePart is parseable and store it
if lastRawResponse and not lastValidCompletePart:
try:
contexts = getContexts(lastRawResponse)
if contexts.jsonParsingSuccess and contexts.completePart:
lastValidCompletePart = contexts.completePart
logger.debug(f"Iteration {iteration}: Stored initial valid completePart ({len(lastValidCompletePart)} chars)")
except Exception as e:
logger.debug(f"Iteration {iteration}: Failed to extract completePart: {e}")
# Unified prompt builder call: Continuation builders only need continuationContext, templateStructure, and basePrompt
# All initial context (section, userPrompt, etc.) is already in basePrompt, so promptArgs is not needed
# Extract templateStructure and basePrompt from promptArgs (they're explicit parameters)
iterationPrompt = await promptBuilder(
continuationContext=continuationContext,
templateStructure=templateStructure,
basePrompt=basePrompt
)
else:
# First iteration - use original prompt
iterationPrompt = prompt
# Make AI call
try:
checkWorkflowStopped(self.services)
if iterationOperationId:
self.services.chat.progressLogUpdate(iterationOperationId, 0.3, "Calling AI model")
# ARCHITECTURE: Pass ContentParts directly to AiCallRequest
# This allows model-aware chunking to handle large content properly
# ContentParts are only passed in first iteration (continuations don't need them)
request = AiCallRequest(
prompt=iterationPrompt,
context="",
options=options,
contentParts=contentParts if iteration == 1 else None # Only pass ContentParts in first iteration
)
# Write the ACTUAL prompt sent to AI
# For section content generation: write prompt for first iteration and continuation iterations
# For document generation: write prompt for each iteration
isSectionContent = "_section_" in debugPrefix
if iteration == 1:
self.services.utils.writeDebugFile(iterationPrompt, f"{debugPrefix}_prompt")
elif isSectionContent:
# Save continuation prompts for section_content debugging
self.services.utils.writeDebugFile(iterationPrompt, f"{debugPrefix}_prompt_iteration_{iteration}")
else:
# Document generation - save all iteration prompts
self.services.utils.writeDebugFile(iterationPrompt, f"{debugPrefix}_prompt_iteration_{iteration}")
response = await self.aiService.callAi(request)
result = response.content
# Track bytes for progress reporting
bytesReceived = len(result.encode('utf-8')) if result else 0
totalBytesSoFar = sum(len(section.get('content', '').encode('utf-8')) if isinstance(section.get('content'), str) else 0 for section in allSections) + bytesReceived
# Update progress after AI call with byte information
if iterationOperationId:
# Format bytes for display (kB or MB)
if totalBytesSoFar < 1024:
bytesDisplay = f"{totalBytesSoFar}B"
elif totalBytesSoFar < 1024 * 1024:
bytesDisplay = f"{totalBytesSoFar / 1024:.1f}kB"
else:
bytesDisplay = f"{totalBytesSoFar / (1024 * 1024):.1f}MB"
self.services.chat.progressLogUpdate(iterationOperationId, 0.6, f"AI response received ({bytesDisplay})")
# Write raw AI response to debug file
# For section content generation: write response for first iteration and continuation iterations
# For document generation: write response for each iteration
if iteration == 1:
self.services.utils.writeDebugFile(result, f"{debugPrefix}_response")
elif isSectionContent:
# Save continuation responses for section_content debugging
self.services.utils.writeDebugFile(result, f"{debugPrefix}_response_iteration_{iteration}")
else:
# Document generation - save all iteration responses
self.services.utils.writeDebugFile(result, f"{debugPrefix}_response_iteration_{iteration}")
# Note: Stats are now stored centrally in callAi() - no need to duplicate here
# Check for error response using generic error detection (errorCount > 0 or modelName == "error")
if hasattr(response, 'errorCount') and response.errorCount > 0:
errorMsg = f"Iteration {iteration}: Error response detected (errorCount={response.errorCount}), stopping loop: {result[:200] if result else 'empty'}"
logger.error(errorMsg)
break
if hasattr(response, 'modelName') and response.modelName == "error":
errorMsg = f"Iteration {iteration}: Error response detected (modelName=error), stopping loop: {result[:200] if result else 'empty'}"
logger.error(errorMsg)
break
if not result or not result.strip():
logger.warning(f"Iteration {iteration}: Empty response, stopping")
break
# Check if this is a text response (not document generation)
# Text responses don't need JSON parsing - return immediately after first successful response
isTextResponse = (promptBuilder is None and promptArgs is None) or debugPrefix == "text"
if isTextResponse:
# For text responses, return the text immediately - no JSON parsing needed
logger.info(f"Iteration {iteration}: Text response received, returning immediately")
if iterationOperationId:
self.services.chat.progressLogFinish(iterationOperationId, True)
return result
# NOTE: Do NOT update lastRawResponse here!
# lastRawResponse should only be updated after successful merge
# This ensures retry iterations use the correct base context
# Handle use cases that return JSON directly (no section extraction needed)
# Check if use case supports direct return (all registered use cases do)
if useCase and not useCase.requiresExtraction:
# =====================================================================
# ITERATION FLOW (Simplified)
# =====================================================================
# Step 4: MERGE jsonBase + new response
# - FAILS: repeat prompt, increment fails cont (if >=3 return fallback)
# - SUCCEEDS: try parse
# - SUCCEEDS: FINISHED
# - FAILS: proceed to Step 5
# Step 5: GET CONTEXTS (merge OK, parse failed)
# - getContexts() with repair
# - If no cut point: overlapContext = ""
# Step 6: DECIDE
# - jsonParsingSuccess=true AND overlapContext="": FINISHED
# - jsonParsingSuccess=true AND overlapContext!="": continue, fails=0
# - ELSE: repeat prompt, increment fails count
# =====================================================================
# STEP 4: MERGE jsonBase + new response
# Use candidateJson to hold merged result until we confirm it's valid
candidateJson = None
if jsonBase is None:
# First iteration - candidate is the current result
candidateJson = result
logger.debug(f"Iteration {iteration}: First response, candidateJson ({len(candidateJson)} chars)")
else:
# Merge jsonBase with new response
logger.info(f"Iteration {iteration}: Merging jsonBase ({len(jsonBase)} chars) with new response ({len(result)} chars)")
mergedJsonString, hasOverlap = JsonResponseHandler.mergeJsonStringsWithOverlap(jsonBase, result)
if not hasOverlap:
# MERGE FAILED - repeat prompt with unchanged jsonBase
mergeFailCount += 1
logger.warning(
f"Iteration {iteration}: Merge failed, no overlap found "
f"(fail {mergeFailCount}/{MAX_MERGE_FAILS})"
)
if mergeFailCount >= MAX_MERGE_FAILS:
# Max failures reached - return last valid completePart
logger.error(
f"Iteration {iteration}: Max merge failures ({MAX_MERGE_FAILS}) reached, "
"returning last valid completePart"
)
if iterationOperationId:
self.services.chat.progressLogFinish(iterationOperationId, False)
if lastValidCompletePart:
try:
extracted = extractJsonString(lastValidCompletePart)
parsed, parseErr, _ = tryParseJson(extracted)
if parseErr is None and parsed:
normalized = self._normalizeJsonStructure(parsed, useCase)
return json.dumps(normalized, indent=2, ensure_ascii=False)
except Exception:
pass
return lastValidCompletePart
else:
# No valid fallback - return whatever we have
return jsonBase if jsonBase else ""
# Not at max failures - retry with same prompt (jsonBase unchanged)
if iterationOperationId:
self.services.chat.progressLogUpdate(
iterationOperationId, 0.7,
f"Merge failed ({mergeFailCount}/{MAX_MERGE_FAILS}), retrying"
)
self.services.chat.progressLogFinish(iterationOperationId, True)
continue
# MERGE SUCCEEDED - set candidate (don't update jsonBase yet!)
candidateJson = mergedJsonString
logger.debug(f"Iteration {iteration}: Merge succeeded, candidateJson ({len(candidateJson)} chars)")
# Update lastRawResponse ONLY after we have a valid candidateJson
# (first iteration or successful merge - NOT on merge failure!)
# This ensures retry iterations use the correct base context
lastRawResponse = candidateJson
# Try direct parse of candidate
try:
extracted = extractJsonString(candidateJson)
parsed, parseErr, _ = tryParseJson(extracted)
if parseErr is None and parsed:
# Direct parse succeeded - FINISHED
# Commit candidate to jsonBase
jsonBase = candidateJson
logger.info(f"Iteration {iteration}: Direct parse succeeded, JSON is complete")
normalized = self._normalizeJsonStructure(parsed, useCase)
result = json.dumps(normalized, indent=2, ensure_ascii=False)
if iterationOperationId:
self.services.chat.progressLogFinish(iterationOperationId, True)
if not useCase.finalResultHandler:
raise ValueError(
f"Use case '{useCaseId}' is missing required 'finalResultHandler' callback."
)
return useCase.finalResultHandler(
result, normalized, extracted, debugPrefix, self.services
)
except Exception as e:
logger.debug(f"Iteration {iteration}: Direct parse failed: {e}")
# STEP 5: GET CONTEXTS (merge OK, parse failed = cut JSON)
# Use candidateJson for context extraction
contexts = getContexts(candidateJson)
overlapInfo = "(empty=complete)" if contexts.overlapContext == "" else f"({len(contexts.overlapContext)} chars)"
logger.debug(
f"Iteration {iteration}: getContexts() -> "
f"jsonParsingSuccess={contexts.jsonParsingSuccess}, "
f"overlapContext={overlapInfo}"
)
# STEP 6: DECIDE based on jsonParsingSuccess and overlapContext
if contexts.jsonParsingSuccess and contexts.overlapContext == "":
# JSON is complete (no cut point) - FINISHED
# Use completePart for final result (closed, repaired JSON)
# No more merging needed, so we don't need the cut version
jsonBase = contexts.completePart
logger.info(f"Iteration {iteration}: jsonParsingSuccess=true, overlapContext='', JSON complete")
# Store and parse completePart
lastValidCompletePart = contexts.completePart
try:
extracted = extractJsonString(contexts.completePart)
parsed, parseErr, _ = tryParseJson(extracted)
if parseErr is None and parsed:
normalized = self._normalizeJsonStructure(parsed, useCase)
result = json.dumps(normalized, indent=2, ensure_ascii=False)
if iterationOperationId:
self.services.chat.progressLogFinish(iterationOperationId, True)
if not useCase.finalResultHandler:
raise ValueError(
f"Use case '{useCaseId}' is missing required 'finalResultHandler' callback."
)
return useCase.finalResultHandler(
result, normalized, extracted, debugPrefix, self.services
)
except Exception as e:
logger.warning(f"Iteration {iteration}: Failed to parse completePart: {e}")
# Fallback: return completePart as-is
if iterationOperationId:
self.services.chat.progressLogFinish(iterationOperationId, True)
return contexts.completePart
elif contexts.jsonParsingSuccess and contexts.overlapContext != "":
# JSON parseable but has cut point - CONTINUE to next iteration
# CRITICAL: Use hierarchyContext (CUT json) as jsonBase for next merge!
# - hierarchyContext = the truncated JSON at cut point (needed for overlap matching)
# - completePart = closed JSON (for validation/fallback only)
# The next AI fragment's overlap must match the CUT point, not closed structures
jsonBase = contexts.hierarchyContext
logger.info(
f"Iteration {iteration}: jsonParsingSuccess=true, overlapContext not empty, "
f"continuing iteration (jsonBase updated to hierarchyContext: {len(jsonBase)} chars)"
)
# Store valid completePart as fallback (different from jsonBase!)
lastValidCompletePart = contexts.completePart
# Reset fail counter on successful progress
mergeFailCount = 0
# Update lastRawResponse for continuation prompt building
# Use the CUT version for prompt context as well
lastRawResponse = jsonBase
if iterationOperationId:
self.services.chat.progressLogUpdate(iterationOperationId, 0.7, "JSON incomplete, requesting continuation")
self.services.chat.progressLogFinish(iterationOperationId, True)
continue
else:
# JSON not parseable after repair - repeat prompt, increment fails
# Do NOT update jsonBase - keep previous valid state
mergeFailCount += 1
logger.warning(
f"Iteration {iteration}: jsonParsingSuccess=false, "
f"repeat prompt (fail {mergeFailCount}/{MAX_MERGE_FAILS})"
)
if mergeFailCount >= MAX_MERGE_FAILS:
# Max failures reached - return last valid completePart
logger.error(
f"Iteration {iteration}: Max failures ({MAX_MERGE_FAILS}) reached, "
"returning last valid completePart"
)
if iterationOperationId:
self.services.chat.progressLogFinish(iterationOperationId, False)
if lastValidCompletePart:
try:
extracted = extractJsonString(lastValidCompletePart)
parsed, parseErr, _ = tryParseJson(extracted)
if parseErr is None and parsed:
normalized = self._normalizeJsonStructure(parsed, useCase)
return json.dumps(normalized, indent=2, ensure_ascii=False)
except Exception:
pass
return lastValidCompletePart
else:
return jsonBase if jsonBase else ""
# Not at max - retry with same prompt
# Do NOT update jsonBase or lastRawResponse - keep previous for retry
if iterationOperationId:
self.services.chat.progressLogUpdate(
iterationOperationId, 0.7,
f"Parse failed ({mergeFailCount}/{MAX_MERGE_FAILS}), retrying"
)
self.services.chat.progressLogFinish(iterationOperationId, True)
continue
except Exception as e:
logger.error(f"Error in AI call iteration {iteration}: {str(e)}")
if iterationOperationId:
self.services.chat.progressLogFinish(iterationOperationId, False)
break
if iteration >= maxIterations:
logger.warning(f"AI call stopped after maximum iterations ({maxIterations})")
# This code path should never be reached because all registered use cases
# return early when JSON is complete. This would only execute for use cases that
# require section extraction, but no such use cases are currently registered.
logger.error(f"Unexpected code path: reached end of loop without return for use case '{useCaseId}'")
return result if result else ""
def _isJsonStringIncomplete(self, jsonString: str) -> bool:
"""
Check if JSON string is incomplete (truncated) BEFORE closing/parsing.
This is critical because if JSON is truncated, closing it makes it appear complete,
but we need to detect the truncation to continue iteration.
Args:
jsonString: JSON string to check
Returns:
True if JSON string appears incomplete/truncated, False otherwise
"""
if not jsonString or not jsonString.strip():
return False
# Normalize JSON string
normalized = stripCodeFences(normalizeJsonText(jsonString)).strip()
if not normalized:
return False
# Find first '{' or '[' to start
startIdx = -1
for i, char in enumerate(normalized):
if char in '{[':
startIdx = i
break
if startIdx == -1:
return False
jsonContent = normalized[startIdx:]
# Check if structures are balanced (all opened structures are closed)
braceCount = 0
bracketCount = 0
inString = False
escapeNext = False
for char in jsonContent:
if escapeNext:
escapeNext = False
continue
if char == '\\':
escapeNext = True
continue
if char == '"':
inString = not inString
continue
if not inString:
if char == '{':
braceCount += 1
elif char == '}':
braceCount -= 1
elif char == '[':
bracketCount += 1
elif char == ']':
bracketCount -= 1
# If structures are unbalanced, JSON is incomplete
if braceCount > 0 or bracketCount > 0:
return True
# Check if JSON ends with incomplete value (e.g., unclosed string, incomplete number, trailing comma)
trimmed = jsonContent.rstrip()
if not trimmed:
return False
# Check for trailing comma (might indicate incomplete)
if trimmed.endswith(','):
# Trailing comma might indicate incomplete, but could also be valid
# Check if there's a closing bracket/brace after the comma
return False # Trailing comma alone doesn't mean incomplete
# Check if ends with incomplete string (odd number of quotes)
quoteCount = jsonContent.count('"')
if quoteCount % 2 == 1:
# Odd number of quotes - string is not closed
return True
# Check if ends mid-value (e.g., ends with "417 instead of "4170. 41719"])
# Look for patterns that suggest truncation:
# - Ends with incomplete number (e.g., "417)
# - Ends with incomplete array element (e.g., ["417)
# - Ends with incomplete object property (e.g., {"key": "val)
# If JSON parses successfully without closing, it's complete
parsed, parseErr, _ = tryParseJson(jsonContent)
if parseErr is None:
# Parses successfully - it's complete
return False
# If it doesn't parse, try closing it and see if that helps
closed = closeJsonStructures(jsonContent)
parsedClosed, parseErrClosed, _ = tryParseJson(closed)
if parseErrClosed is None:
# Only parses after closing - it was incomplete
return True
# Doesn't parse even after closing - might be malformed, but assume incomplete to be safe
return True
def _normalizeJsonStructure(self, parsed: Any, useCase) -> Any:
"""
Normalize JSON structure to ensure consistent format before merging.
Handles different response formats and converts them to expected structure.
Args:
parsed: Parsed JSON object (can be dict, list, or primitive)
useCase: LoopingUseCase instance with jsonNormalizer callback
Returns:
Normalized JSON structure
"""
# Use callback to normalize JSON structure (REQUIRED - no fallback)
if not useCase or not useCase.jsonNormalizer:
raise ValueError(
f"Use case '{useCase.useCaseId if useCase else 'unknown'}' is missing required 'jsonNormalizer' callback. "
"All use cases must provide a jsonNormalizer function."
)
return useCase.jsonNormalizer(parsed, useCase.useCaseId)

View file

@ -0,0 +1,721 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
"""
Content Extraction Module
Handles content extraction and preparation, including:
- Extracting content from documents based on intents
- Processing pre-extracted documents
- Vision AI for image text extraction
- AI processing of text content
"""
import json
import logging
import base64
from typing import Dict, Any, List, Optional
from modules.datamodels.datamodelChat import ChatDocument
from modules.datamodels.datamodelExtraction import ContentPart, DocumentIntent, ExtractionOptions, MergeStrategy
from modules.workflows.processing.shared.stateTools import checkWorkflowStopped
logger = logging.getLogger(__name__)
class ContentExtractor:
"""Handles content extraction and preparation."""
def __init__(self, services, aiService, intentAnalyzer):
"""Initialize ContentExtractor with service center, AI service, and intent analyzer access."""
self.services = services
self.aiService = aiService
self.intentAnalyzer = intentAnalyzer
async def extractAndPrepareContent(
self,
documents: List[ChatDocument],
documentIntents: List[DocumentIntent],
parentOperationId: str,
getIntentForDocument: callable
) -> List[ContentPart]:
"""
Phase 5B: Extrahiert Content basierend auf Intents und bereitet ContentParts mit Metadaten vor.
Gibt Liste von ContentParts im passenden Format zurück.
WICHTIG: Ein Dokument kann mehrere ContentParts erzeugen, wenn mehrere Intents vorhanden sind.
Beispiel: Bild mit intents=["extract", "render"] erzeugt:
- ContentPart(contentFormat="object", ...) für Rendering
- ContentPart(contentFormat="extracted", ...) für Text-Analyse
Args:
documents: Liste der zu verarbeitenden Dokumente
documentIntents: Liste von DocumentIntent-Objekten
parentOperationId: Parent Operation-ID für ChatLog-Hierarchie
getIntentForDocument: Callable to get intent for document ID
Returns:
Liste von ContentParts mit vollständigen Metadaten
"""
# Erstelle Operation-ID für Extraktion
extractionOperationId = f"{parentOperationId}_content_extraction"
# Starte ChatLog mit Parent-Referenz
self.services.chat.progressLogStart(
extractionOperationId,
"Content Extraction",
"Extraction",
f"Extracting from {len(documents)} documents",
parentOperationId=parentOperationId
)
try:
allContentParts = []
for document in documents:
checkWorkflowStopped(self.services)
# Check if document is already a ContentExtracted document (pre-extracted JSON)
logger.debug(f"Checking document {document.id} ({document.fileName}, mimeType={document.mimeType}) for pre-extracted content")
preExtracted = self.intentAnalyzer.resolvePreExtractedDocument(document)
if preExtracted:
logger.info(f"✅ Found pre-extracted document: {document.fileName} -> Original: {preExtracted['originalDocument']['fileName']}")
logger.info(f" Pre-extracted document ID: {document.id}, Original document ID: {preExtracted['originalDocument']['id']}")
logger.info(f" ContentParts count: {len(preExtracted['contentExtracted'].parts) if preExtracted['contentExtracted'].parts else 0}")
# Verwende bereits extrahierte ContentParts direkt
contentExtracted = preExtracted["contentExtracted"]
# WICHTIG: Intent muss für das JSON-Dokument gefunden werden, nicht für das Original
# (Intent-Analyse mappt bereits zurück zu JSON-Dokument-ID)
intent = getIntentForDocument(document.id, documentIntents)
logger.info(f" Intent lookup for document {document.id}: found={intent is not None}")
if intent:
logger.info(f" Intent: {intent.intents}, extractionPrompt: {intent.extractionPrompt[:100] if intent.extractionPrompt else None}...")
else:
logger.warning(f" ⚠️ No intent found for pre-extracted document {document.id}! Available intent documentIds: {[i.documentId for i in documentIntents]}")
if contentExtracted.parts:
# CRITICAL: Process pre-extracted parts - analyze structure parts for nested content
processedParts = []
for part in contentExtracted.parts:
# Überspringe leere Parts (Container ohne Daten)
if not part.data or (isinstance(part.data, str) and len(part.data.strip()) == 0):
if part.typeGroup == "container":
continue # Überspringe leere Container
# CRITICAL: Check if structure part contains nested parts (e.g., JSON with documentData.parts)
if part.typeGroup == "structure" and part.mimeType == "application/json" and part.data:
nestedParts = self._extractNestedPartsFromStructure(part, document, preExtracted, intent)
if nestedParts:
# Replace structure part with extracted nested parts
processedParts.extend(nestedParts)
logger.info(f"✅ Extracted {len(nestedParts)} nested parts from structure part {part.id}")
continue # Skip original structure part
# Keep original part if no nested parts found
processedParts.append(part)
# Use processed parts (with nested parts extracted)
for part in processedParts:
if not part.metadata:
part.metadata = {}
# Ensure metadata is complete
if "documentId" not in part.metadata:
part.metadata["documentId"] = document.id
# WICHTIG: Prüfe Intent für dieses Part
partIntent = intent.intents if intent else ["extract"]
# Debug-Logging für Intent-Verarbeitung
logger.debug(f"Processing part {part.id}: typeGroup={part.typeGroup}, intents={partIntent}, hasData={bool(part.data)}, dataLength={len(str(part.data)) if part.data else 0}")
# WICHTIG: Ein Part kann mehrere Intents haben - erstelle für jeden Intent einen ContentPart
# Generische Intent-Verarbeitung für ALLE Content-Typen
hasReferenceIntent = "reference" in partIntent
hasRenderIntent = "render" in partIntent
hasExtractIntent = "extract" in partIntent
hasPartData = bool(part.data) and (not isinstance(part.data, str) or len(part.data.strip()) > 0)
logger.debug(f"Part {part.id}: reference={hasReferenceIntent}, render={hasRenderIntent}, extract={hasExtractIntent}, hasData={hasPartData}")
# SAFETY: For images with any intent, always ensure render is included
# This ensures the image object part is always available for later rendering
isImage = part.typeGroup == "image" or (part.mimeType and part.mimeType.startswith("image/"))
if isImage and hasPartData and not hasRenderIntent:
logger.info(f"🖼️ Auto-adding render intent for image {part.id} (original intents: {partIntent})")
hasRenderIntent = True
# Track ob der originale Part bereits hinzugefügt wurde
originalPartAdded = False
# 1. Reference Intent: Erstelle Reference ContentPart
if hasReferenceIntent:
referencePart = ContentPart(
id=f"ref_{document.id}_{part.id}",
label=f"Reference: {part.label or 'Content'}",
typeGroup="reference",
mimeType=part.mimeType or "application/octet-stream",
data="", # Leer - nur Referenz
metadata={
"contentFormat": "reference",
"documentId": document.id,
"documentReference": f"docItem:{document.id}:{preExtracted['originalDocument']['fileName']}",
"intent": "reference",
"usageHint": f"Reference: {preExtracted['originalDocument']['fileName']}",
"originalFileName": preExtracted["originalDocument"]["fileName"]
}
)
allContentParts.append(referencePart)
logger.debug(f"✅ Created reference ContentPart for {part.id}")
# 2. Render Intent: Erstelle Object ContentPart (für Binary/Image Rendering)
if hasRenderIntent and hasPartData:
# Prüfe ob es ein Binary/Image ist (kann gerendert werden)
isRenderable = (
part.typeGroup == "image" or
part.typeGroup == "binary" or
(part.mimeType and (
part.mimeType.startswith("image/") or
part.mimeType.startswith("video/") or
part.mimeType.startswith("audio/") or
self._isBinary(part.mimeType)
))
)
if isRenderable:
objectPart = ContentPart(
id=f"obj_{document.id}_{part.id}",
label=f"Object: {part.label or 'Content'}",
typeGroup=part.typeGroup,
mimeType=part.mimeType or "application/octet-stream",
data=part.data, # Base64/Binary data ist bereits vorhanden
metadata={
"contentFormat": "object",
"documentId": document.id,
"intent": "render",
"usageHint": f"Render as visual element: {preExtracted['originalDocument']['fileName']}",
"originalFileName": preExtracted["originalDocument"]["fileName"],
"relatedExtractedPartId": f"extracted_{document.id}_{part.id}" if hasExtractIntent else None
}
)
allContentParts.append(objectPart)
logger.debug(f"✅ Created object ContentPart for {part.id} (render intent)")
else:
logger.warning(f"⚠️ Part {part.id} has render intent but is not renderable (typeGroup={part.typeGroup}, mimeType={part.mimeType})")
elif hasRenderIntent and not hasPartData:
logger.warning(f"⚠️ Part {part.id} has render intent but no data, skipping render part")
# 3. Extract Intent: Erstelle Extracted ContentPart (NO AI processing here - happens during section generation)
if hasExtractIntent:
# For images: Keep as image part with extract intent - Vision AI extraction happens during section generation
if part.typeGroup == "image" and hasPartData:
logger.info(f"📷 Image {part.id} with extract intent - will be processed with Vision AI during section generation")
# Keep image part as-is, mark with extract intent
part.metadata.update({
"contentFormat": "extracted", # Marked for extraction, but not yet extracted
"intent": "extract",
"originalFileName": preExtracted["originalDocument"]["fileName"],
"relatedObjectPartId": f"obj_{document.id}_{part.id}" if hasRenderIntent else None,
"extractionPrompt": intent.extractionPrompt if intent and intent.extractionPrompt else "Extract all text content from this image.",
"needsVisionExtraction": True # Flag to indicate Vision AI extraction needed
})
allContentParts.append(part)
originalPartAdded = True
else:
# For text/table content: Use directly as extracted (no AI processing here)
# AI processing with extractionPrompt happens during section generation
if not originalPartAdded:
part.metadata.update({
"contentFormat": "extracted",
"intent": "extract",
"fromExtractContent": True,
"skipExtraction": True, # Already extracted (raw extraction)
"originalFileName": preExtracted["originalDocument"]["fileName"],
"relatedObjectPartId": f"obj_{document.id}_{part.id}" if hasRenderIntent else None,
"extractionPrompt": intent.extractionPrompt if intent and intent.extractionPrompt else None
})
# Stelle sicher dass contentFormat gesetzt ist
if "contentFormat" not in part.metadata:
part.metadata["contentFormat"] = "extracted"
allContentParts.append(part)
originalPartAdded = True
logger.debug(f"✅ Using pre-extracted ContentPart {part.id} as extracted (no AI processing needed)")
# 4. Fallback: Wenn kein Intent vorhanden oder Part wurde noch nicht hinzugefügt
# (sollte normalerweise nicht vorkommen, da default "extract" ist)
if not hasReferenceIntent and not hasRenderIntent and not hasExtractIntent and not originalPartAdded:
logger.warning(f"⚠️ Part {part.id} has no recognized intents, adding as extracted by default")
part.metadata.update({
"contentFormat": "extracted",
"intent": "extract",
"fromExtractContent": True,
"skipExtraction": True,
"originalFileName": preExtracted["originalDocument"]["fileName"]
})
allContentParts.append(part)
originalPartAdded = True
logger.info(f"✅ Using {len([p for p in contentExtracted.parts if p.data and len(str(p.data)) > 0])} pre-extracted ContentParts from ContentExtracted document {document.fileName}")
logger.info(f" Original document: {preExtracted['originalDocument']['fileName']}")
continue # Skip normal extraction for this document
# Check if it's standardized JSON format (has "documents" or "sections")
if document.mimeType == "application/json":
try:
docBytes = self.services.interfaceDbComponent.getFileData(document.fileId)
if docBytes:
docData = docBytes.decode('utf-8')
jsonData = json.loads(docData)
if isinstance(jsonData, dict) and ("documents" in jsonData or "sections" in jsonData):
logger.info(f"Document is already in standardized JSON format, using as reference")
# Create reference ContentPart for structured JSON
contentPart = ContentPart(
id=f"ref_{document.id}",
label=f"Reference: {document.fileName}",
typeGroup="structure",
mimeType="application/json",
data=docData,
metadata={
"contentFormat": "reference",
"documentId": document.id,
"documentReference": f"docItem:{document.id}:{document.fileName}",
"skipExtraction": True,
"intent": "reference"
}
)
allContentParts.append(contentPart)
logger.info(f"✅ Using JSON document directly without extraction")
continue # Skip normal extraction for this document
except Exception as e:
logger.warning(f"Could not parse JSON document {document.fileName}, will extract normally: {str(e)}")
# Continue with normal extraction
# Normal extraction path
intent = getIntentForDocument(document.id, documentIntents)
if not intent:
# Try to find intent by similar UUID (fix for AI UUID hallucination)
correctedIntent = self._findIntentBySimilarId(document.id, documentIntents)
if correctedIntent:
logger.warning(f"Found intent for document {document.id} using UUID correction (original: {correctedIntent.documentId})")
# Create new intent with correct document ID
intent = DocumentIntent(
documentId=document.id,
intents=correctedIntent.intents,
extractionPrompt=correctedIntent.extractionPrompt,
reasoning=f"Intent matched by UUID similarity (original: {correctedIntent.documentId})"
)
else:
# Default: extract für alle Dokumente ohne Intent
logger.warning(f"No intent found for document {document.id}, using default 'extract'")
intent = DocumentIntent(
documentId=document.id,
intents=["extract"],
extractionPrompt="Extract all content from the document",
reasoning="Default intent: no specific intent found"
)
# WICHTIG: Prüfe alle Intents - ein Dokument kann mehrere ContentParts erzeugen
if "reference" in intent.intents:
# Erstelle Reference ContentPart
contentPart = ContentPart(
id=f"ref_{document.id}",
label=f"Reference: {document.fileName}",
typeGroup="reference",
mimeType=document.mimeType,
data="",
metadata={
"contentFormat": "reference",
"documentId": document.id,
"documentReference": f"docItem:{document.id}:{document.fileName}",
"intent": "reference",
"usageHint": f"Reference document: {document.fileName}"
}
)
allContentParts.append(contentPart)
# WICHTIG: "render" und "extract" können beide vorhanden sein!
# In diesem Fall erzeugen wir BEIDE ContentParts
# SAFETY: For images with any intent, always create object part for later rendering
isImageDocument = document.mimeType and document.mimeType.startswith("image/")
shouldAutoRender = isImageDocument and "render" not in intent.intents and ("extract" in intent.intents or "reference" in intent.intents)
if shouldAutoRender:
logger.info(f"🖼️ Auto-adding render for image document {document.id} (original intents: {intent.intents})")
if "render" in intent.intents or shouldAutoRender:
# Für Images/Binary: extrahiere als Object
if document.mimeType.startswith("image/") or self._isBinary(document.mimeType):
try:
# Lade Binary-Daten (getFileData ist nicht async - keine await nötig)
binaryData = self.services.interfaceDbComponent.getFileData(document.fileId)
if not binaryData:
logger.warning(f"No binary data found for document {document.id}")
continue
base64Data = base64.b64encode(binaryData).decode('utf-8')
contentPart = ContentPart(
id=f"obj_{document.id}",
label=f"Object: {document.fileName}",
typeGroup="image" if document.mimeType.startswith("image/") else "binary",
mimeType=document.mimeType,
data=base64Data,
metadata={
"contentFormat": "object",
"documentId": document.id,
"intent": "render",
"usageHint": f"Render as visual element: {document.fileName}",
"originalFileName": document.fileName,
# Verknüpfung zu extracted Part (falls vorhanden)
"relatedExtractedPartId": f"ext_{document.id}" if "extract" in intent.intents else None
}
)
allContentParts.append(contentPart)
except Exception as e:
logger.error(f"Failed to load binary data for document {document.id}: {str(e)}")
if "extract" in intent.intents:
# Extrahiere Content mit Extraction Service
extractionPrompt = intent.extractionPrompt or "Extract all content from the document"
# Debug-Log (harmonisiert)
self.services.utils.writeDebugFile(
extractionPrompt,
f"content_extraction_prompt_{document.id}"
)
# Führe Extraktion aus
extractionOptions = ExtractionOptions(
prompt=extractionPrompt,
mergeStrategy=MergeStrategy()
)
# extractContent ist nicht async - keine await nötig
checkWorkflowStopped(self.services)
extractedResults = self.services.extraction.extractContent(
[document],
extractionOptions,
operationId=extractionOperationId,
parentOperationId=extractionOperationId
)
# Konvertiere extrahierte Ergebnisse zu ContentParts mit Metadaten
# Check if object part exists (either explicit render or auto-render for images)
hasObjectPart = "render" in intent.intents or shouldAutoRender
for extracted in extractedResults:
for part in extracted.parts:
# Markiere als extracted Format
part.metadata.update({
"contentFormat": "extracted",
"documentId": document.id,
"extractionPrompt": extractionPrompt,
"intent": "extract",
"usageHint": f"Use extracted content from {document.fileName}",
# Verknüpfung zu object Part (falls vorhanden - including auto-render for images)
"relatedObjectPartId": f"obj_{document.id}" if hasObjectPart else None
})
# For images: Mark that Vision AI extraction is needed during section generation
if part.typeGroup == "image":
part.metadata["needsVisionExtraction"] = True
logger.info(f"📷 Image part {part.id} marked for Vision AI extraction during section generation")
# Stelle sicher, dass ID eindeutig ist (falls object Part existiert)
if hasObjectPart:
part.id = f"ext_{document.id}_{part.id}"
allContentParts.append(part)
# Debug-Log (harmonisiert)
self.services.utils.writeDebugFile(
json.dumps([part.dict() for part in allContentParts], indent=2, default=str),
"content_extraction_result"
)
# State 2 Validation: Validate and auto-fix ContentParts
validatedParts = []
for part in allContentParts:
# Validation 2.1: Skip ContentParts without documentId
if not part.metadata.get("documentId"):
logger.warning(f"Skipping ContentPart {part.id} - missing documentId in metadata")
continue
# Validation 2.2: Skip ContentParts with invalid contentFormat
contentFormat = part.metadata.get("contentFormat")
if contentFormat not in ["extracted", "object", "reference"]:
logger.warning(
f"Skipping ContentPart {part.id} - invalid contentFormat: {contentFormat}"
)
continue
validatedParts.append(part)
# ChatLog abschließen
self.services.chat.progressLogFinish(extractionOperationId, True)
return validatedParts
except Exception as e:
self.services.chat.progressLogFinish(extractionOperationId, False)
logger.error(f"Error in extractAndPrepareContent: {str(e)}")
raise
async def extractTextFromImage(self, imagePart: ContentPart, extractionPrompt: str) -> Optional[str]:
"""
Extrahiere Text aus einem Image-Part mit Vision AI.
Args:
imagePart: ContentPart mit typeGroup="image"
extractionPrompt: Prompt für die Text-Extraktion
Returns:
Extrahierter Text oder None bei Fehler
"""
try:
from modules.datamodels.datamodelAi import AiCallRequest, AiCallOptions, OperationTypeEnum
# Final extraction prompt
finalPrompt = extractionPrompt or "Extract all text content from this image. Return only the extracted text, no additional formatting."
# Debug-Log (harmonisiert)
self.services.utils.writeDebugFile(
finalPrompt,
f"content_extraction_prompt_image_{imagePart.id}"
)
# Erstelle AI-Call-Request mit Image-Part
request = AiCallRequest(
prompt=finalPrompt,
context="",
options=AiCallOptions(operationType=OperationTypeEnum.IMAGE_ANALYSE),
contentParts=[imagePart]
)
# Verwende AI-Service für Vision AI-Verarbeitung
checkWorkflowStopped(self.services)
response = await self.aiService.callAi(request)
# Debug-Log für Response (harmonisiert)
if response and response.content:
self.services.utils.writeDebugFile(
response.content,
f"content_extraction_response_image_{imagePart.id}"
)
if response and response.content:
return response.content.strip()
# Kein Content zurückgegeben - return error message für Debugging
errorMsg = f"Vision AI extraction failed: No content returned for image {imagePart.id}"
logger.warning(errorMsg)
return f"[ERROR: {errorMsg}]"
except Exception as e:
errorMsg = f"Vision AI extraction failed for image {imagePart.id}: {str(e)}"
logger.error(errorMsg)
import traceback
logger.debug(f"Traceback: {traceback.format_exc()}")
# Return error message statt None für Debugging
return f"[ERROR: {errorMsg}]"
async def processTextContentWithAi(self, textPart: ContentPart, extractionPrompt: str) -> Optional[str]:
"""
Verarbeite Text-Content mit AI basierend auf extractionPrompt.
WICHTIG: Pre-extracted ContentParts von context.extractContent enthalten RAW extrahierten Text
(z.B. aus PDF-Text-Layer). Wenn "extract" Intent vorhanden ist, muss dieser Text mit AI
verarbeitet werden (Transformation, Strukturierung, etc.) basierend auf extractionPrompt.
Args:
textPart: ContentPart mit typeGroup="text" (oder anderer Text-basierter Typ)
extractionPrompt: Prompt für die AI-Verarbeitung des Textes
Returns:
AI-verarbeiteter Text oder None bei Fehler
"""
try:
from modules.datamodels.datamodelAi import AiCallRequest, AiCallOptions, OperationTypeEnum
# Final extraction prompt
finalPrompt = extractionPrompt or "Process and extract the key information from the following text content."
# Debug-Log (harmonisiert) - log prompt with text preview
textPreview = textPart.data[:500] + "..." if textPart.data and len(textPart.data) > 500 else (textPart.data or "")
promptWithContext = f"{finalPrompt}\n\n--- Text Content (preview) ---\n{textPreview}"
self.services.utils.writeDebugFile(
promptWithContext,
f"content_extraction_prompt_text_{textPart.id}"
)
# Erstelle Text-ContentPart für AI-Verarbeitung
# Verwende den vorhandenen Text als Input
textContentPart = ContentPart(
id=textPart.id,
label=textPart.label,
typeGroup="text",
mimeType="text/plain",
data=textPart.data if textPart.data else "",
metadata=textPart.metadata.copy() if textPart.metadata else {}
)
# Erstelle AI-Call-Request mit Text-Part
request = AiCallRequest(
prompt=finalPrompt,
context="",
options=AiCallOptions(operationType=OperationTypeEnum.DATA_EXTRACT),
contentParts=[textContentPart]
)
# Verwende AI-Service für Text-Verarbeitung
checkWorkflowStopped(self.services)
response = await self.aiService.callAi(request)
# Debug-Log für Response (harmonisiert)
if response and response.content:
self.services.utils.writeDebugFile(
response.content,
f"content_extraction_response_text_{textPart.id}"
)
if response and response.content:
return response.content.strip()
# Kein Content zurückgegeben - return error message für Debugging
errorMsg = f"AI text processing failed: No content returned for text part {textPart.id}"
logger.warning(errorMsg)
return f"[ERROR: {errorMsg}]"
except Exception as e:
errorMsg = f"AI text processing failed for text part {textPart.id}: {str(e)}"
logger.error(errorMsg)
import traceback
logger.debug(f"Traceback: {traceback.format_exc()}")
# Return error message statt None für Debugging
return f"[ERROR: {errorMsg}]"
def _isBinary(self, mimeType: str) -> bool:
"""Prüfe ob MIME-Type binary ist."""
binaryTypes = [
"application/octet-stream",
"application/pdf",
"application/zip",
"application/x-zip-compressed"
]
return mimeType in binaryTypes or mimeType.startswith("image/") or mimeType.startswith("video/") or mimeType.startswith("audio/")
def _extractNestedPartsFromStructure(
self,
structurePart: ContentPart,
document: ChatDocument,
preExtracted: Dict[str, Any],
intent: Optional[Any]
) -> List[ContentPart]:
"""
Extract nested parts from a structure ContentPart (e.g., JSON with documentData.parts).
This is a generic function that analyzes pre-processed ContentParts and extracts
any nested parts that are embedded in structure data (typically JSON).
Works with standard ContentExtracted format: documentData.parts array.
Each nested part is extracted as a separate ContentPart with proper metadata.
Args:
structurePart: ContentPart with typeGroup="structure" containing nested parts
document: The document this part belongs to
preExtracted: Pre-extracted document metadata
intent: Document intent for nested parts
Returns:
List of extracted ContentParts, empty if no nested parts found
"""
nestedParts = []
try:
# Parse JSON structure
jsonData = json.loads(structurePart.data)
# Check for standard ContentExtracted format: documentData.parts
if isinstance(jsonData, dict):
documentData = jsonData.get("documentData")
if isinstance(documentData, dict):
parts = documentData.get("parts", [])
if isinstance(parts, list) and len(parts) > 0:
# Extract each nested part
for nestedPartData in parts:
if not isinstance(nestedPartData, dict):
continue
nestedPartId = nestedPartData.get("id") or f"nested_{len(nestedParts)}"
nestedTypeGroup = nestedPartData.get("typeGroup", "text")
nestedMimeType = nestedPartData.get("mimeType", "text/plain")
nestedLabel = nestedPartData.get("label", structurePart.label)
nestedData = nestedPartData.get("data", "")
nestedMetadata = nestedPartData.get("metadata", {})
# Create ContentPart for nested part
nestedPart = ContentPart(
id=f"{structurePart.id}_{nestedPartId}",
parentId=structurePart.id,
label=nestedLabel,
typeGroup=nestedTypeGroup,
mimeType=nestedMimeType,
data=nestedData,
metadata={
**nestedMetadata,
"documentId": document.id,
"fromNestedStructure": True,
"parentStructurePartId": structurePart.id,
"originalFileName": preExtracted["originalDocument"]["fileName"]
}
)
nestedParts.append(nestedPart)
logger.debug(f"✅ Extracted nested part: {nestedPart.id} (typeGroup={nestedTypeGroup}, mimeType={nestedMimeType})")
# If no nested parts found, return empty list (original part will be kept)
if not nestedParts:
logger.debug(f"No nested parts found in structure part {structurePart.id}")
except json.JSONDecodeError as e:
logger.warning(f"Could not parse structure part {structurePart.id} as JSON: {str(e)}")
except Exception as e:
logger.error(f"Error extracting nested parts from structure part {structurePart.id}: {str(e)}")
return nestedParts
def _findIntentBySimilarId(self, documentId: str, documentIntents: List[DocumentIntent]) -> Optional[DocumentIntent]:
"""
Versucht ein Intent zu finden, dessen UUID ähnlich zur angegebenen Dokument-ID ist.
Dies hilft bei AI UUID-Halluzinationen (z.B. 4451 -> 4551).
Args:
documentId: Die Dokument-ID für die ein Intent gesucht wird
documentIntents: Liste aller verfügbaren DocumentIntents
Returns:
DocumentIntent mit ähnlicher UUID falls gefunden, sonst None
"""
if not documentId or len(documentId) != 36: # UUID Format: 8-4-4-4-12
return None
# Prüfe ob es eine UUID ist (Format: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx)
if documentId.count('-') != 4:
return None
for intent in documentIntents:
intentId = intent.documentId
if len(intentId) != 36:
continue
# Zähle unterschiedliche Zeichen
differences = sum(c1 != c2 for c1, c2 in zip(documentId, intentId))
# Wenn nur 1-2 Zeichen unterschiedlich sind, ist es wahrscheinlich ein Typo
if differences <= 2:
# Prüfe ob die Struktur ähnlich ist (gleiche Positionen der Bindestriche)
if documentId.count('-') == intentId.count('-'):
return intent
return None

View file

@ -0,0 +1,369 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
"""
Document Intent Analysis Module
Handles analysis of document intents, including:
- Clarifying which documents need extraction vs reference
- Resolving pre-extracted documents
- Building intent analysis prompts
"""
import json
import logging
from typing import Dict, Any, List, Optional
from modules.datamodels.datamodelChat import ChatDocument
from modules.datamodels.datamodelExtraction import DocumentIntent
from modules.workflows.processing.shared.stateTools import checkWorkflowStopped
logger = logging.getLogger(__name__)
class DocumentIntentAnalyzer:
"""Handles document intent analysis and resolution."""
def __init__(self, services, aiService):
"""Initialize DocumentIntentAnalyzer with service center and AI service access."""
self.services = services
self.aiService = aiService
async def clarifyDocumentIntents(
self,
documents: List[ChatDocument],
userPrompt: str,
actionParameters: Dict[str, Any],
parentOperationId: str
) -> List[DocumentIntent]:
"""
Phase 5A: Analysiert, welche Dokumente Extraktion vs Referenz benötigen.
Gibt DocumentIntent für jedes Dokument zurück.
Args:
documents: Liste der zu verarbeitenden Dokumente
userPrompt: User-Anfrage
actionParameters: Action-spezifische Parameter (z.B. resultType, outputFormat)
parentOperationId: Parent Operation-ID für ChatLog-Hierarchie
Returns:
Liste von DocumentIntent-Objekten
"""
# Erstelle Operation-ID für Intent-Analyse
intentOperationId = f"{parentOperationId}_intent_analysis"
# Starte ChatLog mit Parent-Referenz
self.services.chat.progressLogStart(
intentOperationId,
"Document Intent Analysis",
"Intent Analysis",
f"Analyzing {len(documents)} documents",
parentOperationId=parentOperationId
)
try:
# Mappe pre-extracted JSONs zu ursprünglichen Dokument-IDs für Intent-Analyse
documentMapping = {} # Maps original doc ID -> JSON doc ID
resolvedDocuments = []
for doc in documents:
preExtracted = self.resolvePreExtractedDocument(doc)
if preExtracted:
originalDocId = preExtracted["originalDocument"]["id"]
documentMapping[originalDocId] = doc.id
# Erstelle temporäres ChatDocument für ursprüngliches Dokument
originalDoc = ChatDocument(
id=originalDocId,
fileName=preExtracted["originalDocument"]["fileName"],
mimeType=preExtracted["originalDocument"]["mimeType"],
fileSize=preExtracted["originalDocument"].get("fileSize", doc.fileSize),
fileId=doc.fileId, # Behalte fileId vom JSON
messageId=doc.messageId if hasattr(doc, 'messageId') else None # Behalte messageId falls vorhanden
)
resolvedDocuments.append(originalDoc)
else:
resolvedDocuments.append(doc)
# Baue Intent-Analyse-Prompt mit ursprünglichen Dokumenten
intentPrompt = self._buildIntentAnalysisPrompt(userPrompt, resolvedDocuments, actionParameters)
# AI-Call (verwende callAiPlanning für einfache JSON-Responses)
# Debug-Logs werden bereits von callAiPlanning geschrieben
checkWorkflowStopped(self.services)
aiResponse = await self.aiService.callAiPlanning(
prompt=intentPrompt,
debugType="document_intent_analysis"
)
# Parse Result und mappe zurück zu JSON-Dokument-IDs falls nötig
intentsData = json.loads(self.services.utils.jsonExtractString(aiResponse))
documentIntents = []
for intent in intentsData.get("intents", []):
docId = intent.get("documentId")
# Wenn Intent für ursprüngliches Dokument, mappe zurück zu JSON-Dokument-ID
if docId in documentMapping:
intent["documentId"] = documentMapping[docId]
documentIntents.append(DocumentIntent(**intent))
# Debug-Log (harmonisiert)
self.services.utils.writeDebugFile(
json.dumps([intent.dict() for intent in documentIntents], indent=2),
"document_intent_analysis_result"
)
# State 1 Validation: Validate and auto-fix document intents
documentIds = {d.id for d in documents}
validatedIntents = []
for intent in documentIntents:
# Validation 1.2: Skip intents for unknown documents
if intent.documentId not in documentIds:
# Try to find similar UUID (fix AI hallucination/typo)
correctedDocId = self._findSimilarDocumentId(intent.documentId, documentIds)
if correctedDocId:
logger.warning(f"Corrected UUID typo in AI response: {intent.documentId} -> {correctedDocId}")
intent.documentId = correctedDocId
else:
logger.warning(f"Skipping intent for unknown document: {intent.documentId}")
continue
validatedIntents.append(intent)
# Validation 1.1: Documents without intents are OK (not needed)
# Intents for non-existing documents are already filtered above
documentIntents = validatedIntents
# ChatLog abschließen
self.services.chat.progressLogFinish(intentOperationId, True)
return documentIntents
except Exception as e:
self.services.chat.progressLogFinish(intentOperationId, False)
logger.error(f"Error in clarifyDocumentIntents: {str(e)}")
raise
def resolvePreExtractedDocument(self, document: ChatDocument) -> Optional[Dict[str, Any]]:
"""
Prüft ob ein JSON-Dokument bereits extrahierte ContentParts enthält.
Gibt Dict zurück mit:
- originalDocument: ChatDocument-Info des ursprünglichen Dokuments
- contentExtracted: ContentExtracted-Objekt mit Parts
- parts: Liste der ContentParts
Returns None wenn kein pre-extracted Format erkannt wird.
"""
if document.mimeType != "application/json":
logger.debug(f"Document {document.id} is not JSON (mimeType={document.mimeType}), skipping pre-extracted check")
return None
try:
docBytes = self.services.interfaceDbComponent.getFileData(document.fileId)
if not docBytes:
return None
docData = docBytes.decode('utf-8')
jsonData = json.loads(docData)
if not isinstance(jsonData, dict):
return None
# Check for ContentExtracted format
# Nur Format 1 (ActionDocument-Format mit validationMetadata) wird unterstützt
documentData = None
validationMetadata = jsonData.get("validationMetadata", {})
actionType = validationMetadata.get("actionType")
logger.debug(f"JSON document {document.id}: validationMetadata.actionType={actionType}, keys={list(jsonData.keys())}")
if actionType == "context.extractContent":
# Format: {"validationMetadata": {"actionType": "context.extractContent"}, "documentData": {...}}
documentData = jsonData.get("documentData")
logger.debug(f"Found ContentExtracted via validationMetadata for {document.fileName}, documentData keys: {list(documentData.keys()) if documentData else None}")
else:
logger.debug(f"JSON document {document.id} does not have actionType='context.extractContent' (got: {actionType})")
if documentData:
try:
# Stelle sicher, dass "id" vorhanden ist
if "id" not in documentData:
documentData["id"] = document.id
contentExtracted = ContentExtracted(**documentData)
if contentExtracted.parts:
# Extrahiere ursprüngliche Dokument-Info aus den Parts
originalDocId = None
originalFileName = None
originalMimeType = None
for part in contentExtracted.parts:
if part.metadata:
# Versuche ursprüngliche Dokument-Info zu finden
if not originalDocId and part.metadata.get("documentId"):
originalDocId = part.metadata.get("documentId")
if not originalFileName and part.metadata.get("originalFileName"):
originalFileName = part.metadata.get("originalFileName")
if not originalMimeType and part.metadata.get("documentMimeType"):
originalMimeType = part.metadata.get("documentMimeType")
# Falls nicht gefunden, versuche aus documentName zu extrahieren
if not originalFileName:
# Versuche aus documentName zu extrahieren (z.B. "B2025-02c_28_extracted_...json" -> "B2025-02c_28.pdf")
if document.fileName and "_extracted_" in document.fileName:
originalFileName = document.fileName.split("_extracted_")[0] + ".pdf"
return {
"originalDocument": {
"id": originalDocId or document.id,
"fileName": originalFileName or document.fileName,
"mimeType": originalMimeType or "application/pdf",
"fileSize": document.fileSize
},
"contentExtracted": contentExtracted,
"parts": contentExtracted.parts
}
except Exception as parseError:
logger.warning(f"Could not parse ContentExtracted format from {document.fileName}: {str(parseError)}")
logger.debug(f"JSON keys: {list(jsonData.keys())}, has parts: {'parts' in jsonData}")
import traceback
logger.debug(f"Parse error traceback: {traceback.format_exc()}")
return None
else:
logger.debug(f"JSON document {document.id} has no documentData (actionType={actionType})")
return None
except Exception as e:
logger.debug(f"Error resolving pre-extracted document {document.fileName}: {str(e)}")
return None
def _buildIntentAnalysisPrompt(
self,
userPrompt: str,
documents: List[ChatDocument],
actionParameters: Dict[str, Any]
) -> str:
"""Baue Prompt für Intent-Analyse."""
# Baue Dokument-Liste - zeige ursprüngliche Dokumente für pre-extracted JSONs
docListText = ""
for i, doc in enumerate(documents, 1):
# Prüfe ob es ein pre-extracted JSON ist
preExtracted = self.resolvePreExtractedDocument(doc)
if preExtracted:
# Zeige ursprüngliches Dokument statt JSON
originalDoc = preExtracted["originalDocument"]
partsInfo = f" (contains {len(preExtracted['parts'])} pre-extracted parts: {', '.join([p.typeGroup for p in preExtracted['parts'] if p.data and len(str(p.data)) > 0])})"
docListText += f"\n{i}. Document ID: {originalDoc['id']}\n"
docListText += f" File Name: {originalDoc['fileName']}{partsInfo}\n"
docListText += f" MIME Type: {originalDoc['mimeType']}\n"
docListText += f" File Size: {originalDoc.get('fileSize', doc.fileSize)} bytes\n"
else:
# Normales Dokument
docListText += f"\n{i}. Document ID: {doc.id}\n"
docListText += f" File Name: {doc.fileName}\n"
docListText += f" MIME Type: {doc.mimeType}\n"
docListText += f" File Size: {doc.fileSize} bytes\n"
outputFormat = actionParameters.get("outputFormat", "txt")
# FENCE user input to prevent prompt injection
fencedUserPrompt = f"""```user_request
{userPrompt}
```"""
prompt = f"""USER REQUEST:
{fencedUserPrompt}
DOCUMENTS TO ANALYZE:
{docListText}
TASK: For each document, determine its intents (can be multiple):
- "extract": Content extraction needed (text, structure, OCR, etc.)
- "render": Image/binary should be rendered as-is (visual element)
- "reference": Document reference/attachment (no extraction, just reference)
TASK: For each document, determine:
1. Intents (can be multiple): "extract", "render", "reference"
Note: Output format and language are NOT determined here - they will be
determined during structure generation (Phase 3) in the chapter structure JSON
OUTPUT FORMAT: {outputFormat} (global fallback - for reference only)
RETURN JSON:
{{
"intents": [
{{
"documentId": "doc_1",
"intents": ["extract"],
"extractionPrompt": "Extract all text content, preserving structure",
"reasoning": "User needs text content for document generation"
}},
{{
"documentId": "doc_2",
"intents": ["extract", "render"],
"extractionPrompt": "Extract text content from image using vision AI",
"reasoning": "Image contains text that needs extraction, but also should be rendered visually"
}},
{{
"documentId": "doc_3",
"intents": ["reference"],
"extractionPrompt": null,
"reasoning": "Document is only used as reference, no extraction needed"
}}
]
}}
CRITICAL RULES:
1. For images (mimeType starts with "image/"):
- If user wants to "include" or "show" images add "render"
- If user wants to "analyze", "read text", or "extract text" from images add "extract"
- Can have BOTH "extract" and "render" if image needs both text extraction and visual rendering
2. For text documents:
- If user mentions "template" or "structure" "reference" or "extract" based on context
- If user mentions "reference" or "context" "reference"
- Default "extract"
3. Consider output format:
- For formats like PDF, DOCX, PPTX: images usually need "render"
- For formats like CSV, JSON: usually "extract" only
- For HTML: can have both "extract" and "render"
Return ONLY valid JSON following the structure above.
"""
return prompt
def _findSimilarDocumentId(self, incorrectId: str, validIds: set) -> Optional[str]:
"""
Versucht eine ähnliche Dokument-ID zu finden, falls die AI die UUID geändert hat.
Prüft auf UUID-Typo (z.B. 4451 -> 4551).
Args:
incorrectId: Die falsche UUID aus der AI-Response
validIds: Set von gültigen Dokument-IDs
Returns:
Korrigierte UUID falls gefunden, sonst None
"""
if not incorrectId or len(incorrectId) != 36: # UUID Format: 8-4-4-4-12
return None
# Prüfe ob es eine UUID ist (Format: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx)
if incorrectId.count('-') != 4:
return None
# Versuche Levenshtein-ähnliche Suche: Prüfe ob nur 1-2 Zeichen unterschiedlich sind
for validId in validIds:
if len(validId) != 36:
continue
# Zähle unterschiedliche Zeichen
differences = sum(c1 != c2 for c1, c2 in zip(incorrectId, validId))
# Wenn nur 1-2 Zeichen unterschiedlich sind, ist es wahrscheinlich ein Typo
if differences <= 2:
# Prüfe ob die Struktur ähnlich ist (gleiche Positionen der Bindestriche)
if incorrectId.count('-') == validId.count('-'):
return validId
return None

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,293 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
"""
Generic Looping Use Case System
Provides parametrized looping infrastructure supporting different JSON formats and use cases.
"""
import logging
from dataclasses import dataclass, field
from typing import Dict, Any, List, Optional, Callable
logger = logging.getLogger(__name__)
# Callback functions for use-case-specific logic
def _handleSectionContentFinalResult(result: str, parsedJsonForUseCase: Any, extractedJsonForUseCase: str,
debugPrefix: str, services: Any) -> str:
"""Handle final result for section_content: return raw result to preserve all JSON blocks."""
final_json = result # Return raw response to preserve all JSON blocks
# Write final merged result for section_content (overwrites iteration 1 response with complete merged result)
if services and hasattr(services, 'utils') and hasattr(services.utils, 'writeDebugFile'):
services.utils.writeDebugFile(final_json, f"{debugPrefix}_response")
return final_json
def _handleChapterStructureFinalResult(result: str, parsedJsonForUseCase: Any, extractedJsonForUseCase: str,
debugPrefix: str, services: Any) -> str:
"""Handle final result for chapter_structure: format JSON and write debug file."""
import json
final_json = json.dumps(parsedJsonForUseCase, indent=2, ensure_ascii=False) if parsedJsonForUseCase else (extractedJsonForUseCase or result)
# Write final result for chapter structure
if services and hasattr(services, 'utils') and hasattr(services.utils, 'writeDebugFile'):
services.utils.writeDebugFile(final_json, f"{debugPrefix}_final_result")
return final_json
def _handleCodeStructureFinalResult(result: str, parsedJsonForUseCase: Any, extractedJsonForUseCase: str,
debugPrefix: str, services: Any) -> str:
"""Handle final result for code_structure: format JSON and write debug file."""
import json
final_json = json.dumps(parsedJsonForUseCase, indent=2, ensure_ascii=False) if parsedJsonForUseCase else (extractedJsonForUseCase or result)
# Write final result for code structure
if services and hasattr(services, 'utils') and hasattr(services.utils, 'writeDebugFile'):
services.utils.writeDebugFile(final_json, f"{debugPrefix}_final_result")
return final_json
def _handleCodeContentFinalResult(result: str, parsedJsonForUseCase: Any, extractedJsonForUseCase: str,
debugPrefix: str, services: Any) -> str:
"""Handle final result for code_content: format JSON."""
import json
final_json = json.dumps(parsedJsonForUseCase, indent=2, ensure_ascii=False) if parsedJsonForUseCase else (extractedJsonForUseCase or result)
return final_json
def _normalizeSectionContentJson(parsed: Any, useCaseId: str) -> Any:
"""Normalize JSON structure for section_content use case."""
# For section_content, expect {"elements": [...]} structure
if isinstance(parsed, list):
# Check if list contains strings (invalid format) or element objects
if parsed and isinstance(parsed[0], str):
# Invalid format - list of strings instead of elements
# Try to convert strings to paragraph elements as fallback
logger.debug(f"Received list of strings instead of elements array, converting to paragraph elements")
elements = []
for text in parsed:
if isinstance(text, str) and text.strip():
elements.append({
"type": "paragraph",
"content": {
"text": text.strip()
}
})
return {"elements": elements} if elements else {"elements": []}
else:
# Convert plain list of elements to elements structure
return {"elements": parsed}
elif isinstance(parsed, dict):
# If it already has "elements", return as-is
if "elements" in parsed:
return parsed
# If it has "type" and looks like an element, wrap in elements array
elif parsed.get("type"):
return {"elements": [parsed]}
# Otherwise, assume it's already in correct format
else:
return parsed
# For other use cases, return as-is (they have their own structures)
return parsed
def _normalizeDefaultJson(parsed: Any, useCaseId: str) -> Any:
"""Default normalizer: return as-is."""
return parsed
@dataclass
class LoopingUseCase:
"""Configuration for a specific looping use case."""
# Identification
useCaseId: str # "section_content", "chapter_structure", "code_structure", "code_content"
# JSON Format Detection
jsonTemplate: Dict[str, Any] # Expected JSON structure template
detectionKeys: List[str] # Keys to check for format detection (e.g., ["elements"], ["chapters"], ["files"])
detectionPath: str # JSONPath to check (e.g., "documents[0].chapters", "files[0].content")
# Prompt Building
initialPromptBuilder: Optional[Callable] = None # Function to build initial prompt
continuationPromptBuilder: Optional[Callable] = None # Function to build continuation prompt
# Accumulation & Merging
accumulator: Optional[Callable] = None # Function to accumulate fragments
merger: Optional[Callable] = None # Function to merge accumulated data
# Continuation Context
continuationContextBuilder: Optional[Callable] = None # Build continuation context for this format
# Result Building
resultBuilder: Optional[Callable] = None # Build final result from accumulated data
# Use-case-specific handlers (callbacks to avoid if/elif chains in generic code)
finalResultHandler: Optional[Callable] = None # Handle final result formatting and debug file writing
jsonNormalizer: Optional[Callable] = None # Normalize JSON structure for this use case
# Metadata
supportsAccumulation: bool = True # Whether this use case supports accumulation
requiresExtraction: bool = False # Whether this requires extraction (like sections)
class LoopingUseCaseRegistry:
"""Registry of all looping use cases."""
def __init__(self):
self.useCases: Dict[str, LoopingUseCase] = {}
self._registerDefaultUseCases()
def register(self, useCase: LoopingUseCase):
"""Register a new use case."""
self.useCases[useCase.useCaseId] = useCase
logger.debug(f"Registered looping use case: {useCase.useCaseId}")
def get(self, useCaseId: str) -> Optional[LoopingUseCase]:
"""Get use case by ID."""
return self.useCases.get(useCaseId)
def detectUseCase(self, parsedJson: Dict[str, Any]) -> Optional[str]:
"""Detect which use case matches the JSON structure."""
for useCaseId, useCase in self.useCases.items():
if self._matchesFormat(parsedJson, useCase):
return useCaseId
return None
def _matchesFormat(self, json: Dict[str, Any], useCase: LoopingUseCase) -> bool:
"""Check if JSON matches use case format."""
# Check top-level keys
for key in useCase.detectionKeys:
if key in json:
return True
# Check nested path using simple dictionary traversal (no jsonpath_ng needed)
if useCase.detectionPath:
try:
# Simple path matching without jsonpath_ng
# Format: "documents[0].chapters" or "files[0].content"
pathParts = useCase.detectionPath.split(".")
current = json
for part in pathParts:
# Handle array indices like "documents[0]"
if "[" in part and "]" in part:
key = part.split("[")[0]
index = int(part.split("[")[1].split("]")[0])
if isinstance(current, dict) and key in current:
if isinstance(current[key], list) and 0 <= index < len(current[key]):
current = current[key][index]
else:
return False
else:
return False
else:
# Regular key access
if isinstance(current, dict) and part in current:
current = current[part]
else:
return False
# If we successfully traversed the path, it matches
return True
except Exception as e:
logger.debug(f"Path matching failed for {useCase.useCaseId}: {e}")
return False
def _registerDefaultUseCases(self):
"""Register default use cases."""
# Use Case 1: Section Content Generation
# Returns JSON with "elements" array directly
self.register(LoopingUseCase(
useCaseId="section_content",
jsonTemplate={"elements": []},
detectionKeys=["elements"],
detectionPath="",
initialPromptBuilder=None, # Will use default prompt builder
continuationPromptBuilder=None, # Will use default continuation builder
accumulator=None, # Direct return, no accumulation
merger=None,
continuationContextBuilder=None, # Will use default continuation context
resultBuilder=None, # Return JSON directly
finalResultHandler=_handleSectionContentFinalResult,
jsonNormalizer=_normalizeSectionContentJson,
supportsAccumulation=False,
requiresExtraction=False
))
# Use Case 2: Chapter Structure Generation
# Returns JSON with "documents[0].chapters" structure
self.register(LoopingUseCase(
useCaseId="chapter_structure",
jsonTemplate={"documents": [{"chapters": []}]},
detectionKeys=["chapters"],
detectionPath="documents[0].chapters",
initialPromptBuilder=None,
continuationPromptBuilder=None,
accumulator=None, # Direct return, no accumulation
merger=None,
continuationContextBuilder=None,
resultBuilder=None, # Return JSON directly
finalResultHandler=_handleChapterStructureFinalResult,
jsonNormalizer=_normalizeDefaultJson,
supportsAccumulation=False,
requiresExtraction=False
))
# Use Case 3: Code Structure Generation
self.register(LoopingUseCase(
useCaseId="code_structure",
jsonTemplate={
"metadata": {
"language": "",
"projectType": "single_file|multi_file",
"projectName": ""
},
"files": [
{
"id": "",
"filename": "",
"fileType": "",
"dependencies": [],
"imports": [],
"functions": [],
"classes": []
}
]
},
detectionKeys=["files"],
detectionPath="files",
initialPromptBuilder=None,
continuationPromptBuilder=None,
accumulator=None, # Direct return
merger=None,
continuationContextBuilder=None,
resultBuilder=None,
finalResultHandler=_handleCodeStructureFinalResult,
jsonNormalizer=_normalizeDefaultJson,
supportsAccumulation=False,
requiresExtraction=False
))
# Use Case 5: Code Content Generation (NEW)
self.register(LoopingUseCase(
useCaseId="code_content",
jsonTemplate={"files": [{"content": "", "functions": []}]},
detectionKeys=["content", "functions"],
detectionPath="files[0].content",
initialPromptBuilder=None,
continuationPromptBuilder=None,
accumulator=None, # Will use default accumulator
merger=None, # Will use default merger
continuationContextBuilder=None,
resultBuilder=None, # Will use default result builder
finalResultHandler=_handleCodeContentFinalResult,
jsonNormalizer=_normalizeDefaultJson,
supportsAccumulation=True,
requiresExtraction=False
))
logger.info(f"Registered {len(self.useCases)} default looping use cases")

View file

@ -0,0 +1,275 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
"""
Response Parsing Module
Handles parsing of AI responses, including:
- Section extraction from responses
- JSON completeness detection
- Loop detection
- Document metadata extraction
- Final result building
"""
import json
import logging
from typing import Dict, Any, List, Optional, Tuple
from modules.shared.jsonUtils import extractJsonString, repairBrokenJson, extractSectionsFromDocument
from .subJsonResponseHandling import JsonResponseHandler
from modules.datamodels.datamodelAi import JsonAccumulationState
logger = logging.getLogger(__name__)
class ResponseParser:
"""Handles parsing of AI responses and completion detection."""
def __init__(self, services):
"""Initialize ResponseParser with service center access."""
self.services = services
def extractSectionsFromResponse(
self,
result: str,
iteration: int,
debugPrefix: str,
allSections: List[Dict[str, Any]] = None,
accumulationState: Optional[JsonAccumulationState] = None
) -> Tuple[List[Dict[str, Any]], bool, Optional[Dict[str, Any]], Optional[JsonAccumulationState]]:
"""
Extract sections from AI response, handling both valid and broken JSON.
NEW BEHAVIOR:
- First iteration: Check if complete, if not start accumulation
- Subsequent iterations: Accumulate strings, parse when complete
Returns:
Tuple of:
- sections: Extracted sections
- wasJsonComplete: True if JSON is complete
- parsedResult: Parsed JSON object
- updatedAccumulationState: Updated accumulation state (None if not in accumulation mode)
"""
if allSections is None:
allSections = []
if iteration == 1:
# First iteration - check if complete
parsed = None
try:
extracted = extractJsonString(result)
parsed = json.loads(extracted)
# Check completeness
if JsonResponseHandler.isJsonComplete(parsed):
# Complete JSON - no accumulation needed
sections = extractSectionsFromDocument(parsed)
logger.info(f"Iteration 1: Complete JSON detected, no accumulation needed")
return sections, True, parsed, None # No accumulation
except Exception:
pass
# Incomplete - try to extract partial sections from broken JSON
logger.info(f"Iteration 1: Incomplete JSON detected, attempting to extract partial sections")
partialSections = []
if parsed:
# Try to extract sections from parsed (even if incomplete)
partialSections = extractSectionsFromDocument(parsed)
else:
# Try to repair broken JSON and extract sections
try:
repaired = repairBrokenJson(result)
if repaired:
partialSections = extractSectionsFromDocument(repaired)
parsed = repaired # Use repaired version for accumulation state
except Exception:
pass # If repair fails, continue with empty sections
# Define KPIs (async call - need to handle this)
# For now, create accumulation state without KPIs, will be updated after async call
accumulationState = JsonAccumulationState(
accumulatedJsonString=result,
isAccumulationMode=True,
lastParsedResult=parsed,
allSections=partialSections,
kpis=[]
)
# Note: KPI definition will be done in the caller (async context)
return partialSections, False, parsed, accumulationState
else:
# Subsequent iterations - accumulate
if accumulationState and accumulationState.isAccumulationMode:
accumulated, sections, isComplete, parsedResult = \
JsonResponseHandler.accumulateAndParseJsonFragments(
accumulationState.accumulatedJsonString,
result,
allSections,
iteration
)
# Update accumulation state
accumulationState.accumulatedJsonString = accumulated
accumulationState.lastParsedResult = parsedResult
accumulationState.allSections = allSections + sections if sections else allSections
accumulationState.isAccumulationMode = not isComplete
# Log accumulated JSON for debugging
if parsedResult:
accumulated_json_str = json.dumps(parsedResult, indent=2, ensure_ascii=False)
self.services.utils.writeDebugFile(accumulated_json_str, f"{debugPrefix}_accumulated_json_iteration_{iteration}.json")
return sections, isComplete, parsedResult, accumulationState
else:
# No accumulation mode - process normally (shouldn't happen)
logger.warning(f"Iteration {iteration}: No accumulation state but iteration > 1")
return [], False, None, None
def shouldContinueGeneration(
self,
allSections: List[Dict[str, Any]],
iteration: int,
wasJsonComplete: bool,
rawResponse: str = None
) -> bool:
"""
Determine if AI generation loop should continue.
CRITICAL: This is ONLY about AI Loop Completion, NOT Action DoD!
Action DoD is checked AFTER the AI Loop completes in _refineDecide.
Simple logic:
- If JSON parsing failed or incomplete continue (needs more content)
- If JSON parses successfully and is complete stop (all content delivered)
- Loop detection prevents infinite loops
CRITICAL: JSON completeness is determined by parsing, NOT by last character check!
Returns True if we should continue, False if AI Loop is done.
"""
if len(allSections) == 0:
return True # No sections yet, continue
# CRITERION 1: If JSON was incomplete/broken (parsing failed or incomplete) - continue to repair/complete
if not wasJsonComplete:
logger.info(f"Iteration {iteration}: JSON incomplete/broken - continuing to complete")
return True
# CRITERION 2: JSON is complete (parsed successfully) - check for loop detection
if self._isStuckInLoop(allSections, iteration):
logger.warning(f"Iteration {iteration}: Detected potential infinite loop - stopping AI loop")
return False
# JSON is complete and not stuck in loop - done
logger.info(f"Iteration {iteration}: JSON complete - AI loop done")
return False
def _isStuckInLoop(
self,
allSections: List[Dict[str, Any]],
iteration: int
) -> bool:
"""
Detect if we're stuck in a loop (same content being repeated).
Generic approach: Check if recent iterations are adding minimal or duplicate content.
"""
if iteration < 3:
return False # Need at least 3 iterations to detect a loop
if len(allSections) == 0:
return False
# Check if last section is very small (might be stuck)
lastSection = allSections[-1]
elements = lastSection.get("elements", [])
if isinstance(elements, list) and elements:
lastElem = elements[-1] if elements else {}
else:
lastElem = elements if isinstance(elements, dict) else {}
# Check content size of last section
lastSectionSize = 0
if isinstance(lastElem, dict):
for key, value in lastElem.items():
if isinstance(value, str):
lastSectionSize += len(value)
elif isinstance(value, list):
lastSectionSize += len(str(value))
# If last section is very small and we've done many iterations, might be stuck
if lastSectionSize < 100 and iteration > 10:
logger.warning(f"Potential loop detected: iteration {iteration}, last section size {lastSectionSize}")
return True
return False
def extractDocumentMetadata(
self,
parsedResult: Dict[str, Any]
) -> Optional[Dict[str, Any]]:
"""
Extract document metadata (title, filename) from parsed AI response.
Returns dict with 'title' and 'filename' keys if found, None otherwise.
"""
if not isinstance(parsedResult, dict):
return None
# Try to get from documents array (preferred structure)
if "documents" in parsedResult and isinstance(parsedResult["documents"], list) and len(parsedResult["documents"]) > 0:
firstDoc = parsedResult["documents"][0]
if isinstance(firstDoc, dict):
title = firstDoc.get("title")
filename = firstDoc.get("filename")
if title or filename:
return {
"title": title,
"filename": filename
}
return None
def buildFinalResultFromSections(
self,
allSections: List[Dict[str, Any]],
documentMetadata: Optional[Dict[str, Any]] = None
) -> str:
"""
Build final JSON result from accumulated sections.
Uses AI-provided metadata (title, filename) if available.
"""
if not allSections:
return ""
# Extract metadata from AI response if available
title = "Generated Document"
filename = "document.json"
if documentMetadata:
if documentMetadata.get("title"):
title = documentMetadata["title"]
if documentMetadata.get("filename"):
filename = documentMetadata["filename"]
# Build documents structure
# Assuming single document for now
documents = [{
"id": "doc_1",
"title": title,
"filename": filename,
"sections": allSections
}]
result = {
"metadata": {
"split_strategy": "single_document",
"source_documents": [],
"extraction_method": "ai_generation"
},
"documents": documents
}
return json.dumps(result, indent=2)

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,508 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
"""
Structure Generation Module
Handles document structure generation, including:
- Generating document structure with sections
- Building structure prompts
"""
import json
import logging
from typing import Dict, Any, List, Optional
from modules.datamodels.datamodelExtraction import ContentPart
from modules.datamodels.datamodelAi import AiCallOptions, OperationTypeEnum, PriorityEnum, ProcessingModeEnum
from modules.workflows.processing.shared.stateTools import checkWorkflowStopped
logger = logging.getLogger(__name__)
class StructureGenerator:
"""Handles document structure generation."""
def __init__(self, services, aiService):
"""Initialize StructureGenerator with service center and AI service access."""
self.services = services
self.aiService = aiService
def _getUserLanguage(self) -> str:
"""Get user language for document generation"""
try:
if self.services:
# Prefer detected language if available (from user intention analysis)
if hasattr(self.services, 'currentUserLanguage') and self.services.currentUserLanguage:
return self.services.currentUserLanguage
# Fallback to user's preferred language
elif hasattr(self.services, 'user') and self.services.user and hasattr(self.services.user, 'language'):
return self.services.user.language
except Exception:
pass
return 'en' # Default fallback
async def generateStructure(
self,
userPrompt: str,
contentParts: List[ContentPart],
outputFormat: Optional[str] = None,
parentOperationId: str = None
) -> Dict[str, Any]:
"""
Phase 5C: Generiert Chapter-Struktur (Table of Contents).
Definiert für jedes Chapter:
- Level, Title
- contentParts (unified object with instruction and/or caption per part)
- generationHint
Generate document structure with per-document format determination.
Multiple documents can be produced with different formats (e.g., one PDF, one HTML).
AI determines formats per-document from user prompt. The outputFormat parameter is
only a validation fallback - used if AI doesn't return format per document.
Args:
userPrompt: User-Anfrage
contentParts: Alle vorbereiteten ContentParts mit Metadaten
outputFormat: Optional global format fallback. If omitted, formats are determined
from user prompt by AI. Used as validation fallback if AI doesn't
return format per document. Defaults to "txt" if not provided.
parentOperationId: Parent Operation-ID für ChatLog-Hierarchie
Returns:
Struktur-Dict mit documents und chapters (nicht sections!)
"""
# If outputFormat not provided, use "txt" as fallback for validation
# AI will determine formats per document from user prompt
if not outputFormat:
outputFormat = "txt"
logger.debug("outputFormat not provided - using 'txt' as validation fallback, formats determined from prompt")
# Erstelle Operation-ID für Struktur-Generierung
structureOperationId = f"{parentOperationId}_structure_generation"
# Starte ChatLog mit Parent-Referenz
formatDisplay = outputFormat if outputFormat else "auto-determined"
self.services.chat.progressLogStart(
structureOperationId,
"Chapter Structure Generation",
"Structure",
f"Generating chapter structure (format: {formatDisplay})",
parentOperationId=parentOperationId
)
try:
# Baue Chapter-Struktur-Prompt mit Content-Index
structurePrompt = self._buildChapterStructurePrompt(
userPrompt=userPrompt,
contentParts=contentParts,
outputFormat=outputFormat
)
# AI-Call für Chapter-Struktur-Generierung mit Looping-Unterstützung
# Use _callAiWithLooping instead of callAiPlanning to support continuation if response is cut
options = AiCallOptions(
operationType=OperationTypeEnum.DATA_GENERATE,
priority=PriorityEnum.QUALITY,
processingMode=ProcessingModeEnum.DETAILED,
compressPrompt=False,
compressContext=False,
resultFormat="json"
)
structurePrompt, templateStructure = self._buildChapterStructurePrompt(
userPrompt=userPrompt,
contentParts=contentParts,
outputFormat=outputFormat
)
# Create prompt builder for continuation support
async def buildChapterStructurePromptWithContinuation(
continuationContext: Any,
templateStructure: str,
basePrompt: str
) -> str:
"""Build chapter structure prompt with continuation context. Uses unified signature.
Note: All initial context (userPrompt, contentParts, outputFormat, etc.) is already
contained in basePrompt. This function only adds continuation-specific instructions.
"""
# Extract continuation context fields (only what's needed for continuation)
incompletePart = continuationContext.incomplete_part
lastRawJson = continuationContext.last_raw_json
# Generate both overlap context and hierarchy context using jsonContinuation
overlapContext = ""
unifiedContext = ""
if lastRawJson:
# Get contexts directly from jsonContinuation
from modules.shared.jsonContinuation import getContexts
contexts = getContexts(lastRawJson)
overlapContext = contexts.overlapContext
unifiedContext = contexts.hierarchyContextForPrompt
elif incompletePart:
unifiedContext = incompletePart
else:
unifiedContext = "Unable to extract context - response was completely broken"
# Build unified continuation prompt format
continuationPrompt = f"""{basePrompt}
--- CONTINUATION REQUEST ---
The previous JSON response was incomplete. Continue from where it stopped.
Context showing structure hierarchy with cut point:
```
{unifiedContext}
```
Overlap Requirement:
To ensure proper merging, your response MUST start EXACTLY with the overlap context shown below, then continue with new content.
Overlap context (start your response with this exact text):
```json
{overlapContext if overlapContext else "No overlap context available"}
```
TASK:
1. Start your response EXACTLY with the overlap context shown above (character by character)
2. Continue seamlessly from where the overlap context ends
3. Complete the remaining content following the JSON structure template above
4. Return ONLY valid JSON following the structure template - no overlap/continuation wrapper objects
CRITICAL:
- Your response MUST begin with the exact overlap context text (this enables automatic merging)
- Continue seamlessly after the overlap context with new content
- Your response must be valid JSON matching the structure template above"""
return continuationPrompt
# Call AI with looping support
# NOTE: Do NOT pass contentParts here - we only need metadata for structure generation
# The contentParts metadata is already included in the prompt (contentPartsIndex)
# Actual content extraction happens later during section generation
checkWorkflowStopped(self.services)
aiResponseJson = await self.aiService.callAiWithLooping(
prompt=structurePrompt,
options=options,
debugPrefix="chapter_structure_generation",
promptBuilder=buildChapterStructurePromptWithContinuation,
promptArgs={
"userPrompt": userPrompt,
"outputFormat": outputFormat,
"templateStructure": templateStructure,
"basePrompt": structurePrompt
},
useCaseId="chapter_structure", # REQUIRED: Explicit use case ID
operationId=structureOperationId,
userPrompt=userPrompt,
contentParts=None # Do not pass ContentParts - only metadata needed, not content extraction
)
# Parse the complete JSON response (looping system already handles completion)
extractedJson = self.services.utils.jsonExtractString(aiResponseJson)
parsedJson, parseError, cleanedJson = self.services.utils.jsonTryParse(extractedJson)
if parseError is not None:
# Even with looping, try repair as fallback
logger.warning(f"JSON parsing failed after looping: {str(parseError)}. Attempting repair...")
from modules.shared import jsonUtils
repairedJson = jsonUtils.repairBrokenJson(extractedJson)
if repairedJson:
parsedJson, parseError, _ = self.services.utils.jsonTryParse(json.dumps(repairedJson))
if parseError is None:
logger.info("Successfully repaired and parsed JSON structure after looping")
structure = parsedJson
else:
logger.error(f"Failed to parse repaired JSON: {str(parseError)}")
raise ValueError(f"Failed to parse JSON structure after repair: {str(parseError)}")
else:
logger.error(f"Failed to repair JSON. Parse error: {str(parseError)}")
logger.error(f"Cleaned JSON preview (first 500 chars): {cleanedJson[:500]}")
raise ValueError(f"Failed to parse JSON structure: {str(parseError)}")
else:
structure = parsedJson
# State 3 Validation: Validate and auto-fix structure
# Validation 3.1: Structure missing 'documents' field
if "documents" not in structure:
raise ValueError("Structure missing 'documents' field - cannot auto-fix")
documents = structure["documents"]
# Validation 3.2: Structure has no documents
if not isinstance(documents, list) or len(documents) == 0:
raise ValueError("Structure has no documents - cannot generate without documents")
# Import renderer registry for format validation (existing infrastructure)
from modules.serviceCenter.services.serviceGeneration.renderers.registry import getRenderer
# Validate and fix each document
for doc in documents:
# Validation 3.3 & 3.4: Document outputFormat
# outputFormat parameter is optional - if omitted, formats determined from prompt by AI
# Use as fallback only if AI doesn't return format per document
# Multiple documents can have different formats (e.g., one PDF, one HTML)
globalFormatFallback = outputFormat or "txt" # Fallback for validation
if "outputFormat" not in doc or not doc["outputFormat"]:
# AI didn't return format or returned empty - use global fallback
doc["outputFormat"] = globalFormatFallback
logger.warning(f"Document {doc.get('id')} missing outputFormat - using fallback: {doc['outputFormat']}")
else:
# AI returned format - validate using existing renderer registry
formatName = str(doc["outputFormat"]).lower().strip()
renderer = getRenderer(formatName) # Uses existing infrastructure
if not renderer:
# Format doesn't match any renderer - use txt (simple approach)
logger.warning(f"Document {doc.get('id')} has format without renderer: {formatName}, using 'txt'")
doc["outputFormat"] = "txt"
else:
# Valid format with renderer - normalize and keep AI result
doc["outputFormat"] = formatName
logger.debug(f"Document {doc.get('id')} using AI-determined format: {formatName}")
# Validation 3.5 & 3.6: Document language
# Use validated currentUserLanguage (always valid, validated during user intention analysis)
# Access via _getUserLanguage() which uses self.services.currentUserLanguage
userPromptLanguage = self._getUserLanguage() # Uses validated currentUserLanguage infrastructure
if "language" not in doc or not isinstance(doc["language"], str) or len(doc["language"]) != 2:
# AI didn't return language or invalid format - use validated currentUserLanguage
doc["language"] = userPromptLanguage
if "language" not in doc:
logger.warning(f"Document {doc.get('id')} missing language - using currentUserLanguage: {userPromptLanguage}")
else:
logger.warning(f"Document {doc.get('id')} has invalid language format from AI: {doc['language']}, using currentUserLanguage")
else:
# AI returned valid language format - normalize
doc["language"] = doc["language"].lower().strip()[:2]
logger.debug(f"Document {doc.get('id')} using AI-determined language: {doc['language']}")
# Validation 3.7: Document missing 'chapters' field
if "chapters" not in doc:
raise ValueError(f"Document {doc.get('id')} missing 'chapters' field - cannot auto-fix")
# Validation 3.8: Chapter missing 'contentParts' field
for chapter in doc["chapters"]:
if "contentParts" not in chapter:
raise ValueError(f"Chapter {chapter.get('id')} missing 'contentParts' field - cannot auto-fix")
# ChatLog abschließen
self.services.chat.progressLogFinish(structureOperationId, True)
return structure
except Exception as e:
self.services.chat.progressLogFinish(structureOperationId, False)
logger.error(f"Error in generateStructure: {str(e)}")
raise
def _buildChapterStructurePrompt(
self,
userPrompt: str,
contentParts: List[ContentPart],
outputFormat: str
) -> tuple[str, str]:
"""Baue Prompt für Chapter-Struktur-Generierung."""
# Baue ContentParts-Index - filtere leere Parts heraus
contentPartsIndex = ""
validParts = []
filteredParts = []
for part in contentParts:
contentFormat = part.metadata.get("contentFormat", "unknown")
# WICHTIG: Reference Parts haben absichtlich leere Daten - immer einschließen
if contentFormat == "reference":
validParts.append(part)
logger.debug(f"Including reference ContentPart {part.id} (intentionally empty data)")
continue
# Überspringe leere Parts (keine Daten oder nur Container ohne Inhalt)
# ABER: Reference Parts wurden bereits oben behandelt
if not part.data or (isinstance(part.data, str) and len(part.data.strip()) == 0):
# Überspringe Container-Parts ohne Daten
if part.typeGroup == "container" and not part.data:
filteredParts.append((part.id, "container without data"))
continue
# Überspringe andere leere Parts (aber nicht Reference, die wurden bereits behandelt)
if not part.data:
filteredParts.append((part.id, f"no data (format: {contentFormat})"))
continue
validParts.append(part)
logger.debug(f"Including ContentPart {part.id}: format={contentFormat}, type={part.typeGroup}, dataLength={len(str(part.data)) if part.data else 0}")
if filteredParts:
logger.debug(f"Filtered out {len(filteredParts)} empty ContentParts: {filteredParts}")
logger.info(f"Building structure prompt with {len(validParts)} valid ContentParts (from {len(contentParts)} total)")
# Baue Index nur für gültige Parts
for i, part in enumerate(validParts, 1):
contentFormat = part.metadata.get("contentFormat", "unknown")
originalFileName = part.metadata.get('originalFileName', 'N/A')
contentPartsIndex += f"\n{i}. ContentPart ID: {part.id}\n"
contentPartsIndex += f" Format: {contentFormat}\n"
contentPartsIndex += f" Type: {part.typeGroup}\n"
contentPartsIndex += f" MIME Type: {part.mimeType or 'N/A'}\n"
contentPartsIndex += f" Source: {part.metadata.get('documentId', 'unknown')}\n"
contentPartsIndex += f" Original file name: {originalFileName}\n"
contentPartsIndex += f" Usage hint: {part.metadata.get('usageHint', 'N/A')}\n"
if not contentPartsIndex:
contentPartsIndex = "\n(No content parts available)"
# Get language from services (user intention analysis)
language = self._getUserLanguage()
logger.debug(f"Using language from services (user intention analysis) for structure generation: {language}")
# Create template structure explicitly (not extracted from prompt)
# This ensures exact identity between initial and continuation prompts
templateStructure = f"""{{
"metadata": {{
"title": "Document Title",
"language": "{language}"
}},
"documents": [{{
"id": "doc_1",
"title": "Document Title",
"filename": "document.{outputFormat}",
"outputFormat": "{outputFormat}",
"language": "{language}",
"chapters": [
{{
"id": "chapter_1",
"level": 1,
"title": "Chapter Title",
"contentParts": {{
"extracted_part_id": {{
"instruction": "Use extracted content with ALL relevant details from user request"
}}
}},
"generationHint": "Detailed description including ALL relevant details from user request for this chapter",
"sections": []
}}
]
}}]
}}"""
prompt = f"""# TASK: Plan Document Structure (Documents + Chapters)
This is a STRUCTURE PLANNING task. You define which documents to create and which chapters each document will have.
Chapter CONTENT will be generated in a later step - here you only plan the STRUCTURE and assign content references.
Return EXACTLY ONE complete JSON object. Do not generate multiple JSON objects, alternatives, or variations. Do not use separators like "---" between JSON objects.
## USER REQUEST (for context)
```
{userPrompt}
```
## AVAILABLE CONTENT PARTS
{contentPartsIndex}
## CONTENT ASSIGNMENT RULE
CRITICAL: Every chapter MUST have contentParts assigned if it relates to documents/images/data from the user request.
If the user request mentions documents/images/data, then EVERY chapter that generates content related to those references MUST assign the relevant ContentParts explicitly.
Assignment logic:
- If chapter DISPLAYS a document/image assign "object" format ContentPart with "caption"
- If chapter generates text content ABOUT a document/image/data assign ContentPart with "instruction":
- Prefer "extracted" format if available (contains analyzed/extracted content)
- If only "object" format is available, use "object" format with "instruction" (to write ABOUT the image/document)
- If chapter's generationHint or purpose relates to a document/image/data mentioned in user request → it MUST have ContentParts assigned
- Multiple chapters might assign the same ContentPart (e.g., one chapter displays image, another writes about it)
- Use ContentPart IDs exactly as listed in AVAILABLE CONTENT PARTS above
- Empty contentParts are only allowed if chapter generates content WITHOUT referencing any documents/images/data from the user request
CRITICAL RULE: If the user request mentions BOTH:
a) Documents/images/data (listed in AVAILABLE CONTENT PARTS above), AND
b) Generic content types (article text, main content, body text, etc.)
Then chapters that generate those generic content types MUST assign the relevant ContentParts, because the content should relate to or be based on the provided documents/images/data.
## CONTENT EFFICIENCY PRINCIPLES
- Generate COMPACT content: Focus on essential information only
- AVOID verbose, lengthy, or repetitive text - be concise and direct
- Prioritize FACTS over filler text - no introductions like "In this chapter..."
- Minimize system resources: shorter content = faster processing
- Quality over quantity: precise, meaningful content rather than padding
## CHAPTER STRUCTURE REQUIREMENTS
- Generate chapters based on USER REQUEST - analyze what structure the user wants
- Create ONLY the minimum chapters needed to cover the user's request - avoid over-structuring
- IMPORTANT: Each chapter MUST have ALL these fields:
- id: Unique identifier (e.g., "chapter_1")
- level: Heading level (1, 2, 3, etc.)
- title: Chapter title
- contentParts: Object mapping ContentPart IDs to usage instructions (MUST assign if chapter relates to documents/data from user request)
- generationHint: Description of what content to generate (including formatting/styling requirements)
- sections: Empty array [] (REQUIRED - sections are generated in next phase)
- contentParts: {{"partId": {{"instruction": "..."}} or {{"caption": "..."}} or both}} - Assign ContentParts as required by CONTENT ASSIGNMENT RULE above
- The "instruction" field for each ContentPart MUST contain ALL relevant details from the USER REQUEST that apply to content extraction for this specific chapter. Include all formatting rules, data requirements, constraints, and specifications mentioned in the user request that are relevant for processing this ContentPart in this chapter.
- generationHint: Keep CONCISE but include relevant details from the USER REQUEST. Focus on WHAT to generate, not HOW to phrase it verbosely.
- The number of chapters depends on the user request - create only what is requested. Do NOT create chapters for topics without available data.
CRITICAL: Only create chapters for CONTENT sections, not for formatting/styling requirements. Formatting/styling requirements to be included in each generationHint if needed.
## DOCUMENT STRUCTURE
For each document, determine:
- outputFormat: From USER REQUEST (explicit mention or infer from purpose/content type). Default: "{outputFormat}". Multiple documents can have different formats.
- language: From USER REQUEST (map to ISO 639-1: de, en, fr, it...). Default: "{language}". Multiple documents can have different languages.
- chapters: Structure appropriately for the format (e.g., pptx=slides, docx=sections, xlsx=worksheets). Match format capabilities and constraints.
Required JSON fields:
- metadata: {{"title": "...", "language": "..."}}
- documents: Array with id, title, filename, outputFormat, language, chapters[]
- chapters: Array with id, level, title, contentParts, generationHint, sections[]
EXAMPLE STRUCTURE (for reference only - adapt to user request):
{{
"metadata": {{
"title": "Document Title",
"language": "{language}"
}},
"documents": [{{
"id": "doc_1",
"title": "Document Title",
"filename": "document.{outputFormat}",
"outputFormat": "{outputFormat}",
"language": "{language}",
"chapters": [
{{
"id": "chapter_1",
"level": 1,
"title": "Chapter Title",
"contentParts": {{
"extracted_part_id": {{
"instruction": "Use extracted content with ALL relevant details from user request"
}}
}},
"generationHint": "Detailed description including ALL relevant details from user request for this chapter",
"sections": []
}}
]
}}]
}}
CRITICAL INSTRUCTIONS:
- Generate chapters based on USER REQUEST, NOT based on the example above
- The example shows the JSON structure format, NOT the required chapters
- Create only the chapters that match the user's request
- Adapt chapter titles and structure to match the user's specific request
- Determine outputFormat and language for each document by analyzing the USER REQUEST above
- The example shows placeholders "{outputFormat}" and "{language}" - YOU MUST REPLACE THESE with actual values determined from the USER REQUEST
MANDATORY CONTENT ASSIGNMENT CHECK:
For each chapter, verify:
1. Does the user request mention documents/images/data? (e.g., "photo", "image", "document", "data", "based on", "about")
2. Does this chapter's generationHint, title, or purpose relate to those documents/images/data mentioned in step 1?
- Examples: "article about the photo", "text describing the image", "analysis of the document", "content based on the data"
- Even if chapter doesn't explicitly say "about the image", if user request mentions both the image AND this chapter's content type relate them
3. If YES to both chapter MUST have contentParts assigned (cannot be empty {{}})
4. If ContentPart is "object" format and chapter needs to write ABOUT it assign with "instruction" field, not just "caption"
OUTPUT FORMAT: Start with {{ and end with }}. Do NOT use markdown code fences (```json). Do NOT add explanatory text before or after the JSON. Return ONLY the JSON object itself.
"""
return prompt, templateStructure

View file

@ -0,0 +1,7 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
"""Billing service."""
from .mainServiceBilling import BillingService, getService, InsufficientBalanceException, ProviderNotAllowedException, BillingContextError
__all__ = ["BillingService", "getService", "InsufficientBalanceException", "ProviderNotAllowedException", "BillingContextError"]

View file

@ -0,0 +1,436 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
"""
Billing Service - Central service for billing operations.
Handles:
- Balance checks before AI operations
- Cost recording after AI operations
- Provider permission checks via RBAC
- Price calculation with markup
"""
import logging
from typing import Dict, Any, List, Optional
from datetime import datetime
from modules.datamodels.datamodelUam import User
from modules.datamodels.datamodelBilling import (
BillingModelEnum,
BillingCheckResult,
TransactionTypeEnum,
ReferenceTypeEnum,
BillingTransaction,
BillingBalanceResponse,
)
from modules.interfaces.interfaceDbBilling import getInterface as getBillingInterface
logger = logging.getLogger(__name__)
# Markup percentage for internal pricing (+50% für Infrastruktur und Platform Service + 50% für Währungsrisiko ==> Faktor 2.0)
BILLING_MARKUP_PERCENT = 100
# Singleton cache
_billingServices: Dict[str, "BillingService"] = {}
def getService(currentUser: User, mandateId: str, featureInstanceId: str = None, featureCode: str = None) -> "BillingService":
"""
Factory function to get or create a BillingService instance.
Args:
currentUser: Current user object
mandateId: Mandate ID for context
featureInstanceId: Optional feature instance ID
featureCode: Optional feature code (e.g., 'chatplayground', 'automation')
Returns:
BillingService instance
"""
cacheKey = f"{currentUser.id}_{mandateId}_{featureInstanceId}"
if cacheKey not in _billingServices:
_billingServices[cacheKey] = BillingService(currentUser, mandateId, featureInstanceId, featureCode)
else:
_billingServices[cacheKey].setContext(currentUser, mandateId, featureInstanceId, featureCode)
return _billingServices[cacheKey]
def _get_feature_code_from_context(context) -> Optional[str]:
"""Extract featureCode from ServiceCenterContext."""
if context.workflow and hasattr(context.workflow, "feature") and context.workflow.feature:
return getattr(context.workflow.feature, "code", None)
return getattr(context.workflow, "featureCode", None) if context.workflow else None
class BillingService:
"""
Central billing service for AI operations.
Responsibilities:
- Check balance before operations
- Record usage costs
- Apply pricing markup
- Check provider permissions via RBAC
Supports both service center (context, get_service) and legacy (user, mandateId, ...) initialization.
"""
def __init__(self, context_or_user, mandateId=None, featureInstanceId=None, featureCode=None, get_service=None):
"""
Initialize the billing service.
Service center: (context, get_service) - resolver passes exactly these two args
Legacy: (currentUser, mandateId, featureInstanceId, featureCode) from getService() factory
"""
# Detect service center: second arg is callable (get_service)
if mandateId is not None and callable(mandateId):
ctx = context_or_user
get_service = mandateId
self.currentUser = ctx.user
self.mandateId = ctx.mandate_id or ""
self.featureInstanceId = ctx.feature_instance_id
self.featureCode = _get_feature_code_from_context(ctx)
elif get_service is not None and hasattr(context_or_user, "user"):
ctx = context_or_user
self.currentUser = ctx.user
self.mandateId = ctx.mandate_id or ""
self.featureInstanceId = ctx.feature_instance_id
self.featureCode = _get_feature_code_from_context(ctx)
else:
self.currentUser = context_or_user
self.mandateId = mandateId or ""
self.featureInstanceId = featureInstanceId
self.featureCode = featureCode
self._billingInterface = getBillingInterface(self.currentUser, self.mandateId)
self._settingsCache = None
def setContext(
self,
currentUser: User,
mandateId: str,
featureInstanceId: str = None,
featureCode: str = None
):
"""Update service context."""
self.currentUser = currentUser
self.mandateId = mandateId
self.featureInstanceId = featureInstanceId
self.featureCode = featureCode
self._billingInterface = getBillingInterface(currentUser, mandateId)
self._settingsCache = None
def _getSettings(self) -> Optional[Dict[str, Any]]:
"""Get billing settings with caching."""
if self._settingsCache is None:
self._settingsCache = self._billingInterface.getSettings(self.mandateId)
return self._settingsCache
# =========================================================================
# Price Calculation
# =========================================================================
def calculatePriceWithMarkup(self, basePriceCHF: float) -> float:
"""
Calculate final price with markup.
The AICore plugins return prices in their original currency (USD).
This method applies the configured markup percentage.
Args:
basePriceCHF: Base price from AI model (actually USD from provider)
Returns:
Final price in CHF with markup applied
"""
if basePriceCHF <= 0:
return 0.0
# Apply markup (50% = multiply by 1.5)
markup_multiplier = 1 + (BILLING_MARKUP_PERCENT / 100)
return round(basePriceCHF * markup_multiplier, 6)
# =========================================================================
# Balance Operations
# =========================================================================
def checkBalance(self, estimatedCost: float = 0.0) -> BillingCheckResult:
"""
Check if the current user/mandate has sufficient balance.
Args:
estimatedCost: Estimated cost of the operation (with markup applied)
Returns:
BillingCheckResult indicating if operation is allowed
"""
return self._billingInterface.checkBalance(
self.mandateId,
self.currentUser.id,
estimatedCost
)
def hasBalance(self, estimatedCost: float = 0.0) -> bool:
"""
Quick check if balance is sufficient.
Args:
estimatedCost: Estimated cost with markup
Returns:
True if operation is allowed
"""
result = self.checkBalance(estimatedCost)
return result.allowed
def getCurrentBalance(self) -> float:
"""
Get current balance for the user/mandate.
Returns:
Current balance in CHF
"""
result = self.checkBalance(0.0)
return result.currentBalance or 0.0
# =========================================================================
# Usage Recording
# =========================================================================
def recordUsage(
self,
priceCHF: float,
workflowId: str = None,
aicoreProvider: str = None,
aicoreModel: str = None,
description: str = None
) -> Optional[Dict[str, Any]]:
"""
Record AI usage cost as a billing transaction.
This method:
1. Applies the pricing markup
2. Creates a DEBIT transaction
3. Updates the account balance
Args:
priceCHF: Base price from AI model (before markup)
workflowId: Optional workflow ID
aicoreProvider: AICore provider name (e.g., 'anthropic', 'openai')
aicoreModel: AICore model name (e.g., 'claude-4-sonnet', 'gpt-4o')
description: Optional description
Returns:
Created transaction dict or None if not recorded
"""
if priceCHF <= 0:
return None
# Apply markup
finalPrice = self.calculatePriceWithMarkup(priceCHF)
if finalPrice <= 0:
return None
# Build description
if not description:
description = f"AI Usage: {aicoreModel or aicoreProvider or 'unknown'}"
return self._billingInterface.recordUsage(
mandateId=self.mandateId,
userId=self.currentUser.id,
priceCHF=finalPrice,
workflowId=workflowId,
featureInstanceId=self.featureInstanceId,
featureCode=self.featureCode,
aicoreProvider=aicoreProvider,
aicoreModel=aicoreModel,
description=description
)
# =========================================================================
# Provider Permission Check (via RBAC)
# =========================================================================
def isProviderAllowed(self, provider: str) -> bool:
"""
Check if the user has permission to use an AICore provider.
Uses RBAC to check for resource permission:
resource.aicore.{provider}
Args:
provider: Provider name (e.g., 'anthropic', 'openai')
Returns:
True if provider is allowed
"""
try:
from modules.security.rbac import RbacClass
from modules.datamodels.datamodelRbac import AccessRuleContext
from modules.security.rootAccess import getRootDbAppConnector
# Get database connector via established pattern
dbApp = getRootDbAppConnector()
rbac = RbacClass(dbApp, dbApp)
resourceKey = f"resource.aicore.{provider}"
# Check if user has view permission for this resource (view = use for RESOURCE context)
permissions = rbac.getUserPermissions(
self.currentUser,
AccessRuleContext.RESOURCE,
resourceKey,
mandateId=self.mandateId
)
return permissions.view
except Exception as e:
logger.warning(f"Error checking provider permission: {e}")
# Default to allowed if RBAC check fails
return True
def getallowedProviders(self) -> List[str]:
"""
Get list of AICore providers the user is allowed to use.
Returns:
List of allowed provider names
"""
try:
from modules.aicore.aicoreModelRegistry import modelRegistry
# Get all available providers
connectors = modelRegistry.discoverConnectors()
allProviders = [c.getConnectorType() for c in connectors]
# Filter by RBAC permissions
return [p for p in allProviders if self.isProviderAllowed(p)]
except Exception as e:
logger.warning(f"Error getting allowed providers: {e}")
return []
# =========================================================================
# Admin Operations
# =========================================================================
def addCredit(
self,
amount: float,
description: str = "Manual credit",
referenceType: ReferenceTypeEnum = ReferenceTypeEnum.ADMIN
) -> Optional[Dict[str, Any]]:
"""
Add credit to the account (admin operation).
Args:
amount: Amount to credit (positive)
description: Transaction description
referenceType: Reference type (ADMIN, PAYMENT, SYSTEM)
Returns:
Created transaction dict or None
"""
if amount <= 0:
return None
settings = self._getSettings()
if not settings:
logger.warning(f"No billing settings for mandate {self.mandateId}")
return None
billingModel = BillingModelEnum(settings.get("billingModel", BillingModelEnum.UNLIMITED.value))
# Get or create account
if billingModel == BillingModelEnum.PREPAY_USER:
account = self._billingInterface.getOrCreateUserAccount(
self.mandateId,
self.currentUser.id,
initialBalance=0.0
)
else:
account = self._billingInterface.getOrCreateMandateAccount(
self.mandateId,
initialBalance=0.0
)
# Create credit transaction
transaction = BillingTransaction(
accountId=account["id"],
transactionType=TransactionTypeEnum.CREDIT,
amount=amount,
description=description,
referenceType=referenceType
)
return self._billingInterface.createTransaction(transaction)
# =========================================================================
# Statistics & Reporting
# =========================================================================
def getBalancesForUser(self) -> List[BillingBalanceResponse]:
"""
Get all billing balances for the current user.
Returns:
List of balance responses for each mandate
"""
return self._billingInterface.getBalancesForUser(self.currentUser.id)
def getTransactionHistory(self, limit: int = 100) -> List[Dict[str, Any]]:
"""
Get transaction history for the user across all mandates.
Args:
limit: Maximum number of transactions
Returns:
List of transactions
"""
return self._billingInterface.getTransactionsForUser(self.currentUser.id, limit=limit)
# ============================================================================
# Exception Classes
# ============================================================================
class InsufficientBalanceException(Exception):
"""Raised when there's insufficient balance for an operation."""
def __init__(self, currentBalance: float, requiredAmount: float, message: str = None):
self.currentBalance = currentBalance
self.requiredAmount = requiredAmount
self.message = message or f"Insufficient balance. Current: {currentBalance:.2f} CHF, Required: {requiredAmount:.2f} CHF"
super().__init__(self.message)
class ProviderNotAllowedException(Exception):
"""Raised when a user doesn't have permission to use an AI provider."""
def __init__(self, provider: str, message: str = None):
self.provider = provider
self.message = message or f"Provider '{provider}' is not allowed for your role"
super().__init__(self.message)
class BillingContextError(Exception):
"""Raised when billing context is incomplete (missing mandateId, user, etc.).
This is a FAIL-SAFE error: AI calls MUST NOT proceed without valid billing context.
Acts like a 0 CHF credit card pre-authorization check - validates that billing
CAN be recorded before any expensive AI operation starts.
"""
def __init__(self, message: str = None):
self.message = message or "Billing context incomplete - AI call blocked"
super().__init__(self.message)
# Expose exception classes on BillingService so consumers can use service.InsufficientBalanceException
# instead of importing from this module
BillingService.InsufficientBalanceException = InsufficientBalanceException
BillingService.ProviderNotAllowedException = ProviderNotAllowedException
BillingService.BillingContextError = BillingContextError

View file

@ -0,0 +1,104 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
"""
Stripe Checkout service for billing credit top-ups.
Creates Checkout Sessions for redirect-based payment flow.
"""
import logging
from typing import Optional
from modules.shared.configuration import APP_CONFIG
logger = logging.getLogger(__name__)
# Server-side allowed amounts in CHF - never trust client
ALLOWED_AMOUNTS_CHF = [10, 25, 50, 100, 250, 500]
def create_checkout_session(
mandate_id: str,
user_id: Optional[str],
amount_chf: float
) -> str:
"""
Create a Stripe Checkout Session for credit top-up.
Amount and currency are validated server-side. The client-provided amount
must match an allowed preset.
Args:
mandate_id: Target mandate ID
user_id: Target user ID (for PREPAY_USER) or None (for mandate pool)
amount_chf: Amount in CHF (must be in ALLOWED_AMOUNTS_CHF)
Returns:
Stripe Checkout Session URL for redirect
Raises:
ValueError: If amount is invalid
"""
import stripe
# Validate amount server-side
if amount_chf not in ALLOWED_AMOUNTS_CHF:
raise ValueError(
f"Invalid amount {amount_chf} CHF. Allowed: {ALLOWED_AMOUNTS_CHF}"
)
# Pin API version from config (match Stripe Dashboard)
api_version = APP_CONFIG.get("STRIPE_API_VERSION")
if api_version:
stripe.api_version = api_version
# Get secrets
secret_key = APP_CONFIG.get("STRIPE_SECRET_KEY_SECRET") or APP_CONFIG.get("STRIPE_SECRET_KEY")
if not secret_key:
raise ValueError("STRIPE_SECRET_KEY_SECRET not configured")
stripe.api_key = secret_key
frontend_url = APP_CONFIG.get("APP_FRONTEND_URL", "https://nyla-int.poweron-center.net")
base_path = "/admin/billing"
success_url = f"{frontend_url.rstrip('/')}{base_path}?success=true&session_id={{CHECKOUT_SESSION_ID}}"
cancel_url = f"{frontend_url.rstrip('/')}{base_path}?canceled=true"
# Amount in cents for Stripe (CHF uses 2 decimal places)
amount_cents = int(round(amount_chf * 100))
metadata = {
"mandateId": mandate_id,
"amountChf": str(amount_chf),
}
if user_id:
metadata["userId"] = user_id
session = stripe.checkout.Session.create(
mode="payment",
line_items=[
{
"price_data": {
"currency": "chf",
"unit_amount": amount_cents,
"product_data": {
"name": "Guthaben aufladen",
"description": "AI Service Guthaben (CHF)",
},
},
"quantity": 1,
}
],
success_url=success_url,
cancel_url=cancel_url,
metadata=metadata,
)
if not session or not session.url:
raise ValueError("Stripe Checkout Session creation failed")
logger.info(
f"Created Stripe Checkout Session {session.id} for mandate {mandate_id}, "
f"amount {amount_chf} CHF"
)
return session.url

View file

@ -0,0 +1,7 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
"""Chat service."""
from .mainServiceChat import ChatService
__all__ = ["ChatService"]

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,7 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
from .mainServiceExtraction import ExtractionService
__all__ = ["ExtractionService"]

View file

@ -0,0 +1,4 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.

View file

@ -0,0 +1,184 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
from typing import Any, Dict, List
import base64
import io
from modules.datamodels.datamodelExtraction import ContentPart
from ..subRegistry import Chunker
class ImageChunker(Chunker):
"""Chunker for reducing image size through resizing, compression, and tiling."""
def chunk(self, part: ContentPart, options: Dict[str, Any]) -> list[Dict[str, Any]]:
"""
Chunk an image by reducing its size through various strategies.
Args:
part: ContentPart containing image data (base64 encoded)
options: Chunking options including:
- imageChunkSize: Maximum size in bytes for each chunk
- imageMaxPixels: Maximum pixels (width*height) for the image
- imageQuality: JPEG quality (0-100, default 85)
- imageTileSize: Size for tiling if image is still too large
Returns:
List of image chunks with reduced size
"""
maxBytes = int(options.get("imageChunkSize", 1000000)) # 1MB default
maxPixels = int(options.get("imageMaxPixels", 1024 * 1024)) # 1MP default
quality = int(options.get("imageQuality", 85))
tileSize = int(options.get("imageTileSize", 512)) # 512x512 tiles
chunks: List[Dict[str, Any]] = []
try:
# Lazy import PIL to avoid hanging during module import
from PIL import Image
# Decode base64 image data
imageData = base64.b64decode(part.data)
image = Image.open(io.BytesIO(imageData))
# Get original dimensions
originalWidth, originalHeight = image.size
originalPixels = originalWidth * originalHeight
# Strategy 1: If image is small enough, return as-is
if len(part.data) <= maxBytes and originalPixels <= maxPixels:
chunks.append({
"data": part.data,
"size": len(part.data),
"order": 0,
"metadata": {
"originalSize": len(part.data),
"originalPixels": originalPixels,
"strategy": "original"
}
})
return chunks
# Strategy 2: Resize to fit within pixel limit
if originalPixels > maxPixels:
# Calculate new dimensions maintaining aspect ratio
scale = (maxPixels / originalPixels) ** 0.5
newWidth = int(originalWidth * scale)
newHeight = int(originalHeight * scale)
# Ensure minimum size
newWidth = max(newWidth, 64)
newHeight = max(newHeight, 64)
image = image.resize((newWidth, newHeight), Image.Resampling.LANCZOS)
# Strategy 3: Compress with quality reduction
currentSize = len(part.data)
currentQuality = quality
while currentSize > maxBytes and currentQuality > 10:
# Compress image
output = io.BytesIO()
image.save(output, format='JPEG', quality=currentQuality, optimize=True)
compressedData = output.getvalue()
compressedB64 = base64.b64encode(compressedData).decode('utf-8')
currentSize = len(compressedB64)
if currentSize <= maxBytes:
chunks.append({
"data": compressedB64,
"size": currentSize,
"order": 0,
"metadata": {
"originalSize": len(part.data),
"originalPixels": originalPixels,
"compressedSize": currentSize,
"quality": currentQuality,
"strategy": "compressed"
}
})
return chunks
currentQuality -= 10
# Strategy 4: Tile the image if still too large
if currentSize > maxBytes:
chunks = self._tileImage(image, maxBytes, tileSize, quality, originalPixels)
return chunks
# Fallback: Return compressed version even if over limit
output = io.BytesIO()
image.save(output, format='JPEG', quality=10, optimize=True)
compressedData = output.getvalue()
compressedB64 = base64.b64encode(compressedData).decode('utf-8')
chunks.append({
"data": compressedB64,
"size": len(compressedB64),
"order": 0,
"metadata": {
"originalSize": len(part.data),
"originalPixels": originalPixels,
"compressedSize": len(compressedB64),
"quality": 10,
"strategy": "fallback_compressed"
}
})
except Exception as e:
# Fallback: Return original data with error metadata
chunks.append({
"data": part.data,
"size": len(part.data),
"order": 0,
"metadata": {
"originalSize": len(part.data),
"strategy": "error_fallback",
"error": str(e)
}
})
return chunks
def _tileImage(self, image: Any, maxBytes: int, tileSize: int, quality: int, originalPixels: int) -> List[Dict[str, Any]]:
"""Split image into tiles if it's still too large after compression."""
chunks = []
width, height = image.size
# Calculate tile grid
tilesX = (width + tileSize - 1) // tileSize
tilesY = (height + tileSize - 1) // tileSize
for y in range(tilesY):
for x in range(tilesX):
# Calculate tile boundaries
left = x * tileSize
top = y * tileSize
right = min(left + tileSize, width)
bottom = min(top + tileSize, height)
# Extract tile
tile = image.crop((left, top, right, bottom))
# Compress tile
output = io.BytesIO()
tile.save(output, format='JPEG', quality=quality, optimize=True)
tileData = output.getvalue()
tileB64 = base64.b64encode(tileData).decode('utf-8')
chunks.append({
"data": tileB64,
"size": len(tileB64),
"order": y * tilesX + x,
"metadata": {
"originalSize": len(image.tobytes()),
"originalPixels": originalPixels,
"tileSize": tileSize,
"tilePosition": f"{x},{y}",
"tileBounds": f"{left},{top},{right},{bottom}",
"quality": quality,
"strategy": "tiled"
}
})
return chunks

View file

@ -0,0 +1,91 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
from typing import Any, Dict, List
import json
from modules.datamodels.datamodelExtraction import ContentPart
from ..subRegistry import Chunker
class StructureChunker(Chunker):
def chunk(self, part: ContentPart, options: Dict[str, Any]) -> list[Dict[str, Any]]:
maxBytes = int(options.get("structureChunkSize", 40000))
data = part.data or ""
# best-effort: try JSON list/object bucketing; else fallback to line-based
chunks: List[Dict[str, Any]] = []
try:
obj = json.loads(data)
def emit(bucket: Any):
text = json.dumps(bucket, ensure_ascii=False)
chunks.append({"data": text, "size": len(text.encode('utf-8')), "order": len(chunks)})
if isinstance(obj, list):
bucket: list[Any] = []
size = 0
for item in obj:
text = json.dumps(item, ensure_ascii=False)
s = len(text.encode('utf-8'))
if size + s > maxBytes and bucket:
emit(bucket)
bucket = [item]
size = s
else:
bucket.append(item)
size += s
if bucket:
emit(bucket)
else:
# JSON object (dict) - check if it fits
text = json.dumps(obj, ensure_ascii=False)
textSize = len(text.encode('utf-8'))
if textSize <= maxBytes:
emit(obj)
else:
# Object too large - try to split by keys if possible
# For large objects, we need to chunk by character boundaries
# since we can't split JSON objects arbitrarily
if isinstance(obj, dict) and len(obj) > 1:
# Try to split object into multiple chunks by keys
# This preserves JSON structure better than line-based chunking
currentChunk: Dict[str, Any] = {}
currentSize = 2 # Start with "{}" overhead
for key, value in obj.items():
itemText = json.dumps({key: value}, ensure_ascii=False)
itemSize = len(itemText.encode('utf-8'))
# Account for comma and spacing between items
if currentChunk:
itemSize += 2 # ", " separator
if currentSize + itemSize > maxBytes and currentChunk:
# Current chunk is full, emit it
emit(currentChunk)
currentChunk = {key: value}
currentSize = len(itemText.encode('utf-8'))
else:
currentChunk[key] = value
currentSize += itemSize
# Emit remaining chunk
if currentChunk:
emit(currentChunk)
else:
# Single large value or can't split - fallback to line chunking
raise ValueError("too large")
except Exception:
current: List[str] = []
size = 0
for line in data.split('\n'):
s = len(line.encode('utf-8')) + 1
if size + s > maxBytes and current:
text = '\n'.join(current)
chunks.append({"data": text, "size": len(text.encode('utf-8')), "order": len(chunks)})
current = [line]
size = s
else:
current.append(line)
size += s
if current:
text = '\n'.join(current)
chunks.append({"data": text, "size": len(text.encode('utf-8')), "order": len(chunks)})
return chunks

View file

@ -0,0 +1,30 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
from typing import Any, Dict, List
from modules.datamodels.datamodelExtraction import ContentPart
from ..subRegistry import Chunker
class TableChunker(Chunker):
def chunk(self, part: ContentPart, options: Dict[str, Any]) -> list[Dict[str, Any]]:
maxBytes = int(options.get("tableChunkSize", 40000))
chunks: List[Dict[str, Any]] = []
current: List[str] = []
size = 0
for line in part.data.split('\n'):
lineSize = len(line.encode('utf-8')) + 1
if size + lineSize > maxBytes and current:
data = '\n'.join(current)
chunks.append({"data": data, "size": len(data.encode('utf-8')), "order": len(chunks)})
current = [line]
size = lineSize
else:
current.append(line)
size += lineSize
if current:
data = '\n'.join(current)
chunks.append({"data": data, "size": len(data.encode('utf-8')), "order": len(chunks)})
return chunks

View file

@ -0,0 +1,58 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
from typing import Any, Dict, List
import logging
from modules.datamodels.datamodelExtraction import ContentPart
from ..subRegistry import Chunker
logger = logging.getLogger(__name__)
class TextChunker(Chunker):
def chunk(self, part: ContentPart, options: Dict[str, Any]) -> list[Dict[str, Any]]:
maxBytes = int(options.get("textChunkSize", 40000))
logger.debug(f"TextChunker: textChunkSize from options: {options.get('textChunkSize', 'NOT_FOUND')}")
logger.debug(f"TextChunker: using maxBytes: {maxBytes}")
chunks: List[Dict[str, Any]] = []
# Split by lines first (preferred method for text)
lines = part.data.split('\n')
current: List[str] = []
size = 0
for line in lines:
lineSize = len(line.encode('utf-8')) + 1 # +1 for newline character
if size + lineSize > maxBytes and current:
# Current chunk is full, save it and start new one
data = '\n'.join(current)
chunks.append({"data": data, "size": len(data.encode('utf-8')), "order": len(chunks)})
current = []
size = 0
# If a single line is larger than maxBytes, split it by character boundaries
if lineSize > maxBytes:
# Split the long line into chunks
lineBytes = line.encode('utf-8')
lineStart = 0
while lineStart < len(lineBytes):
chunkBytes = lineBytes[lineStart:lineStart + maxBytes]
chunkText = chunkBytes.decode('utf-8', errors='ignore')
chunks.append({"data": chunkText, "size": len(chunkBytes), "order": len(chunks)})
lineStart += maxBytes
# Don't add this line to current, it's already chunked
continue
# Add line to current chunk
current.append(line)
size += lineSize
# Add remaining lines as final chunk
if current:
data = '\n'.join(current)
chunks.append({"data": data, "size": len(data.encode('utf-8')), "order": len(chunks)})
logger.debug(f"TextChunker: Created {len(chunks)} chunks, total input size: {len(part.data.encode('utf-8'))} bytes")
return chunks

View file

@ -0,0 +1,4 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.

View file

@ -0,0 +1,47 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
from typing import Any, Dict, List
import base64
from ..subUtils import makeId
from modules.datamodels.datamodelExtraction import ContentPart
from ..subRegistry import Extractor
class BinaryExtractor(Extractor):
"""
Fallback extractor for unsupported file types.
This extractor handles any file type that doesn't match other extractors.
It encodes the file as base64 and marks it as binary data.
Supported formats:
- All file types (fallback)
- MIME types: application/octet-stream (default)
- File extensions: All (fallback)
"""
def detect(self, fileName: str, mimeType: str, headBytes: bytes) -> bool:
return True
def getSupportedExtensions(self) -> list[str]:
"""Return list of supported file extensions (all)."""
return [] # Accepts all extensions as fallback
def getSupportedMimeTypes(self) -> list[str]:
"""Return list of supported MIME types (all)."""
return [] # Accepts all MIME types as fallback
def extract(self, fileBytes: bytes, context: Dict[str, Any]) -> List[ContentPart]:
mimeType = context.get("mimeType") or "application/octet-stream"
return [ContentPart(
id=makeId(),
parentId=None,
label="binary",
typeGroup="binary",
mimeType=mimeType,
data=base64.b64encode(fileBytes).decode("utf-8"),
metadata={"size": len(fileBytes), "warning": "Unsupported file type"}
)]

View file

@ -0,0 +1,45 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
from typing import Any, Dict, List
from modules.datamodels.datamodelExtraction import ContentPart
from ..subUtils import makeId
from ..subRegistry import Extractor
class CsvExtractor(Extractor):
"""
Extractor for CSV files.
Supported formats:
- MIME types: text/csv
- File extensions: .csv
- Special handling: Treats as table data
"""
def detect(self, fileName: str, mimeType: str, headBytes: bytes) -> bool:
return mimeType == "text/csv" or (fileName or "").lower().endswith(".csv")
def getSupportedExtensions(self) -> list[str]:
"""Return list of supported file extensions."""
return [".csv"]
def getSupportedMimeTypes(self) -> list[str]:
"""Return list of supported MIME types."""
return ["text/csv"]
def extract(self, fileBytes: bytes, context: Dict[str, Any]) -> List[ContentPart]:
fileName = context.get("fileName")
mimeType = context.get("mimeType") or "text/csv"
data = fileBytes.decode("utf-8", errors="replace")
return [ContentPart(
id=makeId(),
parentId=None,
label="main",
typeGroup="table",
mimeType=mimeType,
data=data,
metadata={"size": len(fileBytes)}
)]

View file

@ -0,0 +1,109 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
from typing import Any, Dict, List
import io
from ..subUtils import makeId
from modules.datamodels.datamodelExtraction import ContentPart
from ..subRegistry import Extractor
class DocxExtractor(Extractor):
"""
Extractor for Microsoft Word documents.
Supported formats:
- MIME types: application/vnd.openxmlformats-officedocument.wordprocessingml.document
- File extensions: .docx
- Special handling: Extracts paragraphs and tables (converts tables to CSV)
- Dependencies: python-docx
"""
def __init__(self):
self._loaded = False
self._haveLibs = False
def _load(self):
if self._loaded:
return
self._loaded = True
try:
global docx
import docx # python-docx
self._haveLibs = True
except Exception:
self._haveLibs = False
def detect(self, fileName: str, mimeType: str, headBytes: bytes) -> bool:
return mimeType == "application/vnd.openxmlformats-officedocument.wordprocessingml.document" or (fileName or "").lower().endswith(".docx")
def getSupportedExtensions(self) -> list[str]:
"""Return list of supported file extensions."""
return [".docx"]
def getSupportedMimeTypes(self) -> list[str]:
"""Return list of supported MIME types."""
return ["application/vnd.openxmlformats-officedocument.wordprocessingml.document"]
def extract(self, fileBytes: bytes, context: Dict[str, Any]) -> List[ContentPart]:
self._load()
parts: List[ContentPart] = []
rootId = makeId()
parts.append(ContentPart(
id=rootId,
parentId=None,
label="docx",
typeGroup="container",
mimeType="application/vnd.openxmlformats-officedocument.wordprocessingml.document",
data="",
metadata={"size": len(fileBytes)}
))
if not self._haveLibs:
parts.append(ContentPart(
id=makeId(),
parentId=rootId,
label="binary",
typeGroup="binary",
mimeType="application/vnd.openxmlformats-officedocument.wordprocessingml.document",
data="",
metadata={"size": len(fileBytes), "warning": "DOCX lib not available"}
))
return parts
with io.BytesIO(fileBytes) as buf:
d = docx.Document(buf)
# paragraphs
for i, para in enumerate(d.paragraphs):
text = para.text or ""
if text.strip():
parts.append(ContentPart(
id=makeId(),
parentId=rootId,
label=f"p_{i+1}",
typeGroup="text",
mimeType="text/plain",
data=text,
metadata={"size": len(text.encode('utf-8'))}
))
# tables → CSV rows
for ti, table in enumerate(d.tables):
rows: list[str] = []
for row in table.rows:
cells = [ (cell.text or "").replace('"', '""') for cell in row.cells ]
rows.append(",".join([f'"{c}"' for c in cells]))
csvData = "\n".join(rows)
if csvData:
parts.append(ContentPart(
id=makeId(),
parentId=rootId,
label=f"table_{ti+1}",
typeGroup="table",
mimeType="text/csv",
data=csvData,
metadata={"size": len(csvData.encode('utf-8'))}
))
return parts

View file

@ -0,0 +1,50 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
from typing import Any, Dict, List
from bs4 import BeautifulSoup
from modules.datamodels.datamodelExtraction import ContentPart
from ..subUtils import makeId
from ..subRegistry import Extractor
class HtmlExtractor(Extractor):
"""
Extractor for HTML files.
Supported formats:
- MIME types: text/html
- File extensions: .html, .htm
- Special handling: Uses BeautifulSoup for parsing
- Dependencies: beautifulsoup4
"""
def detect(self, fileName: str, mimeType: str, headBytes: bytes) -> bool:
return mimeType == "text/html" or (fileName or "").lower().endswith((".html", ".htm"))
def getSupportedExtensions(self) -> list[str]:
"""Return list of supported file extensions."""
return [".html", ".htm"]
def getSupportedMimeTypes(self) -> list[str]:
"""Return list of supported MIME types."""
return ["text/html"]
def extract(self, fileBytes: bytes, context: Dict[str, Any]) -> List[ContentPart]:
mimeType = context.get("mimeType") or "text/html"
text = fileBytes.decode("utf-8", errors="replace")
try:
BeautifulSoup(text, "html.parser")
except Exception:
pass
return [ContentPart(
id=makeId(),
parentId=None,
label="main",
typeGroup="structure",
mimeType=mimeType,
data=text,
metadata={"size": len(fileBytes)}
)]

View file

@ -0,0 +1,77 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
from typing import Any, Dict, List
import base64
import logging
from ..subUtils import makeId
from modules.datamodels.datamodelExtraction import ContentPart
from ..subRegistry import Extractor
logger = logging.getLogger(__name__)
class ImageExtractor(Extractor):
"""
Extractor for image files.
Supported formats:
- MIME types: image/jpeg, image/png, image/gif, image/webp, image/bmp, image/tiff
- File extensions: .jpg, .jpeg, .png, .gif, .webp, .bmp, .tiff
- Special handling: GIF files are converted to PNG during extraction
"""
def detect(self, fileName: str, mimeType: str, headBytes: bytes) -> bool:
return ((mimeType or "").startswith("image/") or
(fileName or "").lower().endswith((".jpg", ".jpeg", ".png", ".gif", ".webp", ".bmp", ".tiff")))
def getSupportedExtensions(self) -> list[str]:
"""Return list of supported file extensions."""
return [".jpg", ".jpeg", ".png", ".gif", ".webp", ".bmp", ".tiff"]
def getSupportedMimeTypes(self) -> list[str]:
"""Return list of supported MIME types."""
return ["image/jpeg", "image/png", "image/gif", "image/webp", "image/bmp", "image/tiff"]
def extract(self, fileBytes: bytes, context: Dict[str, Any]) -> List[ContentPart]:
mimeType = context.get("mimeType") or "image/unknown"
fileName = context.get("fileName", "")
# Convert GIF to PNG during extraction
if mimeType.lower() == "image/gif":
try:
from PIL import Image
import io
# Open GIF and convert to PNG
with Image.open(io.BytesIO(fileBytes)) as img:
# Convert to RGB (removes animation)
if img.mode in ('RGBA', 'LA', 'P'):
img = img.convert('RGB')
# Save as PNG in memory
png_buffer = io.BytesIO()
img.save(png_buffer, format='PNG')
png_data = png_buffer.getvalue()
# Update mimeType and fileBytes
mimeType = "image/png"
fileBytes = png_data
logger.info(f"GIF converted to PNG during extraction: {fileName}, original={len(fileBytes)} bytes, converted={len(png_data)} bytes")
except Exception as e:
logger.warning(f"GIF conversion failed during extraction for {fileName}: {str(e)}, using original")
# Keep original GIF data if conversion fails
return [ContentPart(
id=makeId(),
parentId=None,
label="image",
typeGroup="image",
mimeType=mimeType,
data=base64.b64encode(fileBytes).decode("utf-8"),
metadata={"size": len(fileBytes)}
)]

View file

@ -0,0 +1,50 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
from typing import Any, Dict, List
import json
from modules.datamodels.datamodelExtraction import ContentPart
from ..subUtils import makeId
from ..subRegistry import Extractor
class JsonExtractor(Extractor):
"""
Extractor for JSON files.
Supported formats:
- MIME types: application/json
- File extensions: .json
- Special handling: Validates JSON format, falls back to text if invalid
"""
def detect(self, fileName: str, mimeType: str, headBytes: bytes) -> bool:
return mimeType == "application/json" or (fileName or "").lower().endswith(".json")
def getSupportedExtensions(self) -> list[str]:
"""Return list of supported file extensions."""
return [".json"]
def getSupportedMimeTypes(self) -> list[str]:
"""Return list of supported MIME types."""
return ["application/json"]
def extract(self, fileBytes: bytes, context: Dict[str, Any]) -> List[ContentPart]:
mimeType = context.get("mimeType") or "application/json"
text = fileBytes.decode("utf-8", errors="replace")
# verify JSON is well-formed; fall back to text if not
try:
json.loads(text)
except Exception:
pass
return [ContentPart(
id=makeId(),
parentId=None,
label="main",
typeGroup="structure",
mimeType=mimeType,
data=text,
metadata={"size": len(fileBytes)}
)]

View file

@ -0,0 +1,156 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
from typing import Any, Dict, List
import base64
import io
from ..subUtils import makeId
from modules.datamodels.datamodelExtraction import ContentPart
from ..subRegistry import Extractor
class PdfExtractor(Extractor):
"""
Extractor for PDF files.
Supported formats:
- MIME types: application/pdf
- File extensions: .pdf
- Special handling: Extracts text per page and embedded images
- Dependencies: PyPDF2, PyMuPDF (fitz)
"""
def __init__(self):
self._loaded = False
self._haveLibs = False
def _load(self):
if self._loaded:
return
self._loaded = True
try:
global PyPDF2, fitz
import PyPDF2
import fitz # PyMuPDF
self._haveLibs = True
except Exception:
self._haveLibs = False
def detect(self, fileName: str, mimeType: str, headBytes: bytes) -> bool:
return mimeType == "application/pdf" or (fileName or "").lower().endswith(".pdf")
def getSupportedExtensions(self) -> list[str]:
"""Return list of supported file extensions."""
return [".pdf"]
def getSupportedMimeTypes(self) -> list[str]:
"""Return list of supported MIME types."""
return ["application/pdf"]
def extract(self, fileBytes: bytes, context: Dict[str, Any]) -> List[ContentPart]:
self._load()
parts: List[ContentPart] = []
rootId = makeId()
parts.append(ContentPart(
id=rootId,
parentId=None,
label="pdf",
typeGroup="container",
mimeType="application/pdf",
data="",
metadata={"size": len(fileBytes)}
))
if not self._haveLibs:
parts.append(ContentPart(
id=makeId(),
parentId=rootId,
label="binary",
typeGroup="binary",
mimeType="application/pdf",
data=base64.b64encode(fileBytes).decode("utf-8"),
metadata={"size": len(fileBytes), "warning": "PDF libs not available"}
))
return parts
# Extract text per page with PyMuPDF (same lib as in-place search - ensures extraction matches PDF text layer)
try:
with io.BytesIO(fileBytes) as buf:
doc = fitz.open(stream=buf.getvalue(), filetype="pdf")
for i in range(len(doc)):
try:
page = doc[i]
text = page.get_text() or ""
if text.strip():
parts.append(ContentPart(
id=makeId(),
parentId=rootId,
label=f"page_{i+1}",
typeGroup="text",
mimeType="text/plain",
data=text,
metadata={"pages": 1, "pageIndex": i, "size": len(text.encode('utf-8'))}
))
except Exception:
continue
doc.close()
except Exception:
pass
# Fallback to PyPDF2 if PyMuPDF text extraction failed or returned nothing
has_text = any(getattr(p, 'typeGroup', '') == "text" for p in parts)
if not has_text:
try:
with io.BytesIO(fileBytes) as buf:
reader = PyPDF2.PdfReader(buf)
for i, page in enumerate(reader.pages):
try:
text = page.extract_text() or ""
if text.strip():
parts.append(ContentPart(
id=makeId(),
parentId=rootId,
label=f"page_{i+1}",
typeGroup="text",
mimeType="text/plain",
data=text,
metadata={"pages": 1, "pageIndex": i, "size": len(text.encode('utf-8'))}
))
except Exception:
continue
except Exception:
pass
# Extract images with PyMuPDF
try:
with io.BytesIO(fileBytes) as buf2:
doc = fitz.open(stream=buf2.getvalue(), filetype="pdf")
for i in range(len(doc)):
page = doc[i]
images = page.get_images(full=True)
for j, img in enumerate(images):
try:
xref = img[0]
baseImage = doc.extract_image(xref)
if baseImage:
imgBytes = baseImage.get("image", b"")
ext = baseImage.get("ext", "png")
if imgBytes:
parts.append(ContentPart(
id=makeId(),
parentId=rootId,
label=f"image_{i+1}_{j}",
typeGroup="image",
mimeType=f"image/{ext}",
data=base64.b64encode(imgBytes).decode("utf-8"),
metadata={"pageIndex": i, "size": len(imgBytes)}
))
except Exception:
continue
doc.close()
except Exception:
pass
return parts

View file

@ -0,0 +1,227 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
import logging
import base64
from typing import List, Dict, Any, Optional
from modules.datamodels.datamodelExtraction import ContentPart, ContentExtracted
from ..subRegistry import Extractor
logger = logging.getLogger(__name__)
class PptxExtractor(Extractor):
"""
Extractor for PowerPoint files.
Supported formats:
- MIME types: application/vnd.openxmlformats-officedocument.presentationml.presentation, application/vnd.ms-powerpoint
- File extensions: .pptx, .ppt
- Special handling: Extracts slide content, tables, and images
- Dependencies: python-pptx
"""
def __init__(self):
self._loaded = False
self._haveLibs = False
def _load(self):
if self._loaded:
return
self._loaded = True
try:
global Presentation
from pptx import Presentation
self._haveLibs = True
except Exception:
self._haveLibs = False
def detect(self, fileName: str, mimeType: str, headBytes: bytes) -> bool:
return (mimeType in [
"application/vnd.openxmlformats-officedocument.presentationml.presentation",
"application/vnd.ms-powerpoint"
]) or (fileName or "").lower().endswith((".pptx", ".ppt"))
def getSupportedExtensions(self) -> list[str]:
"""Return list of supported file extensions."""
return [".pptx", ".ppt"]
def getSupportedMimeTypes(self) -> list[str]:
"""Return list of supported MIME types."""
return [
"application/vnd.openxmlformats-officedocument.presentationml.presentation",
"application/vnd.ms-powerpoint"
]
def extract(self, fileBytes: bytes, context: Dict[str, Any]) -> List[ContentPart]:
"""
Extract content from PowerPoint files.
Args:
fileBytes: Raw file data as bytes
context: Context dictionary with file information
Returns:
List of ContentPart objects with extracted content
"""
self._load()
if not self._haveLibs:
logger.error("python-pptx library not installed. Install with: pip install python-pptx")
return [ContentPart(
id="error",
label="PowerPoint Extraction Error",
typeGroup="text",
mimeType="text/plain",
data="Error: python-pptx library not installed",
metadata={"error": True, "error_message": "python-pptx library not installed"}
)]
try:
import io
# Load presentation from bytes
presentation = Presentation(io.BytesIO(fileBytes))
parts = []
slide_index = 0
# Extract content from each slide
for slide in presentation.slides:
slide_index += 1
slide_content = []
# Extract text from slide
for shape in slide.shapes:
if hasattr(shape, "text") and shape.text.strip():
slide_content.append(shape.text.strip())
# Extract table data
for shape in slide.shapes:
if shape.has_table:
table = shape.table
table_data = []
for row in table.rows:
row_data = []
for cell in row.cells:
row_data.append(cell.text.strip())
table_data.append(row_data)
if table_data:
# Convert table to markdown format
table_md = self._table_to_markdown(table_data)
slide_content.append(table_md)
# Extract images
for shape in slide.shapes:
if shape.shape_type == 13: # MSO_SHAPE_TYPE.PICTURE
try:
image = shape.image
image_bytes = image.blob
image_b64 = base64.b64encode(image_bytes).decode('utf-8')
# Create image part
image_part = ContentPart(
id=f"slide_{slide_index}_image_{len(parts)}",
label=f"Slide {slide_index} Image",
typeGroup="image",
mimeType="image/png", # Default to PNG
data=image_b64,
metadata={
"slide_number": slide_index,
"shape_type": "image",
"extracted_from": "powerpoint"
}
)
parts.append(image_part)
except Exception as e:
logger.warning(f"Failed to extract image from slide {slide_index}: {str(e)}")
# Create slide content part
if slide_content:
slide_text = f"# Slide {slide_index}\n\n" + "\n\n".join(slide_content)
slide_part = ContentPart(
id=f"slide_{slide_index}",
label=f"Slide {slide_index} Content",
typeGroup="structure",
mimeType="text/plain",
data=slide_text,
metadata={
"slide_number": slide_index,
"content_type": "slide",
"extracted_from": "powerpoint",
"text_length": len(slide_text)
}
)
parts.append(slide_part)
# Create presentation overview
file_name = context.get("fileName", "presentation.pptx")
overview_text = f"# PowerPoint Presentation: {file_name}\n\n"
overview_text += f"**Total Slides:** {len(presentation.slides)}\n\n"
overview_text += f"**Content Parts:** {len(parts)}\n\n"
# Add slide summaries
for i, slide in enumerate(presentation.slides, 1):
slide_text_parts = []
for shape in slide.shapes:
if hasattr(shape, "text") and shape.text.strip():
slide_text_parts.append(shape.text.strip())
if slide_text_parts:
overview_text += f"## Slide {i}\n"
overview_text += "\n".join(slide_text_parts[:3]) # First 3 text elements
overview_text += "\n\n"
# Create overview part
overview_part = ContentPart(
id="presentation_overview",
label="Presentation Overview",
typeGroup="text",
mimeType="text/plain",
data=overview_text,
metadata={
"content_type": "overview",
"extracted_from": "powerpoint",
"total_slides": len(presentation.slides),
"text_length": len(overview_text)
}
)
parts.insert(0, overview_part) # Insert at beginning
return parts
except Exception as e:
logger.error(f"Error extracting PowerPoint content: {str(e)}")
return [ContentPart(
id="error",
label="PowerPoint Extraction Error",
typeGroup="text",
mimeType="text/plain",
data=f"Error extracting PowerPoint content: {str(e)}",
metadata={"error": True, "error_message": str(e)}
)]
def _table_to_markdown(self, table_data: List[List[str]]) -> str:
"""Convert table data to markdown format."""
if not table_data:
return ""
markdown_lines = []
# Header row
if table_data:
header = "| " + " | ".join(table_data[0]) + " |"
markdown_lines.append(header)
# Separator row
separator = "| " + " | ".join(["---"] * len(table_data[0])) + " |"
markdown_lines.append(separator)
# Data rows
for row in table_data[1:]:
data_row = "| " + " | ".join(row) + " |"
markdown_lines.append(data_row)
return "\n".join(markdown_lines)

View file

@ -0,0 +1,58 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
from typing import Any, Dict, List
from modules.datamodels.datamodelExtraction import ContentPart
from ..subUtils import makeId
from ..subRegistry import Extractor
class SqlExtractor(Extractor):
"""
Extractor for SQL files.
Supported formats:
- MIME types: text/x-sql, application/sql
- File extensions: .sql, .ddl, .dml, .dcl, .tcl
- Special handling: Treats as structured text with SQL syntax
"""
def detect(self, fileName: str, mimeType: str, headBytes: bytes) -> bool:
return (mimeType in ("text/x-sql", "application/sql") or
(fileName or "").lower().endswith((".sql", ".ddl", ".dml", ".dcl", ".tcl")))
def getSupportedExtensions(self) -> list[str]:
"""Return list of supported file extensions."""
return [".sql", ".ddl", ".dml", ".dcl", ".tcl"]
def getSupportedMimeTypes(self) -> list[str]:
"""Return list of supported MIME types."""
return ["text/x-sql", "application/sql"]
def extract(self, fileBytes: bytes, context: Dict[str, Any]) -> List[ContentPart]:
fileName = context.get("fileName")
mimeType = context.get("mimeType") or "text/x-sql"
data = fileBytes.decode("utf-8", errors="replace")
# Add SQL-specific metadata
metadata = {
"size": len(fileBytes),
"file_type": "sql",
"line_count": len(data.splitlines()),
"has_select": "SELECT" in data.upper(),
"has_insert": "INSERT" in data.upper(),
"has_update": "UPDATE" in data.upper(),
"has_delete": "DELETE" in data.upper(),
"has_create": "CREATE" in data.upper(),
"has_drop": "DROP" in data.upper()
}
return [ContentPart(
id=makeId(),
parentId=None,
label="main",
typeGroup="structure",
mimeType=mimeType,
data=data,
metadata=metadata
)]

View file

@ -0,0 +1,105 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
from typing import Any, Dict, List
from modules.datamodels.datamodelExtraction import ContentPart
from ..subUtils import makeId
from ..subRegistry import Extractor
class TextExtractor(Extractor):
"""
Extractor for plain text files and code files.
Supported formats:
- MIME types: text/plain, text/markdown, text/x-python, text/x-java-source, text/javascript, etc.
- File extensions: .txt, .md, .log, .java, .js, .jsx, .ts, .tsx, .py, .config, .ini, .cfg, .conf, .properties, .yaml, .yml, .toml, .sh, .bat, .ps1, .sql, .css, .scss, .sass, .less, .xml, .json, .csv, .tsv, .rtf, .tex, .rst, .adoc, .org, .pod, .man, .1, .2, .3, .4, .5, .6, .7, .8, .9, .n, .l, .m, .r, .t, .x, .y, .z
"""
def detect(self, fileName: str, mimeType: str, headBytes: bytes) -> bool:
# Check MIME types
if mimeType and mimeType.startswith("text/"):
return True
# Check file extensions
if fileName:
ext = fileName.lower()
return ext.endswith((
# Basic text files
".txt", ".md", ".log", ".rtf", ".tex", ".rst", ".adoc", ".org", ".pod",
# Programming languages
".java", ".js", ".jsx", ".ts", ".tsx", ".py", ".rb", ".go", ".rs", ".cpp", ".c", ".h", ".hpp", ".cc", ".cxx",
".cs", ".php", ".swift", ".kt", ".scala", ".clj", ".hs", ".ml", ".fs", ".vb", ".dart", ".r", ".m", ".pl", ".sh",
# Web technologies
".html", ".htm", ".css", ".scss", ".sass", ".less", ".vue", ".svelte",
# Configuration files
".config", ".ini", ".cfg", ".conf", ".properties", ".yaml", ".yml", ".toml", ".json", ".xml",
# Scripts and automation
".bat", ".ps1", ".psm1", ".psd1", ".vbs", ".wsf", ".cmd", ".com",
# Data files
".csv", ".tsv", ".tab", ".dat", ".data",
# Documentation
".man", ".1", ".2", ".3", ".4", ".5", ".6", ".7", ".8", ".9", ".n", ".l", ".m", ".r", ".t", ".x", ".y", ".z",
# Other text formats
".diff", ".patch", ".gitignore", ".dockerignore", ".editorconfig", ".gitattributes",
".env", ".env.local", ".env.development", ".env.production", ".env.test",
".lock", ".lockb", ".lockfile", ".pkg-lock", ".yarn-lock"
))
return False
def getSupportedExtensions(self) -> list[str]:
"""Return list of supported file extensions."""
return [
# Basic text files
".txt", ".md", ".log", ".rtf", ".tex", ".rst", ".adoc", ".org", ".pod",
# Programming languages
".java", ".js", ".jsx", ".ts", ".tsx", ".py", ".rb", ".go", ".rs", ".cpp", ".c", ".h", ".hpp", ".cc", ".cxx",
".cs", ".php", ".swift", ".kt", ".scala", ".clj", ".hs", ".ml", ".fs", ".vb", ".dart", ".r", ".m", ".pl", ".sh",
# Web technologies
".html", ".htm", ".css", ".scss", ".sass", ".less", ".vue", ".svelte",
# Configuration files
".config", ".ini", ".cfg", ".conf", ".properties", ".yaml", ".yml", ".toml", ".json", ".xml",
# Scripts and automation
".bat", ".ps1", ".psm1", ".psd1", ".vbs", ".wsf", ".cmd", ".com",
# Data files
".csv", ".tsv", ".tab", ".dat", ".data",
# Documentation
".man", ".1", ".2", ".3", ".4", ".5", ".6", ".7", ".8", ".9", ".n", ".l", ".m", ".r", ".t", ".x", ".y", ".z",
# Other text formats
".diff", ".patch", ".gitignore", ".dockerignore", ".editorconfig", ".gitattributes",
".env", ".env.local", ".env.development", ".env.production", ".env.test",
".lock", ".lockb", ".lockfile", ".pkg-lock", ".yarn-lock"
]
def getSupportedMimeTypes(self) -> list[str]:
"""Return list of supported MIME types."""
return [
"text/plain", "text/markdown", "text/x-python", "text/x-java-source",
"text/javascript", "text/x-javascript", "text/typescript", "text/x-typescript",
"text/x-c", "text/x-c++", "text/x-csharp", "text/x-php", "text/x-ruby",
"text/x-go", "text/x-rust", "text/x-scala", "text/x-swift", "text/x-kotlin",
"text/x-sql", "text/x-sh", "text/x-shellscript", "text/x-yaml", "text/x-toml",
"text/x-ini", "text/x-config", "text/x-properties", "text/x-log",
"text/html", "text/css", "text/x-scss", "text/x-sass", "text/x-less",
"text/xml", "text/csv", "text/tab-separated-values", "text/rtf",
"text/x-tex", "text/x-rst", "text/x-asciidoc", "text/x-org",
"application/x-yaml", "application/x-toml", "application/x-ini",
"application/x-config", "application/x-properties", "application/x-log"
]
def extract(self, fileBytes: bytes, context: Dict[str, Any]) -> List[ContentPart]:
fileName = context.get("fileName")
mimeType = context.get("mimeType") or "text/plain"
data = fileBytes.decode("utf-8", errors="replace")
return [ContentPart(
id=makeId(),
parentId=None,
label="main",
typeGroup="text",
mimeType=mimeType,
data=data,
metadata={"size": len(fileBytes)}
)]

View file

@ -0,0 +1,114 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
from typing import Any, Dict, List
import io
from datetime import datetime
from ..subUtils import makeId
from modules.datamodels.datamodelExtraction import ContentPart
from ..subRegistry import Extractor
class XlsxExtractor(Extractor):
"""
Extractor for Microsoft Excel spreadsheets.
Supported formats:
- MIME types: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
- File extensions: .xlsx, .xlsm
- Special handling: Extracts all sheets as CSV data
- Dependencies: openpyxl
"""
def __init__(self):
self._loaded = False
self._haveLibs = False
def _load(self):
if self._loaded:
return
self._loaded = True
try:
global openpyxl
import openpyxl
self._haveLibs = True
except Exception:
self._haveLibs = False
def detect(self, fileName: str, mimeType: str, headBytes: bytes) -> bool:
mt = "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
return mimeType == mt or (fileName or "").lower().endswith((".xlsx", ".xlsm"))
def getSupportedExtensions(self) -> list[str]:
"""Return list of supported file extensions."""
return [".xlsx", ".xlsm"]
def getSupportedMimeTypes(self) -> list[str]:
"""Return list of supported MIME types."""
return ["application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"]
def extract(self, fileBytes: bytes, context: Dict[str, Any]) -> List[ContentPart]:
self._load()
parts: List[ContentPart] = []
rootId = makeId()
parts.append(ContentPart(
id=rootId,
parentId=None,
label="xlsx",
typeGroup="container",
mimeType="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
data="",
metadata={"size": len(fileBytes)}
))
if not self._haveLibs:
parts.append(ContentPart(
id=makeId(),
parentId=rootId,
label="binary",
typeGroup="binary",
mimeType="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
data="",
metadata={"size": len(fileBytes), "warning": "openpyxl not available"}
))
return parts
with io.BytesIO(fileBytes) as buf:
wb = openpyxl.load_workbook(buf, data_only=True)
for sheetName in wb.sheetnames:
ws = wb[sheetName]
# extract rectangular data region by min/max
min_row = ws.min_row
max_row = ws.max_row
min_col = ws.min_column
max_col = ws.max_column
lines: list[str] = []
for r in range(min_row, max_row + 1):
cells: list[str] = []
for c in range(min_col, max_col + 1):
cell = ws.cell(row=r, column=c)
v = cell.value
if v is None:
cells.append("")
elif isinstance(v, (int, float)):
cells.append(str(v))
elif isinstance(v, datetime):
cells.append(v.strftime("%Y-%m-%d %H:%M:%S"))
else:
escaped_value = str(v).replace('"', '""')
cells.append(f'"{escaped_value}"')
lines.append(",".join(cells))
csvData = "\n".join(lines)
parts.append(ContentPart(
id=makeId(),
parentId=rootId,
label=f"sheet_{sheetName}",
typeGroup="table",
mimeType="text/csv",
data=csvData,
metadata={"sheet": sheetName, "size": len(csvData.encode('utf-8'))}
))
return parts

View file

@ -0,0 +1,49 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
from typing import Any, Dict, List
import xml.etree.ElementTree as ET
from modules.datamodels.datamodelExtraction import ContentPart
from ..subUtils import makeId
from ..subRegistry import Extractor
class XmlExtractor(Extractor):
"""
Extractor for XML files.
Supported formats:
- MIME types: application/xml
- File extensions: .xml, .rss, .atom
- Special handling: Uses ElementTree for parsing
"""
def detect(self, fileName: str, mimeType: str, headBytes: bytes) -> bool:
return mimeType == "application/xml" or (fileName or "").lower().endswith((".xml", ".rss", ".atom"))
def getSupportedExtensions(self) -> list[str]:
"""Return list of supported file extensions."""
return [".xml", ".rss", ".atom"]
def getSupportedMimeTypes(self) -> list[str]:
"""Return list of supported MIME types."""
return ["application/xml"]
def extract(self, fileBytes: bytes, context: Dict[str, Any]) -> List[ContentPart]:
mimeType = context.get("mimeType") or "application/xml"
text = fileBytes.decode("utf-8", errors="replace")
try:
ET.fromstring(text)
except Exception:
pass
return [ContentPart(
id=makeId(),
parentId=None,
label="main",
typeGroup="structure",
mimeType=mimeType,
data=text,
metadata={"size": len(fileBytes)}
)]

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,2 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.

View file

@ -0,0 +1,13 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
from typing import Any, Dict, List
from modules.datamodels.datamodelExtraction import ContentPart, MergeStrategy
class DefaultMerger:
def merge(self, parts: List[ContentPart], strategy: MergeStrategy) -> List[ContentPart]:
"""
Default merger that passes through parts unchanged.
Used for image, binary, metadata, container typeGroups.
"""
return parts

View file

@ -0,0 +1,154 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
from typing import Any, Dict, List
from modules.datamodels.datamodelExtraction import ContentPart, MergeStrategy
from ..subUtils import makeId
class TableMerger:
def merge(self, parts: List[ContentPart], strategy: MergeStrategy) -> List[ContentPart]:
"""
Merge table parts based on strategy.
Strategy options:
- groupBy: "parentId" (default), "documentId", "sheet", "none"
- maxSize: maximum size per merged part
- combineSheets: bool - whether to combine multiple sheets into one table
"""
if not parts:
return parts
groupBy = strategy.groupBy
maxSize = strategy.maxSize or 0
combineSheets = strategy.tableMerge.get("combineSheets", False) if strategy.tableMerge else False
# Group parts
groups = self._groupParts(parts, groupBy, combineSheets)
merged: List[ContentPart] = []
for groupKey, groupParts in groups.items():
if maxSize > 0:
merged.extend(self._mergeWithSizeLimit(groupParts, maxSize, groupKey))
else:
merged.extend(self._mergeGroup(groupParts, groupKey))
return merged
def _groupParts(self, parts: List[ContentPart], groupBy: str, combineSheets: bool) -> Dict[str, List[ContentPart]]:
groups: Dict[str, List[ContentPart]] = {}
for part in parts:
if part.typeGroup != "table":
# Non-table parts go in their own group
key = f"nontable_{part.id}"
if key not in groups:
groups[key] = []
groups[key].append(part)
continue
if groupBy == "parentId":
key = part.parentId or "root"
elif groupBy == "documentId":
key = part.metadata.get("documentId", "unknown")
elif groupBy == "sheet" and not combineSheets:
key = part.metadata.get("sheet", "unknown")
else: # "none" or combineSheets=True
key = "all_tables"
if key not in groups:
groups[key] = []
groups[key].append(part)
return groups
def _mergeGroup(self, parts: List[ContentPart], groupKey: str) -> List[ContentPart]:
if not parts:
return []
if len(parts) == 1:
return parts
# For tables, we typically keep them separate unless explicitly combining
# But we can add metadata about the group
for i, part in enumerate(parts):
part.metadata["groupKey"] = groupKey
part.metadata["groupIndex"] = i
part.metadata["groupSize"] = len(parts)
return parts
def _mergeWithSizeLimit(self, parts: List[ContentPart], maxSize: int, groupKey: str) -> List[ContentPart]:
if not parts:
return []
# For tables, we typically don't merge across different tables
# Instead, we chunk individual large tables
merged: List[ContentPart] = []
for part in parts:
partSize = part.metadata.get("size", 0)
if partSize <= maxSize:
# Part fits within limit
part.metadata["groupKey"] = groupKey
merged.append(part)
else:
# Chunk the large table
chunks = self._chunkTable(part, maxSize)
merged.extend(chunks)
return merged
def _chunkTable(self, part: ContentPart, maxSize: int) -> List[ContentPart]:
"""Chunk a large table by rows while preserving CSV structure."""
lines = part.data.split('\n')
if not lines:
return [part]
chunks: List[ContentPart] = []
currentChunk: List[str] = []
currentSize = 0
for line in lines:
lineSize = len(line.encode('utf-8')) + 1 # +1 for newline
if currentSize + lineSize > maxSize and currentChunk:
# Flush current chunk
chunkData = '\n'.join(currentChunk)
chunks.append(ContentPart(
id=makeId(),
parentId=part.parentId,
label=f"{part.label}_chunk_{len(chunks)}",
typeGroup="table",
mimeType=part.mimeType,
data=chunkData,
metadata={
"size": len(chunkData.encode('utf-8')),
"chunk": True,
"originalPart": part.id,
"chunkIndex": len(chunks)
}
))
currentChunk = [line]
currentSize = lineSize
else:
currentChunk.append(line)
currentSize += lineSize
# Flush remaining chunk
if currentChunk:
chunkData = '\n'.join(currentChunk)
chunks.append(ContentPart(
id=makeId(),
parentId=part.parentId,
label=f"{part.label}_chunk_{len(chunks)}",
typeGroup="table",
mimeType=part.mimeType,
data=chunkData,
metadata={
"size": len(chunkData.encode('utf-8')),
"chunk": True,
"originalPart": part.id,
"chunkIndex": len(chunks)
}
))
return chunks

View file

@ -0,0 +1,138 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
from typing import Any, Dict, List
from modules.datamodels.datamodelExtraction import ContentPart, MergeStrategy
from ..subUtils import makeId
class TextMerger:
def merge(self, parts: List[ContentPart], strategy: MergeStrategy) -> List[ContentPart]:
"""
Merge text parts based on strategy.
Strategy options:
- groupBy: "parentId" (default), "documentId", "none"
- orderBy: "label", "pageIndex", "sheetIndex", "none"
- maxSize: maximum size per merged part
"""
if not parts:
return parts
groupBy = strategy.groupBy
orderBy = strategy.orderBy
maxSize = strategy.maxSize or 0
# Group parts
groups = self._groupParts(parts, groupBy)
merged: List[ContentPart] = []
for groupKey, groupParts in groups.items():
# Sort within group
sortedParts = self._sortParts(groupParts, orderBy)
# Merge respecting maxSize
if maxSize > 0:
merged.extend(self._mergeWithSizeLimit(sortedParts, maxSize))
else:
merged.extend(self._mergeGroup(sortedParts, groupKey))
return merged
def _groupParts(self, parts: List[ContentPart], groupBy: str) -> Dict[str, List[ContentPart]]:
groups: Dict[str, List[ContentPart]] = {}
for part in parts:
if part.typeGroup != "text":
# Non-text parts go in their own group
key = f"nontext_{part.id}"
if key not in groups:
groups[key] = []
groups[key].append(part)
continue
if groupBy == "parentId":
key = part.parentId or "root"
elif groupBy == "documentId":
key = part.metadata.get("documentId", "unknown")
else: # "none"
key = "all"
if key not in groups:
groups[key] = []
groups[key].append(part)
return groups
def _sortParts(self, parts: List[ContentPart], orderBy: str) -> List[ContentPart]:
if orderBy == "pageIndex":
return sorted(parts, key=lambda p: p.metadata.get("pageIndex", 0))
elif orderBy == "sheetIndex":
return sorted(parts, key=lambda p: p.metadata.get("sheetIndex", 0))
elif orderBy == "label":
return sorted(parts, key=lambda p: p.label)
else: # "none"
return parts
def _mergeGroup(self, parts: List[ContentPart], groupKey: str) -> List[ContentPart]:
if not parts:
return []
if len(parts) == 1:
return parts
# Merge all text parts in group
textParts = [p for p in parts if p.typeGroup == "text"]
nonTextParts = [p for p in parts if p.typeGroup != "text"]
if not textParts:
return nonTextParts
# Combine text data
combinedData = "\n".join([p.data for p in textParts])
totalSize = sum(p.metadata.get("size", 0) for p in textParts)
mergedPart = ContentPart(
id=makeId(),
parentId=textParts[0].parentId,
label=f"merged_{groupKey}",
typeGroup="text",
mimeType="text/plain",
data=combinedData,
metadata={
"size": totalSize,
"merged": len(textParts),
"originalParts": [p.id for p in textParts]
}
)
return [mergedPart] + nonTextParts
def _mergeWithSizeLimit(self, parts: List[ContentPart], maxSize: int) -> List[ContentPart]:
if not parts:
return []
textParts = [p for p in parts if p.typeGroup == "text"]
nonTextParts = [p for p in parts if p.typeGroup != "text"]
if not textParts:
return nonTextParts
merged: List[ContentPart] = []
currentGroup: List[ContentPart] = []
currentSize = 0
for part in textParts:
partSize = part.metadata.get("size", 0)
if currentSize + partSize > maxSize and currentGroup:
# Flush current group
merged.extend(self._mergeGroup(currentGroup, f"chunk_{len(merged)}"))
currentGroup = [part]
currentSize = partSize
else:
currentGroup.append(part)
currentSize += partSize
# Flush remaining group
if currentGroup:
merged.extend(self._mergeGroup(currentGroup, f"chunk_{len(merged)}"))
return merged + nonTextParts

View file

@ -0,0 +1,211 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
"""
Intelligent Token-Aware Merger for optimizing AI calls based on LLM token limits.
"""
from typing import List, Dict, Any
import logging
from modules.datamodels.datamodelExtraction import ContentPart
from .subUtils import makeId
logger = logging.getLogger(__name__)
class IntelligentTokenAwareMerger:
"""
Intelligent merger that groups chunks based on LLM token limits to minimize AI calls.
Strategy:
1. Calculate token count for each chunk
2. Group chunks to maximize token usage without exceeding limits
3. Preserve document structure and semantic boundaries
4. Minimize total number of AI calls
"""
def __init__(self, modelCapabilities: Dict[str, Any]):
self.maxTokens = modelCapabilities.get("maxTokens", 4000)
self.safetyMargin = modelCapabilities.get("safetyMargin", 0.1)
self.effectiveMaxTokens = int(self.maxTokens * (1 - self.safetyMargin))
self.charsPerToken = modelCapabilities.get("charsPerToken", 4) # Rough estimation
def mergeChunksIntelligently(self, chunks: List[ContentPart], prompt: str = "") -> List[ContentPart]:
"""
Merge chunks intelligently based on token limits.
Args:
chunks: List of ContentPart chunks to merge
prompt: AI prompt to account for in token calculation
Returns:
List of optimally merged ContentPart objects
"""
if not chunks:
return chunks
logger.info(f"🧠 Intelligent merging: {len(chunks)} chunks, maxTokens={self.effectiveMaxTokens}")
# Calculate tokens for prompt
promptTokens = self._estimateTokens(prompt)
availableTokens = self.effectiveMaxTokens - promptTokens
logger.info(f"📊 Prompt tokens: {promptTokens}, Available for content: {availableTokens}")
# Group chunks by document and type for semantic coherence
groupedChunks = self._groupChunksByDocumentAndType(chunks)
mergedParts = []
for groupKey, groupChunks in groupedChunks.items():
logger.info(f"📁 Processing group: {groupKey} ({len(groupChunks)} chunks)")
# Merge chunks within this group optimally
groupMerged = self._mergeGroupOptimally(groupChunks, availableTokens)
mergedParts.extend(groupMerged)
logger.info(f"✅ Intelligent merging complete: {len(chunks)}{len(mergedParts)} parts")
return mergedParts
def _groupChunksByDocumentAndType(self, chunks: List[ContentPart]) -> Dict[str, List[ContentPart]]:
"""Group chunks by document and type for semantic coherence."""
groups = {}
for chunk in chunks:
# Create group key: document_id + type_group
docId = chunk.metadata.get("documentId", "unknown")
typeGroup = chunk.typeGroup
groupKey = f"{docId}_{typeGroup}"
if groupKey not in groups:
groups[groupKey] = []
groups[groupKey].append(chunk)
return groups
def _mergeGroupOptimally(self, chunks: List[ContentPart], availableTokens: int) -> List[ContentPart]:
"""Merge chunks within a group optimally to minimize AI calls."""
if not chunks:
return []
# Sort chunks by size (smallest first for better packing)
sortedChunks = sorted(chunks, key=lambda c: self._estimateTokens(c.data))
mergedParts = []
currentGroup = []
currentTokens = 0
for chunk in sortedChunks:
chunkTokens = self._estimateTokens(chunk.data)
# Special case: If single chunk is already at max size, process it alone
if chunkTokens >= availableTokens * 0.9: # 90% of available tokens
# Finalize current group if it exists
if currentGroup:
mergedPart = self._createMergedPart(currentGroup, currentTokens)
mergedParts.append(mergedPart)
currentGroup = []
currentTokens = 0
# Process large chunk individually
mergedParts.append(chunk)
logger.debug(f"🔍 Large chunk processed individually: {chunkTokens} tokens")
continue
# If adding this chunk would exceed limit, finalize current group
if currentTokens + chunkTokens > availableTokens and currentGroup:
mergedPart = self._createMergedPart(currentGroup, currentTokens)
mergedParts.append(mergedPart)
currentGroup = [chunk]
currentTokens = chunkTokens
else:
currentGroup.append(chunk)
currentTokens += chunkTokens
# Finalize remaining group
if currentGroup:
mergedPart = self._createMergedPart(currentGroup, currentTokens)
mergedParts.append(mergedPart)
logger.info(f"📦 Group merged: {len(chunks)}{len(mergedParts)} parts")
return mergedParts
def _createMergedPart(self, chunks: List[ContentPart], totalTokens: int) -> ContentPart:
"""Create a merged ContentPart from multiple chunks."""
if len(chunks) == 1:
return chunks[0] # No need to merge single chunk
# Combine data with semantic separators
combinedData = self._combineChunkData(chunks)
# Use metadata from first chunk as base
baseChunk = chunks[0]
mergedMetadata = baseChunk.metadata.copy()
mergedMetadata.update({
"merged": True,
"originalChunkCount": len(chunks),
"totalTokens": totalTokens,
"originalChunkIds": [c.id for c in chunks],
"size": len(combinedData.encode('utf-8'))
})
mergedPart = ContentPart(
id=makeId(),
parentId=baseChunk.parentId,
label=f"merged_{len(chunks)}_chunks",
typeGroup=baseChunk.typeGroup,
mimeType=baseChunk.mimeType,
data=combinedData,
metadata=mergedMetadata
)
logger.debug(f"🔗 Created merged part: {len(chunks)} chunks, {totalTokens} tokens")
return mergedPart
def _combineChunkData(self, chunks: List[ContentPart]) -> str:
"""Combine chunk data with appropriate separators."""
if not chunks:
return ""
# Use different separators based on content type
if chunks[0].typeGroup == "text":
separator = "\n\n---\n\n" # Clear text separation
elif chunks[0].typeGroup == "table":
separator = "\n\n[TABLE BREAK]\n\n" # Table separation
else:
separator = "\n\n---\n\n" # Default separation
return separator.join([chunk.data for chunk in chunks])
def _estimateTokens(self, text: str) -> int:
"""Estimate token count for text."""
if not text:
return 0
return len(text) // self.charsPerToken
def calculateOptimizationStats(self, originalChunks: List[ContentPart], mergedParts: List[ContentPart]) -> Dict[str, Any]:
"""Calculate optimization statistics with detailed analysis."""
originalCalls = len(originalChunks)
optimizedCalls = len(mergedParts)
reductionPercent = ((originalCalls - optimizedCalls) / originalCalls * 100) if originalCalls > 0 else 0
# Analyze chunk sizes
largeChunks = [c for c in originalChunks if self._estimateTokens(c.data) >= self.effectiveMaxTokens * 0.9]
smallChunks = [c for c in originalChunks if self._estimateTokens(c.data) < self.effectiveMaxTokens * 0.9]
# Calculate theoretical maximum optimization (if all small chunks could be merged)
theoreticalMinCalls = len(largeChunks) + max(1, len(smallChunks) // 3) # Assume 3 small chunks per call
theoreticalReduction = ((originalCalls - theoreticalMinCalls) / originalCalls * 100) if originalCalls > 0 else 0
return {
"original_ai_calls": originalCalls,
"optimized_ai_calls": optimizedCalls,
"reduction_percent": round(reductionPercent, 1),
"cost_savings": f"{reductionPercent:.1f}%",
"efficiency_gain": f"{originalCalls / optimizedCalls:.1f}x" if optimizedCalls > 0 else "",
"analysis": {
"large_chunks": len(largeChunks),
"small_chunks": len(smallChunks),
"theoretical_min_calls": theoreticalMinCalls,
"theoretical_reduction": round(theoreticalReduction, 1),
"optimization_potential": "high" if reductionPercent > 50 else "moderate" if reductionPercent > 20 else "low"
}
}

View file

@ -0,0 +1,48 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
from typing import List
import logging
from modules.datamodels.datamodelExtraction import ContentExtracted, ContentPart, ExtractionOptions, MergeStrategy
from .subUtils import makeId
from .subRegistry import ExtractorRegistry, ChunkerRegistry
logger = logging.getLogger(__name__)
# REMOVED: _mergeParts function - unused, functionality replaced by applyMerging in interfaceAiObjects.py
def runExtraction(extractorRegistry: ExtractorRegistry, chunkerRegistry: ChunkerRegistry, documentBytes: bytes, fileName: str, mimeType: str, options: ExtractionOptions) -> ContentExtracted:
extractor = extractorRegistry.resolve(mimeType, fileName)
if extractor is None:
# fallback: single binary part
part = ContentPart(
id=makeId(),
parentId=None,
label="file",
typeGroup="binary",
mimeType=mimeType or "application/octet-stream",
data="",
metadata={"warning": "No extractor registered"}
)
return ContentExtracted(id=makeId(), parts=[part])
parts = extractor.extract(documentBytes, {"fileName": fileName, "mimeType": mimeType})
# REMOVED: poolAndLimit(parts, chunkerRegistry, options)
# REMOVED: Chunking logic - now handled in AI call phase
# Apply merging strategy if provided (preserve existing logic)
if options.mergeStrategy:
# Use module-level applyMerging function
from .mainServiceExtraction import applyMerging
parts = applyMerging(parts, options.mergeStrategy)
return ContentExtracted(id=makeId(), parts=parts)
# REMOVED: poolAndLimit function - chunking now handled in AI call phase
# REMOVED: applyMerging function - moved to interfaceAiObjects.py for proper interface-level access

View file

@ -0,0 +1,214 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
"""
Prompt builder for document extraction.
This module builds prompts for extracting content from documents.
"""
import json
import logging
from typing import Dict, Any, Optional
from modules.datamodels.datamodelAi import AiCallRequest, AiCallOptions, OperationTypeEnum
# Type hint for renderer parameter
from typing import TYPE_CHECKING
if TYPE_CHECKING:
from modules.serviceCenter.services.serviceGeneration.renderers.documentRendererBaseTemplate import BaseRenderer
_RendererLike = BaseRenderer
else:
_RendererLike = Any
logger = logging.getLogger(__name__)
async def buildExtractionPrompt(
outputFormat: str,
userPrompt: str,
title: str,
aiService=None,
services=None,
renderer: _RendererLike = None
) -> str:
"""
Build unified extraction prompt for extracting content from documents.
Always uses multi-file format (single doc = multi with n=1).
Args:
outputFormat: Target output format
userPrompt: User's prompt describing what to extract
title: Document title
aiService: Optional AI service for intent parsing
services: Services instance
renderer: Optional renderer for format-specific guidelines
Returns:
Complete extraction prompt string
"""
# Flat extraction format - returns extracted content as structured data, not documents/sections
# This format allows merging multiple contentParts into one response
json_example = {
"extracted_content": {
"text": "Extracted text content from the document...",
"tables": [
{
"headers": ["Column 1", "Column 2"],
"rows": [
["Value 1", "Value 2"],
["Value 3", "Value 4"]
]
}
],
"headings": [
{
"level": 1,
"text": "Main Heading"
},
{
"level": 2,
"text": "Subheading"
}
],
"lists": [
{
"type": "bullet",
"items": ["Item 1", "Item 2", "Item 3"]
}
],
"images": [
{
"description": "Description of image content, including all visible text, tables, and visual elements"
}
]
}
}
structure_instruction = """CRITICAL EXTRACTION REQUIREMENTS:
1. Extract content from the provided ContentPart(s) - process what is provided in this call
2. If this ContentPart contains tables, extract them with proper structure (headers and rows)
3. If this ContentPart contains text, extract it as structured text
4. Return ONE JSON object with extracted content from this ContentPart
5. Preserve all original data - do not summarize or interpret
6. The system will merge results from multiple ContentParts automatically - focus on extracting this ContentPart's content accurately"""
# Parse extraction intent if AI service is available
extraction_intent = await _parseExtractionIntent(userPrompt, outputFormat, aiService, services) if aiService else userPrompt
# Extract user language for document language instruction
userLanguage = 'en' # Default fallback
if services:
try:
# Prefer detected language if available
if hasattr(services, 'currentUserLanguage') and services.currentUserLanguage:
userLanguage = services.currentUserLanguage
elif hasattr(services, 'user') and services.user and hasattr(services.user, 'language'):
userLanguage = services.user.language
except Exception:
pass
# Build base prompt with clear user prompt markers
sanitized_user_prompt = services.utils.sanitizePromptContent(userPrompt, 'userinput') if services else userPrompt
adaptive_prompt = f"""
{'='*80}
USER REQUEST / USER PROMPT:
{'='*80}
{sanitized_user_prompt}
{'='*80}
END OF USER REQUEST / USER PROMPT
{'='*80}
You are a document processing assistant that extracts content from documents. Your task is to analyze the provided ContentPart(s) and extract their content into a structured JSON format.
TASK: Extract content from the provided ContentPart(s). Extract all tables, text, headings, lists, and other content types accurately. The system processes ContentParts individually and merges results automatically.
LANGUAGE REQUIREMENT: All extracted content must be in the language '{userLanguage}'. Extract and preserve content in this language.
{extraction_intent}
{structure_instruction}
OUTPUT FORMAT: Return only valid JSON in this exact structure:
{json.dumps(json_example, indent=2)}
CRITICAL EXTRACTION RULES:
- Extract only content that is ACTUALLY PRESENT in the ContentPart - never create fake or placeholder data
- Return empty arrays [] or empty strings "" when content is missing - this is normal and expected
- Extract all tables, text, headings, lists accurately with proper structure
- Preserve all original data - do not summarize or interpret
- Return ONE JSON object per ContentPart (the system merges multiple ContentParts automatically)
Content Types to Extract:
1. Tables: Extract all rows and columns with proper headers
2. Lists: Extract all items with proper nesting
3. Headings: Extract with appropriate levels
4. Paragraphs: Extract as structured text
5. Code: Extract code blocks with language identification
6. Images: Analyze images and describe all visible content including text, tables, logos, graphics, layout, and visual elements
Image Analysis Requirements:
- If you cannot analyze an image for any reason, explain why in the JSON response
- Describe everything you see in the image
- Include all text content, tables, logos, graphics, layout, and visual elements
- If the image is too small, corrupted, or unclear, explain this
- Always provide feedback - never return empty responses
Return only the JSON structure with actual data from the documents. Do not include any text before or after the JSON.
Extract only actual content from the ContentPart. Return empty arrays/strings when content is missing - never create fake data.
""".strip()
# Add renderer-specific guidelines if provided
if renderer:
try:
if hasattr(renderer, 'getExtractionGuidelines'):
formatGuidelines = renderer.getExtractionGuidelines()
adaptive_prompt = f"{adaptive_prompt}\n\n{formatGuidelines}".strip()
except Exception:
pass
# Save extraction prompt to debug file - only if debug enabled
from modules.shared.debugLogger import writeDebugFile
writeDebugFile(adaptive_prompt, "extraction_prompt")
return adaptive_prompt
async def _parseExtractionIntent(userPrompt: str, outputFormat: str, aiService=None, services=None) -> str:
"""
Parse user prompt to extract the core extraction intent.
"""
if not aiService:
return f"Extract content from the provided documents and create a {outputFormat} report."
try:
analysis_prompt = f"""
Analyze this user request and extract the core extraction intent:
User request: "{userPrompt}"
Target format: {outputFormat}
Extract the main intent and requirements for document processing. Focus on:
1. What content needs to be extracted
2. How it should be organized
3. Any specific requirements or preferences
Respond with a clear, concise statement of the extraction intent.
"""
request_options = AiCallOptions()
request_options.operationType = OperationTypeEnum.DATA_GENERATE
request = AiCallRequest(prompt=analysis_prompt, context="", options=request_options)
response = await aiService.aiObjects.call(request)
if response and response.content:
return response.content.strip()
else:
return f"Extract content from the provided documents and create a {outputFormat} report."
except Exception as e:
services.utils.debugLogToFile(f"Extraction intent analysis failed: {str(e)}", "PROMPT_BUILDER")
return f"Extract content from the provided documents and create a {outputFormat} report."

View file

@ -0,0 +1,208 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
from typing import Any, Dict, Optional
import logging
from modules.datamodels.datamodelExtraction import ContentPart
logger = logging.getLogger(__name__)
class Extractor:
"""
Base class for all document extractors.
Each extractor should implement:
- detect(): Check if this extractor can handle the given file
- extract(): Extract content from the file
- getSupportedExtensions(): Return supported file extensions
- getSupportedMimeTypes(): Return supported MIME types
"""
def detect(self, fileName: str, mimeType: str, headBytes: bytes) -> bool:
"""Check if this extractor can handle the given file."""
return False
def extract(self, fileBytes: bytes, context: Dict[str, Any]) -> list[ContentPart]:
"""Extract content from the file bytes."""
raise NotImplementedError
def getSupportedExtensions(self) -> list[str]:
"""Return list of supported file extensions (including dots)."""
return []
def getSupportedMimeTypes(self) -> list[str]:
"""Return list of supported MIME types."""
return []
class Chunker:
def chunk(self, part: ContentPart, options: Dict[str, Any]) -> list[Dict[str, Any]]:
return []
class ExtractorRegistry:
def __init__(self):
self._map: Dict[str, Extractor] = {}
self._fallback: Optional[Extractor] = None
self._auto_discover_extractors()
def _auto_discover_extractors(self):
"""Auto-discover and register all extractors from the extractors directory."""
try:
import os
import importlib
from pathlib import Path
# Get the extractors directory
current_dir = Path(__file__).parent
extractors_dir = current_dir / "extractors"
if not extractors_dir.exists():
logger.error(f"Extractors directory not found: {extractors_dir}")
return
# Import all extractor modules
extractor_modules = []
for file_path in extractors_dir.glob("extractor*.py"):
if file_path.name == "__init__.py":
continue
module_name = file_path.stem
try:
# Import the module
module = importlib.import_module(f".{module_name}", package="modules.serviceCenter.services.serviceExtraction.extractors")
# Find all extractor classes in the module
for attr_name in dir(module):
attr = getattr(module, attr_name)
if (isinstance(attr, type) and
issubclass(attr, Extractor) and
attr != Extractor and
not attr_name.startswith('_')):
# Create instance and auto-register
extractor_instance = attr()
self._auto_register_extractor(extractor_instance)
extractor_modules.append(attr_name)
except Exception as e:
logger.warning(f"Failed to import {module_name}: {str(e)}")
continue
# Set fallback extractor
try:
from .extractors.extractorBinary import BinaryExtractor
self.setFallback(BinaryExtractor())
except Exception as e:
logger.warning(f"Failed to set fallback extractor: {str(e)}")
logger.info(f"ExtractorRegistry: Auto-discovered and registered {len(extractor_modules)} extractor classes: {', '.join(extractor_modules)}")
logger.info(f"ExtractorRegistry: Total registered formats: {len(self._map)}")
except Exception as e:
logger.error(f"ExtractorRegistry: Failed to auto-discover extractors: {str(e)}")
import traceback
traceback.print_exc()
def _auto_register_extractor(self, extractor: Extractor):
"""Auto-register an extractor based on its declared supported formats."""
try:
# Register MIME types
mime_types = extractor.getSupportedMimeTypes()
for mime_type in mime_types:
self.register(mime_type, extractor)
# Register file extensions
extensions = extractor.getSupportedExtensions()
for ext in extensions:
# Remove leading dot for registry key
ext_key = ext.lstrip('.')
self.register(ext_key, extractor)
except Exception as e:
logger.error(f"Failed to auto-register {extractor.__class__.__name__}: {str(e)}")
def register(self, key: str, extractor: Extractor):
self._map[key] = extractor
def setFallback(self, extractor: Extractor):
self._fallback = extractor
def resolve(self, mimeType: str, fileName: str) -> Optional[Extractor]:
if mimeType in self._map:
return self._map[mimeType]
# simple extension fallback
if "." in fileName:
ext = fileName.lower().rsplit(".", 1)[-1]
if ext in self._map:
return self._map[ext]
return self._fallback
def getAllSupportedFormats(self) -> Dict[str, Dict[str, list[str]]]:
"""
Get all supported formats from all registered extractors.
Returns:
Dictionary with format information:
{
"extensions": {
"extractor_name": [".ext1", ".ext2", ...]
},
"mime_types": {
"extractor_name": ["mime/type1", "mime/type2", ...]
}
}
"""
formats = {"extensions": {}, "mime_types": {}}
# Get formats from registered extractors
for key, extractor in self._map.items():
if hasattr(extractor, 'getSupportedExtensions'):
extensions = extractor.getSupportedExtensions()
if extensions:
formats["extensions"][key] = extensions
if hasattr(extractor, 'getSupportedMimeTypes'):
mime_types = extractor.getSupportedMimeTypes()
if mime_types:
formats["mime_types"][key] = mime_types
# Add fallback extractor info
if self._fallback and hasattr(self._fallback, 'getSupportedExtensions'):
formats["extensions"]["fallback"] = self._fallback.getSupportedExtensions()
if self._fallback and hasattr(self._fallback, 'getSupportedMimeTypes'):
formats["mime_types"]["fallback"] = self._fallback.getSupportedMimeTypes()
return formats
class ChunkerRegistry:
def __init__(self):
self._map: Dict[str, Chunker] = {}
self._noop = Chunker()
# Register default chunkers
try:
from .chunking.chunkerText import TextChunker
from .chunking.chunkerTable import TableChunker
from .chunking.chunkerStructure import StructureChunker
from .chunking.chunkerImage import ImageChunker
self.register("text", TextChunker())
self.register("table", TableChunker())
self.register("structure", StructureChunker())
self.register("image", ImageChunker())
# Use text chunker for container and binary content
self.register("container", TextChunker())
self.register("binary", TextChunker())
except Exception as e:
logger.error(f"ChunkerRegistry: Failed to register chunkers: {str(e)}")
import traceback
traceback.print_exc()
def register(self, typeGroup: str, chunker: Chunker):
self._map[typeGroup] = chunker
def resolve(self, typeGroup: str) -> Chunker:
return self._map.get(typeGroup, self._noop)

View file

@ -0,0 +1,7 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
import uuid
def makeId() -> str:
return str(uuid.uuid4())

View file

@ -0,0 +1,7 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
"""Generation service."""
from .mainServiceGeneration import GenerationService
__all__ = ["GenerationService"]

View file

@ -0,0 +1,616 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
import logging
import uuid
import base64
import traceback
from typing import Any, Dict, List, Optional, Callable
from modules.datamodels.datamodelDocument import RenderedDocument
from modules.datamodels.datamodelChat import ChatDocument
from .subDocumentUtility import (
getFileExtension,
getMimeTypeFromExtension,
detectMimeTypeFromContent,
detectMimeTypeFromData,
convertDocumentDataToString
)
logger = logging.getLogger(__name__)
class _ServicesAdapter:
"""Adapter providing Services-like interface from (context, get_service)."""
def __init__(self, context, get_service: Callable[[str], Any]):
self._context = context
self._get_service = get_service
self.user = context.user
self.mandateId = context.mandate_id
self.featureInstanceId = context.feature_instance_id
self.workflow = context.workflow
chat = get_service("chat")
self.interfaceDbComponent = chat.interfaceDbComponent
self.interfaceDbChat = chat.interfaceDbChat
@property
def chat(self):
return self._get_service("chat")
@property
def utils(self):
return self._get_service("utils")
@property
def ai(self):
return self._get_service("ai")
class GenerationService:
def __init__(self, context, get_service: Callable[[str], Any]):
"""Initialize with ServiceCenterContext and service resolver."""
self.services = _ServicesAdapter(context, get_service)
self._get_service = get_service
self.interfaceDbComponent = self.services.interfaceDbComponent
self.interfaceDbChat = self.services.interfaceDbChat
def processActionResultDocuments(self, actionResult, action) -> List[Dict[str, Any]]:
"""
Process documents produced by AI actions and convert them to ChatDocument format.
This function handles AI-generated document data, not document references.
Returns a list of processed document dictionaries.
"""
try:
# Read documents from the standard documents field (not data.documents)
documents = actionResult.documents if actionResult and hasattr(actionResult, 'documents') else []
if not documents:
return []
# Process each document from the AI action result
processedDocuments = []
for doc in documents:
processedDoc = self.processSingleDocument(doc, action)
if processedDoc:
processedDocuments.append(processedDoc)
return processedDocuments
except Exception as e:
logger.error(f"Error processing action result documents: {str(e)}")
return []
def processSingleDocument(self, doc: Any, action) -> Optional[Dict[str, Any]]:
"""Process a single document from action result with simplified logic"""
try:
# ActionDocument objects have documentName, documentData, and mimeType
mime_type = doc.mimeType
if mime_type == "application/octet-stream":
content = doc.documentData
# Detect MIME without relying on a service center
mime_type = detectMimeTypeFromContent(content, doc.documentName)
# WICHTIG: Für ActionDocuments mit validationMetadata (z.B. context.extractContent)
# müssen wir das gesamte ActionDocument serialisieren, nicht nur documentData
document_data = doc.documentData
if hasattr(doc, 'validationMetadata') and doc.validationMetadata:
# Wenn validationMetadata vorhanden ist, serialisiere das gesamte ActionDocument-Format
if mime_type == "application/json":
# Erstelle ActionDocument-Format mit validationMetadata und documentData
if hasattr(document_data, 'model_dump'):
# Pydantic v2
document_data_dict = document_data.model_dump()
elif hasattr(document_data, 'dict'):
# Pydantic v1
document_data_dict = document_data.dict()
elif isinstance(document_data, dict):
document_data_dict = document_data
elif isinstance(document_data, str):
# JSON-String: parsen und als dict speichern (z.B. von outlook.composeAndDraftEmailWithContext)
import json
try:
document_data_dict = json.loads(document_data)
except json.JSONDecodeError:
# Kein valides JSON - als plain text speichern
document_data_dict = {"data": document_data}
else:
document_data_dict = {"data": str(document_data)}
# Erstelle ActionDocument-Format
document_data = {
"validationMetadata": doc.validationMetadata,
"documentData": document_data_dict
}
return {
'fileName': doc.documentName,
'fileSize': len(str(document_data)),
'mimeType': mime_type,
'content': document_data,
'document': doc
}
except Exception as e:
logger.error(f"Error processing single document: {str(e)}")
return None
def createDocumentsFromActionResult(self, actionResult, action, workflow, message_id=None) -> List[Any]:
"""
Create actual document objects from action result and store them in the system.
Returns a list of created document objects with proper workflow context.
"""
try:
processed_docs = self.processActionResultDocuments(actionResult, action)
createdDocuments = []
for i, doc_data in enumerate(processed_docs):
try:
documentName = doc_data['fileName']
documentData = doc_data['content']
mimeType = doc_data['mimeType']
# Handle binary data (images, PDFs, Office docs) differently from text
# Check if this is a binary MIME type
binaryMimeTypes = {
"application/vnd.openxmlformats-officedocument.wordprocessingml.document",
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
"application/vnd.openxmlformats-officedocument.presentationml.presentation",
"application/pdf",
"image/png", "image/jpeg", "image/jpg", "image/gif", "image/webp", "image/bmp", "image/svg+xml",
}
isBinaryMimeType = mimeType in binaryMimeTypes
base64encoded = False
content = None
if isBinaryMimeType:
# For binary data, handle bytes vs base64 string vs regular string
if isinstance(documentData, bytes):
# Already bytes - encode to base64 string for storage
# base64 is already imported at module level
content = base64.b64encode(documentData).decode('utf-8')
base64encoded = True
elif isinstance(documentData, str):
# Check if it's already valid base64
# base64 is already imported at module level
try:
# Try to decode to verify it's base64
base64.b64decode(documentData, validate=True)
# Valid base64 - use as is
content = documentData
base64encoded = True
except Exception:
# Not valid base64 - might be raw string, try encoding
try:
content = base64.b64encode(documentData.encode('utf-8')).decode('utf-8')
base64encoded = True
except Exception:
logger.warning(f"Could not process binary data for {documentName}, skipping")
continue
else:
# Other types - convert to string then base64
# base64 is already imported at module level
try:
content = base64.b64encode(str(documentData).encode('utf-8')).decode('utf-8')
base64encoded = True
except Exception:
logger.warning(f"Could not encode binary data for {documentName}, skipping")
continue
else:
# Text data - convert to string
content = convertDocumentDataToString(documentData, getFileExtension(documentName))
# Skip empty or minimal content
minimalContentPatterns = ['{}', '[]', 'null', '""', "''"]
if not content or content.strip() == "" or content.strip() in minimalContentPatterns:
logger.warning(f"Empty or minimal content for document {documentName}, skipping")
continue
# Normalize file extension based on mime type if missing or incorrect
try:
mime_to_ext = {
"application/vnd.openxmlformats-officedocument.wordprocessingml.document": ".docx",
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet": ".xlsx",
"application/vnd.openxmlformats-officedocument.presentationml.presentation": ".pptx",
"application/pdf": ".pdf",
"text/html": ".html",
"text/markdown": ".md",
"text/plain": ".txt",
"application/json": ".json",
"image/png": ".png",
"image/jpeg": ".jpg",
"image/jpg": ".jpg",
"image/gif": ".gif",
"image/webp": ".webp",
"image/bmp": ".bmp",
"image/svg+xml": ".svg",
}
expectedExt = mime_to_ext.get(mimeType)
if expectedExt:
if not documentName.lower().endswith(expectedExt):
# Append/replace extension to match mime type
if "." in documentName:
documentName = documentName.rsplit(".", 1)[0] + expectedExt
else:
documentName = documentName + expectedExt
except Exception:
pass
# Create document with file in one step using interfaces directly
document = self._createDocument(
fileName=documentName,
mimeType=mimeType,
content=content,
base64encoded=base64encoded,
messageId=message_id
)
if document:
# Set workflow context on the document if possible
self._setDocumentWorkflowContext(document, action, workflow)
createdDocuments.append(document)
else:
logger.error(f"Failed to create ChatDocument object for {documentName}")
except Exception as e:
logger.error(f"Error creating document {doc_data.get('fileName', 'unknown')}: {str(e)}")
continue
return createdDocuments
except Exception as e:
logger.error(f"Error creating documents from action result: {str(e)}")
return []
def _setDocumentWorkflowContext(self, document, action, workflow):
"""Set workflow context on a document for proper routing and labeling"""
try:
# Get current workflow context directly from workflow object
workflowContext = self._getWorkflowContext(workflow)
workflowStats = self._getWorkflowStats(workflow)
currentRound = workflowContext.get('currentRound', 0)
currentTask = workflowContext.get('currentTask', 0)
currentAction = workflowContext.get('currentAction', 0)
# Try to set workflow context attributes if they exist
if hasattr(document, 'roundNumber'):
document.roundNumber = currentRound
if hasattr(document, 'taskNumber'):
document.taskNumber = currentTask
if hasattr(document, 'actionNumber'):
document.actionNumber = currentAction
if hasattr(document, 'actionId'):
document.actionId = action.id if hasattr(action, 'id') else None
# Set additional workflow metadata if available
if hasattr(document, 'workflowId'):
document.workflowId = workflowStats.get('workflowId', workflow.id if hasattr(workflow, 'id') else None)
if hasattr(document, 'workflowStatus'):
document.workflowStatus = workflowStats.get('workflowStatus', workflow.status if hasattr(workflow, 'status') else 'unknown')
except Exception as e:
logger.warning(f"Could not set workflow context on document: {str(e)}")
def _createDocument(self, fileName: str, mimeType: str, content: str, base64encoded: bool = True, messageId: str = None) -> Optional[ChatDocument]:
"""Create file and ChatDocument using interfaces without service indirection."""
try:
if not self.interfaceDbComponent:
logger.error("Component interface not available for document creation")
return None
# Convert content to bytes
if base64encoded:
# base64 is already imported at module level
content_bytes = base64.b64decode(content)
else:
content_bytes = content.encode('utf-8')
# Create file and store data
file_item = self.interfaceDbComponent.createFile(
name=fileName,
mimeType=mimeType,
content=content_bytes
)
self.interfaceDbComponent.createFileData(file_item.id, content_bytes)
# Collect file info
file_info = self._getFileInfo(file_item.id)
if not file_info:
logger.error(f"Could not get file info for fileId: {file_item.id}")
return None
# Build ChatDocument
document = ChatDocument(
id=str(uuid.uuid4()),
messageId=messageId or "",
fileId=file_item.id,
fileName=file_info.get("fileName", fileName),
fileSize=file_info.get("size", 0),
mimeType=file_info.get("mimeType", mimeType)
)
# Ensure document can access component interface later
if hasattr(document, 'setComponentInterface') and self.interfaceDbComponent:
try:
document.setComponentInterface(self.interfaceDbComponent)
except Exception:
pass
return document
except Exception as e:
logger.error(f"Error creating document: {str(e)}")
return None
def _getFileInfo(self, fileId: str) -> Optional[Dict[str, Any]]:
try:
if not self.interfaceDbComponent:
return None
file_item = self.interfaceDbComponent.getFile(fileId)
if file_item:
return {
"id": file_item.id,
"fileName": file_item.fileName,
"size": file_item.fileSize,
"mimeType": file_item.mimeType,
"fileHash": getattr(file_item, 'fileHash', None),
"creationDate": getattr(file_item, 'creationDate', None)
}
return None
except Exception as e:
logger.error(f"Error getting file info for {fileId}: {str(e)}")
return None
def _getWorkflowContext(self, workflow) -> Dict[str, int]:
try:
return {
'currentRound': getattr(workflow, 'currentRound', 0),
'currentTask': getattr(workflow, 'currentTask', 0),
'currentAction': getattr(workflow, 'currentAction', 0)
}
except Exception:
return {'currentRound': 0, 'currentTask': 0, 'currentAction': 0}
def _getWorkflowStats(self, workflow) -> Dict[str, Any]:
try:
context = self._getWorkflowContext(workflow)
return {
'currentRound': context['currentRound'],
'currentTask': context['currentTask'],
'currentAction': context['currentAction'],
'totalTasks': getattr(workflow, 'totalTasks', 0),
'totalActions': getattr(workflow, 'totalActions', 0),
'workflowStatus': getattr(workflow, 'status', 'unknown'),
'workflowId': getattr(workflow, 'id', 'unknown')
}
except Exception:
return {
'currentRound': 0,
'currentTask': 0,
'currentAction': 0,
'totalTasks': 0,
'totalActions': 0,
'workflowStatus': 'unknown',
'workflowId': 'unknown'
}
async def renderReport(self, extractedContent: Dict[str, Any], outputFormat: str, language: str, title: str, userPrompt: str = None, aiService=None, parentOperationId: Optional[str] = None) -> List[RenderedDocument]:
"""
Render extracted JSON content to the specified output format.
Processes EACH document separately and calls renderer for each.
Each renderer can return 1..n documents (e.g., HTML + images).
Per-document format and language are extracted from structure (validated in State 3).
Multiple documents can have different formats and languages.
Args:
extractedContent: Structured JSON document with documents array
outputFormat: Target format (html, pdf, docx, txt, md, json, csv, xlsx) - Global fallback
language: Language (global fallback) - Per-document language extracted from structure
title: Report title
userPrompt: User's original prompt for report generation
aiService: AI service instance for generation prompt creation
parentOperationId: Optional parent operation ID for hierarchical logging
Returns:
List of RenderedDocument objects.
Each RenderedDocument represents one rendered file (main document or supporting file)
"""
try:
# Validate JSON input
if not isinstance(extractedContent, dict):
raise ValueError("extractedContent must be a JSON dictionary")
# Unified approach: Always expect "documents" array
if "documents" not in extractedContent:
raise ValueError("extractedContent must contain 'documents' array")
documents = extractedContent["documents"]
if len(documents) == 0:
raise ValueError("No documents found in 'documents' array")
metadata = extractedContent.get("metadata", {})
allRenderedDocuments = []
# Process EACH document separately
for docIndex, doc in enumerate(documents):
if not isinstance(doc, dict):
logger.warning(f"Skipping invalid document at index {docIndex}")
continue
if "sections" not in doc:
logger.warning(f"Document {doc.get('id', docIndex)} has no sections, skipping")
continue
# Determine format for this document
# Check outputFormat field first (per-document), then format field (legacy), then global fallback
docFormat = doc.get("outputFormat") or doc.get("format") or outputFormat
# Determine language for this document
# Extract per-document language from structure (validated in State 3), fallback to global
docLanguage = doc.get("language") or language
# Validate language format (should be 2-character ISO code, validated in State 3)
if not isinstance(docLanguage, str) or len(docLanguage) != 2:
logger.warning(f"Document {doc.get('id')} has invalid language format: {docLanguage}, using fallback")
docLanguage = language # Use global fallback
# Get renderer for this document's format
renderer = self._getFormatRenderer(docFormat)
if not renderer:
logger.warning(f"Unsupported format '{docFormat}' for document {doc.get('id', docIndex)}, skipping")
continue
# Check output style classification (code/document/image/etc.) from renderer
from .renderers.registry import getOutputStyle
outputStyle = getOutputStyle(docFormat)
if outputStyle:
logger.debug(f"Document {doc.get('id', docIndex)} format '{docFormat}' classified as '{outputStyle}' style")
# Store style in document metadata for potential use in processing paths
if "metadata" not in doc:
doc["metadata"] = {}
doc["metadata"]["outputStyle"] = outputStyle
# Create JSON structure with single document (preserving metadata)
singleDocContent = {
"metadata": {**metadata, "language": docLanguage}, # Add per-document language to metadata
"documents": [doc] # Only this document
}
# Use document title or fallback to provided title
docTitle = doc.get("title", title)
# Render this document (can return multiple files, e.g., HTML + images)
renderedDocs = await renderer.render(singleDocContent, docTitle, userPrompt, aiService)
allRenderedDocuments.extend(renderedDocs)
logger.info(f"Rendered {len(documents)} document(s) into {len(allRenderedDocuments)} file(s)")
return allRenderedDocuments
except Exception as e:
logger.error(f"Error rendering JSON report to {outputFormat}: {str(e)}")
raise
async def generateDocumentWithTwoPhases(
self,
userPrompt: str,
cachedContent: Optional[Dict[str, Any]] = None,
contentParts: Optional[List[Any]] = None,
maxSectionLength: int = 500,
parallelGeneration: bool = True,
progressCallback: Optional[Callable] = None
) -> Dict[str, Any]:
"""
Generate document using two-phase approach:
1. Generate structure skeleton with empty sections
2. Generate content for each section iteratively
This is the core logic for document generation in AI calls.
Args:
userPrompt: User's original prompt
cachedContent: Optional extracted content cache (from extraction phase)
contentParts: Optional list of ContentParts to use for structure generation
maxSectionLength: Maximum words for simple sections
parallelGeneration: Enable parallel section generation
progressCallback: Optional callback function(progress, total, message) for progress updates
Returns:
Complete document structure with populated elements ready for rendering
"""
try:
from .subStructureGenerator import StructureGenerator
from .subContentGenerator import ContentGenerator
# Phase 1: Generate structure skeleton
if progressCallback:
progressCallback(0, 100, "Generating document structure...")
structureGenerator = StructureGenerator(self.services)
# Extract imageDocuments from cachedContent if available
existingImages = None
if cachedContent and cachedContent.get("imageDocuments"):
existingImages = cachedContent.get("imageDocuments")
structure = await structureGenerator.generateStructure(
userPrompt=userPrompt,
documentList=None, # Not used in current implementation
cachedContent=cachedContent,
contentParts=contentParts, # Pass ContentParts for structure generation
maxSectionLength=maxSectionLength,
existingImages=existingImages
)
if progressCallback:
progressCallback(30, 100, "Structure generated, starting content generation...")
# Phase 2: Generate content for each section
contentGenerator = ContentGenerator(self.services)
# Create progress callback wrapper for content generation phase (30-90%)
def contentProgressCallback(sectionIndex: int, totalSections: int, message: str):
if progressCallback:
# Map section progress to overall progress (30% to 90%)
if totalSections > 0:
overallProgress = 30 + int(60 * (sectionIndex / totalSections))
else:
overallProgress = 30
progressCallback(overallProgress, 100, f"Section {sectionIndex}/{totalSections}: {message}")
completeStructure = await contentGenerator.generateContent(
structure=structure,
cachedContent=cachedContent,
userPrompt=userPrompt,
contentParts=contentParts, # Pass ContentParts for content generation
progressCallback=contentProgressCallback,
parallelGeneration=parallelGeneration
)
if progressCallback:
progressCallback(100, 100, "Document generation complete")
return completeStructure
except Exception as e:
logger.error(f"Error in two-phase document generation: {str(e)}")
logger.debug(traceback.format_exc())
raise
async def getAdaptiveExtractionPrompt(
self,
outputFormat: str,
userPrompt: str,
title: str,
aiService=None
) -> str:
"""Get adaptive extraction prompt."""
from modules.serviceCenter.services.serviceExtraction.subPromptBuilderExtraction import buildExtractionPrompt
return await buildExtractionPrompt(
outputFormat=outputFormat,
userPrompt=userPrompt,
title=title,
aiService=aiService,
services=self.services
)
def _getFormatRenderer(self, output_format: str):
"""Get the appropriate document renderer for the specified format."""
try:
from .renderers.registry import getRenderer, getSupportedFormats
renderer = getRenderer(output_format, services=self.services, outputStyle='document')
if renderer:
return renderer
# Log available formats for debugging
availableFormats = getSupportedFormats()
logger.error(
f"No renderer found for format '{output_format}'. "
f"Available formats: {availableFormats}"
)
# Fallback to text renderer if no specific renderer found
logger.warning(f"Falling back to text renderer for format {output_format}")
fallbackRenderer = getRenderer('text', services=self.services, outputStyle='document')
if fallbackRenderer:
return fallbackRenderer
logger.error("Even text renderer fallback failed")
return None
except Exception as e:
logger.error(f"Error getting renderer for {output_format}: {str(e)}")
# traceback is already imported at module level
logger.debug(traceback.format_exc())
return None

View file

@ -0,0 +1,939 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
"""
Code Generation Path
Handles code generation with multi-file project support, dependency handling,
and proper cross-file references.
"""
import json
import logging
import time
import re
from typing import Dict, Any, List, Optional
from modules.datamodels.datamodelWorkflow import AiResponse, AiResponseMetadata, DocumentData
from modules.datamodels.datamodelExtraction import ContentPart
from modules.datamodels.datamodelAi import AiCallOptions, OperationTypeEnum
from modules.shared.jsonUtils import extractJsonString
logger = logging.getLogger(__name__)
class CodeGenerationPath:
"""Code generation path."""
def __init__(self, services):
self.services = services
async def generateCode(
self,
userPrompt: str,
outputFormat: str = None,
contentParts: Optional[List[ContentPart]] = None,
title: str = "Generated Code",
parentOperationId: Optional[str] = None
) -> AiResponse:
"""
Generate code files with multi-file project support.
Returns: AiResponse with code files as documents
"""
# Create operation ID
workflowId = self.services.workflow.id if self.services.workflow else f"no-workflow-{int(time.time())}"
codeOperationId = f"code_gen_{workflowId}_{int(time.time())}"
# Start progress tracking
self.services.chat.progressLogStart(
codeOperationId,
"Code Generation",
"Code Generation",
f"Format: {outputFormat or 'txt'}",
parentOperationId=parentOperationId
)
try:
# Detect language and project type from prompt or outputFormat
language, projectType = self._detectLanguageAndProjectType(userPrompt, outputFormat)
# Phase 1: Code structure generation (with looping)
self.services.chat.progressLogUpdate(codeOperationId, 0.2, "Generating code structure")
codeStructure = await self._generateCodeStructure(
userPrompt=userPrompt,
language=language,
outputFormat=outputFormat,
contentParts=contentParts
)
# Phase 2: Code content generation (with dependency handling)
self.services.chat.progressLogUpdate(codeOperationId, 0.5, "Generating code content")
codeFiles = await self._generateCodeContent(
codeStructure,
codeOperationId,
userPrompt=userPrompt,
contentParts=contentParts
)
# Phase 3: Code formatting & validation
self.services.chat.progressLogUpdate(codeOperationId, 0.8, "Formatting code files")
formattedFiles = await self._formatAndValidateCode(codeFiles)
# Phase 4: Code Rendering (Renderer-Based)
self.services.chat.progressLogUpdate(codeOperationId, 0.9, "Rendering code files")
# Group files by format
filesByFormat = {}
for file in formattedFiles:
fileType = file.get("fileType", outputFormat or "txt")
if fileType not in filesByFormat:
filesByFormat[fileType] = []
filesByFormat[fileType].append(file)
# Render each format group using appropriate renderer
allRenderedDocuments = []
for fileType, files in filesByFormat.items():
# Get renderer for this format
renderer = self._getCodeRenderer(fileType)
if renderer:
# Use code renderer
renderedDocs = await renderer.renderCodeFiles(
codeFiles=files,
metadata=codeStructure.get("metadata", {}),
userPrompt=userPrompt
)
allRenderedDocuments.extend(renderedDocs)
else:
# Fallback: output directly (for formats without renderers)
for file in files:
mimeType = self._getMimeType(file.get("fileType", "txt"))
content = file.get("content", "")
contentBytes = content.encode('utf-8') if isinstance(content, str) else content
from modules.datamodels.datamodelDocument import RenderedDocument
allRenderedDocuments.append(
RenderedDocument(
documentData=contentBytes,
mimeType=mimeType,
filename=file.get("filename", "generated.txt"),
metadata=codeStructure.get("metadata", {})
)
)
# Convert RenderedDocument to DocumentData
documents = []
for renderedDoc in allRenderedDocuments:
documents.append(DocumentData(
documentName=renderedDoc.filename,
documentData=renderedDoc.documentData,
mimeType=renderedDoc.mimeType,
sourceJson=renderedDoc.metadata if hasattr(renderedDoc, 'metadata') else None
))
metadata = AiResponseMetadata(
title=title,
operationType=OperationTypeEnum.DATA_GENERATE.value
)
# Create summary JSON for content field
summaryContent = {
"type": "code_generation",
"metadata": codeStructure.get("metadata", {}),
"files": [
{
"filename": doc.documentName,
"mimeType": doc.mimeType
}
for doc in documents
],
"fileCount": len(documents)
}
self.services.chat.progressLogFinish(codeOperationId, True)
return AiResponse(
documents=documents,
content=json.dumps(summaryContent, ensure_ascii=False),
metadata=metadata
)
except Exception as e:
logger.error(f"Error in code generation: {str(e)}")
self.services.chat.progressLogFinish(codeOperationId, False)
raise
def _detectLanguageAndProjectType(self, userPrompt: str, outputFormat: Optional[str]) -> tuple:
"""Detect programming language and project type from prompt or format."""
promptLower = userPrompt.lower()
# Detect language
language = None
if outputFormat:
if outputFormat == "py":
language = "python"
elif outputFormat in ["js", "ts"]:
language = outputFormat
elif outputFormat == "html":
language = "html"
if not language:
if "python" in promptLower or ".py" in promptLower:
language = "python"
elif "javascript" in promptLower or ".js" in promptLower:
language = "javascript"
elif "typescript" in promptLower or ".ts" in promptLower:
language = "typescript"
elif "html" in promptLower:
language = "html"
else:
language = "python" # Default
# Detect project type
projectType = "single_file"
if "multi" in promptLower or "multiple files" in promptLower or "project" in promptLower:
projectType = "multi_file"
return language, projectType
async def _generateCodeStructure(
self,
userPrompt: str,
language: str,
outputFormat: Optional[str],
contentParts: Optional[List[ContentPart]]
) -> Dict[str, Any]:
"""Generate code structure using looping system."""
# Build content parts index (similar to document generation)
contentPartsIndex = ""
if contentParts:
validParts = []
for part in contentParts:
contentFormat = part.metadata.get("contentFormat", "unknown")
originalFileName = part.metadata.get('originalFileName', 'N/A')
# Include reference parts and parts with data
if contentFormat == "reference" or (part.data and len(str(part.data).strip()) > 0):
validParts.append(part)
if validParts:
contentPartsIndex = "\n## AVAILABLE CONTENT PARTS\n"
for i, part in enumerate(validParts, 1):
contentFormat = part.metadata.get("contentFormat", "unknown")
originalFileName = part.metadata.get('originalFileName', 'N/A')
contentPartsIndex += f"\n{i}. ContentPart ID: {part.id}\n"
contentPartsIndex += f" Format: {contentFormat}\n"
contentPartsIndex += f" Type: {part.typeGroup}\n"
contentPartsIndex += f" MIME Type: {part.mimeType or 'N/A'}\n"
contentPartsIndex += f" Source: {part.metadata.get('documentId', 'unknown')}\n"
contentPartsIndex += f" Original file name: {originalFileName}\n"
contentPartsIndex += f" Usage hint: {part.metadata.get('usageHint', 'N/A')}\n"
if not contentPartsIndex:
contentPartsIndex = "\n(No content parts available)"
# Create template structure explicitly (not extracted from prompt)
templateStructure = f"""{{
"metadata": {{
"language": "{language}",
"projectType": "single_file|multi_file",
"projectName": ""
}},
"files": [
{{
"id": "",
"filename": "",
"fileType": "",
"dependencies": [],
"imports": [],
"functions": [],
"classes": []
}}
]
}}"""
# Build structure generation prompt
structurePrompt = f"""# TASK: Generate Code Project Structure
This is a PLANNING task. Return EXACTLY ONE complete JSON object. Do not generate multiple JSON objects, alternatives, or variations. Do not use separators like "---" between JSON objects.
## USER REQUEST (for context)
```
{userPrompt}
```
{contentPartsIndex}
## LANGUAGE
{language}
## TASK DESCRIPTION
Analyze the USER REQUEST above and create a project structure that fulfills ALL requirements mentioned in the request.
IMPORTANT: If the request mentions multiple files (e.g., "3 files", "config.json and customers.json", etc.), you MUST include ALL requested files in the files array. Set projectType to "multi_file" when multiple files are requested.
## CONTENT PARTS USAGE (if available)
If AVAILABLE CONTENT PARTS are listed above, use them to inform the file structure:
**Analyzing Content Parts:**
- Review each ContentPart's format, type, original file name, and usage hint
- Content parts with "reference" format = documents/images that will be processed/extracted
- Content parts with "extracted" format = pre-processed data ready to use
- Content parts with "object" format = images/documents to be displayed or processed
**Mapping Content Parts to Files:**
- If content parts contain data (e.g., expense receipts, customer lists), create data files (JSON/CSV) that will store/represent that data
- If content parts are documents to be processed (e.g., PDFs), you may need code files that parse/process them
- Use the original file names and usage hints to determine appropriate filenames and file types
**Populating File Structure Fields:**
- **dependencies**: List file IDs that this file depends on (e.g., if a Python script reads a JSON config file, the script depends on the config file)
- **imports**: For code files, list imports needed based on content parts (e.g., if processing PDFs: ["import PyPDF2"], if processing CSV: ["import csv"], if processing JSON: ["import json"])
- **functions**: For CODE files only - list function signatures if the USER REQUEST specifies functionality (e.g., {{"name": "parseReceipt", "signature": "def parseReceipt(pdf_path: str) -> dict"}})
- **classes**: For CODE files only - list class definitions if the USER REQUEST specifies OOP structure
- **functions/classes for DATA files**: Leave as empty arrays [] - data files (JSON/CSV/XML) don't contain executable code
## FILE STRUCTURE REQUIREMENTS
Create a JSON structure with:
1. metadata: {{"language": "{language}", "projectType": "single_file|multi_file", "projectName": "..."}}
- projectName: Derive from USER REQUEST or content parts (e.g., "expense-tracker", "customer-manager")
2. files: Array of file structures, each with:
- id: Unique identifier (e.g., "file_1", "file_2")
- filename: File name matching USER REQUEST requirements (e.g., "config.json", "customers.json", "expenses.csv")
- fileType: File extension matching the requested format (e.g., "json", "py", "js", "csv", "xml")
- dependencies: List of file IDs this file depends on (for multi-file projects where files reference each other)
- imports: List of import statements that this file will need (e.g., ["import json", "import csv"] for Python files processing JSON/CSV)
- functions: Array of function signatures {{"name": "...", "signature": "..."}} - ONLY if the file will contain executable code (not for pure data files like JSON/CSV)
- classes: Array of class definitions {{"name": "...", "signature": "..."}} - ONLY if the file will contain executable code (not for pure data files like JSON/CSV)
IMPORTANT FOR DATA FILES (JSON, CSV, XML):
- For pure data files (config.json, customers.json, expenses.csv), leave functions and classes as empty arrays []
- These files contain structured data, not executable code
- Use imports only if the file will be processed by code (e.g., a Python script that reads the CSV)
IMPORTANT FOR CODE FILES (Python, JavaScript, etc.):
- Include functions/classes if the USER REQUEST specifies functionality
- Use dependencies to indicate which data files this code file reads/processes
- Use imports to specify what libraries/modules are needed
For single-file projects, return one file. For multi-file projects, include ALL requested files in the files array.
Return ONLY valid JSON matching the request above.
"""
# Build continuation prompt builder
async def buildCodeStructurePromptWithContinuation(
continuationContext: Any,
templateStructure: str,
basePrompt: str
) -> str:
"""Build code structure prompt with continuation context. Uses unified signature.
Note: All initial context (userPrompt, contentParts, etc.) is already
contained in basePrompt. This function only adds continuation-specific instructions.
"""
# Extract continuation context fields (only what's needed for continuation)
incompletePart = continuationContext.incomplete_part
lastRawJson = continuationContext.last_raw_json
# Generate both overlap context and hierarchy context using jsonContinuation
overlapContext = ""
unifiedContext = ""
if lastRawJson:
# Get contexts directly from jsonContinuation
from modules.shared.jsonContinuation import getContexts
contexts = getContexts(lastRawJson)
overlapContext = contexts.overlapContext
unifiedContext = contexts.hierarchyContextForPrompt
elif incompletePart:
unifiedContext = incompletePart
else:
unifiedContext = "Unable to extract context - response was completely broken"
# Build unified continuation prompt format
continuationPrompt = f"""{basePrompt}
--- CONTINUATION REQUEST ---
The previous JSON response was incomplete. Continue from where it stopped.
Context showing structure hierarchy with cut point:
```
{unifiedContext}
```
Overlap Requirement:
To ensure proper merging, your response MUST start EXACTLY with the overlap context shown below, then continue with new content.
Overlap context (start your response with this exact text):
```json
{overlapContext if overlapContext else "No overlap context available"}
```
TASK:
1. Start your response EXACTLY with the overlap context shown above (character by character)
2. Continue seamlessly from where the overlap context ends
3. Complete the remaining content following the JSON structure template above
4. Return ONLY valid JSON following the structure template - no overlap/continuation wrapper objects
CRITICAL:
- Your response MUST begin with the exact overlap context text (this enables automatic merging)
- Continue seamlessly after the overlap context with new content
- Your response must be valid JSON matching the structure template above"""
return continuationPrompt
# Use generic looping system with code_structure use case
options = AiCallOptions(
operationType=OperationTypeEnum.DATA_GENERATE,
resultFormat="json"
)
structureJson = await self.services.ai.callAiWithLooping(
prompt=structurePrompt,
options=options,
promptBuilder=buildCodeStructurePromptWithContinuation,
promptArgs={
"userPrompt": userPrompt,
"contentParts": contentParts,
"templateStructure": templateStructure,
"basePrompt": structurePrompt
},
useCaseId="code_structure",
debugPrefix="code_structure_generation",
contentParts=contentParts
)
# Extract JSON from markdown fences if present
extractedJson = extractJsonString(structureJson)
parsed = json.loads(extractedJson)
return parsed
async def _generateCodeContent(
self,
codeStructure: Dict[str, Any],
parentOperationId: str,
userPrompt: str = None,
contentParts: Optional[List[ContentPart]] = None
) -> List[Dict[str, Any]]:
"""Generate code content for each file with dependency handling."""
files = codeStructure.get("files", [])
metadata = codeStructure.get("metadata", {})
if not files:
raise ValueError("No files found in code structure")
# Step 1: Resolve dependency order
orderedFiles = self._resolveDependencyOrder(files)
# Step 2: Generate dependency files first (requirements.txt, package.json, etc.)
dependencyFiles = await self._generateDependencyFiles(metadata, orderedFiles)
# Step 3: Generate code files in dependency order (not fully parallel)
codeFiles = []
generatedFileContext = {} # Track what's been generated for cross-file references
for idx, fileStructure in enumerate(orderedFiles):
# Update progress
progress = 0.5 + (0.4 * (idx / len(orderedFiles)))
self.services.chat.progressLogUpdate(
parentOperationId,
progress,
f"Generating {fileStructure.get('filename', 'file')}"
)
# Provide context about already-generated files for proper imports
fileContext = self._buildFileContext(generatedFileContext, fileStructure)
# Generate this file with context
fileContent = await self._generateSingleFileContent(
fileStructure,
fileContext=fileContext,
allFilesStructure=orderedFiles,
metadata=metadata,
userPrompt=userPrompt,
contentParts=contentParts
)
codeFiles.append(fileContent)
# Update context with generated file info (for next files)
generatedFileContext[fileStructure["id"]] = {
"filename": fileContent.get("filename", fileStructure.get("filename")),
"functions": fileContent.get("functions", []),
"classes": fileContent.get("classes", []),
"exports": fileContent.get("exports", [])
}
# Combine dependency files and code files
return dependencyFiles + codeFiles
def _resolveDependencyOrder(self, files: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
"""Resolve file generation order based on dependencies using topological sort."""
# Build dependency graph
fileMap = {f["id"]: f for f in files}
dependencies = {}
for file in files:
fileId = file["id"]
deps = file.get("dependencies", []) # List of file IDs this file depends on
dependencies[fileId] = deps
# Topological sort
ordered = []
visited = set()
tempMark = set()
def visit(fileId: str):
if fileId in tempMark:
# Circular dependency detected - break it
logger.warning(f"Circular dependency detected involving {fileId}")
return
if fileId in visited:
return
tempMark.add(fileId)
for depId in dependencies.get(fileId, []):
if depId in fileMap:
visit(depId)
tempMark.remove(fileId)
visited.add(fileId)
ordered.append(fileMap[fileId])
for file in files:
if file["id"] not in visited:
visit(file["id"])
return ordered
async def _generateDependencyFiles(
self,
metadata: Dict[str, Any],
files: List[Dict[str, Any]]
) -> List[Dict[str, Any]]:
"""Generate dependency files (requirements.txt, package.json, etc.)."""
language = metadata.get("language", "").lower()
dependencyFiles = []
# Generate requirements.txt for Python
if language in ["python", "py"]:
requirementsContent = await self._generateRequirementsTxt(files)
if requirementsContent:
dependencyFiles.append({
"filename": "requirements.txt",
"content": requirementsContent,
"fileType": "txt",
"id": "requirements_txt"
})
# Generate package.json for JavaScript/TypeScript
elif language in ["javascript", "typescript", "js", "ts"]:
packageJson = await self._generatePackageJson(files, metadata)
if packageJson:
dependencyFiles.append({
"filename": "package.json",
"content": json.dumps(packageJson, indent=2),
"fileType": "json",
"id": "package_json"
})
return dependencyFiles
async def _generateRequirementsTxt(
self,
files: List[Dict[str, Any]]
) -> Optional[str]:
"""Generate requirements.txt content from Python imports."""
pythonPackages = set()
for file in files:
imports = file.get("imports", [])
if isinstance(imports, list):
for imp in imports:
if isinstance(imp, str):
# Extract package name from import
# Handle: "from flask import", "import flask", "from flask import Flask"
imp = imp.strip()
if "import" in imp:
if "from" in imp:
# "from package import ..."
parts = imp.split("from")
if len(parts) > 1:
package = parts[1].split("import")[0].strip()
if package and not package.startswith("."):
pythonPackages.add(package.split(".")[0]) # Get root package
else:
# "import package" or "import package.module"
parts = imp.split("import")
if len(parts) > 1:
package = parts[1].strip().split(".")[0].strip()
if package and not package.startswith("."):
pythonPackages.add(package)
if pythonPackages:
return "\n".join(sorted(pythonPackages))
return None
async def _generatePackageJson(
self,
files: List[Dict[str, Any]],
metadata: Dict[str, Any]
) -> Optional[Dict[str, Any]]:
"""Generate package.json content from JavaScript/TypeScript imports."""
npmPackages = {}
for file in files:
imports = file.get("imports", [])
if isinstance(imports, list):
for imp in imports:
if isinstance(imp, str):
# Extract npm package from import
# Handle: "import express from 'express'", "const express = require('express')"
imp = imp.strip()
if "from" in imp:
# ES6 import: "import ... from 'package'"
parts = imp.split("from")
if len(parts) > 1:
package = parts[1].strip().strip("'\"")
if package and not package.startswith(".") and not package.startswith("/"):
npmPackages[package] = "*"
elif "require" in imp:
# CommonJS: "require('package')"
match = re.search(r"require\(['\"]([^'\"]+)['\"]\)", imp)
if match:
package = match.group(1)
if not package.startswith(".") and not package.startswith("/"):
npmPackages[package] = "*"
if npmPackages:
return {
"name": metadata.get("projectName", "generated-project"),
"version": "1.0.0",
"dependencies": npmPackages
}
return None
def _buildFileContext(
self,
generatedFileContext: Dict[str, Dict[str, Any]],
currentFile: Dict[str, Any]
) -> Dict[str, Any]:
"""Build context about other files for proper imports/references."""
context = {
"availableFiles": [],
"availableFunctions": {},
"availableClasses": {}
}
# Add info about already-generated files
for fileId, fileInfo in generatedFileContext.items():
context["availableFiles"].append({
"id": fileId,
"filename": fileInfo["filename"],
"functions": fileInfo.get("functions", []),
"classes": fileInfo.get("classes", []),
"exports": fileInfo.get("exports", [])
})
# Build function/class maps for easy lookup
for func in fileInfo.get("functions", []):
funcName = func.get("name", "")
if funcName:
context["availableFunctions"][funcName] = {
"file": fileInfo["filename"],
"signature": func.get("signature", "")
}
for cls in fileInfo.get("classes", []):
className = cls.get("name", "")
if className:
context["availableClasses"][className] = {
"file": fileInfo["filename"]
}
return context
async def _generateSingleFileContent(
self,
fileStructure: Dict[str, Any],
fileContext: Dict[str, Any] = None,
allFilesStructure: List[Dict[str, Any]] = None,
metadata: Dict[str, Any] = None,
userPrompt: str = None,
contentParts: Optional[List[ContentPart]] = None
) -> Dict[str, Any]:
"""Generate code content for a single file with context about other files."""
# Build prompt with context about other files for proper imports
filename = fileStructure.get("filename", "generated.py")
fileType = fileStructure.get("fileType", "py")
dependencies = fileStructure.get("dependencies", [])
functions = fileStructure.get("functions", [])
classes = fileStructure.get("classes", [])
contextInfo = ""
if fileContext and fileContext.get("availableFiles"):
contextInfo = "\n\nAvailable files and their exports:\n"
for fileInfo in fileContext["availableFiles"]:
contextInfo += f"- {fileInfo['filename']}: "
funcs = [f.get("name", "") for f in fileInfo.get("functions", [])]
cls = [c.get("name", "") for c in fileInfo.get("classes", [])]
exports = []
if funcs:
exports.extend(funcs)
if cls:
exports.extend(cls)
if exports:
contextInfo += ", ".join(exports)
contextInfo += "\n"
# Build content parts section if available
contentPartsSection = ""
if contentParts:
relevantParts = []
for part in contentParts:
# Include parts that might be relevant to this file
usageHint = part.metadata.get('usageHint', '').lower()
originalFileName = part.metadata.get('originalFileName', '').lower()
filenameLower = filename.lower()
# Check if this content part is relevant to this file
if (filenameLower in usageHint or
filenameLower in originalFileName or
part.metadata.get('contentFormat') == 'reference' or
(part.data and len(str(part.data).strip()) > 0)):
relevantParts.append(part)
if relevantParts:
contentPartsSection = "\n## AVAILABLE CONTENT PARTS\n"
for i, part in enumerate(relevantParts, 1):
contentFormat = part.metadata.get("contentFormat", "unknown")
originalFileName = part.metadata.get('originalFileName', 'N/A')
contentPartsSection += f"\n{i}. ContentPart ID: {part.id}\n"
contentPartsSection += f" Format: {contentFormat}\n"
contentPartsSection += f" Type: {part.typeGroup}\n"
contentPartsSection += f" Original file name: {originalFileName}\n"
contentPartsSection += f" Usage hint: {part.metadata.get('usageHint', 'N/A')}\n"
# Include actual content if it's small enough (for data files like CSV, JSON)
if part.data and isinstance(part.data, str) and len(part.data) < 2000:
contentPartsSection += f" Content preview: {part.data[:500]}...\n"
# Build user request section
userRequestSection = ""
if userPrompt:
userRequestSection = f"""
## ORIGINAL USER REQUEST
```
{userPrompt}
```
"""
# Create template structure explicitly (not extracted from prompt)
templateStructure = f"""{{
"files": [
{{
"filename": "{filename}",
"content": "// Complete code here",
"functions": {json.dumps(functions, indent=2) if functions else '[]'},
"classes": {json.dumps(classes, indent=2) if classes else '[]'}
}}
]
}}"""
# Build base prompt
contentPrompt = f"""# TASK: Generate Code File Content
Generate complete, executable code for the file: {filename}
{userRequestSection}## FILE SPECIFICATIONS
File Type: {fileType}
Language: {metadata.get('language', 'python') if metadata else 'python'}
{contentPartsSection}
Required functions:
{json.dumps(functions, indent=2) if functions else 'None specified'}
Required classes:
{json.dumps(classes, indent=2) if classes else 'None specified'}
Dependencies on other files: {', '.join(dependencies) if dependencies else 'None'}
{contextInfo}
Generate complete, production-ready code with:
1. Proper imports (including imports from other files in the project if dependencies exist)
2. All required functions and classes
3. Error handling
4. Documentation/docstrings
5. Type hints where appropriate
Return ONLY valid JSON in this format:
{templateStructure}
"""
# Build continuation prompt builder
async def buildCodeContentPromptWithContinuation(
continuationContext: Any,
templateStructure: str,
basePrompt: str
) -> str:
"""Build code content prompt with continuation context. Uses unified signature.
Note: All initial context (filename, fileType, functions, etc.) is already
contained in basePrompt. This function only adds continuation-specific instructions.
"""
# Extract continuation context fields (only what's needed for continuation)
incompletePart = continuationContext.incomplete_part
lastRawJson = continuationContext.last_raw_json
# Generate both overlap context and hierarchy context using jsonContinuation
overlapContext = ""
unifiedContext = ""
if lastRawJson:
# Get contexts directly from jsonContinuation
from modules.shared.jsonContinuation import getContexts
contexts = getContexts(lastRawJson)
overlapContext = contexts.overlapContext
unifiedContext = contexts.hierarchyContextForPrompt
elif incompletePart:
unifiedContext = incompletePart
else:
unifiedContext = "Unable to extract context - response was completely broken"
# Build unified continuation prompt format
continuationPrompt = f"""{basePrompt}
--- CONTINUATION REQUEST ---
The previous JSON response was incomplete. Continue from where it stopped.
Context showing structure hierarchy with cut point:
```
{unifiedContext}
```
Overlap Requirement:
To ensure proper merging, your response MUST start EXACTLY with the overlap context shown below, then continue with new content.
Overlap context (start your response with this exact text):
```json
{overlapContext if overlapContext else "No overlap context available"}
```
TASK:
1. Start your response EXACTLY with the overlap context shown above (character by character)
2. Continue seamlessly from where the overlap context ends
3. Complete the remaining content following the JSON structure template above
4. Return ONLY valid JSON following the structure template - no overlap/continuation wrapper objects
CRITICAL:
- Your response MUST begin with the exact overlap context text (this enables automatic merging)
- Continue seamlessly after the overlap context with new content
- Your response must be valid JSON matching the structure template above"""
return continuationPrompt
# Use generic looping system with code_content use case
options = AiCallOptions(
operationType=OperationTypeEnum.DATA_GENERATE,
resultFormat="json"
)
contentJson = await self.services.ai.callAiWithLooping(
prompt=contentPrompt,
options=options,
promptBuilder=buildCodeContentPromptWithContinuation,
promptArgs={
"filename": filename,
"fileType": fileType,
"functions": functions,
"classes": classes,
"dependencies": dependencies,
"metadata": metadata,
"userPrompt": userPrompt,
"contentParts": contentParts,
"contextInfo": contextInfo,
"templateStructure": templateStructure,
"basePrompt": contentPrompt
},
useCaseId="code_content",
debugPrefix=f"code_content_{fileStructure.get('id', 'file')}",
)
# Extract JSON from markdown fences if present
extractedJson = extractJsonString(contentJson)
parsed = json.loads(extractedJson)
# Extract file content and metadata
files = parsed.get("files", [])
if files and len(files) > 0:
fileData = files[0]
return {
"filename": fileData.get("filename", filename),
"content": fileData.get("content", ""),
"fileType": fileType,
"functions": fileData.get("functions", functions),
"classes": fileData.get("classes", classes),
"id": fileStructure.get("id")
}
# Fallback if structure is different
return {
"filename": filename,
"content": parsed.get("content", ""),
"fileType": fileType,
"functions": functions,
"classes": classes,
"id": fileStructure.get("id")
}
async def _formatAndValidateCode(self, codeFiles: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
"""Format and validate generated code files."""
# For now, just return files as-is
# TODO: Add code formatting (black, prettier, etc.) and validation
formatted = []
for file in codeFiles:
content = file.get("content", "")
# Basic cleanup: remove markdown code fences if present
if isinstance(content, str):
content = re.sub(r'^```[\w]*\n', '', content, flags=re.MULTILINE)
content = re.sub(r'\n```$', '', content, flags=re.MULTILINE)
file["content"] = content.strip()
formatted.append(file)
return formatted
def _getMimeType(self, fileType: str) -> str:
"""Get MIME type for file type."""
mimeTypes = {
"py": "text/x-python",
"js": "application/javascript",
"ts": "application/typescript",
"html": "text/html",
"css": "text/css",
"json": "application/json",
"txt": "text/plain",
"md": "text/markdown",
"java": "text/x-java-source",
"cpp": "text/x-c++src",
"c": "text/x-csrc",
"csv": "text/csv",
"xml": "application/xml"
}
return mimeTypes.get(fileType.lower(), "text/plain")
def _getCodeRenderer(self, fileType: str):
"""Get code renderer for file type."""
from ..renderers.registry import getRenderer
# Map file types to renderer formats (code path)
formatMap = {
'json': 'json',
'csv': 'csv',
'xml': 'xml'
}
rendererFormat = formatMap.get(fileType.lower())
if rendererFormat:
renderer = getRenderer(rendererFormat, self.services, outputStyle='code')
# Check if renderer supports code rendering
if renderer and hasattr(renderer, 'renderCodeFiles'):
return renderer
return None

View file

@ -0,0 +1,214 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
"""
Document Generation Path
Handles document generation using existing chapter/section model.
"""
import json
import logging
import time
import copy
from typing import Dict, Any, List, Optional
from modules.datamodels.datamodelWorkflow import AiResponse, AiResponseMetadata, DocumentData
from modules.datamodels.datamodelExtraction import ContentPart, DocumentIntent
from modules.datamodels.datamodelAi import AiCallOptions, OperationTypeEnum
from modules.datamodels.datamodelDocument import RenderedDocument
from modules.workflows.processing.shared.stateTools import checkWorkflowStopped
logger = logging.getLogger(__name__)
class DocumentGenerationPath:
"""Document generation path (existing functionality, refactored)."""
def __init__(self, services):
self.services = services
async def generateDocument(
self,
userPrompt: str,
documentList: Optional[Any] = None, # DocumentReferenceList
documentIntents: Optional[List[DocumentIntent]] = None,
contentParts: Optional[List[ContentPart]] = None,
outputFormat: str = "txt",
title: Optional[str] = None,
parentOperationId: Optional[str] = None
) -> AiResponse:
"""
Generate document using existing chapter/section model.
Returns: AiResponse with documents list
"""
# Create operation ID
workflowId = self.services.workflow.id if self.services.workflow else f"no-workflow-{int(time.time())}"
docOperationId = f"doc_gen_{workflowId}_{int(time.time())}"
# Start progress tracking
self.services.chat.progressLogStart(
docOperationId,
"Document Generation",
"Document Generation",
f"Format: {outputFormat}",
parentOperationId=parentOperationId
)
try:
# Schritt 5A: Kläre Dokument-Intents
documents = []
if documentList:
documents = self.services.chat.getChatDocumentsFromDocumentList(documentList)
# Filter: Entferne Original-Dokumente, wenn bereits Pre-Extracted JSONs existieren
# (um Duplikate zu vermeiden - Pre-Extracted JSONs enthalten bereits die ContentParts)
# Schritt 1: Identifiziere alle Original-Dokument-IDs, die durch Pre-Extracted JSONs abgedeckt werden
originalDocIdsCoveredByPreExtracted = set()
for doc in documents:
preExtracted = self.services.ai.intentAnalyzer.resolvePreExtractedDocument(doc)
if preExtracted:
originalDocId = preExtracted["originalDocument"]["id"]
originalDocIdsCoveredByPreExtracted.add(originalDocId)
logger.debug(f"Found pre-extracted JSON {doc.id} covering original document {originalDocId}")
# Schritt 2: Filtere Dokumente - entferne Original-Dokumente, die bereits durch Pre-Extracted JSONs abgedeckt werden
filteredDocuments = []
for doc in documents:
preExtracted = self.services.ai.intentAnalyzer.resolvePreExtractedDocument(doc)
if preExtracted:
# Pre-Extracted JSON behalten
filteredDocuments.append(doc)
elif doc.id in originalDocIdsCoveredByPreExtracted:
# Original-Dokument, das bereits durch Pre-Extracted JSON abgedeckt wird - entfernen
logger.info(f"Skipping original document {doc.id} ({doc.fileName}) - already covered by pre-extracted JSON")
else:
# Normales Dokument ohne Pre-Extracted JSON - behalten
filteredDocuments.append(doc)
documents = filteredDocuments
checkWorkflowStopped(self.services)
if not documentIntents and documents:
documentIntents = await self.services.ai.clarifyDocumentIntents(
documents,
userPrompt,
{"outputFormat": outputFormat},
docOperationId
)
checkWorkflowStopped(self.services)
# Schritt 5B: Extrahiere und bereite Content vor
if documents:
preparedContentParts = await self.services.ai.extractAndPrepareContent(
documents,
documentIntents or [],
docOperationId
)
# Merge mit bereitgestellten contentParts (falls vorhanden)
if contentParts:
# Prüfe auf pre-extracted Content
for part in contentParts:
if part.metadata.get("skipExtraction", False):
# Bereits extrahiert - verwende as-is, stelle sicher dass Metadaten vollständig
part.metadata.setdefault("contentFormat", "extracted")
part.metadata.setdefault("isPreExtracted", True)
preparedContentParts.extend(contentParts)
contentParts = preparedContentParts
# Schritt 5B.5: Documents are converted to contentParts (like pre-processed JSON files)
# No AI extraction here - AI extraction happens during section generation
if contentParts:
logger.info(f"Using {len(contentParts)} content parts for generation (no AI extraction at this stage)")
checkWorkflowStopped(self.services)
# Schritt 5C: Generiere Struktur
structure = await self.services.ai.generateStructure(
userPrompt,
contentParts or [],
outputFormat,
docOperationId
)
checkWorkflowStopped(self.services)
# Schritt 5D: Fülle Struktur
# Language will be extracted from services (user intention analysis) in fillStructure
filledStructure = await self.services.ai.fillStructure(
structure,
contentParts or [],
userPrompt,
docOperationId
)
checkWorkflowStopped(self.services)
# Schritt 5E: Rendere Resultat
# Jedes Dokument wird einzeln gerendert, kann 1..n Dateien zurückgeben (z.B. HTML + Bilder)
# Language is already validated in structure (State 3) and preserved in filled structure (State 4)
# Per-document language will be extracted in renderReport() from filledStructure
# Use validated currentUserLanguage as global fallback (always valid infrastructure)
language = self.services.currentUserLanguage if hasattr(self.services, 'currentUserLanguage') and self.services.currentUserLanguage else "en"
# IMPORTANT: Create deep copy BEFORE renderResult to preserve filledStructure with elements
# renderResult might modify the structure, so we need to preserve the original for sourceJson
# This ensures sourceJson contains the complete structure with elements for validation
filledStructureForSourceJson = copy.deepcopy(filledStructure) if filledStructure else None
renderedDocuments = await self.services.ai.renderResult(
filledStructure,
outputFormat,
language, # Global fallback (per-document language extracted from structure in renderReport)
title or "Generated Document",
userPrompt,
docOperationId
)
# Baue Response: Konvertiere alle gerenderten Dokumente zu DocumentData
documentDataList = []
for renderedDoc in renderedDocuments:
try:
# Erstelle DocumentData für jedes gerenderte Dokument
# Use the preserved filledStructureForSourceJson (with elements) for sourceJson
docDataObj = DocumentData(
documentName=renderedDoc.filename,
documentData=renderedDoc.documentData,
mimeType=renderedDoc.mimeType,
sourceJson=filledStructureForSourceJson if len(documentDataList) == 0 else None # Nur für erstes Dokument
)
documentDataList.append(docDataObj)
logger.debug(f"Added rendered document: {renderedDoc.filename} ({len(renderedDoc.documentData)} bytes, {renderedDoc.mimeType})")
except Exception as e:
logger.warning(f"Error creating document {renderedDoc.filename}: {str(e)}")
if not documentDataList:
raise ValueError("No documents were rendered")
metadata = AiResponseMetadata(
title=title or filledStructure.get("metadata", {}).get("title", "Generated Document"),
operationType=OperationTypeEnum.DATA_GENERATE.value
)
# Debug-Log (harmonisiert)
self.services.utils.writeDebugFile(
json.dumps(filledStructure, indent=2, ensure_ascii=False, default=str),
"document_generation_response"
)
self.services.chat.progressLogFinish(docOperationId, True)
return AiResponse(
content=json.dumps(filledStructure),
metadata=metadata,
documents=documentDataList
)
except Exception as e:
logger.error(f"Error in document generation: {str(e)}")
self.services.chat.progressLogFinish(docOperationId, False)
raise

View file

@ -0,0 +1,128 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
"""
Image Generation Path
Handles image generation with support for single and batch generation.
"""
import logging
import time
from typing import List, Optional
from modules.datamodels.datamodelWorkflow import AiResponse, AiResponseMetadata, DocumentData
from modules.datamodels.datamodelAi import AiCallOptions, OperationTypeEnum, AiCallRequest
logger = logging.getLogger(__name__)
class ImageGenerationPath:
"""Image generation path."""
def __init__(self, services):
self.services = services
async def generateImages(
self,
userPrompt: str,
count: int = 1,
style: Optional[str] = None,
format: str = "png",
title: Optional[str] = None,
parentOperationId: Optional[str] = None
) -> AiResponse:
"""
Generate image files.
Returns: AiResponse with image files as documents
"""
# Create operation ID
workflowId = self.services.workflow.id if self.services.workflow else f"no-workflow-{int(time.time())}"
imageOperationId = f"image_gen_{workflowId}_{int(time.time())}"
# Start progress tracking
self.services.chat.progressLogStart(
imageOperationId,
"Image Generation",
"Image Generation",
f"Format: {format}",
parentOperationId=parentOperationId
)
try:
self.services.chat.progressLogUpdate(imageOperationId, 0.4, "Calling AI for image generation")
# Build prompt with style if provided
imagePrompt = userPrompt
if style:
imagePrompt = f"{userPrompt}\n\nStyle: {style}"
# Use IMAGE_GENERATE operation
options = AiCallOptions(
operationType=OperationTypeEnum.IMAGE_GENERATE,
resultFormat=format
)
request = AiCallRequest(
prompt=imagePrompt,
context="",
options=options
)
response = await self.services.ai.callAi(request)
if not response.content:
errorMsg = f"No image data returned: {response.content}"
logger.error(f"Error in AI image generation: {errorMsg}")
self.services.chat.progressLogFinish(imageOperationId, False)
raise ValueError(errorMsg)
# Handle response content (could be base64 string or bytes)
imageData = response.content
if isinstance(imageData, str):
# Assume base64 encoded string
import base64
try:
imageData = base64.b64decode(imageData)
except Exception:
# If not base64, try encoding as bytes
imageData = imageData.encode('utf-8')
elif not isinstance(imageData, bytes):
imageData = bytes(imageData)
# Create document
imageDoc = DocumentData(
documentName=f"generated_image.{format}",
documentData=imageData,
mimeType=f"image/{format}"
)
metadata = AiResponseMetadata(
title=title or "Generated Image",
operationType=OperationTypeEnum.IMAGE_GENERATE.value
)
# Note: Stats are now stored centrally in callAi() - no need to duplicate here
self.services.chat.progressLogUpdate(imageOperationId, 0.9, "Image generated")
self.services.chat.progressLogFinish(imageOperationId, True)
# Create content string describing the image generation
import json
contentJson = json.dumps({
"type": "image",
"format": format,
"prompt": userPrompt,
"filename": imageDoc.documentName
}, ensure_ascii=False)
return AiResponse(
content=contentJson, # JSON string describing the image generation
metadata=metadata,
documents=[imageDoc]
)
except Exception as e:
logger.error(f"Error in image generation: {str(e)}")
self.services.chat.progressLogFinish(imageOperationId, False)
raise

View file

@ -0,0 +1,45 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
"""
Base renderer class for code format renderers.
"""
from abc import abstractmethod
from .documentRendererBaseTemplate import BaseRenderer
from modules.datamodels.datamodelDocument import RenderedDocument
from typing import Dict, Any, List, Optional
import logging
logger = logging.getLogger(__name__)
class BaseCodeRenderer(BaseRenderer):
"""Base class for code format renderers."""
@abstractmethod
async def renderCodeFiles(
self,
codeFiles: List[Dict[str, Any]],
metadata: Dict[str, Any],
userPrompt: str = None
) -> List[RenderedDocument]:
"""
Render code files to format-specific output.
Args:
codeFiles: List of file dictionaries with:
- filename: str
- fileType: str (json, csv, xml, etc.)
- content: str (generated code)
- id: str (optional)
metadata: Project metadata (language, projectType, etc.)
userPrompt: Original user prompt
Returns:
List of RenderedDocument objects (can be 1..n files)
"""
pass
def _validateCodeFile(self, codeFile: Dict[str, Any]) -> bool:
"""Validate code file structure."""
required = ['filename', 'fileType', 'content']
return all(key in codeFile for key in required)

View file

@ -0,0 +1,484 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
"""
Base renderer class for all format renderers.
"""
from abc import ABC, abstractmethod
from typing import Dict, Any, List, Tuple, Optional
from modules.datamodels.datamodelJson import supportedSectionTypes
from modules.datamodels.datamodelDocument import RenderedDocument
import json
import logging
import re
from datetime import datetime, UTC
import base64
import io
from PIL import Image
from modules.datamodels.datamodelAi import AiCallRequest, AiCallOptions, OperationTypeEnum
logger = logging.getLogger(__name__)
class BaseRenderer(ABC):
"""Base class for all format renderers."""
def __init__(self, services=None):
self.logger = logger
self.services = services # Add services attribute
@classmethod
def getSupportedFormats(cls) -> List[str]:
"""
Return list of supported format names for this renderer.
Override this method in subclasses to specify supported formats.
"""
return []
@classmethod
def getFormatAliases(cls) -> List[str]:
"""
Return list of format aliases for this renderer.
Override this method in subclasses to specify format aliases.
"""
return []
@classmethod
def getPriority(cls) -> int:
"""
Return priority for this renderer (higher number = higher priority).
Used when multiple renderers support the same format.
"""
return 0
@classmethod
def getOutputStyle(cls, formatName: Optional[str] = None) -> str:
"""
Return the output style classification for this renderer.
Returns: 'code', 'document', 'image', or other (e.g., 'video' for future use)
Override this method in subclasses to specify the output style.
Args:
formatName: Optional format name (e.g., 'txt', 'js', 'csv') - useful for renderers
that handle multiple formats with different styles (e.g., RendererText)
"""
return 'document' # Default to document style
@classmethod
def getAcceptedSectionTypes(cls, formatName: Optional[str] = None) -> List[str]:
"""
Return list of section content types that this renderer accepts.
This allows renderers to declare which section types they can process.
Default implementation returns all supported section types.
Override this method in subclasses to restrict accepted types.
Args:
formatName: Optional format name (e.g., 'txt', 'js', 'csv') - useful for renderers
that handle multiple formats with different accepted types (e.g., RendererText)
Returns:
List of accepted section content types (e.g., ["table", "paragraph", "heading"])
Valid types: "table", "bullet_list", "heading", "paragraph", "code_block", "image"
"""
# Default: accept all section types
return list(supportedSectionTypes)
@abstractmethod
async def render(self, extractedContent: Dict[str, Any], title: str, userPrompt: str = None, aiService=None) -> List[RenderedDocument]:
"""
Render extracted JSON content to multiple documents.
Each renderer must implement this method.
Can return 1..n documents (e.g., HTML + images).
Args:
extractedContent: Structured JSON content with sections and metadata (contains single document)
title: Report title
userPrompt: Original user prompt for context
aiService: AI service instance for additional processing
Returns:
List of RenderedDocument objects.
First document is the main document, additional documents are supporting files (e.g., images).
Even if only one document is returned, it must be wrapped in a list.
"""
pass
def _determineFilename(self, title: str, mimeType: str) -> str:
"""Determine filename from title and mimeType."""
import re
# Get extension from mimeType
extensionMap = {
"text/html": "html",
"application/pdf": "pdf",
"application/vnd.openxmlformats-officedocument.wordprocessingml.document": "docx",
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet": "xlsx",
"text/plain": "txt",
"text/markdown": "md",
"application/json": "json",
"text/csv": "csv"
}
extension = extensionMap.get(mimeType, "txt")
# Sanitize title for filename
sanitized = re.sub(r"[^a-zA-Z0-9._-]", "_", title)
sanitized = re.sub(r"_+", "_", sanitized).strip("_")
if not sanitized:
sanitized = "document"
return f"{sanitized}.{extension}"
def _extractSections(self, reportData: Dict[str, Any]) -> List[Dict[str, Any]]:
"""
Extract sections from standardized schema: {metadata: {...}, documents: [{sections: [...]}]}
Phase 5: Supports multiple documents - extracts all sections from all documents.
"""
if "documents" not in reportData:
raise ValueError("Report data must follow standardized schema with 'documents' array")
documents = reportData.get("documents", [])
if not isinstance(documents, list) or len(documents) == 0:
raise ValueError("Standardized schema must contain at least one document in 'documents' array")
# Phase 5: Extract sections from ALL documents
all_sections = []
for doc in documents:
if isinstance(doc, dict) and "sections" in doc:
sections = doc.get("sections", [])
if isinstance(sections, list):
all_sections.extend(sections)
if not all_sections:
raise ValueError("No sections found in any document")
return all_sections
def _extractMetadata(self, reportData: Dict[str, Any]) -> Dict[str, Any]:
"""
Extract metadata from standardized schema: {metadata: {...}, documents: [{sections: [...]}]}
"""
if "metadata" not in reportData:
raise ValueError("Report data must follow standardized schema with 'metadata' field")
metadata = reportData.get("metadata", {})
if not isinstance(metadata, dict):
raise ValueError("Metadata in standardized schema must be a dictionary")
return metadata
def _getTitle(self, reportData: Dict[str, Any], fallbackTitle: str) -> str:
"""Get title from report data or use fallback."""
metadata = reportData.get('metadata', {})
return metadata.get('title', fallbackTitle)
def _validateJsonStructure(self, jsonContent: Dict[str, Any]) -> bool:
"""
Validate that JSON content follows standardized schema: {metadata: {...}, documents: [{sections: [...]}]}
"""
if not isinstance(jsonContent, dict):
return False
# Validate metadata field exists
if "metadata" not in jsonContent:
return False
if not isinstance(jsonContent.get("metadata"), dict):
return False
# Validate documents array exists and is not empty
if "documents" not in jsonContent:
return False
documents = jsonContent.get("documents", [])
if not isinstance(documents, list) or len(documents) == 0:
return False
# Validate first document has sections
firstDoc = documents[0]
if not isinstance(firstDoc, dict) or "sections" not in firstDoc:
return False
sections = firstDoc.get("sections", [])
if not isinstance(sections, list):
return False
# Validate each section has content_type and elements
for section in sections:
if not isinstance(section, dict):
return False
if "content_type" not in section or "elements" not in section:
return False
return True
def _getSectionType(self, section: Dict[str, Any]) -> str:
"""Get the type of a section; default to 'paragraph' for non-dict inputs."""
if isinstance(section, dict):
return section.get("content_type", "paragraph")
# If section is a list or any other type, treat as paragraph elements
return "paragraph"
def _getSectionData(self, section: Dict[str, Any]) -> List[Dict[str, Any]]:
"""Get the elements of a section; if a list is provided directly, return it."""
if isinstance(section, dict):
return section.get("elements", [])
if isinstance(section, list):
return section
return []
def _getSectionId(self, section: Dict[str, Any]) -> str:
"""Get the ID of a section (if available)."""
if isinstance(section, dict):
return section.get("id", "unknown")
return "unknown"
def _validateImageData(self, base64Data: str, altText: str) -> bool:
"""Validate image data."""
if not base64Data:
self.logger.warning("Image section has no base64 data")
return False
if not altText:
self.logger.warning("Image section has no alt text")
return False
# Basic base64 validation
try:
base64.b64decode(base64Data, validate=True)
return True
except Exception as e:
self.logger.warning(f"Invalid base64 image data: {str(e)}")
return False
def _getImageDimensions(self, base64Data: str) -> Tuple[int, int]:
"""
Get image dimensions from base64 data.
This is a helper method that format-specific renderers can use.
"""
try:
# Decode base64 data
imageData = base64.b64decode(base64Data)
image = Image.open(io.BytesIO(imageData))
return image.size # Returns (width, height)
except Exception as e:
self.logger.warning(f"Could not determine image dimensions: {str(e)}")
return (0, 0)
def _resizeImageIfNeeded(self, base64Data: str, maxWidth: int = 800, maxHeight: int = 600) -> str:
"""
Resize image if it exceeds maximum dimensions.
Returns the resized image as base64 string.
"""
try:
# Decode base64 data
imageData = base64.b64decode(base64Data)
image = Image.open(io.BytesIO(imageData))
# Check if resizing is needed
width, height = image.size
if width <= maxWidth and height <= maxHeight:
return base64Data # No resizing needed
# Calculate new dimensions maintaining aspect ratio
ratio = min(maxWidth / width, maxHeight / height)
newWidth = int(width * ratio)
newHeight = int(height * ratio)
# Resize image
resizedImage = image.resize((newWidth, newHeight), Image.Resampling.LANCZOS)
# Convert back to base64
buffer = io.BytesIO()
resizedImage.save(buffer, format=image.format or 'PNG')
resizedData = buffer.getvalue()
return base64.b64encode(resizedData).decode('utf-8')
except Exception as e:
self.logger.warning(f"Could not resize image: {str(e)}")
return base64Data # Return original if resize fails
def _getSupportedSectionTypes(self) -> List[str]:
"""Return list of supported section types (from unified schema)."""
return supportedSectionTypes
def _isValidSectionType(self, sectionType: str) -> bool:
"""Check if a section type is valid."""
return sectionType in self._getSupportedSectionTypes()
def _formatTimestamp(self, timestamp: str = None) -> str:
"""Format timestamp for display."""
if timestamp:
return timestamp
return datetime.now(UTC).strftime("%Y-%m-%d %H:%M:%S UTC")
# ===== GENERIC AI STYLING HELPERS =====
async def _getAiStyles(self, aiService, styleTemplate: str, defaultStyles: Dict[str, Any]) -> Dict[str, Any]:
"""
Generic AI styling method that can be used by all renderers.
Args:
aiService: AI service instance
styleTemplate: Format-specific style template
defaultStyles: Default styles to fall back to
Returns:
Dict with styling definitions
"""
# DEBUG: Show which renderer is calling this method
if not aiService:
return defaultStyles
try:
requestOptions = AiCallOptions()
requestOptions.operationType = OperationTypeEnum.DATA_GENERATE
request = AiCallRequest(prompt=styleTemplate, context="", options=requestOptions)
# DEBUG: Show the actual prompt being sent to AI
self.logger.debug(f"AI Style Template Prompt:")
self.logger.debug(f"{styleTemplate}")
response = await aiService.callAi(request)
# Save styling prompt and response to debug (fire and forget - don't block on slow file I/O)
# The writeDebugFile calls os.listdir() which can be slow with many files
# Run in background thread to avoid blocking rendering
import threading
def _writeDebugFiles():
try:
self.services.utils.writeDebugFile(styleTemplate, "renderer_styling_prompt")
self.services.utils.writeDebugFile(response.content or '', "renderer_styling_response")
except Exception:
pass # Silently fail - debug writing should never block rendering
threading.Thread(target=_writeDebugFiles, daemon=True).start()
# Clean and parse JSON
result = response.content.strip() if response and response.content else ""
# Check if result is empty
if not result:
self.logger.warning("AI styling returned empty response, using defaults")
return defaultStyles
# Extract JSON from markdown if present
jsonMatch = re.search(r'```json\s*\n(.*?)\n```', result, re.DOTALL)
if jsonMatch:
result = jsonMatch.group(1).strip()
elif result.startswith('```json'):
result = re.sub(r'^```json\s*', '', result)
result = re.sub(r'\s*```$', '', result)
elif result.startswith('```'):
result = re.sub(r'^```\s*', '', result)
result = re.sub(r'\s*```$', '', result)
# Try to parse JSON
try:
styles = json.loads(result)
except json.JSONDecodeError as jsonError:
self.logger.warning(f"AI styling returned invalid JSON: {jsonError}")
# Use print instead of logger to avoid truncation
self.services.utils.debugLogToFile(f"FULL AI RESPONSE THAT FAILED TO PARSE: {result}", "RENDERER")
self.services.utils.debugLogToFile(f"RESPONSE LENGTH: {len(result)} characters", "RENDERER")
self.logger.warning(f"Raw content that failed to parse: {result}")
# Try to fix incomplete JSON by adding missing closing braces
openBraces = result.count('{')
closeBraces = result.count('}')
if openBraces > closeBraces:
# JSON is incomplete, add missing closing braces
missingBraces = openBraces - closeBraces
result = result + '}' * missingBraces
self.logger.info(f"Added {missingBraces} missing closing brace(s)")
self.logger.debug(f"Fixed JSON: {result}")
# Try parsing the fixed JSON
try:
styles = json.loads(result)
self.logger.info("Successfully fixed incomplete JSON")
except json.JSONDecodeError as fixError:
self.logger.warning(f"Fixed JSON still invalid: {fixError}")
self.logger.warning(f"Fixed JSON content: {result}")
# Try to extract just the JSON part if it's embedded in text
jsonStart = result.find('{')
jsonEnd = result.rfind('}')
if jsonStart != -1 and jsonEnd != -1 and jsonEnd > jsonStart:
jsonPart = result[jsonStart:jsonEnd+1]
try:
styles = json.loads(jsonPart)
self.logger.info("Successfully extracted JSON from explanatory text")
except json.JSONDecodeError:
self.logger.warning("Could not extract valid JSON from response, using defaults")
return defaultStyles
else:
return defaultStyles
else:
# Try to extract just the JSON part if it's embedded in text
jsonStart = result.find('{')
jsonEnd = result.rfind('}')
if jsonStart != -1 and jsonEnd != -1 and jsonEnd > jsonStart:
jsonPart = result[jsonStart:jsonEnd+1]
try:
styles = json.loads(jsonPart)
self.logger.info("Successfully extracted JSON from explanatory text")
except json.JSONDecodeError:
self.logger.warning("Could not extract valid JSON from response, using defaults")
return defaultStyles
else:
return defaultStyles
# Convert colors to appropriate format
styles = self._convertColorsFormat(styles)
return styles
except Exception as e:
self.logger.warning(f"AI styling failed: {str(e)}, using defaults")
return defaultStyles
def _convertColorsFormat(self, styles: Dict[str, Any]) -> Dict[str, Any]:
"""
Convert colors to appropriate format based on renderer type.
Override this method in subclasses for format-specific color handling.
"""
return styles
def _createAiStyleTemplate(self, formatName: str, userPrompt: str, styleSchema: Dict[str, Any]) -> str:
"""
Create a standardized AI style template for any format.
Args:
formatName: Name of the format (e.g., "docx", "xlsx", "pptx")
userPrompt: User's original prompt
styleSchema: Format-specific style schema
Returns:
Formatted prompt string
"""
schemaJson = json.dumps(styleSchema, indent=4)
# DEBUG: Show the schema being sent
return f"""You are a professional document styling expert. Generate a complete JSON styling configuration for {formatName.upper()} documents.
User request: {userPrompt}
Use this schema as a template:
{schemaJson}
Requirements:
- Return ONLY the complete JSON object (no markdown, no explanations)
- If the user request contains style/formatting/design instructions (in any language), customize the styling accordingly (adapt styles and add styles if needed)
- If the user request has NO style instructions, return the default schema values unchanged
- Ensure all objects are properly closed with closing braces
- Only modify styles if style instructions are present in the user request
Return the complete JSON:"""

View file

@ -0,0 +1,238 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
"""
Renderer registry for automatic discovery and registration of renderers.
Renderers are indexed by (format, outputStyle) so that document generation
and code generation each get the correct renderer for the same format.
"""
import logging
import importlib
from typing import Dict, Type, List, Optional, Tuple
from .documentRendererBaseTemplate import BaseRenderer
logger = logging.getLogger(__name__)
class RendererRegistry:
"""Registry for automatic renderer discovery and management.
Maintains separate renderer mappings per outputStyle ('document', 'code', etc.)
so that document-generation and code-generation paths each resolve to the
correct renderer, even when both support the same format (e.g. 'csv').
"""
def __init__(self):
# Key: (formatName, outputStyle) -> rendererClass
self._renderers: Dict[Tuple[str, str], Type[BaseRenderer]] = {}
self._format_mappings: Dict[str, str] = {}
self._discovered = False
def discoverRenderers(self) -> None:
"""Automatically discover and register all renderers by scanning files."""
if self._discovered:
return
try:
from pathlib import Path
currentDir = Path(__file__).parent
packageName = __name__.rsplit('.', 1)[0]
for filePath in currentDir.glob("*.py"):
if filePath.name in ['registry.py', 'documentRendererBaseTemplate.py', 'codeRendererBaseTemplate.py', '__init__.py']:
continue
moduleName = filePath.stem
try:
fullModuleName = f"{packageName}.{moduleName}"
module = importlib.import_module(fullModuleName)
for attrName in dir(module):
attr = getattr(module, attrName)
if (isinstance(attr, type) and
issubclass(attr, BaseRenderer) and
attr != BaseRenderer and
hasattr(attr, 'getSupportedFormats')):
self._registerRendererClass(attr)
except Exception as e:
logger.warning(f"Could not load renderer from {moduleName}: {str(e)}")
continue
self._discovered = True
except Exception as e:
logger.error(f"Error during renderer discovery: {str(e)}")
self._discovered = True
def _registerRendererClass(self, rendererClass: Type[BaseRenderer]) -> None:
"""Register a renderer class keyed by (format, outputStyle)."""
try:
supportedFormats = rendererClass.getSupportedFormats()
outputStyle = rendererClass.getOutputStyle() if hasattr(rendererClass, 'getOutputStyle') else 'document'
priority = rendererClass.getPriority() if hasattr(rendererClass, 'getPriority') else 0
for formatName in supportedFormats:
formatKey = formatName.lower()
registryKey = (formatKey, outputStyle)
if registryKey in self._renderers:
existingRenderer = self._renderers[registryKey]
existingPriority = existingRenderer.getPriority() if hasattr(existingRenderer, 'getPriority') else 0
if priority > existingPriority:
logger.debug(f"Replacing {existingRenderer.__name__} with {rendererClass.__name__} for ({formatKey}, {outputStyle}) (priority {priority} > {existingPriority})")
self._renderers[registryKey] = rendererClass
else:
logger.debug(f"Keeping {existingRenderer.__name__} for ({formatKey}, {outputStyle}) (priority {existingPriority} >= {priority})")
else:
self._renderers[registryKey] = rendererClass
# Register aliases
if hasattr(rendererClass, 'getFormatAliases'):
aliases = rendererClass.getFormatAliases()
for alias in aliases:
self._format_mappings[alias.lower()] = formatKey
logger.debug(f"Registered {rendererClass.__name__} for formats={supportedFormats}, style={outputStyle}, priority={priority}")
except Exception as e:
logger.error(f"Error registering renderer {rendererClass.__name__}: {str(e)}")
def getRenderer(self, outputFormat: str, services=None, outputStyle: str = None) -> Optional[BaseRenderer]:
"""Get a renderer instance for the specified format and style.
Args:
outputFormat: Format name (e.g. 'csv', 'json', 'pdf')
services: Services instance passed to renderer constructor
outputStyle: 'document' or 'code'. If None, returns the first match
with preference: document > code (most callers are document path).
"""
if not self._discovered:
self.discoverRenderers()
formatName = outputFormat.lower().strip()
if formatName in self._format_mappings:
formatName = self._format_mappings[formatName]
rendererClass = None
if outputStyle:
# Exact match by style
rendererClass = self._renderers.get((formatName, outputStyle))
else:
# No style specified — prefer 'document', then 'code', then any
for style in ['document', 'code']:
rendererClass = self._renderers.get((formatName, style))
if rendererClass:
break
# Fallback: check any registered style
if not rendererClass:
for key, cls in self._renderers.items():
if key[0] == formatName:
rendererClass = cls
break
if rendererClass:
try:
return rendererClass(services=services)
except Exception as e:
logger.error(f"Error creating renderer instance for {formatName}: {str(e)}")
return None
logger.warning(f"No renderer found for format={outputFormat}, style={outputStyle}")
return None
def getSupportedFormats(self) -> List[str]:
"""Get list of all supported formats."""
if not self._discovered:
self.discoverRenderers()
formats = set()
for (fmt, _style) in self._renderers.keys():
formats.add(fmt)
formats.update(self._format_mappings.keys())
return sorted(formats)
def getRendererInfo(self) -> Dict[str, Dict[str, str]]:
"""Get information about all registered renderers."""
if not self._discovered:
self.discoverRenderers()
info = {}
for (formatName, style), rendererClass in self._renderers.items():
key = f"{formatName}:{style}"
info[key] = {
'class_name': rendererClass.__name__,
'module': rendererClass.__module__,
'outputStyle': style,
'description': getattr(rendererClass, '__doc__', 'No description').strip().split('\n')[0] if rendererClass.__doc__ else 'No description'
}
return info
def getOutputStyle(self, outputFormat: str) -> Optional[str]:
"""
Get the output style classification for a given format.
When both 'document' and 'code' renderers exist for a format,
returns the default ('document') since this is called during document generation.
"""
if not self._discovered:
self.discoverRenderers()
formatName = outputFormat.lower().strip()
if formatName in self._format_mappings:
formatName = self._format_mappings[formatName]
# Check document first, then code
for style in ['document', 'code']:
rendererClass = self._renderers.get((formatName, style))
if rendererClass:
try:
return rendererClass.getOutputStyle(formatName)
except Exception:
pass
# Fallback: any style
for key, rendererClass in self._renderers.items():
if key[0] == formatName:
try:
return rendererClass.getOutputStyle(formatName)
except Exception:
pass
logger.warning(f"No renderer found for format: {outputFormat}, cannot determine output style")
return None
# Global registry instance
_registry = RendererRegistry()
def getRenderer(outputFormat: str, services=None, outputStyle: str = None) -> Optional[BaseRenderer]:
"""Get a renderer instance for the specified format and style.
Args:
outputFormat: Format name (e.g. 'csv', 'json', 'pdf')
services: Services instance
outputStyle: 'document' or 'code'. If None, prefers document renderer.
"""
return _registry.getRenderer(outputFormat, services, outputStyle=outputStyle)
def getSupportedFormats() -> List[str]:
"""Get list of all supported formats."""
return _registry.getSupportedFormats()
def getRendererInfo() -> Dict[str, Dict[str, str]]:
"""Get information about all registered renderers."""
return _registry.getRendererInfo()
def getOutputStyle(outputFormat: str) -> Optional[str]:
"""Get the output style classification for a given format."""
return _registry.getOutputStyle(outputFormat)

View file

@ -0,0 +1,159 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
"""
CSV code renderer for code generation.
"""
from .codeRendererBaseTemplate import BaseCodeRenderer
from modules.datamodels.datamodelDocument import RenderedDocument
from typing import Dict, Any, List, Optional
import csv
import io
class RendererCodeCsv(BaseCodeRenderer):
"""Renders CSV code files."""
@classmethod
def getSupportedFormats(cls) -> List[str]:
"""Return supported CSV formats."""
return ['csv']
@classmethod
def getFormatAliases(cls) -> List[str]:
"""Return format aliases."""
return []
@classmethod
def getPriority(cls) -> int:
"""Return priority for CSV code renderer."""
return 75 # Higher than document renderer (70) for code generation
@classmethod
def getOutputStyle(cls, formatName: Optional[str] = None) -> str:
"""Return output style classification: CSV requires specific structure."""
return 'code'
async def renderCodeFiles(
self,
codeFiles: List[Dict[str, Any]],
metadata: Dict[str, Any],
userPrompt: str = None
) -> List[RenderedDocument]:
"""
Render CSV code files.
For single file: output as-is (validate structure)
For multiple files: output separately (each is independent CSV)
"""
renderedDocs = []
for codeFile in codeFiles:
if not self._validateCodeFile(codeFile):
self.logger.warning(f"Invalid code file: {codeFile.get('filename', 'unknown')}")
continue
filename = codeFile['filename']
content = codeFile['content']
# Validate CSV structure (header row, consistent columns)
validatedContent = self._validateAndFixCsv(content)
# Extract CSV statistics for validation
csvStats = self._extractCsvStatistics(validatedContent)
# Merge file-specific metadata with project metadata
fileMetadata = dict(metadata) if metadata else {}
fileMetadata.update({
"filename": filename,
"fileType": "csv",
"statistics": csvStats
})
renderedDocs.append(
RenderedDocument(
documentData=validatedContent.encode('utf-8'),
mimeType="text/csv",
filename=filename,
metadata=fileMetadata
)
)
return renderedDocs
async def render(self, extractedContent: Dict[str, Any], title: str, userPrompt: str = None, aiService=None) -> List[RenderedDocument]:
"""
Render method for document generation compatibility.
Delegates to document renderer if needed, or handles code files directly.
"""
# Check if this is code generation (has files array) or document generation (has documents array)
if "files" in extractedContent:
# Code generation path - use renderCodeFiles
files = extractedContent.get("files", [])
metadata = extractedContent.get("metadata", {})
return await self.renderCodeFiles(files, metadata, userPrompt)
else:
# Document generation path - delegate to document renderer
from .rendererCsv import RendererCsv
documentRenderer = RendererCsv(self.services)
return await documentRenderer.render(extractedContent, title, userPrompt, aiService)
def _validateAndFixCsv(self, content: str) -> str:
"""Validate CSV structure and fix common issues."""
try:
# Parse CSV to validate structure
reader = csv.reader(io.StringIO(content))
rows = list(reader)
if not rows:
return content # Empty CSV
# Check header row exists
headerRow = rows[0]
headerCount = len(headerRow)
# Validate all rows have same column count
fixedRows = [headerRow] # Start with header
for i, row in enumerate(rows[1:], 1):
if len(row) != headerCount:
self.logger.debug(f"Row {i} has {len(row)} columns, expected {headerCount}. Auto-fixing...")
# Pad or truncate to match header
if len(row) < headerCount:
row.extend([''] * (headerCount - len(row)))
else:
row = row[:headerCount]
fixedRows.append(row)
# Convert back to CSV string
output = io.StringIO()
writer = csv.writer(output)
for row in fixedRows:
writer.writerow(row)
return output.getvalue()
except Exception as e:
self.logger.warning(f"CSV validation failed: {e}, returning original content")
return content
def _extractCsvStatistics(self, content: str) -> Dict[str, Any]:
"""Extract CSV statistics for validation (row count, column count, headers)."""
try:
reader = csv.reader(io.StringIO(content))
rows = list(reader)
if not rows:
return {"rowCount": 0, "columnCount": 0, "headerRow": []}
headerRow = rows[0]
columnCount = len(headerRow)
rowCount = len(rows) - 1 # Exclude header
return {
"rowCount": rowCount,
"columnCount": columnCount,
"headerRow": headerRow,
"dataRowCount": rowCount
}
except Exception as e:
self.logger.warning(f"CSV statistics extraction failed: {e}")
return {}

View file

@ -0,0 +1,141 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
"""
JSON code renderer for code generation.
"""
from .codeRendererBaseTemplate import BaseCodeRenderer
from modules.datamodels.datamodelDocument import RenderedDocument
from typing import Dict, Any, List, Optional
import json
class RendererCodeJson(BaseCodeRenderer):
"""Renders JSON code files."""
@classmethod
def getSupportedFormats(cls) -> List[str]:
"""Return supported JSON formats."""
return ['json']
@classmethod
def getFormatAliases(cls) -> List[str]:
"""Return format aliases."""
return []
@classmethod
def getPriority(cls) -> int:
"""Return priority for JSON code renderer."""
return 85 # Higher than document renderer (80) for code generation
@classmethod
def getOutputStyle(cls, formatName: Optional[str] = None) -> str:
"""Return output style classification: JSON is structured data format."""
return 'code'
async def renderCodeFiles(
self,
codeFiles: List[Dict[str, Any]],
metadata: Dict[str, Any],
userPrompt: str = None
) -> List[RenderedDocument]:
"""
Render JSON code files.
For single file: output as-is
For multiple files: output separately (each file is independent JSON)
"""
renderedDocs = []
for codeFile in codeFiles:
if not self._validateCodeFile(codeFile):
self.logger.warning(f"Invalid code file: {codeFile.get('filename', 'unknown')}")
continue
filename = codeFile['filename']
content = codeFile['content']
# Validate JSON syntax and extract statistics
parsed = None
try:
parsed = json.loads(content) # Validate JSON
except json.JSONDecodeError as e:
self.logger.warning(f"Invalid JSON in {filename}: {e}")
# Could fix/format JSON here if needed
# Format JSON (pretty print)
try:
if parsed is None:
parsed = json.loads(content)
formattedContent = json.dumps(parsed, indent=2, ensure_ascii=False)
except Exception:
formattedContent = content # Use original if formatting fails
# Extract JSON statistics for validation
jsonStats = self._extractJsonStatistics(parsed) if parsed else {}
# Merge file-specific metadata with project metadata
fileMetadata = dict(metadata) if metadata else {}
fileMetadata.update({
"filename": filename,
"fileType": "json",
"statistics": jsonStats
})
renderedDocs.append(
RenderedDocument(
documentData=formattedContent.encode('utf-8'),
mimeType="application/json",
filename=filename,
metadata=fileMetadata
)
)
return renderedDocs
async def render(self, extractedContent: Dict[str, Any], title: str, userPrompt: str = None, aiService=None) -> List[RenderedDocument]:
"""
Render method for document generation compatibility.
Delegates to document renderer if needed, or handles code files directly.
"""
# Check if this is code generation (has files array) or document generation (has documents array)
if "files" in extractedContent:
# Code generation path - use renderCodeFiles
files = extractedContent.get("files", [])
metadata = extractedContent.get("metadata", {})
return await self.renderCodeFiles(files, metadata, userPrompt)
else:
# Document generation path - delegate to document renderer
# Import here to avoid circular dependency
from .rendererJson import RendererJson
documentRenderer = RendererJson(self.services)
return await documentRenderer.render(extractedContent, title, userPrompt, aiService)
def _extractJsonStatistics(self, parsed: Any) -> Dict[str, Any]:
"""Extract JSON statistics for validation (object count, array count, key count)."""
try:
stats = {
"isArray": isinstance(parsed, list),
"isObject": isinstance(parsed, dict),
"itemCount": 0,
"keyCount": 0
}
if isinstance(parsed, list):
stats["itemCount"] = len(parsed)
# Count nested objects/arrays
objectCount = sum(1 for item in parsed if isinstance(item, dict))
arrayCount = sum(1 for item in parsed if isinstance(item, list))
stats["objectCount"] = objectCount
stats["arrayCount"] = arrayCount
elif isinstance(parsed, dict):
stats["keyCount"] = len(parsed)
stats["keys"] = list(parsed.keys())
# Count nested objects/arrays
objectCount = sum(1 for v in parsed.values() if isinstance(v, dict))
arrayCount = sum(1 for v in parsed.values() if isinstance(v, list))
stats["objectCount"] = objectCount
stats["arrayCount"] = arrayCount
return stats
except Exception as e:
self.logger.warning(f"JSON statistics extraction failed: {e}")
return {}

View file

@ -0,0 +1,148 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
"""
XML code renderer for code generation.
"""
from .codeRendererBaseTemplate import BaseCodeRenderer
from modules.datamodels.datamodelDocument import RenderedDocument
from typing import Dict, Any, List, Optional
import xml.etree.ElementTree as ET
from xml.dom import minidom
class RendererCodeXml(BaseCodeRenderer):
"""Renders XML code files."""
@classmethod
def getSupportedFormats(cls) -> List[str]:
"""Return supported XML formats."""
return ['xml']
@classmethod
def getFormatAliases(cls) -> List[str]:
"""Return format aliases."""
return []
@classmethod
def getPriority(cls) -> int:
"""Return priority for XML code renderer."""
return 80
@classmethod
def getOutputStyle(cls, formatName: Optional[str] = None) -> str:
"""Return output style classification: XML is structured data format."""
return 'code'
async def renderCodeFiles(
self,
codeFiles: List[Dict[str, Any]],
metadata: Dict[str, Any],
userPrompt: str = None
) -> List[RenderedDocument]:
"""
Render XML code files.
Validates XML syntax and formats (pretty print).
"""
renderedDocs = []
for codeFile in codeFiles:
if not self._validateCodeFile(codeFile):
self.logger.warning(f"Invalid code file: {codeFile.get('filename', 'unknown')}")
continue
filename = codeFile['filename']
content = codeFile['content']
# Validate and format XML
formattedContent = self._validateAndFormatXml(content)
# Extract XML statistics for validation
xmlStats = self._extractXmlStatistics(formattedContent)
# Merge file-specific metadata with project metadata
fileMetadata = dict(metadata) if metadata else {}
fileMetadata.update({
"filename": filename,
"fileType": "xml",
"statistics": xmlStats
})
renderedDocs.append(
RenderedDocument(
documentData=formattedContent.encode('utf-8'),
mimeType="application/xml",
filename=filename,
metadata=fileMetadata
)
)
return renderedDocs
async def render(self, extractedContent: Dict[str, Any], title: str, userPrompt: str = None, aiService=None) -> List[RenderedDocument]:
"""
Render method for document generation compatibility.
For XML, we only support code generation (no document renderer exists yet).
"""
# Check if this is code generation (has files array)
if "files" in extractedContent:
# Code generation path - use renderCodeFiles
files = extractedContent.get("files", [])
metadata = extractedContent.get("metadata", {})
return await self.renderCodeFiles(files, metadata, userPrompt)
else:
# Document generation path - not supported yet, return error
self.logger.warning("XML document generation not supported, only code generation")
return [
RenderedDocument(
documentData=f"XML document generation not yet supported".encode('utf-8'),
mimeType="text/plain",
filename="error.txt",
metadata={}
)
]
def _validateAndFormatXml(self, content: str) -> str:
"""Validate XML syntax and format (pretty print)."""
try:
# Parse XML to validate
root = ET.fromstring(content)
# Format XML (pretty print)
rough_string = ET.tostring(root, encoding='unicode')
reparsed = minidom.parseString(rough_string)
formatted = reparsed.toprettyxml(indent=" ")
# Remove extra blank lines
lines = [line for line in formatted.split('\n') if line.strip()]
return '\n'.join(lines)
except ET.ParseError as e:
self.logger.warning(f"Invalid XML: {e}, returning original content")
return content
except Exception as e:
self.logger.warning(f"XML formatting failed: {e}, returning original content")
return content
def _extractXmlStatistics(self, content: str) -> Dict[str, Any]:
"""Extract XML statistics for validation (element count, attribute count, root element)."""
try:
root = ET.fromstring(content)
# Count all elements recursively
elementCount = len(list(root.iter()))
# Count attributes
attributeCount = sum(len(elem.attrib) for elem in root.iter())
# Get root element name
rootElement = root.tag
return {
"elementCount": elementCount,
"attributeCount": attributeCount,
"rootElement": rootElement,
"hasRoot": True
}
except Exception as e:
self.logger.warning(f"XML statistics extraction failed: {e}")
return {}

View file

@ -0,0 +1,415 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
"""
CSV renderer for report generation.
"""
from .documentRendererBaseTemplate import BaseRenderer
from modules.datamodels.datamodelDocument import RenderedDocument
from typing import Dict, Any, List, Optional
class RendererCsv(BaseRenderer):
"""Renders content to CSV format with format-specific extraction."""
@classmethod
def getSupportedFormats(cls) -> List[str]:
"""Return supported CSV formats."""
return ['csv']
@classmethod
def getFormatAliases(cls) -> List[str]:
"""Return format aliases."""
return ['spreadsheet', 'table']
@classmethod
def getPriority(cls) -> int:
"""Return priority for CSV renderer."""
return 70
@classmethod
def getOutputStyle(cls, formatName: Optional[str] = None) -> str:
"""Return output style classification: CSV document renderer converts structured document content to CSV."""
return 'document'
@classmethod
def getAcceptedSectionTypes(cls, formatName: Optional[str] = None) -> List[str]:
"""
Return list of section content types that CSV renderer accepts.
CSV renderer accepts table sections and code_block sections (for raw CSV content).
"""
return ["table", "code_block"]
async def render(self, extractedContent: Dict[str, Any], title: str, userPrompt: str = None, aiService=None) -> List[RenderedDocument]:
"""Render extracted JSON content to CSV format. Produces one CSV file per table section."""
try:
# Validate JSON structure
if not self._validateJsonStructure(extractedContent):
raise ValueError("JSON content must follow standardized schema: {metadata: {...}, documents: [{sections: [...]}]}")
# Extract sections and metadata
sections = self._extractSections(extractedContent)
metadata = self._extractMetadata(extractedContent)
# Determine base filename from document or title
documents = extractedContent.get("documents", [])
baseFilename = None
if documents and isinstance(documents[0], dict):
baseFilename = documents[0].get("filename")
if not baseFilename:
baseFilename = self._determineFilename(title, "text/csv")
# Remove extension from base filename if present
if baseFilename.endswith('.csv'):
baseFilename = baseFilename[:-4]
# Collect CSV-producing sections: table sections AND code_block sections with CSV language
tableSections = []
codeBlockCsvSections = []
for section in sections:
sectionType = section.get("content_type", "paragraph")
if sectionType == "table":
tableSections.append(section)
elif sectionType == "code_block":
# Check if any element is a code_block with language "csv"
for element in section.get("elements", []):
content = element.get("content", {})
if isinstance(content, dict) and content.get("language", "").lower() == "csv":
codeBlockCsvSections.append(section)
break
# If no usable sections found, return empty CSV
if not tableSections and not codeBlockCsvSections:
self.logger.warning("No table or CSV code_block sections found in CSV document - returning empty CSV")
emptyCsv = self._convertRowsToCsv([["No table data available"]])
return [
RenderedDocument(
documentData=emptyCsv.encode('utf-8'),
mimeType="text/csv",
filename=self._determineFilename(title, "text/csv"),
documentType=metadata.get("documentType") if isinstance(metadata, dict) else None,
metadata=metadata if isinstance(metadata, dict) else None
)
]
allCsvSections = tableSections + codeBlockCsvSections
# Generate one CSV file per section
renderedDocuments = []
for i, csvSection in enumerate(allCsvSections):
sectionType = csvSection.get("content_type", "paragraph")
sectionTitle = csvSection.get("title")
csvContent = ""
if sectionType == "code_block":
# Extract raw CSV content directly from code_block elements
rawCsvParts = []
for element in csvSection.get("elements", []):
content = element.get("content", {})
if isinstance(content, dict) and content.get("language", "").lower() == "csv":
code = content.get("code", "")
if code:
rawCsvParts.append(code)
csvContent = "\n".join(rawCsvParts)
else:
# Table section — render via table logic
csvRows = []
if sectionTitle:
csvRows.append([sectionTitle])
csvRows.append([]) # Empty row after title
elements = csvSection.get("elements", [])
for element in elements:
tableRows = self._renderJsonTableToCsv(element)
if tableRows:
csvRows.extend(tableRows)
csvContent = self._convertRowsToCsv(csvRows)
# Determine filename
if len(allCsvSections) == 1:
filename = f"{baseFilename}.csv"
else:
sectionId = csvSection.get("id", f"csv_{i+1}")
if sectionTitle:
safeTitle = "".join(c for c in sectionTitle if c.isalnum() or c in (' ', '-', '_')).strip()
safeTitle = safeTitle.replace(' ', '_')[:30]
filename = f"{baseFilename}_{safeTitle}.csv"
else:
filename = f"{baseFilename}_{sectionId}.csv"
documentType = metadata.get("documentType") if isinstance(metadata, dict) else None
renderedDocuments.append(
RenderedDocument(
documentData=csvContent.encode('utf-8'),
mimeType="text/csv",
filename=filename,
documentType=documentType,
metadata=metadata if isinstance(metadata, dict) else None
)
)
return renderedDocuments
except Exception as e:
self.logger.error(f"Error rendering CSV: {str(e)}")
# Return minimal CSV fallback
fallbackCsv = self._convertRowsToCsv([["Title", "Content"], [title, f"Error rendering report: {str(e)}"]])
return [
RenderedDocument(
documentData=fallbackCsv.encode('utf-8'),
mimeType="text/csv",
filename=self._determineFilename(title, "text/csv"),
metadata=extractedContent.get("metadata", {}) if extractedContent else None
)
]
async def _generateCsvFromJson(self, jsonContent: Dict[str, Any], title: str) -> str:
"""Generate CSV content from structured JSON document. DEPRECATED: Use render() method instead."""
# This method is kept for backward compatibility but is no longer used
# The render() method now handles CSV generation directly
try:
# Validate JSON structure (standardized schema: {metadata: {...}, documents: [{sections: [...]}]})
if not self._validateJsonStructure(jsonContent):
raise ValueError("JSON content must follow standardized schema: {metadata: {...}, documents: [{sections: [...]}]}")
# Extract sections and metadata from standardized schema
sections = self._extractSections(jsonContent)
metadata = self._extractMetadata(jsonContent)
# Use provided title (which comes from documents[].title) as primary source
# Fallback to metadata.title only if title parameter is empty
documentTitle = title if title else metadata.get("title", "Generated Document")
# Generate CSV content
csvRows = []
# Add title row
if documentTitle:
csvRows.append([documentTitle])
csvRows.append([]) # Empty row
# Process each section in order - only table sections
for section in sections:
sectionType = section.get("content_type", "paragraph")
if sectionType == "table":
sectionCsv = self._renderJsonSectionToCsv(section)
if sectionCsv:
csvRows.extend(sectionCsv)
csvRows.append([]) # Empty row between sections
# Convert to CSV string
csvContent = self._convertRowsToCsv(csvRows)
return csvContent
except Exception as e:
self.logger.error(f"Error generating CSV from JSON: {str(e)}")
raise Exception(f"CSV generation failed: {str(e)}")
def _renderJsonSectionToCsv(self, section: Dict[str, Any]) -> List[List[str]]:
"""Render a single JSON section to CSV rows."""
try:
sectionType = section.get("content_type", "paragraph")
elements = section.get("elements", [])
csvRows = []
# Add section title if available
sectionTitle = section.get("title")
if sectionTitle:
csvRows.append([f"# {sectionTitle}"])
# Process each element in the section
for element in elements:
if sectionType == "table":
csvRows.extend(self._renderJsonTableToCsv(element))
elif sectionType == "list":
csvRows.extend(self._renderJsonListToCsv(element))
elif sectionType == "heading":
csvRows.extend(self._renderJsonHeadingToCsv(element))
elif sectionType == "paragraph":
csvRows.extend(self._renderJsonParagraphToCsv(element))
elif sectionType == "code":
csvRows.extend(self._renderJsonCodeToCsv(element))
else:
# Fallback to paragraph for unknown types
csvRows.extend(self._renderJsonParagraphToCsv(element))
return csvRows
except Exception as e:
self.logger.warning(f"Error rendering section {section.get('id', 'unknown')}: {str(e)}")
return [["[Error rendering section]"]]
def _renderJsonTableToCsv(self, tableData: Dict[str, Any]) -> List[List[str]]:
"""Render a JSON table to CSV rows."""
try:
# Extract from nested content structure
content = tableData.get("content", {})
if not isinstance(content, dict):
return []
headers = content.get("headers", [])
rows = content.get("rows", [])
csvRows = []
if headers:
csvRows.append(headers)
if rows:
csvRows.extend(rows)
return csvRows
except Exception as e:
self.logger.warning(f"Error rendering table: {str(e)}")
return [["[Error rendering table]"]]
def _renderJsonListToCsv(self, listData: Dict[str, Any]) -> List[List[str]]:
"""Render a JSON list to CSV rows."""
try:
# Extract from nested content structure
content = listData.get("content", {})
if not isinstance(content, dict):
return []
items = content.get("items", [])
csvRows = []
for item in items:
if isinstance(item, dict):
text = item.get("text", "")
subitems = item.get("subitems", [])
csvRows.append([text])
# Add subitems as indented rows
for subitem in subitems:
if isinstance(subitem, dict):
csvRows.append([f" - {subitem.get('text', '')}"])
else:
csvRows.append([f" - {subitem}"])
else:
csvRows.append([str(item)])
return csvRows
except Exception as e:
self.logger.warning(f"Error rendering list: {str(e)}")
return [["[Error rendering list]"]]
def _renderJsonHeadingToCsv(self, headingData: Dict[str, Any]) -> List[List[str]]:
"""Render a JSON heading to CSV rows."""
try:
# Extract from nested content structure
content = headingData.get("content", {})
if not isinstance(content, dict):
return []
text = content.get("text", "")
level = content.get("level", 1)
if text:
# Use # symbols for heading levels
headingText = f"{'#' * level} {text}"
return [[headingText]]
return []
except Exception as e:
self.logger.warning(f"Error rendering heading: {str(e)}")
return [["[Error rendering heading]"]]
def _renderJsonParagraphToCsv(self, paragraphData: Dict[str, Any]) -> List[List[str]]:
"""Render a JSON paragraph to CSV rows."""
try:
# Extract from nested content structure
content = paragraphData.get("content", {})
if isinstance(content, dict):
text = content.get("text", "")
elif isinstance(content, str):
text = content
else:
text = ""
if text:
# Split long paragraphs into multiple rows if needed
if len(text) > 100:
words = text.split()
rows = []
currentRow = []
currentLength = 0
for word in words:
if currentLength + len(word) > 100 and currentRow:
rows.append([" ".join(currentRow)])
currentRow = [word]
currentLength = len(word)
else:
currentRow.append(word)
currentLength += len(word) + 1
if currentRow:
rows.append([" ".join(currentRow)])
return rows
else:
return [[text]]
return []
except Exception as e:
self.logger.warning(f"Error rendering paragraph: {str(e)}")
return [["[Error rendering paragraph]"]]
def _renderJsonCodeToCsv(self, codeData: Dict[str, Any]) -> List[List[str]]:
"""Render a JSON code block to CSV rows."""
try:
# Extract from nested content structure
content = codeData.get("content", {})
if not isinstance(content, dict):
return []
code = content.get("code", "")
language = content.get("language", "")
csvRows = []
if language:
csvRows.append([f"Code ({language}):"])
if code:
# Split code into lines
codeLines = code.split('\n')
for line in codeLines:
csvRows.append([f" {line}"])
return csvRows
except Exception as e:
self.logger.warning(f"Error rendering code block: {str(e)}")
return [["[Error rendering code block]"]]
def _convertRowsToCsv(self, rows: List[List[str]]) -> str:
"""Convert rows to CSV string."""
import csv
import io
output = io.StringIO()
writer = csv.writer(output)
for row in rows:
if row: # Only write non-empty rows
writer.writerow(row)
return output.getvalue()
def _cleanCsvContent(self, content: str, title: str) -> str:
"""Clean and validate CSV content from AI."""
content = content.strip()
# Remove markdown code blocks if present
if content.startswith("```") and content.endswith("```"):
lines = content.split('\n')
if len(lines) > 2:
content = '\n'.join(lines[1:-1]).strip()
return content

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,841 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
"""
HTML renderer for report generation.
"""
from .documentRendererBaseTemplate import BaseRenderer
from modules.datamodels.datamodelDocument import RenderedDocument
from typing import Dict, Any, List, Optional
class RendererHtml(BaseRenderer):
"""Renders content to HTML format with format-specific extraction."""
@classmethod
def getSupportedFormats(cls) -> List[str]:
"""Return supported HTML formats."""
return ['html', 'htm']
@classmethod
def getFormatAliases(cls) -> List[str]:
"""Return format aliases."""
return ['web', 'webpage']
@classmethod
def getPriority(cls) -> int:
"""Return priority for HTML renderer."""
return 100
@classmethod
def getOutputStyle(cls, formatName: Optional[str] = None) -> str:
"""Return output style classification: HTML web pages are rendered documents."""
return 'document'
@classmethod
def getAcceptedSectionTypes(cls, formatName: Optional[str] = None) -> List[str]:
"""
Return list of section content types that HTML renderer accepts.
HTML renderer accepts all section types (HTML pages can contain all content types including images).
"""
from modules.datamodels.datamodelJson import supportedSectionTypes
return list(supportedSectionTypes)
async def render(self, extractedContent: Dict[str, Any], title: str, userPrompt: str = None, aiService=None) -> List[RenderedDocument]:
"""
Render HTML document with images as separate files.
Returns list of documents: [HTML document, image1, image2, ...]
"""
import base64
# Extract images first
images = self._extractImages(extractedContent)
# Store images in instance for later retrieval
self._renderedImages = images
# Generate HTML using AI-analyzed styling
htmlContent = await self._generateHtmlFromJson(extractedContent, title, userPrompt, aiService)
# Replace base64 data URIs with relative file paths if images exist
if images:
htmlContent = self._replaceImageDataUris(htmlContent, images)
# Determine HTML filename from document or title
documents = extractedContent.get("documents", [])
if documents and isinstance(documents[0], dict):
htmlFilename = documents[0].get("filename")
if not htmlFilename:
htmlFilename = self._determineFilename(title, "text/html")
else:
htmlFilename = self._determineFilename(title, "text/html")
# Extract metadata for document type and other info
metadata = extractedContent.get("metadata", {}) if extractedContent else {}
documentType = metadata.get("documentType") if isinstance(metadata, dict) else None
# Start with HTML document
resultDocuments = [
RenderedDocument(
documentData=htmlContent.encode('utf-8'),
mimeType="text/html",
filename=htmlFilename,
documentType=documentType,
metadata=metadata if isinstance(metadata, dict) else None
)
]
# Add images as separate documents
for img in images:
base64Data = img.get("base64Data", "")
filename = img.get("filename", f"image_{len(resultDocuments)}.png")
mimeType = img.get("mimeType", "image/png")
if base64Data:
try:
# Decode base64 to bytes
imageBytes = base64.b64decode(base64Data)
resultDocuments.append(
RenderedDocument(
documentData=imageBytes,
mimeType=mimeType,
filename=filename
)
)
self.logger.debug(f"Added image file: {filename} ({len(imageBytes)} bytes)")
except Exception as e:
self.logger.warning(f"Error creating image file {filename}: {str(e)}")
return resultDocuments
async def _generateHtmlFromJson(self, jsonContent: Dict[str, Any], title: str, userPrompt: str = None, aiService=None) -> str:
"""Generate HTML content from structured JSON document using AI-generated styling."""
try:
# Get style set: use styles from metadata if available, otherwise enhance with AI
styles = await self._getStyleSet(jsonContent, userPrompt, aiService)
# Validate JSON structure
if not self._validateJsonStructure(jsonContent):
raise ValueError("JSON content must follow standardized schema: {metadata: {...}, documents: [{sections: [...]}]}")
# Extract sections and metadata from standardized schema
sections = self._extractSections(jsonContent)
metadata = self._extractMetadata(jsonContent)
# Use provided title (which comes from documents[].title) as primary source
# Fallback to metadata.title only if title parameter is empty
documentTitle = title if title else metadata.get("title", "Generated Document")
# Build HTML document
htmlParts = []
# HTML document structure
htmlParts.append('<!DOCTYPE html>')
htmlParts.append('<html lang="en">')
htmlParts.append('<head>')
htmlParts.append('<meta charset="UTF-8">')
htmlParts.append('<meta name="viewport" content="width=device-width, initial-scale=1.0">')
htmlParts.append(f'<title>{documentTitle}</title>')
htmlParts.append('<style>')
htmlParts.append(self._generateCssStyles(styles))
htmlParts.append('</style>')
htmlParts.append('</head>')
htmlParts.append('<body>')
# Document header
htmlParts.append(f'<header><h1 class="document-title">{documentTitle}</h1></header>')
# Main content
htmlParts.append('<main>')
# Process each section
for section in sections:
sectionHtml = self._renderJsonSection(section, styles)
if sectionHtml:
htmlParts.append(sectionHtml)
htmlParts.append('</main>')
# Footer
htmlParts.append('<footer>')
htmlParts.append(f'<p class="generated-info">Generated: {self._formatTimestamp()}</p>')
htmlParts.append('</footer>')
htmlParts.append('</body>')
htmlParts.append('</html>')
return '\n'.join(htmlParts)
except Exception as e:
self.logger.error(f"Error generating HTML from JSON: {str(e)}")
raise Exception(f"HTML generation failed: {str(e)}")
async def _getStyleSet(self, extractedContent: Dict[str, Any] = None, userPrompt: str = None, aiService=None, templateName: str = None) -> Dict[str, Any]:
"""Get style set - use styles from document generation metadata if available,
otherwise enhance default styles with AI if userPrompt provided.
WICHTIG: In a dynamic scalable AI system, styling should come from document generation,
not be generated separately by renderers. Only fall back to AI if styles not provided.
Args:
extractedContent: Document content with metadata (may contain styles)
userPrompt: User's prompt (AI will detect style instructions in any language)
aiService: AI service (used only if styles not in metadata and userPrompt provided)
templateName: Name of template style set (None = default)
Returns:
Dict with style definitions for all document styles
"""
# Get default style set
defaultStyleSet = self._getDefaultStyleSet()
# FIRST: Check if styles are provided in document generation metadata (preferred approach)
if extractedContent:
metadata = extractedContent.get("metadata", {})
if isinstance(metadata, dict):
styles = metadata.get("styles")
if styles and isinstance(styles, dict):
self.logger.debug("Using styles from document generation metadata")
return self._validateStylesContrast(styles)
# FALLBACK: Enhance with AI if userPrompt provided (only if styles not in metadata)
if userPrompt and aiService:
self.logger.info(f"Styles not in metadata, enhancing with AI based on user prompt...")
enhancedStyleSet = await self._enhanceStylesWithAI(userPrompt, defaultStyleSet, aiService)
return self._validateStylesContrast(enhancedStyleSet)
else:
# Use default styles only
return defaultStyleSet
async def _enhanceStylesWithAI(self, userPrompt: str, defaultStyleSet: Dict[str, Any], aiService) -> Dict[str, Any]:
"""Enhance default styles with AI based on user prompt."""
try:
style_template = self._createAiStyleTemplate("html", userPrompt, defaultStyleSet)
enhanced_styles = await self._getAiStyles(aiService, style_template, defaultStyleSet)
return enhanced_styles
except Exception as e:
self.logger.warning(f"AI style enhancement failed: {str(e)}, using default styles")
return defaultStyleSet
def _validateStylesContrast(self, styles: Dict[str, Any]) -> Dict[str, Any]:
"""Validate and fix contrast issues in AI-generated styles."""
try:
# Fix table header contrast
if "table_header" in styles:
header = styles["table_header"]
bgColor = header.get("background", "#FFFFFF")
textColor = header.get("color", "#000000")
# If both are white or both are dark, fix it
if bgColor.upper() == "#FFFFFF" and textColor.upper() == "#FFFFFF":
header["background"] = "#4F4F4F"
header["color"] = "#FFFFFF"
elif bgColor.upper() == "#000000" and textColor.upper() == "#000000":
header["background"] = "#4F4F4F"
header["color"] = "#FFFFFF"
# Fix table cell contrast
if "table_cell" in styles:
cell = styles["table_cell"]
bgColor = cell.get("background", "#FFFFFF")
textColor = cell.get("color", "#000000")
# If both are white or both are dark, fix it
if bgColor.upper() == "#FFFFFF" and textColor.upper() == "#FFFFFF":
cell["background"] = "#FFFFFF"
cell["color"] = "#2F2F2F"
elif bgColor.upper() == "#000000" and textColor.upper() == "#000000":
cell["background"] = "#FFFFFF"
cell["color"] = "#2F2F2F"
return styles
except Exception as e:
self.logger.warning(f"Style validation failed: {str(e)}")
return self._getDefaultStyleSet()
def _getDefaultStyleSet(self) -> Dict[str, Any]:
"""Default HTML style set - used when no style instructions present."""
return {
"title": {"font_size": "2.5em", "color": "#1F4E79", "font_weight": "bold", "text_align": "center", "margin": "0 0 1em 0"},
"heading1": {"font_size": "2em", "color": "#2F2F2F", "font_weight": "bold", "text_align": "left", "margin": "1.5em 0 0.5em 0"},
"heading2": {"font_size": "1.5em", "color": "#4F4F4F", "font_weight": "bold", "text_align": "left", "margin": "1em 0 0.5em 0"},
"paragraph": {"font_size": "1em", "color": "#2F2F2F", "font_weight": "normal", "text_align": "left", "margin": "0 0 1em 0", "line_height": "1.6"},
"table": {"border": "1px solid #ddd", "border_collapse": "collapse", "width": "100%", "margin": "1em 0"},
"table_header": {"background": "#4F4F4F", "color": "#FFFFFF", "font_weight": "bold", "text_align": "center", "padding": "12px"},
"table_cell": {"background": "#FFFFFF", "color": "#2F2F2F", "font_weight": "normal", "text_align": "left", "padding": "8px", "border": "1px solid #ddd"},
"bullet_list": {"font_size": "1em", "color": "#2F2F2F", "margin": "0 0 1em 0", "padding_left": "20px"},
"code_block": {"font_family": "Courier New, monospace", "font_size": "0.9em", "color": "#2F2F2F", "background": "#F5F5F5", "padding": "1em", "border": "1px solid #ddd", "border_radius": "4px", "margin": "1em 0"},
"image": {"max_width": "100%", "height": "auto", "margin": "1em 0", "border_radius": "4px"},
"body": {"font_family": "Arial, sans-serif", "background": "#FFFFFF", "color": "#2F2F2F", "margin": "0", "padding": "20px"}
}
def _generateCssStyles(self, styles: Dict[str, Any]) -> str:
"""Generate CSS from style definitions."""
css_parts = []
# Body styles
body_style = styles.get("body", {})
css_parts.append("body {")
for property_name, value in body_style.items():
css_property = property_name.replace("_", "-")
css_parts.append(f" {css_property}: {value};")
css_parts.append("}")
# Document title
title_style = styles.get("title", {})
css_parts.append(".document-title {")
for property_name, value in title_style.items():
css_property = property_name.replace("_", "-")
css_parts.append(f" {css_property}: {value};")
css_parts.append("}")
# Headings
for heading_level in ["heading1", "heading2"]:
heading_style = styles.get(heading_level, {})
css_class = f"h{heading_level[-1]}"
css_parts.append(f"{css_class} {{")
for property_name, value in heading_style.items():
css_property = property_name.replace("_", "-")
css_parts.append(f" {css_property}: {value};")
css_parts.append("}")
# Paragraphs
paragraph_style = styles.get("paragraph", {})
css_parts.append("p {")
for property_name, value in paragraph_style.items():
css_property = property_name.replace("_", "-")
css_parts.append(f" {css_property}: {value};")
css_parts.append("}")
# Tables
table_style = styles.get("table", {})
css_parts.append("table {")
for property_name, value in table_style.items():
css_property = property_name.replace("_", "-")
css_parts.append(f" {css_property}: {value};")
css_parts.append("}")
# Table headers
table_header_style = styles.get("table_header", {})
css_parts.append("th {")
for property_name, value in table_header_style.items():
css_property = property_name.replace("_", "-")
css_parts.append(f" {css_property}: {value};")
css_parts.append("}")
# Table cells
table_cell_style = styles.get("table_cell", {})
css_parts.append("td {")
for property_name, value in table_cell_style.items():
css_property = property_name.replace("_", "-")
css_parts.append(f" {css_property}: {value};")
css_parts.append("}")
# Lists
bullet_list_style = styles.get("bullet_list", {})
css_parts.append("ul {")
for property_name, value in bullet_list_style.items():
css_property = property_name.replace("_", "-")
css_parts.append(f" {css_property}: {value};")
css_parts.append("}")
# Code blocks
code_block_style = styles.get("code_block", {})
css_parts.append("pre {")
for property_name, value in code_block_style.items():
css_property = property_name.replace("_", "-")
css_parts.append(f" {css_property}: {value};")
css_parts.append("}")
# Images
image_style = styles.get("image", {})
css_parts.append("img {")
for property_name, value in image_style.items():
css_property = property_name.replace("_", "-")
css_parts.append(f" {css_property}: {value};")
css_parts.append("}")
# Generated info
css_parts.append(".generated-info {")
css_parts.append(" font-size: 0.9em;")
css_parts.append(" color: #666;")
css_parts.append(" text-align: center;")
css_parts.append(" margin-top: 2em;")
css_parts.append(" padding-top: 1em;")
css_parts.append(" border-top: 1px solid #ddd;")
css_parts.append("}")
return '\n'.join(css_parts)
def _renderJsonSection(self, section: Dict[str, Any], styles: Dict[str, Any]) -> str:
"""Render a single JSON section to HTML using AI-generated styles.
Supports three content formats: reference, object (base64), extracted_text.
WICHTIG: Respektiert sectionType (content_type) für korrekte Rendering-Logik.
"""
try:
sectionType = self._getSectionType(section)
sectionData = self._getSectionData(section)
# WICHTIG: Respektiere sectionType (content_type) ZUERST, dann process elements entsprechend
# Process elements according to section's content_type, not just element types
if sectionType == "table":
# Work directly with elements like other renderers
if isinstance(sectionData, list) and sectionData:
element = sectionData[0] if isinstance(sectionData[0], dict) else {}
return self._renderJsonTable(element, styles)
return ""
elif sectionType == "bullet_list":
# Work directly with elements like other renderers
if isinstance(sectionData, list) and sectionData:
element = sectionData[0] if isinstance(sectionData[0], dict) else {}
return self._renderJsonBulletList(element, styles)
return ""
elif sectionType == "heading":
# Work directly with elements like other renderers
if isinstance(sectionData, list) and sectionData:
element = sectionData[0] if isinstance(sectionData[0], dict) else {}
return self._renderJsonHeading(element, styles)
return ""
elif sectionType == "paragraph":
# Process paragraph elements, including extracted_text
if isinstance(sectionData, list):
htmlParts = []
for element in sectionData:
element_type = element.get("type", "") if isinstance(element, dict) else ""
if element_type == "reference":
doc_ref = element.get("documentReference", "")
label = element.get("label", "Reference")
htmlParts.append(f'<p class="reference"><em>[Reference: {label}]</em></p>')
elif element_type == "extracted_text":
content = element.get("content", "")
source = element.get("source", "")
if content:
source_text = f' <small><em>(Source: {source})</em></small>' if source else ''
htmlParts.append(f'<p>{content}{source_text}</p>')
elif isinstance(element, dict):
# Regular paragraph element - extract from nested content structure (standard JSON format)
content = element.get("content", {})
if isinstance(content, dict):
text = content.get("text", "")
elif isinstance(content, str):
text = content
else:
text = ""
if text:
htmlParts.append(f'<p>{text}</p>')
elif isinstance(element, str):
htmlParts.append(f'<p>{element}</p>')
if htmlParts:
return '\n'.join(htmlParts)
# If sectionData is not a list, treat it as a dict
if isinstance(sectionData, dict):
return self._renderJsonParagraph(sectionData, styles)
return ""
elif sectionType == "code_block":
# Work directly with elements like other renderers
if isinstance(sectionData, list) and sectionData:
element = sectionData[0] if isinstance(sectionData[0], dict) else {}
return self._renderJsonCodeBlock(element, styles)
return ""
elif sectionType == "image":
# Work directly with elements like other renderers
if isinstance(sectionData, list) and sectionData:
element = sectionData[0] if isinstance(sectionData[0], dict) else {}
return self._renderJsonImage(element, styles)
return ""
else:
# Fallback: Check for special element types first
if isinstance(sectionData, list):
htmlParts = []
for element in sectionData:
element_type = element.get("type", "") if isinstance(element, dict) else ""
if element_type == "reference":
doc_ref = element.get("documentReference", "")
label = element.get("label", "Reference")
htmlParts.append(f'<p class="reference"><em>[Reference: {label}]</em></p>')
elif element_type == "extracted_text":
content = element.get("content", "")
source = element.get("source", "")
if content:
source_text = f' <small><em>(Source: {source})</em></small>' if source else ''
htmlParts.append(f'<p>{content}{source_text}</p>')
if htmlParts:
return '\n'.join(htmlParts)
# Fallback to paragraph for unknown types
if isinstance(sectionData, dict):
return self._renderJsonParagraph(sectionData, styles)
return ""
except Exception as e:
self.logger.warning(f"Error rendering section {self._getSectionId(section)}: {str(e)}")
return f'<div class="error">[Error rendering section: {str(e)}]</div>'
def _renderJsonTable(self, tableData: Dict[str, Any], styles: Dict[str, Any]) -> str:
"""Render a JSON table to HTML using AI-generated styles."""
try:
# Extract from nested content structure: element.content.{headers, rows}
content = tableData.get("content", {})
if not isinstance(content, dict):
return ""
headers = content.get("headers", [])
rows = content.get("rows", [])
if not headers or not rows:
return ""
htmlParts = ['<table>']
# Table header
htmlParts.append('<thead><tr>')
for header in headers:
htmlParts.append(f'<th>{header}</th>')
htmlParts.append('</tr></thead>')
# Table body
htmlParts.append('<tbody>')
for row in rows:
htmlParts.append('<tr>')
for cellData in row:
htmlParts.append(f'<td>{cellData}</td>')
htmlParts.append('</tr>')
htmlParts.append('</tbody>')
htmlParts.append('</table>')
return '\n'.join(htmlParts)
except Exception as e:
self.logger.warning(f"Error rendering table: {str(e)}")
return ""
def _renderJsonBulletList(self, listData: Dict[str, Any], styles: Dict[str, Any]) -> str:
"""Render a JSON bullet list to HTML using AI-generated styles."""
try:
# Extract from nested content structure: element.content.{items}
content = listData.get("content", {})
if not isinstance(content, dict):
return ""
items = content.get("items", [])
if not items:
return ""
htmlParts = ['<ul>']
for item in items:
if isinstance(item, str):
htmlParts.append(f'<li>{item}</li>')
elif isinstance(item, dict) and "text" in item:
htmlParts.append(f'<li>{item["text"]}</li>')
htmlParts.append('</ul>')
return '\n'.join(htmlParts)
except Exception as e:
self.logger.warning(f"Error rendering bullet list: {str(e)}")
return ""
def _renderJsonHeading(self, headingData: Dict[str, Any], styles: Dict[str, Any]) -> str:
"""Render a JSON heading to HTML using AI-generated styles."""
try:
# Extract from nested content structure: element.content.{text, level}
content = headingData.get("content", {})
if not isinstance(content, dict):
return ""
text = content.get("text", "")
level = content.get("level", 1)
if text:
level = max(1, min(6, level))
return f'<h{level}>{text}</h{level}>'
return ""
except Exception as e:
self.logger.warning(f"Error rendering heading: {str(e)}")
return ""
def _renderJsonParagraph(self, paragraphData: Dict[str, Any], styles: Dict[str, Any]) -> str:
"""Render a JSON paragraph to HTML using AI-generated styles."""
try:
# Normalize inputs - paragraphData is typically a list of elements from _getSectionData
if isinstance(paragraphData, list):
# Extract text from all paragraph elements (expects nested content structure)
texts = []
for el in paragraphData:
if isinstance(el, dict):
content = el.get("content", {})
if isinstance(content, dict):
text = content.get("text", "")
elif isinstance(content, str):
text = content
else:
text = ""
if text:
texts.append(text)
elif isinstance(el, str):
texts.append(el)
if texts:
# Join multiple paragraphs with <p> tags
return '\n'.join(f'<p>{text}</p>' for text in texts)
return ""
elif isinstance(paragraphData, str):
return f'<p>{paragraphData}</p>'
elif isinstance(paragraphData, dict):
# Handle nested content structure: element.content vs element.text
# Extract from nested content structure
content = paragraphData.get("content", {})
if isinstance(content, dict):
text = content.get("text", "")
elif isinstance(content, str):
text = content
else:
text = ""
if text:
return f'<p>{text}</p>'
return ""
else:
return ""
except Exception as e:
self.logger.warning(f"Error rendering paragraph: {str(e)}")
return ""
def _renderJsonCodeBlock(self, codeData: Dict[str, Any], styles: Dict[str, Any]) -> str:
"""Render a JSON code block to HTML using AI-generated styles."""
try:
# Extract from nested content structure: element.content.{code, language}
content = codeData.get("content", {})
if not isinstance(content, dict):
return ""
code = content.get("code", "")
language = content.get("language", "")
if code:
if language:
return f'<pre><code class="language-{language}">{code}</code></pre>'
else:
return f'<pre><code>{code}</code></pre>'
return ""
except Exception as e:
self.logger.warning(f"Error rendering code block: {str(e)}")
return ""
def _renderJsonImage(self, imageData: Dict[str, Any], styles: Dict[str, Any]) -> str:
"""Render a JSON image to HTML with placeholder for later replacement. Expects nested content structure."""
try:
import html
# Extract from nested content structure (standard JSON format)
content = imageData.get("content", {})
if not isinstance(content, dict):
return ""
base64Data = content.get("base64Data", "")
altText = content.get("altText", "Image")
caption = content.get("caption", "")
# Escape HTML in altText and caption to prevent injection
altTextEscaped = html.escape(str(altText))
captionEscaped = html.escape(str(caption)) if caption else ""
if base64Data:
# Use data URI as placeholder - will be replaced with file path in _replaceImageDataUris
# Include a marker so we can find and replace it
imageMarker = f"<!--IMAGE_MARKER:{len(base64Data)}:{altTextEscaped[:50]}-->"
# Add max-width and max-height to ensure image fits within page dimensions
# Typical page width is ~800-1200px, height varies but we limit to 600px for readability
imgTag = f'<img src="data:image/png;base64,{base64Data}" alt="{altTextEscaped}" style="max-width: 100%; max-height: 600px; width: auto; height: auto;">'
if captionEscaped:
return f'{imageMarker}<figure>{imgTag}<figcaption>{captionEscaped}</figcaption></figure>'
else:
return f'{imageMarker}{imgTag}'
return ""
except Exception as e:
self.logger.error(f"Error embedding image in HTML: {str(e)}")
altText = imageData.get("altText", "Image")
errorMsg = html.escape(f"[Error: Could not embed image '{altText}'. {str(e)}]")
return f'<div class="error" style="color: red; padding: 10px; border: 1px solid red;">{errorMsg}</div>'
def _extractImages(self, jsonContent: Dict[str, Any]) -> List[Dict[str, Any]]:
"""
Extract all images from JSON structure.
Returns:
List of image data dictionaries with base64Data, altText, caption, sectionId
"""
images = []
try:
# Extract from standardized schema: {metadata: {...}, documents: [{sections: [...]}]}
documents = jsonContent.get("documents", [])
if not documents or not isinstance(documents, list):
return images
for doc in documents:
if not isinstance(doc, dict):
continue
sections = doc.get("sections", [])
for section in sections:
if section.get("content_type") == "image":
elements = section.get("elements", [])
for element in elements:
# Extract from nested content structure
content = element.get("content", {})
base64Data = ""
if isinstance(content, dict):
base64Data = content.get("base64Data", "")
elif isinstance(content, str):
# Content might be base64 string directly (shouldn't happen)
pass
# If base64Data not found in content, try direct element fields (fallback)
if not base64Data:
base64Data = element.get("base64Data", "")
# If base64Data still not found, try extracting from url data URI
if not base64Data:
url = element.get("url", "") or (content.get("url", "") if isinstance(content, dict) else "")
if url and isinstance(url, str) and url.startswith("data:image/"):
# Extract base64 from data URI: data:image/png;base64,<base64>
import re
match = re.match(r'data:image/[^;]+;base64,(.+)', url)
if match:
base64Data = match.group(1)
if base64Data:
sectionId = section.get("id", "unknown")
# Bestimme MIME-Type und Extension
mimeType = element.get("mimeType", "") or (content.get("mimeType", "") if isinstance(content, dict) else "")
if not mimeType or mimeType == "unknown":
# Versuche MIME-Type aus base64 zu erkennen
if base64Data.startswith("/9j/"):
mimeType = "image/jpeg"
elif base64Data.startswith("iVBORw0KGgo"):
mimeType = "image/png"
else:
mimeType = "image/png" # Default
# Bestimme Extension basierend auf MIME-Type
extension = "png"
if mimeType == "image/jpeg" or mimeType == "image/jpg":
extension = "jpg"
elif mimeType == "image/png":
extension = "png"
elif mimeType == "image/gif":
extension = "gif"
elif mimeType == "image/webp":
extension = "webp"
# Generate filename from section ID
filename = f"{sectionId}.{extension}"
# Clean filename (remove invalid characters)
filename = "".join(c if c.isalnum() or c in "._-" else "_" for c in filename)
images.append({
"base64Data": base64Data,
"altText": element.get("altText", "Image"),
"caption": element.get("caption"),
"sectionId": sectionId,
"filename": filename,
"mimeType": mimeType
})
self.logger.debug(f"Extracted image from section {sectionId}: {filename}")
self.logger.info(f"Extracted {len(images)} image(s) from JSON structure")
return images
except Exception as e:
self.logger.warning(f"Error extracting images: {str(e)}")
return []
def _replaceImageDataUris(self, htmlContent: str, images: List[Dict[str, Any]]) -> str:
"""
Replace base64 data URIs in HTML with relative file paths.
Args:
htmlContent: HTML content with data URIs
images: List of image data dictionaries
Returns:
HTML content with relative file paths
"""
try:
import base64
import re
# Find entire img tags with data URIs and replace them
# Pattern: <img src="data:image/[type];base64,<base64>" [other attributes]>
imgTagPattern = r'<img\s+src="data:image/[^"]+"[^>]*>'
def replaceImgTag(match):
imgTag = match.group(0)
# Extract base64 data from the img tag
base64Match = re.search(r'data:image/[^;]+;base64,([A-Za-z0-9+/=]+)', imgTag)
if not base64Match:
return imgTag # Return original if no base64 found
base64Data = base64Match.group(1)
# Find matching image in images list
matchingImage = None
for img in images:
imgBase64 = img.get("base64Data", "")
# Vergleiche base64-Daten (kann unterschiedliche Längen haben durch Padding)
if imgBase64 == base64Data or imgBase64.startswith(base64Data[:100]) or base64Data.startswith(imgBase64[:100]):
matchingImage = img
break
if matchingImage:
import html
# Use filename from image data (generated from section ID)
filename = matchingImage.get("filename", f"image_{images.index(matchingImage) + 1}.png")
# Extract existing alt text or use from matchingImage
altMatch = re.search(r'alt="([^"]*)"', imgTag)
existingAlt = altMatch.group(1) if altMatch else ""
altText = html.escape(str(matchingImage.get("altText", existingAlt or "Image")))
caption = html.escape(str(matchingImage.get("caption", ""))) if matchingImage.get("caption") else ""
# Create new img tag with filename
imgTag = f'<img src="{filename}" alt="{altText}">'
if caption:
return f'<figure>{imgTag}<figcaption>{caption}</figcaption></figure>'
else:
return imgTag
else:
# Keep original if no match found
return match.group(0)
# Replace all img tags with data URIs (auch IMAGE_MARKER Kommentare entfernen)
updatedHtml = re.sub(imgTagPattern, replaceImgTag, htmlContent)
# Entferne IMAGE_MARKER Kommentare die übrig geblieben sind
updatedHtml = re.sub(r'<!--IMAGE_MARKER:[^>]+-->', '', updatedHtml)
return updatedHtml
except Exception as e:
self.logger.warning(f"Error replacing image data URIs: {str(e)}")
return htmlContent # Return original if replacement fails
def getRenderedImages(self) -> List[Dict[str, Any]]:
"""
Get images that were extracted during rendering.
Returns list of image dicts with base64Data, altText, caption, and filename.
"""
if not hasattr(self, '_renderedImages'):
return []
return self._renderedImages

View file

@ -0,0 +1,355 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
"""
Image renderer for report generation using AI image generation.
"""
from .documentRendererBaseTemplate import BaseRenderer
from modules.datamodels.datamodelDocument import RenderedDocument
from typing import Dict, Any, List, Optional
import logging
import base64
logger = logging.getLogger(__name__)
class RendererImage(BaseRenderer):
"""Renders content to image format using AI image generation."""
@classmethod
def getSupportedFormats(cls) -> List[str]:
"""Return supported image formats."""
return ['png', 'jpg', 'jpeg', 'image']
@classmethod
def getFormatAliases(cls) -> List[str]:
"""Return format aliases."""
return ['img', 'picture', 'photo', 'graphic']
@classmethod
def getPriority(cls) -> int:
"""Return priority for image renderer."""
return 90
@classmethod
def getOutputStyle(cls, formatName: Optional[str] = None) -> str:
"""Return output style classification: Images are visual media."""
return 'image'
@classmethod
def getAcceptedSectionTypes(cls, formatName: Optional[str] = None) -> List[str]:
"""
Return list of section content types that Image renderer accepts.
Image renderer only accepts image sections (images are generated from image sections).
"""
return ["image"]
async def render(self, extractedContent: Dict[str, Any], title: str, userPrompt: str = None, aiService=None) -> List[RenderedDocument]:
"""Render extracted JSON content to image format using AI image generation."""
try:
# Generate AI image from content
imageContent = await self._generateAiImage(extractedContent, title, userPrompt, aiService)
# Determine filename from document or title
documents = extractedContent.get("documents", [])
if documents and isinstance(documents[0], dict):
filename = documents[0].get("filename")
if not filename:
filename = self._determineFilename(title, "image/png")
else:
filename = self._determineFilename(title, "image/png")
# Convert image content to bytes (base64 string or bytes)
if isinstance(imageContent, str):
try:
imageBytes = base64.b64decode(imageContent)
except Exception:
imageBytes = imageContent.encode('utf-8')
else:
imageBytes = imageContent
# Extract metadata for document type and other info
metadata = extractedContent.get("metadata", {}) if extractedContent else {}
documentType = metadata.get("documentType") if isinstance(metadata, dict) else None
return [
RenderedDocument(
documentData=imageBytes,
mimeType="image/png",
filename=filename,
documentType=documentType,
metadata=metadata if isinstance(metadata, dict) else None
)
]
except Exception as e:
self.logger.error(f"Error rendering image: {str(e)}")
# Re-raise the exception instead of using fallback
raise Exception(f"Image rendering failed: {str(e)}")
async def _generateAiImage(self, extractedContent: Dict[str, Any], title: str, userPrompt: str = None, aiService=None) -> str:
"""Generate AI image from extracted content."""
try:
if not aiService:
raise ValueError("AI service is required for image generation")
# Validate JSON structure (standardized schema: {metadata: {...}, documents: [{sections: [...]}]})
if not self._validateJsonStructure(extractedContent):
raise ValueError("Extracted content must follow standardized schema: {metadata: {...}, documents: [{sections: [...]}]}")
# Extract metadata from standardized schema
metadata = self._extractMetadata(extractedContent)
# Use provided title (which comes from documents[].title) as primary source
# Fallback to metadata.title only if title parameter is empty
documentTitle = title if title else metadata.get("title", "Generated Document")
# Create AI prompt for image generation
imagePrompt = await self._createImageGeneratePrompt(extractedContent, documentTitle, userPrompt, aiService)
# Save image generation prompt to debug
aiService.services.utils.writeDebugFile(imagePrompt, "image_generation_prompt")
# Format prompt as JSON with image generation parameters
from modules.datamodels.datamodelAi import AiCallPromptImage, AiCallOptions, OperationTypeEnum
import json
promptModel = AiCallPromptImage(
prompt=imagePrompt,
size="1024x1024",
quality="standard",
style="vivid"
)
promptJson = promptModel.model_dump_json(exclude_none=True, indent=2)
# Use unified callAiContent method
options = AiCallOptions(
operationType=OperationTypeEnum.IMAGE_GENERATE,
resultFormat="base64"
)
# Use unified callAiContent method
imageResponse = await aiService.callAiContent(
prompt=promptJson,
options=options,
outputFormat="base64"
)
# Save image generation response to debug
aiService.services.utils.writeDebugFile(str(imageResponse.content), "image_generation_response")
# Extract base64 image data from AiResponse
# AiResponse.documents contains DocumentData objects
if imageResponse.documents and len(imageResponse.documents) > 0:
imageData = imageResponse.documents[0].documentData
if imageData:
return imageData
# Fallback: check content field (might be base64 string)
if imageResponse.content:
return imageResponse.content
raise ValueError("No image data returned from AI")
except Exception as e:
self.logger.error(f"Error generating AI image: {str(e)}")
raise Exception(f"AI image generation failed: {str(e)}")
async def _createImageGeneratePrompt(self, extractedContent: Dict[str, Any], title: str, userPrompt: str = None, aiService=None) -> str:
"""Create a detailed prompt for AI image generation based on the content."""
try:
# Start with base prompt
promptParts = []
# Add user's original intent if available
if userPrompt:
sanitized_prompt = aiService.services.utils.sanitizePromptContent(userPrompt, 'userinput') if aiService else userPrompt
promptParts.append(f"User Request: {sanitized_prompt}")
# Add document title
promptParts.append(f"Document Title: {title}")
# Analyze content and create visual description
sections = self._extractSections(extractedContent)
contentDescription = self._analyzeContentForVisualDescription(sections)
if contentDescription:
promptParts.append(f"Content to Visualize: {contentDescription}")
# Add style guidance
styleGuidance = self._getStyleGuidanceFromContent(extractedContent, userPrompt)
if styleGuidance:
promptParts.append(f"Visual Style: {styleGuidance}")
# Combine all parts
fullPrompt = "Create a professional, informative image that visualizes the following content:\n\n" + "\n\n".join(promptParts)
# Add technical requirements
fullPrompt += "\n\nTechnical Requirements:"
fullPrompt += "\n- High quality, professional appearance"
fullPrompt += "\n- Clear, readable text if any text is included"
fullPrompt += "\n- Appropriate colors and layout"
fullPrompt += "\n- Suitable for business/professional use"
# Truncate prompt if it exceeds DALL-E's 4000 character limit
if len(fullPrompt) > 4000:
# Use AI to compress the prompt intelligently
compressedPrompt = await self._compressPromptWithAi(fullPrompt, aiService)
if compressedPrompt and len(compressedPrompt) <= 4000:
return compressedPrompt
# Fallback to minimal prompt if AI compression fails or is still too long
minimalPrompt = f"Create a professional image representing: {title}"
if userPrompt:
sanitized_prompt = aiService.services.utils.sanitizePromptContent(userPrompt, 'userinput') if aiService else userPrompt
minimalPrompt += f" - {sanitized_prompt}"
# If even the minimal prompt is too long, truncate it
if len(minimalPrompt) > 4000:
minimalPrompt = minimalPrompt[:3997] + "..."
return minimalPrompt
return fullPrompt
except Exception as e:
self.logger.warning(f"Error creating image prompt: {str(e)}")
# Fallback to simple prompt
return f"Create a professional image representing: {title}"
async def _compressPromptWithAi(self, longPrompt: str, aiService=None) -> str:
"""Use AI to intelligently compress a long prompt while preserving key information."""
try:
if not aiService:
return None
compressionPrompt = f"""
You are an expert at creating concise, effective prompts for AI image generation.
The following prompt is too long for DALL-E (4000 character limit) and needs to be compressed to under 4000 characters while preserving the most important visual information.
Original prompt ({len(longPrompt)} characters):
{longPrompt}
Please create a compressed version that:
1. Keeps the most important visual elements and requirements
2. Maintains the core intent and style guidance
3. Preserves technical requirements
4. Stays under 4000 characters
5. Is optimized for DALL-E image generation
Return only the compressed prompt, no explanations.
"""
# Use AI to compress the prompt - call the AI service correctly
# The ai_service has an aiObjects attribute that contains the actual AI interface
from modules.datamodels.datamodelAi import AiCallRequest, AiCallOptions, OperationTypeEnum
request = AiCallRequest(
prompt=compressionPrompt,
options=AiCallOptions(
operationType=OperationTypeEnum.DATA_GENERATE,
maxTokens=None, # Let the model use its full context length
temperature=0.3 # Lower temperature for more consistent compression
)
)
response = await aiService.callAi(request)
compressed = response.content.strip()
# Validate the compressed prompt
if compressed and len(compressed) <= 4000 and len(compressed) > 50:
self.logger.info(f"Successfully compressed prompt from {len(longPrompt)} to {len(compressed)} characters")
return compressed
else:
self.logger.warning(f"AI compression failed or produced invalid result: {len(compressed) if compressed else 0} chars")
return None
except Exception as e:
self.logger.warning(f"Error compressing prompt with AI: {str(e)}")
return None
def _analyzeContentForVisualDescription(self, sections: List[Dict[str, Any]]) -> str:
"""Analyze content sections and create a visual description for AI."""
try:
descriptions = []
for section in sections:
sectionType = self._getSectionType(section)
sectionData = self._getSectionData(section)
if sectionType == "table":
headers = sectionData.get("headers", [])
rows = sectionData.get("rows", [])
if headers and rows:
descriptions.append(f"Data table with {len(headers)} columns and {len(rows)} rows: {', '.join(headers)}")
elif sectionType == "bullet_list":
items = sectionData.get("items", [])
if items:
descriptions.append(f"List with {len(items)} items")
elif sectionType == "heading":
text = sectionData.get("text", "")
level = sectionData.get("level", 1)
if text:
descriptions.append(f"Heading {level}: {text}")
elif sectionType == "paragraph":
text = sectionData.get("text", "")
if text and len(text) > 10: # Only include substantial paragraphs
# Truncate long text
truncated = text[:100] + "..." if len(text) > 100 else text
descriptions.append(f"Text content: {truncated}")
elif sectionType == "code_block":
code = sectionData.get("code", "")
language = sectionData.get("language", "")
if code:
descriptions.append(f"Code block ({language}): {code[:50]}...")
return "; ".join(descriptions) if descriptions else "General document content"
except Exception as e:
self.logger.warning(f"Error analyzing content: {str(e)}")
return "Document content"
def _getStyleGuidanceFromContent(self, extractedContent: Dict[str, Any], userPrompt: str = None) -> str:
"""Determine visual style guidance based on content and user prompt."""
try:
styleElements = []
# Analyze user prompt for style hints
if userPrompt:
promptLower = userPrompt.lower()
if any(word in promptLower for word in ["modern", "contemporary", "sleek"]):
styleElements.append("modern, clean design")
elif any(word in promptLower for word in ["classic", "traditional", "formal"]):
styleElements.append("classic, formal design")
elif any(word in promptLower for word in ["creative", "artistic", "colorful"]):
styleElements.append("creative, artistic design")
elif any(word in promptLower for word in ["corporate", "business", "professional"]):
styleElements.append("corporate, professional design")
# Analyze content type for additional style hints
sections = self._extractSections(extractedContent)
hasTables = any(self._getSectionType(s) == "table" for s in sections)
hasLists = any(self._getSectionType(s) == "bullet_list" for s in sections)
hasCode = any(self._getSectionType(s) == "code_block" for s in sections)
if hasTables:
styleElements.append("data-focused layout")
if hasLists:
styleElements.append("organized, structured presentation")
if hasCode:
styleElements.append("technical, developer-friendly")
# Default style if no specific guidance
if not styleElements:
styleElements.append("professional, clean design")
return ", ".join(styleElements)
except Exception as e:
self.logger.warning(f"Error determining style guidance: {str(e)}")
return "professional design"

View file

@ -0,0 +1,129 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
"""
JSON renderer for report generation.
"""
from .documentRendererBaseTemplate import BaseRenderer
from modules.datamodels.datamodelDocument import RenderedDocument
from typing import Dict, Any, List, Optional
import json
class RendererJson(BaseRenderer):
"""Renders content to JSON format with format-specific extraction."""
@classmethod
def getSupportedFormats(cls) -> List[str]:
"""Return supported JSON formats."""
return ['json']
@classmethod
def getFormatAliases(cls) -> List[str]:
"""Return format aliases."""
return ['data']
@classmethod
def getPriority(cls) -> int:
"""Return priority for JSON renderer."""
return 80
@classmethod
def getOutputStyle(cls, formatName: Optional[str] = None) -> str:
"""Return output style classification: JSON document renderer converts structured document content to JSON."""
return 'document'
@classmethod
def getAcceptedSectionTypes(cls, formatName: Optional[str] = None) -> List[str]:
"""
Return list of section content types that JSON renderer accepts.
JSON renderer accepts all section types except images (images cannot be serialized to JSON).
"""
from modules.datamodels.datamodelJson import supportedSectionTypes
# Return all types except image
return [st for st in supportedSectionTypes if st != "image"]
async def render(self, extractedContent: Dict[str, Any], title: str, userPrompt: str = None, aiService=None) -> List[RenderedDocument]:
"""Render extracted JSON content to JSON format."""
try:
# The extracted content should already be JSON from the AI
# Just validate and format it
jsonContent = self._cleanJsonContent(extractedContent, title)
# Determine filename from document or title
documents = extractedContent.get("documents", [])
if documents and isinstance(documents[0], dict):
filename = documents[0].get("filename")
if not filename:
filename = self._determineFilename(title, "application/json")
else:
filename = self._determineFilename(title, "application/json")
# Extract metadata for document type and other info
metadata = extractedContent.get("metadata", {}) if extractedContent else {}
documentType = metadata.get("documentType") if isinstance(metadata, dict) else None
return [
RenderedDocument(
documentData=jsonContent.encode('utf-8'),
mimeType="application/json",
filename=filename,
documentType=documentType,
metadata=metadata if isinstance(metadata, dict) else None
)
]
except Exception as e:
self.logger.error(f"Error rendering JSON: {str(e)}")
# Return minimal JSON fallback
fallbackData = {
"title": title,
"sections": [{"content_type": "paragraph", "elements": [{"text": f"Error rendering report: {str(e)}"}]}],
"metadata": {"error": str(e)}
}
fallbackContent = json.dumps(fallbackData, indent=2)
metadata = extractedContent.get("metadata", {}) if extractedContent else {}
documentType = metadata.get("documentType") if isinstance(metadata, dict) else None
return [
RenderedDocument(
documentData=fallbackContent.encode('utf-8'),
mimeType="application/json",
filename=self._determineFilename(title, "application/json"),
documentType=documentType,
metadata=metadata if isinstance(metadata, dict) else None
)
]
def _cleanJsonContent(self, content: Dict[str, Any], title: str) -> str:
"""Clean and validate JSON content from AI."""
try:
# Validate JSON structure
if not isinstance(content, dict):
raise ValueError("Content must be a dictionary")
# Ensure it has the expected structure
if "sections" not in content:
# Convert old format to new format
content = {
"sections": [{"content_type": "paragraph", "elements": [{"text": str(content)}]}],
"metadata": {"title": title}
}
# Ensure metadata exists
if "metadata" not in content:
content["metadata"] = {}
# Set title in metadata if not present
if "title" not in content["metadata"]:
content["metadata"]["title"] = title
# Re-format with proper indentation
return json.dumps(content, indent=2, ensure_ascii=False)
except Exception as e:
self.logger.warning(f"Error cleaning JSON content: {str(e)}")
# Return minimal valid JSON
fallbackData = {
"sections": [{"content_type": "paragraph", "elements": [{"text": str(content)}]}],
"metadata": {"title": title, "error": str(e)}
}
return json.dumps(fallbackData, indent=2, ensure_ascii=False)

View file

@ -0,0 +1,349 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
"""
Markdown renderer for report generation.
"""
from .documentRendererBaseTemplate import BaseRenderer
from modules.datamodels.datamodelDocument import RenderedDocument
from typing import Dict, Any, List, Optional
class RendererMarkdown(BaseRenderer):
"""Renders content to Markdown format with format-specific extraction."""
@classmethod
def getSupportedFormats(cls) -> List[str]:
"""Return supported Markdown formats."""
return ['md', 'markdown']
@classmethod
def getFormatAliases(cls) -> List[str]:
"""Return format aliases."""
return ['mdown', 'mkd']
@classmethod
def getPriority(cls) -> int:
"""Return priority for markdown renderer."""
return 95
@classmethod
def getOutputStyle(cls, formatName: Optional[str] = None) -> str:
"""Return output style classification: Markdown documents are formatted documents."""
return 'document'
@classmethod
def getAcceptedSectionTypes(cls, formatName: Optional[str] = None) -> List[str]:
"""
Return list of section content types that Markdown renderer accepts.
Markdown renderer accepts all section types except images.
"""
from modules.datamodels.datamodelJson import supportedSectionTypes
return [st for st in supportedSectionTypes if st != "image"]
async def render(self, extractedContent: Dict[str, Any], title: str, userPrompt: str = None, aiService=None) -> List[RenderedDocument]:
"""Render extracted JSON content to Markdown format."""
try:
# Generate markdown from JSON structure
markdownContent = self._generateMarkdownFromJson(extractedContent, title)
# Determine filename from document or title
documents = extractedContent.get("documents", [])
if documents and isinstance(documents[0], dict):
filename = documents[0].get("filename")
if not filename:
filename = self._determineFilename(title, "text/markdown")
else:
filename = self._determineFilename(title, "text/markdown")
# Extract metadata for document type and other info
metadata = extractedContent.get("metadata", {}) if extractedContent else {}
documentType = metadata.get("documentType") if isinstance(metadata, dict) else None
return [
RenderedDocument(
documentData=markdownContent.encode('utf-8'),
mimeType="text/markdown",
filename=filename,
documentType=documentType,
metadata=metadata if isinstance(metadata, dict) else None
)
]
except Exception as e:
self.logger.error(f"Error rendering markdown: {str(e)}")
# Return minimal markdown fallback
fallbackContent = f"# {title}\n\nError rendering report: {str(e)}"
metadata = extractedContent.get("metadata", {}) if extractedContent else {}
documentType = metadata.get("documentType") if isinstance(metadata, dict) else None
return [
RenderedDocument(
documentData=fallbackContent.encode('utf-8'),
mimeType="text/markdown",
filename=self._determineFilename(title, "text/markdown"),
documentType=documentType,
metadata=metadata if isinstance(metadata, dict) else None
)
]
def _generateMarkdownFromJson(self, jsonContent: Dict[str, Any], title: str) -> str:
"""Generate markdown content from structured JSON document."""
try:
# Validate JSON structure (standardized schema: {metadata: {...}, documents: [{sections: [...]}]})
if not self._validateJsonStructure(jsonContent):
raise ValueError("JSON content must follow standardized schema: {metadata: {...}, documents: [{sections: [...]}]}")
# Extract sections and metadata from standardized schema
sections = self._extractSections(jsonContent)
metadata = self._extractMetadata(jsonContent)
# Use provided title (which comes from documents[].title) as primary source
# Fallback to metadata.title only if title parameter is empty
documentTitle = title if title else metadata.get("title", "Generated Document")
# Build markdown content
markdownParts = []
# Document title
markdownParts.append(f"# {documentTitle}")
markdownParts.append("")
# Process each section
for section in sections:
sectionMarkdown = self._renderJsonSection(section)
if sectionMarkdown:
markdownParts.append(sectionMarkdown)
markdownParts.append("") # Add spacing between sections
# Add generation info
markdownParts.append("---")
markdownParts.append(f"*Generated: {self._formatTimestamp()}*")
return '\n'.join(markdownParts)
except Exception as e:
self.logger.error(f"Error generating markdown from JSON: {str(e)}")
raise Exception(f"Markdown generation failed: {str(e)}")
def _renderJsonSection(self, section: Dict[str, Any]) -> str:
"""Render a single JSON section to markdown.
Supports three content formats: reference, object (base64), extracted_text.
"""
try:
sectionType = self._getSectionType(section)
sectionData = self._getSectionData(section)
# Check for three content formats from Phase 5D in elements
if isinstance(sectionData, list):
markdownParts = []
for element in sectionData:
element_type = element.get("type", "") if isinstance(element, dict) else ""
# Support three content formats from Phase 5D
if element_type == "reference":
# Document reference format
doc_ref = element.get("documentReference", "")
label = element.get("label", "Reference")
markdownParts.append(f"*[Reference: {label}]*")
continue
elif element_type == "extracted_text":
# Extracted text format
content = element.get("content", "")
source = element.get("source", "")
if content:
source_text = f" *(Source: {source})*" if source else ""
markdownParts.append(f"{content}{source_text}")
continue
# If we processed reference/extracted_text elements, return them
if markdownParts:
return '\n\n'.join(markdownParts)
if sectionType == "table":
# Work directly with elements like other renderers
if isinstance(sectionData, list) and sectionData:
element = sectionData[0] if isinstance(sectionData[0], dict) else {}
return self._renderJsonTable(element)
return ""
elif sectionType == "bullet_list":
# Work directly with elements like other renderers
if isinstance(sectionData, list) and sectionData:
element = sectionData[0] if isinstance(sectionData[0], dict) else {}
return self._renderJsonBulletList(element)
return ""
elif sectionType == "heading":
# Work directly with elements like other renderers
if isinstance(sectionData, list) and sectionData:
element = sectionData[0] if isinstance(sectionData[0], dict) else {}
return self._renderJsonHeading(element)
return ""
elif sectionType == "paragraph":
# Work directly with elements like other renderers
if isinstance(sectionData, list) and sectionData:
element = sectionData[0] if isinstance(sectionData[0], dict) else {}
return self._renderJsonParagraph(element)
elif isinstance(sectionData, dict):
return self._renderJsonParagraph(sectionData)
return ""
elif sectionType == "code_block":
# Work directly with elements like other renderers
if isinstance(sectionData, list) and sectionData:
element = sectionData[0] if isinstance(sectionData[0], dict) else {}
return self._renderJsonCodeBlock(element)
return ""
elif sectionType == "image":
# Work directly with elements like other renderers
if isinstance(sectionData, list) and sectionData:
element = sectionData[0] if isinstance(sectionData[0], dict) else {}
return self._renderJsonImage(element)
return ""
else:
# Fallback to paragraph for unknown types
if isinstance(sectionData, list) and sectionData:
element = sectionData[0] if isinstance(sectionData[0], dict) else {}
return self._renderJsonParagraph(element)
elif isinstance(sectionData, dict):
return self._renderJsonParagraph(sectionData)
return ""
except Exception as e:
self.logger.warning(f"Error rendering section {self._getSectionId(section)}: {str(e)}")
return f"*[Error rendering section: {str(e)}]*"
def _renderJsonTable(self, tableData: Dict[str, Any]) -> str:
"""Render a JSON table to markdown."""
try:
# Extract from nested content structure: element.content.{headers, rows}
content = tableData.get("content", {})
if not isinstance(content, dict):
return ""
headers = content.get("headers", [])
rows = content.get("rows", [])
if not headers or not rows:
return ""
markdownParts = []
# Create table header
headerLine = " | ".join(str(header) for header in headers)
markdownParts.append(headerLine)
# Add separator line
separatorLine = " | ".join("---" for _ in headers)
markdownParts.append(separatorLine)
# Add data rows
for row in rows:
rowLine = " | ".join(str(cellData) for cellData in row)
markdownParts.append(rowLine)
return '\n'.join(markdownParts)
except Exception as e:
self.logger.warning(f"Error rendering table: {str(e)}")
return ""
def _renderJsonBulletList(self, listData: Dict[str, Any]) -> str:
"""Render a JSON bullet list to markdown."""
try:
# Extract from nested content structure: element.content.{items}
content = listData.get("content", {})
if not isinstance(content, dict):
return ""
items = content.get("items", [])
if not items:
return ""
markdownParts = []
for item in items:
if isinstance(item, str):
markdownParts.append(f"- {item}")
elif isinstance(item, dict) and "text" in item:
markdownParts.append(f"- {item['text']}")
return '\n'.join(markdownParts)
except Exception as e:
self.logger.warning(f"Error rendering bullet list: {str(e)}")
return ""
def _renderJsonHeading(self, headingData: Dict[str, Any]) -> str:
"""Render a JSON heading to markdown."""
try:
# Extract from nested content structure: element.content.{text, level}
content = headingData.get("content", {})
if not isinstance(content, dict):
return ""
text = content.get("text", "")
level = content.get("level", 1)
if text:
level = max(1, min(6, level))
return f"{'#' * level} {text}"
return ""
except Exception as e:
self.logger.warning(f"Error rendering heading: {str(e)}")
return ""
def _renderJsonParagraph(self, paragraphData: Dict[str, Any]) -> str:
"""Render a JSON paragraph to markdown."""
try:
# Extract from nested content structure
content = paragraphData.get("content", {})
if isinstance(content, dict):
text = content.get("text", "")
elif isinstance(content, str):
text = content
else:
text = ""
return text if text else ""
except Exception as e:
self.logger.warning(f"Error rendering paragraph: {str(e)}")
return ""
def _renderJsonCodeBlock(self, codeData: Dict[str, Any]) -> str:
"""Render a JSON code block to markdown."""
try:
# Extract from nested content structure
content = codeData.get("content", {})
if not isinstance(content, dict):
return ""
code = content.get("code", "")
language = content.get("language", "")
if code:
if language:
return f"```{language}\n{code}\n```"
else:
return f"```\n{code}\n```"
return ""
except Exception as e:
self.logger.warning(f"Error rendering code block: {str(e)}")
return ""
def _renderJsonImage(self, imageData: Dict[str, Any]) -> str:
"""Render a JSON image to markdown."""
try:
# Extract from nested content structure: element.content.{base64Data, altText, caption}
content = imageData.get("content", {})
if not isinstance(content, dict):
return ""
altText = content.get("altText", "Image")
base64Data = content.get("base64Data", "")
if base64Data:
# For base64 images, we can't embed them directly in markdown
# So we'll use a placeholder with the alt text
return f"![{altText}](data:image/png;base64,{base64Data[:50]}...)"
else:
return f"![{altText}](image-placeholder)"
except Exception as e:
self.logger.warning(f"Error rendering image: {str(e)}")
return f"![{imageData.get('altText', 'Image')}](image-error)"

View file

@ -0,0 +1,944 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
"""
PDF renderer for report generation using reportlab.
"""
from .documentRendererBaseTemplate import BaseRenderer
from modules.datamodels.datamodelDocument import RenderedDocument
from typing import Dict, Any, List, Optional
import io
import base64
try:
from reportlab.lib.pagesizes import letter, A4
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, Table, TableStyle, PageBreak
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.lib.units import inch
from reportlab.lib import colors
from reportlab.lib.enums import TA_CENTER, TA_LEFT, TA_JUSTIFY
REPORTLAB_AVAILABLE = True
except ImportError:
REPORTLAB_AVAILABLE = False
class RendererPdf(BaseRenderer):
"""Renders content to PDF format using reportlab."""
@classmethod
def getSupportedFormats(cls) -> List[str]:
"""Return supported PDF formats."""
return ['pdf']
@classmethod
def getFormatAliases(cls) -> List[str]:
"""Return format aliases."""
return ['document', 'print']
@classmethod
def getPriority(cls) -> int:
"""Return priority for PDF renderer."""
return 120
@classmethod
def getOutputStyle(cls, formatName: Optional[str] = None) -> str:
"""Return output style classification: PDF documents are formatted documents."""
return 'document'
@classmethod
def getAcceptedSectionTypes(cls, formatName: Optional[str] = None) -> List[str]:
"""
Return list of section content types that PDF renderer accepts.
PDF renderer accepts all section types (PDF documents can contain all content types).
"""
from modules.datamodels.datamodelJson import supportedSectionTypes
return list(supportedSectionTypes)
async def render(self, extractedContent: Dict[str, Any], title: str, userPrompt: str = None, aiService=None) -> List[RenderedDocument]:
"""Render extracted JSON content to PDF format using AI-analyzed styling."""
try:
if not REPORTLAB_AVAILABLE:
# Fallback to HTML if reportlab not available
from .rendererHtml import RendererHtml
html_renderer = RendererHtml()
return await html_renderer.render(extractedContent, title, userPrompt, aiService)
# Generate PDF using AI-analyzed styling
pdf_content = await self._generatePdfFromJson(extractedContent, title, userPrompt, aiService)
# Extract metadata for document type and other info
metadata = extractedContent.get("metadata", {}) if extractedContent else {}
documentType = metadata.get("documentType") if isinstance(metadata, dict) else None
# Determine filename from document or title
documents = extractedContent.get("documents", [])
if documents and isinstance(documents[0], dict):
filename = documents[0].get("filename")
if not filename:
filename = self._determineFilename(title, "application/pdf")
else:
filename = self._determineFilename(title, "application/pdf")
# Convert PDF content to bytes if it's a string (base64)
if isinstance(pdf_content, str):
# Try to decode as base64, otherwise encode as UTF-8
try:
pdf_bytes = base64.b64decode(pdf_content)
except Exception:
pdf_bytes = pdf_content.encode('utf-8')
else:
pdf_bytes = pdf_content
return [
RenderedDocument(
documentData=pdf_bytes,
mimeType="application/pdf",
filename=filename,
documentType=documentType,
metadata=metadata if isinstance(metadata, dict) else None
)
]
except Exception as e:
self.logger.error(f"Error rendering PDF: {str(e)}")
# Return minimal fallback
fallbackContent = f"PDF Generation Error: {str(e)}"
return [
RenderedDocument(
documentData=fallbackContent.encode('utf-8'),
mimeType="text/plain",
filename=self._determineFilename(title, "text/plain")
)
]
async def _generatePdfFromJson(self, json_content: Dict[str, Any], title: str, userPrompt: str = None, aiService=None) -> str:
"""Generate PDF content from structured JSON document using AI-generated styling."""
try:
# Get style set: use styles from metadata if available, otherwise enhance with AI
styles = await self._getStyleSet(json_content, userPrompt, aiService)
# Validate JSON structure
if not self._validateJsonStructure(json_content):
raise ValueError("JSON content must follow standardized schema: {metadata: {...}, documents: [{sections: [...]}]}")
# Extract sections and metadata from standardized schema
sections = self._extractSections(json_content)
metadata = self._extractMetadata(json_content)
# Use provided title (which comes from documents[].title) as primary source
# Fallback to metadata.title only if title parameter is empty
document_title = title if title else metadata.get("title", "Generated Document")
# Make title shorter to prevent wrapping/overlapping
if len(document_title) > 40:
document_title = "PowerOn - Consent Agreement"
# Create a buffer to hold the PDF
buffer = io.BytesIO()
# Create PDF document
doc = SimpleDocTemplate(
buffer,
pagesize=A4,
rightMargin=72,
leftMargin=72,
topMargin=72,
bottomMargin=18
)
# Build PDF content
story = []
# Title page
title_style = self._createTitleStyle(styles)
story.append(Paragraph(document_title, title_style))
story.append(Spacer(1, 50)) # Increased spacing to prevent overlap
story.append(Paragraph(f"Generated: {self._formatTimestamp()}", self._createNormalStyle(styles)))
story.append(Spacer(1, 30)) # Add spacing before page break
story.append(PageBreak())
# Process each section (sections already extracted above)
self.services.utils.debugLogToFile(f"PDF SECTIONS TO PROCESS: {len(sections)} sections", "PDF_RENDERER")
for i, section in enumerate(sections):
self.services.utils.debugLogToFile(f"PDF SECTION {i}: content_type={section.get('content_type', 'unknown')}, id={section.get('id', 'unknown')}", "PDF_RENDERER")
section_elements = self._renderJsonSection(section, styles)
self.services.utils.debugLogToFile(f"PDF SECTION {i} ELEMENTS: {len(section_elements)} elements", "PDF_RENDERER")
story.extend(section_elements)
# Build PDF
doc.build(story)
# Get PDF content as base64
buffer.seek(0)
pdf_bytes = buffer.getvalue()
pdf_base64 = base64.b64encode(pdf_bytes).decode('utf-8')
return pdf_base64
except Exception as e:
self.logger.error(f"Error generating PDF from JSON: {str(e)}")
raise Exception(f"PDF generation failed: {str(e)}")
async def _getStyleSet(self, extractedContent: Dict[str, Any] = None, userPrompt: str = None, aiService=None, templateName: str = None) -> Dict[str, Any]:
"""Get style set - use styles from document generation metadata if available,
otherwise enhance default styles with AI if userPrompt provided.
WICHTIG: In a dynamic scalable AI system, styling should come from document generation,
not be generated separately by renderers. Only fall back to AI if styles not provided.
Args:
extractedContent: Document content with metadata (may contain styles)
userPrompt: User's prompt (AI will detect style instructions in any language)
aiService: AI service (used only if styles not in metadata and userPrompt provided)
templateName: Name of template style set (None = default)
Returns:
Dict with style definitions for all document styles
"""
# Get default style set
defaultStyleSet = self._getDefaultStyleSet()
# FIRST: Check if styles are provided in document generation metadata (preferred approach)
if extractedContent:
metadata = extractedContent.get("metadata", {})
if isinstance(metadata, dict):
styles = metadata.get("styles")
if styles and isinstance(styles, dict):
self.logger.debug("Using styles from document generation metadata")
enhancedStyleSet = self._convertColorsFormat(styles)
return self._validateStylesContrast(enhancedStyleSet)
# FALLBACK: Enhance with AI if userPrompt provided (only if styles not in metadata)
if userPrompt and aiService:
self.logger.info(f"Styles not in metadata, enhancing with AI based on user prompt...")
enhancedStyleSet = await self._enhanceStylesWithAI(userPrompt, defaultStyleSet, aiService)
# Convert colors to PDF format after getting styles
enhancedStyleSet = self._convertColorsFormat(enhancedStyleSet)
return self._validateStylesContrast(enhancedStyleSet)
else:
# Use default styles only
return defaultStyleSet
async def _enhanceStylesWithAI(self, userPrompt: str, defaultStyleSet: Dict[str, Any], aiService) -> Dict[str, Any]:
"""Enhance default styles with AI based on user prompt."""
try:
style_template = self._createAiStyleTemplate("pdf", userPrompt, defaultStyleSet)
enhanced_styles = await self._getAiStyles(aiService, style_template, defaultStyleSet)
return enhanced_styles
except Exception as e:
self.logger.warning(f"AI style enhancement failed: {str(e)}, using default styles")
return defaultStyleSet
def _validateStylesContrast(self, styles: Dict[str, Any]) -> Dict[str, Any]:
"""Validate and fix contrast issues in AI-generated styles."""
try:
# Fix table header contrast
if "table_header" in styles:
header = styles["table_header"]
bg_color = header.get("background", "#FFFFFF")
text_color = header.get("text_color", "#000000")
# If both are white or both are dark, fix it
if bg_color.upper() == "#FFFFFF" and text_color.upper() == "#FFFFFF":
header["background"] = "#4F4F4F"
header["text_color"] = "#FFFFFF"
elif bg_color.upper() == "#000000" and text_color.upper() == "#000000":
header["background"] = "#4F4F4F"
header["text_color"] = "#FFFFFF"
# Fix table cell contrast
if "table_cell" in styles:
cell = styles["table_cell"]
bg_color = cell.get("background", "#FFFFFF")
text_color = cell.get("text_color", "#000000")
# If both are white or both are dark, fix it
if bg_color.upper() == "#FFFFFF" and text_color.upper() == "#FFFFFF":
cell["background"] = "#FFFFFF"
cell["text_color"] = "#2F2F2F"
elif bg_color.upper() == "#000000" and text_color.upper() == "#000000":
cell["background"] = "#FFFFFF"
cell["text_color"] = "#2F2F2F"
return styles
except Exception as e:
self.logger.warning(f"Style validation failed: {str(e)}")
return self._getDefaultStyleSet()
def _getDefaultStyleSet(self) -> Dict[str, Any]:
"""Default PDF style set - used when no style instructions present."""
return {
"title": {"font_size": 24, "color": "#1F4E79", "bold": True, "align": "center", "space_after": 30},
"heading1": {"font_size": 18, "color": "#2F2F2F", "bold": True, "align": "left", "space_after": 12, "space_before": 12},
"heading2": {"font_size": 14, "color": "#4F4F4F", "bold": True, "align": "left", "space_after": 8, "space_before": 8},
"paragraph": {"font_size": 11, "color": "#2F2F2F", "bold": False, "align": "left", "space_after": 6, "line_height": 1.2},
"table_header": {"background": "#4F4F4F", "text_color": "#FFFFFF", "bold": True, "align": "center", "font_size": 12},
"table_cell": {"background": "#FFFFFF", "text_color": "#2F2F2F", "bold": False, "align": "left", "font_size": 10},
"bullet_list": {"font_size": 11, "color": "#2F2F2F", "space_after": 3},
"code_block": {"font": "Courier", "font_size": 9, "color": "#2F2F2F", "background": "#F5F5F5", "space_after": 6}
}
async def _getAiStylesWithPdfColors(self, ai_service, style_template: str, default_styles: Dict[str, Any]) -> Dict[str, Any]:
"""Get AI styles with proper PDF color conversion."""
if not ai_service:
return default_styles
try:
from modules.datamodels.datamodelAi import AiCallRequest, AiCallOptions, OperationTypeEnum
request_options = AiCallOptions()
request_options.operationType = OperationTypeEnum.DATA_GENERATE
request = AiCallRequest(prompt=style_template, context="", options=request_options)
# Check if AI service is properly configured
if not hasattr(ai_service, 'aiObjects') or not ai_service.aiObjects:
self.logger.warning("AI service not properly configured, using defaults")
return default_styles
response = await ai_service.callAi(request)
# Check if response is valid
if not response:
self.logger.warning("AI service returned no response, using defaults")
return default_styles
import json
import re
# Clean and parse JSON
result = response.content.strip() if response and response.content else ""
# Check if result is empty
if not result:
self.logger.warning("AI styling returned empty response, using defaults")
return default_styles
# Log the raw response for debugging
self.logger.debug(f"AI styling raw response: {result[:200]}...")
# Extract JSON from various formats
json_match = re.search(r'```json\s*\n(.*?)\n```', result, re.DOTALL)
if json_match:
result = json_match.group(1).strip()
elif result.startswith('```json'):
result = re.sub(r'^```json\s*', '', result)
result = re.sub(r'\s*```$', '', result)
elif result.startswith('```'):
result = re.sub(r'^```\s*', '', result)
result = re.sub(r'\s*```$', '', result)
# Try to extract JSON from explanatory text
json_patterns = [
r'\{[^{}]*"title"[^{}]*\}', # Simple JSON object
r'\{.*?"title".*?\}', # JSON with title field
r'\{.*?"font_size".*?\}', # JSON with font_size field
]
for pattern in json_patterns:
json_match = re.search(pattern, result, re.DOTALL)
if json_match:
result = json_match.group(0)
break
# Additional cleanup - remove any leading/trailing whitespace and newlines
result = result.strip()
# Check if result is still empty after cleanup
if not result:
self.logger.warning("AI styling returned empty content after cleanup, using defaults")
return default_styles
# Try to parse JSON
try:
styles = json.loads(result)
self.logger.debug(f"Successfully parsed AI styles: {list(styles.keys())}")
except json.JSONDecodeError as json_error:
self.logger.warning(f"AI styling returned invalid JSON: {json_error}")
# Use print instead of logger to avoid truncation
self.services.utils.debugLogToFile(f"FULL AI RESPONSE THAT FAILED TO PARSE: {result}", "PDF_RENDERER")
self.services.utils.debugLogToFile(f"RESPONSE LENGTH: {len(result)} characters", "PDF_RENDERER")
self.logger.warning(f"Raw content that failed to parse: {result}")
# Try to fix incomplete JSON by adding missing closing braces
open_braces = result.count('{')
close_braces = result.count('}')
if open_braces > close_braces:
# JSON is incomplete, add missing closing braces
missing_braces = open_braces - close_braces
result = result + '}' * missing_braces
self.logger.info(f"Added {missing_braces} missing closing brace(s)")
# Try parsing the fixed JSON
try:
styles = json.loads(result)
self.logger.info("Successfully fixed incomplete JSON")
except json.JSONDecodeError as fix_error:
self.logger.warning(f"Fixed JSON still invalid: {fix_error}")
# Try to extract just the JSON part if it's embedded in text
json_start = result.find('{')
json_end = result.rfind('}')
if json_start != -1 and json_end != -1 and json_end > json_start:
json_part = result[json_start:json_end+1]
try:
styles = json.loads(json_part)
self.logger.info("Successfully extracted JSON from explanatory text")
except json.JSONDecodeError:
self.logger.warning("Could not extract valid JSON from response, using defaults")
return default_styles
else:
return default_styles
else:
# Try to extract just the JSON part if it's embedded in text
json_start = result.find('{')
json_end = result.rfind('}')
if json_start != -1 and json_end != -1 and json_end > json_start:
json_part = result[json_start:json_end+1]
try:
styles = json.loads(json_part)
self.logger.info("Successfully extracted JSON from explanatory text")
except json.JSONDecodeError:
self.logger.warning("Could not extract valid JSON from response, using defaults")
return default_styles
else:
return default_styles
# Convert colors to PDF format (keep as hex strings, PDF renderer will convert them)
styles = self._convertColorsFormat(styles)
return styles
except Exception as e:
self.logger.warning(f"AI styling failed: {str(e)}, using defaults")
return default_styles
def _convertColorsFormat(self, styles: Dict[str, Any]) -> Dict[str, Any]:
"""Convert colors to proper format for PDF compatibility."""
try:
for style_name, style_config in styles.items():
if isinstance(style_config, dict):
for prop, value in style_config.items():
if isinstance(value, str) and value.startswith('#') and len(value) == 7:
# Convert #RRGGBB to #AARRGGBB (add FF alpha channel) for consistency
styles[style_name][prop] = f"FF{value[1:]}"
elif isinstance(value, str) and value.startswith('#') and len(value) == 9:
# Already aRGB format, keep as is
pass
return styles
except Exception as e:
self.logger.warning(f"Color conversion failed: {str(e)}")
return styles
def _getSafeColor(self, color_value: str, default: str = "#000000") -> str:
"""Get a safe hex color value for PDF."""
if isinstance(color_value, str) and color_value.startswith('#'):
if len(color_value) == 7:
return f"FF{color_value[1:]}"
elif len(color_value) == 9:
return color_value
return default
def _createTitleStyle(self, styles: Dict[str, Any]) -> ParagraphStyle:
"""Create title style from style definitions."""
title_style_def = styles.get("title", {})
# DEBUG: Show what color and spacing is being used for title
title_color = title_style_def.get("color", "#1F4E79")
title_space_after = title_style_def.get("space_after", 30)
self.services.utils.debugLogToFile(f"PDF TITLE COLOR: {title_color} -> {self._hexToColor(title_color)}", "PDF_RENDERER")
self.services.utils.debugLogToFile(f"PDF TITLE SPACE_AFTER: {title_space_after}", "PDF_RENDERER")
return ParagraphStyle(
'CustomTitle',
fontSize=title_style_def.get("font_size", 20), # Reduced from 24 to 20
spaceAfter=title_style_def.get("space_after", 30),
alignment=self._getAlignment(title_style_def.get("align", "center")),
textColor=self._hexToColor(title_color),
leading=title_style_def.get("font_size", 20) * 1.4, # Add line spacing for multi-line titles
spaceBefore=0 # Ensure no space before title
)
def _createHeadingStyle(self, styles: Dict[str, Any], level: int) -> ParagraphStyle:
"""Create heading style from style definitions."""
heading_key = f"heading{level}"
heading_style_def = styles.get(heading_key, styles.get("heading1", {}))
return ParagraphStyle(
f'CustomHeading{level}',
fontSize=heading_style_def.get("font_size", 18 - level * 2),
spaceAfter=heading_style_def.get("space_after", 12),
spaceBefore=heading_style_def.get("space_before", 12),
alignment=self._getAlignment(heading_style_def.get("align", "left")),
textColor=self._hexToColor(heading_style_def.get("color", "#2F2F2F"))
)
def _createNormalStyle(self, styles: Dict[str, Any]) -> ParagraphStyle:
"""Create normal paragraph style from style definitions."""
paragraph_style_def = styles.get("paragraph", {})
return ParagraphStyle(
'CustomNormal',
fontSize=paragraph_style_def.get("font_size", 11),
spaceAfter=paragraph_style_def.get("space_after", 6),
alignment=self._getAlignment(paragraph_style_def.get("align", "left")),
textColor=self._hexToColor(paragraph_style_def.get("color", "#2F2F2F")),
leading=paragraph_style_def.get("line_height", 1.2) * paragraph_style_def.get("font_size", 11)
)
def _getAlignment(self, align: str) -> int:
"""Convert alignment string to reportlab alignment constant."""
if not align or not isinstance(align, str):
return TA_LEFT
align_map = {
"center": TA_CENTER,
"left": TA_LEFT,
"justify": TA_JUSTIFY,
"right": TA_LEFT, # ReportLab doesn't have TA_RIGHT, use LEFT as fallback
"0": TA_LEFT, # Handle numeric strings
"1": TA_CENTER,
"2": TA_JUSTIFY
}
return align_map.get(align.lower().strip(), TA_LEFT)
def _getTableAlignment(self, align: str) -> str:
"""Convert alignment string to ReportLab table alignment string."""
if not align or not isinstance(align, str):
return 'LEFT'
align_map = {
"center": 'CENTER',
"left": 'LEFT',
"justify": 'LEFT', # Tables don't support justify, use LEFT
"right": 'RIGHT',
"0": 'LEFT', # Handle numeric strings
"1": 'CENTER',
"2": 'LEFT' # Tables don't support justify, use LEFT
}
return align_map.get(align.lower().strip(), 'LEFT')
def _hexToColor(self, hex_color: str) -> colors.Color:
"""Convert hex color to reportlab color."""
try:
hex_color = hex_color.lstrip('#')
# Handle aRGB format (8 characters: FF + RGB)
if len(hex_color) == 8:
# Skip the alpha channel (first 2 characters)
hex_color = hex_color[2:]
# Handle RGB format (6 characters)
if len(hex_color) == 6:
r = int(hex_color[0:2], 16) / 255.0
g = int(hex_color[2:4], 16) / 255.0
b = int(hex_color[4:6], 16) / 255.0
return colors.Color(r, g, b)
# Fallback for other formats
return colors.black
except:
return colors.black
def _renderJsonSection(self, section: Dict[str, Any], styles: Dict[str, Any]) -> List[Any]:
"""Render a single JSON section to PDF elements using AI-generated styles.
Supports three content formats: reference, object (base64), extracted_text.
"""
try:
section_type = self._getSectionType(section)
elements = self._getSectionData(section)
# Process each element in the section
all_elements = []
for element in elements:
element_type = element.get("type", "") if isinstance(element, dict) else ""
# Support three content formats from Phase 5D
if element_type == "reference":
# Document reference format
doc_ref = element.get("documentReference", "")
label = element.get("label", "Reference")
ref_style = ParagraphStyle(
'Reference',
parent=self._createNormalStyle(styles),
fontStyle='italic',
textColor=colors.grey
)
all_elements.append(Paragraph(f"[Reference: {label}]", ref_style))
all_elements.append(Spacer(1, 6))
continue
elif element_type == "extracted_text":
# Extracted text format
content = element.get("content", "")
source = element.get("source", "")
if content:
source_text = f" <i>(Source: {source})</i>" if source else ""
all_elements.append(Paragraph(f"{content}{source_text}", self._createNormalStyle(styles)))
all_elements.append(Spacer(1, 6))
continue
# Check element type, not section type (elements can have different types than section)
if element_type == "table":
all_elements.extend(self._renderJsonTable(element, styles))
elif element_type == "bullet_list":
all_elements.extend(self._renderJsonBulletList(element, styles))
elif element_type == "heading":
all_elements.extend(self._renderJsonHeading(element, styles))
elif element_type == "paragraph":
all_elements.extend(self._renderJsonParagraph(element, styles))
elif element_type == "code_block":
all_elements.extend(self._renderJsonCodeBlock(element, styles))
elif element_type == "image":
all_elements.extend(self._renderJsonImage(element, styles))
else:
# Fallback: if element_type not set, use section_type as fallback
if section_type == "table":
all_elements.extend(self._renderJsonTable(element, styles))
elif section_type == "bullet_list":
all_elements.extend(self._renderJsonBulletList(element, styles))
elif section_type == "heading":
all_elements.extend(self._renderJsonHeading(element, styles))
elif section_type == "paragraph":
all_elements.extend(self._renderJsonParagraph(element, styles))
elif section_type == "code_block":
all_elements.extend(self._renderJsonCodeBlock(element, styles))
elif section_type == "image":
all_elements.extend(self._renderJsonImage(element, styles))
else:
# Final fallback to paragraph for unknown types
all_elements.extend(self._renderJsonParagraph(element, styles))
return all_elements
except Exception as e:
self.logger.warning(f"Error rendering section {self._getSectionId(section)}: {str(e)}")
return [Paragraph(f"[Error rendering section: {str(e)}]", self._createNormalStyle(styles))]
def _renderJsonTable(self, table_data: Dict[str, Any], styles: Dict[str, Any]) -> List[Any]:
"""Render a JSON table to PDF elements using AI-generated styles."""
try:
# Handle nested content structure: element.content.headers vs element.headers
# Extract from nested content structure
content = table_data.get("content", {})
if not isinstance(content, dict):
return []
headers = content.get("headers", [])
rows = content.get("rows", [])
if not headers or not rows:
return []
# Prepare table data
table_data_list = [headers] + rows
# Create table
table = Table(table_data_list)
# Apply styling
table_header_style = styles.get("table_header", {})
table_cell_style = styles.get("table_cell", {})
table_style = [
('BACKGROUND', (0, 0), (-1, 0), self._hexToColor(table_header_style.get("background", "#4F4F4F"))),
('TEXTCOLOR', (0, 0), (-1, 0), self._hexToColor(table_header_style.get("text_color", "#FFFFFF"))),
('ALIGN', (0, 0), (-1, -1), self._getTableAlignment(table_cell_style.get("align", "left"))),
('FONTNAME', (0, 0), (-1, 0), 'Helvetica-Bold' if table_header_style.get("bold", True) else 'Helvetica'),
('FONTSIZE', (0, 0), (-1, 0), table_header_style.get("font_size", 12)),
('BOTTOMPADDING', (0, 0), (-1, 0), 12),
('BACKGROUND', (0, 1), (-1, -1), self._hexToColor(table_cell_style.get("background", "#FFFFFF"))),
('FONTSIZE', (0, 1), (-1, -1), table_cell_style.get("font_size", 10)),
('GRID', (0, 0), (-1, -1), 1, colors.black)
]
table.setStyle(TableStyle(table_style))
return [table, Spacer(1, 12)]
except Exception as e:
self.logger.warning(f"Error rendering table: {str(e)}")
return []
def _renderJsonBulletList(self, list_data: Dict[str, Any], styles: Dict[str, Any]) -> List[Any]:
"""Render a JSON bullet list to PDF elements using AI-generated styles."""
try:
# Extract from nested content structure
content = list_data.get("content", {})
if not isinstance(content, dict):
return []
items = content.get("items", [])
bullet_style_def = styles.get("bullet_list", {})
elements = []
for item in items:
if isinstance(item, str):
elements.append(Paragraph(f"{item}", self._createNormalStyle(styles)))
elif isinstance(item, dict) and "text" in item:
elements.append(Paragraph(f"{item['text']}", self._createNormalStyle(styles)))
if elements:
elements.append(Spacer(1, bullet_style_def.get("space_after", 3)))
return elements
except Exception as e:
self.logger.warning(f"Error rendering bullet list: {str(e)}")
return []
def _renderJsonHeading(self, heading_data: Dict[str, Any], styles: Dict[str, Any]) -> List[Any]:
"""Render a JSON heading to PDF elements using AI-generated styles."""
try:
# Extract from nested content structure
content = heading_data.get("content", {})
if not isinstance(content, dict):
return []
text = content.get("text", "")
level = content.get("level", 1)
if text:
level = max(1, min(6, level))
heading_style = self._createHeadingStyle(styles, level)
return [Paragraph(text, heading_style)]
return []
except Exception as e:
self.logger.warning(f"Error rendering heading: {str(e)}")
return []
def _renderJsonParagraph(self, paragraph_data: Dict[str, Any], styles: Dict[str, Any]) -> List[Any]:
"""Render a JSON paragraph to PDF elements using AI-generated styles."""
try:
# Extract from nested content structure
content = paragraph_data.get("content", {})
if isinstance(content, dict):
text = content.get("text", "")
elif isinstance(content, str):
text = content
else:
text = ""
if text:
return [Paragraph(text, self._createNormalStyle(styles))]
return []
except Exception as e:
self.logger.warning(f"Error rendering paragraph: {str(e)}")
return []
def _renderJsonCodeBlock(self, code_data: Dict[str, Any], styles: Dict[str, Any]) -> List[Any]:
"""Render a JSON code block to PDF elements using AI-generated styles."""
try:
# Extract from nested content structure
content = code_data.get("content", {})
if not isinstance(content, dict):
return []
code = content.get("code", "")
language = content.get("language", "")
code_style_def = styles.get("code_block", {})
if code:
elements = []
if language:
lang_style = ParagraphStyle(
'CodeLanguage',
fontSize=code_style_def.get("font_size", 9),
textColor=self._hexToColor(code_style_def.get("color", "#2F2F2F")),
fontName='Helvetica-Bold'
)
elements.append(Paragraph(f"Code ({language}):", lang_style))
code_style = ParagraphStyle(
'CodeBlock',
fontSize=code_style_def.get("font_size", 9),
textColor=self._hexToColor(code_style_def.get("color", "#2F2F2F")),
fontName=code_style_def.get("font", "Courier"),
backColor=self._hexToColor(code_style_def.get("background", "#F5F5F5")),
spaceAfter=code_style_def.get("space_after", 6)
)
elements.append(Paragraph(code, code_style))
return elements
return []
except Exception as e:
self.logger.warning(f"Error rendering code block: {str(e)}")
return []
def _renderJsonImage(self, image_data: Dict[str, Any], styles: Dict[str, Any]) -> List[Any]:
"""Render a JSON image to PDF elements using reportlab."""
try:
# Extract from nested content structure
content = image_data.get("content", {})
base64_data = ""
alt_text = "Image"
caption = ""
if isinstance(content, dict):
# Nested content structure
base64_data = content.get("base64Data", "")
alt_text = content.get("altText", "Image")
caption = content.get("caption", "")
elif isinstance(content, str):
# Content might be base64 string directly (shouldn't happen, but handle it)
self.logger.warning("Image content is a string, not a dict. This should not happen.")
return [Paragraph(f"[Image: Invalid format]", self._createNormalStyle(styles))]
# If base64Data not found in content, try direct element fields (fallback)
if not base64_data:
base64_data = image_data.get("base64Data", "")
if not alt_text or alt_text == "Image":
alt_text = image_data.get("altText", "Image")
if not caption:
caption = image_data.get("caption", "")
# If base64Data still not found, try extracting from url data URI
if not base64_data:
url = image_data.get("url", "") or (content.get("url", "") if isinstance(content, dict) else "")
if url and isinstance(url, str) and url.startswith("data:image/"):
# Extract base64 from data URI: data:image/png;base64,<base64>
import re
match = re.match(r'data:image/[^;]+;base64,(.+)', url)
if match:
base64_data = match.group(1)
if not base64_data:
self.logger.warning(f"No base64 data found for image. Alt text: {alt_text}")
return [Paragraph(f"[Image: {alt_text}]", self._createNormalStyle(styles))]
# Validate that base64_data is actually base64 (not the entire element rendered as text)
if len(base64_data) > 10000: # Very long string might be entire element JSON
self.logger.warning(f"Base64 data seems too long ({len(base64_data)} chars), might be incorrectly extracted")
# Ensure base64_data is a string, not bytes or other type
if not isinstance(base64_data, str):
self.logger.warning(f"Base64 data is not a string: {type(base64_data)}")
return [Paragraph(f"[Image: {alt_text} - Invalid data type]", self._createNormalStyle(styles))]
try:
from reportlab.platypus import Image as ReportLabImage
from reportlab.lib.units import inch
import base64
import io
# Decode base64 image data
imageBytes = base64.b64decode(base64_data)
imageStream = io.BytesIO(imageBytes)
# Create reportlab Image element
# Try to get image dimensions from PIL
try:
from PIL import Image as PILImage
from reportlab.lib.pagesizes import A4
pilImage = PILImage.open(imageStream)
originalWidth, originalHeight = pilImage.size
# Calculate available page dimensions (A4 with margins: 72pt left/right, 72pt top, 18pt bottom)
pageWidth = A4[0] # 595.27 points
pageHeight = A4[1] # 841.89 points
leftMargin = 72
rightMargin = 72
topMargin = 72
bottomMargin = 18
# Use actual frame dimensions from SimpleDocTemplate
# Frame is smaller than page minus margins due to internal spacing
# From error message: frame is 439.27559055118115 x 739.8897637795277
# Use conservative values with safety margin
availableWidth = 430.0 # Slightly smaller than frame width for safety
availableHeight = 730.0 # Slightly smaller than frame height for safety
# Convert original image size from pixels to points
# PIL provides size in pixels, need to convert to points
# Standard conversion: 1 inch = 72 points, typical screen DPI = 96 pixels/inch
# So: pixels * (72/96) = points, or pixels * 0.75 = points
# But for images, we should use the image's actual DPI if available
dpi = pilImage.info.get('dpi', (96, 96))[0] # Default to 96 DPI if not specified
if dpi <= 0:
dpi = 96 # Fallback to 96 DPI
# Convert pixels to points: 1 point = 1/72 inch, so pixels * (72/dpi) = points
imgWidthPoints = originalWidth * (72.0 / dpi)
imgHeightPoints = originalHeight * (72.0 / dpi)
# Scale to fit within available page dimensions while maintaining aspect ratio
widthScale = availableWidth / imgWidthPoints if imgWidthPoints > 0 else 1.0
heightScale = availableHeight / imgHeightPoints if imgHeightPoints > 0 else 1.0
# Use the smaller scale to ensure image fits both width and height
scale = min(widthScale, heightScale, 1.0) # Don't scale up, only down
imgWidth = imgWidthPoints * scale
imgHeight = imgHeightPoints * scale
# Additional safety check: ensure dimensions don't exceed available space
if imgWidth > availableWidth:
scale = availableWidth / imgWidth
imgWidth = availableWidth
imgHeight = imgHeight * scale
if imgHeight > availableHeight:
scale = availableHeight / imgHeight
imgHeight = availableHeight
imgWidth = imgWidth * scale
# Reset stream for reportlab
imageStream.seek(0)
except Exception as e:
# Fallback: use default size that fits page
self.logger.warning(f"Error calculating image size: {str(e)}, using safe default")
# Use 80% of available width as safe default
imgWidth = 4 * inch # ~288 points, safe for ~451pt available width
imgHeight = 3 * inch # ~216 points, safe for ~751pt available height
imageStream.seek(0)
# Create reportlab Image
reportlabImage = ReportLabImage(imageStream, width=imgWidth, height=imgHeight)
elements = [reportlabImage]
# Add caption if available
if caption:
captionStyle = self._createNormalStyle(styles)
captionStyle.fontSize = 10
captionStyle.textColor = self._hexToColor(styles.get("paragraph", {}).get("color", "#666666"))
elements.append(Paragraph(f"<i>{caption}</i>", captionStyle))
elif alt_text and alt_text != "Image":
# Use alt text as caption if no caption provided, but avoid usageHint format
if "Render as visual element:" in alt_text:
# Extract filename from usageHint if possible
parts = alt_text.split("Render as visual element:")
if len(parts) > 1:
filename = parts[1].strip()
caption_text = f"Figure: {filename}"
else:
caption_text = alt_text
else:
caption_text = f"Figure: {alt_text}"
captionStyle = self._createNormalStyle(styles)
captionStyle.fontSize = 10
captionStyle.textColor = self._hexToColor(styles.get("paragraph", {}).get("color", "#666666"))
elements.append(Paragraph(f"<i>{caption_text}</i>", captionStyle))
return elements
except Exception as imgError:
self.logger.error(f"Error embedding image in PDF: {str(imgError)}")
# Return error message instead of placeholder
errorStyle = self._createNormalStyle(styles)
errorStyle.textColor = self._hexToColor("#FF0000") # Red color for error
errorMsg = f"[Error: Could not embed image '{alt_text}'. {str(imgError)}]"
return [Paragraph(errorMsg, errorStyle)]
except Exception as e:
self.logger.error(f"Error rendering image: {str(e)}")
errorStyle = self._createNormalStyle(styles)
errorStyle.textColor = self._hexToColor("#FF0000") # Red color for error
errorMsg = f"[Error: Could not render image '{image_data.get('altText', 'Image')}'. {str(e)}]"
return [Paragraph(errorMsg, errorStyle)]

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,380 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
"""
Text renderer for report generation.
"""
from .documentRendererBaseTemplate import BaseRenderer
from modules.datamodels.datamodelDocument import RenderedDocument
from typing import Dict, Any, List, Optional
class RendererText(BaseRenderer):
"""Renders content to plain text format with format-specific extraction."""
@classmethod
def getSupportedFormats(cls) -> List[str]:
"""Return supported text formats (excluding formats with dedicated renderers)."""
return [
'txt', 'text', 'plain',
# Programming languages
'js', 'javascript', 'ts', 'typescript', 'jsx', 'tsx',
'py', 'python', 'java', 'cpp', 'c', 'h', 'hpp',
'cs', 'csharp', 'php', 'rb', 'ruby', 'go', 'rs', 'rust',
'swift', 'kt', 'kotlin', 'scala', 'r', 'm', 'objc',
'sh', 'bash', 'zsh', 'fish', 'ps1', 'bat', 'cmd',
# Web technologies (excluding html/htm which have dedicated renderer)
'css', 'scss', 'sass', 'less', 'xml', 'yaml', 'yml', 'toml', 'ini', 'cfg',
# Data formats (excluding csv, md/markdown which have dedicated renderers)
'tsv', 'log', 'rst', 'sql', 'dockerfile', 'dockerignore', 'gitignore',
# Configuration files
'env', 'properties', 'conf', 'config', 'rc',
'gitattributes', 'editorconfig', 'eslintrc',
# Documentation
'readme', 'changelog', 'license', 'authors',
'contributing', 'todo', 'notes', 'docs'
]
@classmethod
def getFormatAliases(cls) -> List[str]:
"""Return format aliases."""
return [
'ascii', 'utf8', 'utf-8', 'code', 'source',
'script', 'program', 'file', 'document',
'raw', 'unformatted', 'plaintext'
]
@classmethod
def getPriority(cls) -> int:
"""Return priority for text renderer."""
return 90
@classmethod
def getOutputStyle(cls, formatName: str = None) -> str:
"""
Return output style classification based on format.
For txt/text/plain: 'document' (unstructured text)
For all other formats: 'code' (structured formats with rules/syntax)
Note: formatName parameter is provided by registry when calling this method.
"""
# Plain text formats are document style
if formatName and formatName.lower() in ['txt', 'text', 'plain']:
return 'document'
# All other formats handled by RendererText are code style
return 'code'
@classmethod
def getAcceptedSectionTypes(cls, formatName: Optional[str] = None) -> List[str]:
"""
Return list of section content types that Text renderer accepts.
Text renderer accepts all section types except images (text formats cannot display images).
"""
from modules.datamodels.datamodelJson import supportedSectionTypes
# Text renderer accepts all types except images
return [st for st in supportedSectionTypes if st != "image"]
async def render(self, extractedContent: Dict[str, Any], title: str, userPrompt: str = None, aiService=None) -> List[RenderedDocument]:
"""Render extracted JSON content to plain text format."""
try:
# Generate text from JSON structure
textContent = self._generateTextFromJson(extractedContent, title)
# Determine filename from document or title
documents = extractedContent.get("documents", [])
if documents and isinstance(documents[0], dict):
filename = documents[0].get("filename")
if not filename:
filename = self._determineFilename(title, "text/plain")
else:
filename = self._determineFilename(title, "text/plain")
# Extract metadata for document type and other info
metadata = extractedContent.get("metadata", {}) if extractedContent else {}
documentType = metadata.get("documentType") if isinstance(metadata, dict) else None
return [
RenderedDocument(
documentData=textContent.encode('utf-8'),
mimeType="text/plain",
filename=filename,
documentType=documentType,
metadata=metadata if isinstance(metadata, dict) else None
)
]
except Exception as e:
self.logger.error(f"Error rendering text: {str(e)}")
# Return minimal text fallback
fallbackContent = f"{title}\n\nError rendering report: {str(e)}"
metadata = extractedContent.get("metadata", {}) if extractedContent else {}
documentType = metadata.get("documentType") if isinstance(metadata, dict) else None
return [
RenderedDocument(
documentData=fallbackContent.encode('utf-8'),
mimeType="text/plain",
filename=self._determineFilename(title, "text/plain"),
documentType=documentType,
metadata=metadata if isinstance(metadata, dict) else None
)
]
def _generateTextFromJson(self, jsonContent: Dict[str, Any], title: str) -> str:
"""Generate text content from structured JSON document."""
try:
# Validate JSON structure
if not self._validateJsonStructure(jsonContent):
raise ValueError("JSON content must follow standardized schema: {metadata: {...}, documents: [{sections: [...]}]}")
# Extract sections and metadata from standardized schema
sections = self._extractSections(jsonContent)
metadata = self._extractMetadata(jsonContent)
# Use provided title (which comes from documents[].title) as primary source
# Fallback to metadata.title only if title parameter is empty
documentTitle = title if title else metadata.get("title", "Generated Document")
# Build text content
textParts = []
# Document title
textParts.append(documentTitle)
textParts.append("=" * len(documentTitle))
textParts.append("")
# Process each section
for section in sections:
sectionText = self._renderJsonSection(section)
if sectionText:
textParts.append(sectionText)
textParts.append("") # Add spacing between sections
# Add generation info
textParts.append("")
textParts.append(f"Generated: {self._formatTimestamp()}")
return '\n'.join(textParts)
except Exception as e:
self.logger.error(f"Error generating text from JSON: {str(e)}")
raise Exception(f"Text generation failed: {str(e)}")
def _renderJsonSection(self, section: Dict[str, Any]) -> str:
"""Render a single JSON section to text.
Supports three content formats: reference, object (base64), extracted_text.
"""
try:
sectionType = self._getSectionType(section)
sectionData = self._getSectionData(section)
# Check for three content formats from Phase 5D in elements
if isinstance(sectionData, list):
textParts = []
for element in sectionData:
element_type = element.get("type", "") if isinstance(element, dict) else ""
# Support three content formats from Phase 5D
if element_type == "reference":
# Document reference format
doc_ref = element.get("documentReference", "")
label = element.get("label", "Reference")
textParts.append(f"[Reference: {label}]")
continue
elif element_type == "extracted_text":
# Extracted text format
content = element.get("content", "")
source = element.get("source", "")
if content:
source_text = f" (Source: {source})" if source else ""
textParts.append(f"{content}{source_text}")
continue
# If we processed reference/extracted_text elements, return them
if textParts:
return '\n\n'.join(textParts)
if sectionType == "table":
# Work directly with elements like other renderers
if isinstance(sectionData, list) and sectionData:
element = sectionData[0] if isinstance(sectionData[0], dict) else {}
return self._renderJsonTable(element)
return ""
elif sectionType == "bullet_list":
# Work directly with elements like other renderers
if isinstance(sectionData, list) and sectionData:
element = sectionData[0] if isinstance(sectionData[0], dict) else {}
return self._renderJsonBulletList(element)
return ""
elif sectionType == "heading":
# Work directly with elements like other renderers
if isinstance(sectionData, list) and sectionData:
element = sectionData[0] if isinstance(sectionData[0], dict) else {}
return self._renderJsonHeading(element)
return ""
elif sectionType == "paragraph":
# Render each paragraph element in the elements array
renderedElements = []
for element in sectionData:
renderedElements.append(self._renderJsonParagraph(element))
return "\n".join(renderedElements)
elif sectionType == "code_block":
# Work directly with elements like other renderers
if isinstance(sectionData, list) and sectionData:
element = sectionData[0] if isinstance(sectionData[0], dict) else {}
return self._renderJsonCodeBlock(element)
return ""
elif sectionType == "image":
# Work directly with elements like other renderers
if isinstance(sectionData, list) and sectionData:
element = sectionData[0] if isinstance(sectionData[0], dict) else {}
return self._renderJsonImage(element)
return ""
else:
# Fallback to paragraph for unknown types - render each element
# sectionData is already the elements array from _getSectionData
renderedElements = []
for element in sectionData:
renderedElements.append(self._renderJsonParagraph(element))
return "\n".join(renderedElements)
except Exception as e:
self.logger.warning(f"Error rendering section {self._getSectionId(section)}: {str(e)}")
return f"[Error rendering section: {str(e)}]"
def _renderJsonTable(self, tableData: Dict[str, Any]) -> str:
"""Render a JSON table to text."""
try:
# Extract from nested content structure: element.content.{headers, rows}
content = tableData.get("content", {})
if not isinstance(content, dict):
return ""
headers = content.get("headers", [])
rows = content.get("rows", [])
if not headers or not rows:
return ""
textParts = []
# Create table header
headerLine = " | ".join(str(header) for header in headers)
textParts.append(headerLine)
# Add separator line
separatorLine = " | ".join("-" * len(str(header)) for header in headers)
textParts.append(separatorLine)
# Add data rows
for row in rows:
rowLine = " | ".join(str(cellData) for cellData in row)
textParts.append(rowLine)
return '\n'.join(textParts)
except Exception as e:
self.logger.warning(f"Error rendering table: {str(e)}")
return ""
def _renderJsonBulletList(self, listData: Dict[str, Any]) -> str:
"""Render a JSON bullet list to text."""
try:
# Extract from nested content structure: element.content.{items}
content = listData.get("content", {})
if not isinstance(content, dict):
return ""
items = content.get("items", [])
if not items:
return ""
textParts = []
for item in items:
if isinstance(item, str):
textParts.append(f"- {item}")
elif isinstance(item, dict) and "text" in item:
textParts.append(f"- {item['text']}")
return '\n'.join(textParts)
except Exception as e:
self.logger.warning(f"Error rendering bullet list: {str(e)}")
return ""
def _renderJsonHeading(self, headingData: Dict[str, Any]) -> str:
"""Render a JSON heading to text."""
try:
# Extract from nested content structure: element.content.{text, level}
content = headingData.get("content", {})
if not isinstance(content, dict):
return ""
text = content.get("text", "")
level = content.get("level", 1)
if text:
level = max(1, min(6, level))
if level == 1:
return f"{text}\n{'=' * len(text)}"
elif level == 2:
return f"{text}\n{'-' * len(text)}"
else:
return f"{'#' * level} {text}"
return ""
except Exception as e:
self.logger.warning(f"Error rendering heading: {str(e)}")
return ""
def _renderJsonParagraph(self, paragraphData: Dict[str, Any]) -> str:
"""Render a JSON paragraph to text."""
try:
# Extract from nested content structure
content = paragraphData.get("content", {})
if isinstance(content, dict):
text = content.get("text", "")
elif isinstance(content, str):
text = content
else:
text = ""
return text if text else ""
except Exception as e:
self.logger.warning(f"Error rendering paragraph: {str(e)}")
return ""
def _renderJsonCodeBlock(self, codeData: Dict[str, Any]) -> str:
"""Render a JSON code block to text."""
try:
# Extract from nested content structure: element.content.{code, language}
content = codeData.get("content", {})
if not isinstance(content, dict):
return ""
code = content.get("code", "")
language = content.get("language", "")
if code:
if language:
return f"Code ({language}):\n{code}"
else:
return code
return ""
except Exception as e:
self.logger.warning(f"Error rendering code block: {str(e)}")
return ""
def _renderJsonImage(self, imageData: Dict[str, Any]) -> str:
"""Render a JSON image to text."""
try:
# Extract from nested content structure: element.content.{base64Data, altText, caption}
content = imageData.get("content", {})
if isinstance(content, dict):
altText = content.get("altText", "Image")
else:
altText = imageData.get("altText", "Image")
return f"[Image: {altText}]"
except Exception as e:
self.logger.warning(f"Error rendering image: {str(e)}")
return f"[Image: Image]"

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,163 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
"""
Content Integrator for hierarchical document generation.
Merges generated content into document structure and validates completeness.
"""
import logging
from typing import Dict, Any, List, Tuple
logger = logging.getLogger(__name__)
class ContentIntegrator:
"""Integrates generated content into document structure"""
def __init__(self, services: Any = None):
self.services = services
def integrateContent(
self,
structure: Dict[str, Any],
generatedSections: List[Dict[str, Any]]
) -> Dict[str, Any]:
"""
Merge generated sections into document structure.
Args:
structure: Original document structure
generatedSections: List of sections with populated elements
Returns:
Complete document structure ready for rendering
"""
try:
# Create mapping of section IDs to generated sections
sectionMap = {section.get("id"): section for section in generatedSections}
# Process each document
for doc in structure.get("documents", []):
sections = doc.get("sections", [])
for idx, section in enumerate(sections):
sectionId = section.get("id")
# Find corresponding generated section
if sectionId in sectionMap:
generatedSection = sectionMap[sectionId]
# Merge elements into structure section
if "elements" in generatedSection:
section["elements"] = generatedSection["elements"]
# Preserve error information if present
if generatedSection.get("error"):
section["error"] = True
section["errorMessage"] = generatedSection.get("errorMessage")
section["originalContentType"] = generatedSection.get("originalContentType")
else:
# Section not generated - create error section
logger.warning(f"Section {sectionId} not found in generated sections")
section = self.createErrorSection(
section,
f"Section {sectionId} was not generated"
)
sections[idx] = section
# Debug: Write final merged structure to debug file (harmonisiert - keine Checks nötig)
import json
structureJson = json.dumps(structure, indent=2, ensure_ascii=False)
self.services.utils.writeDebugFile(
structureJson,
"document_generation_final_merged_json"
)
logger.debug(f"Logged final merged JSON structure ({len(structureJson)} chars)")
return structure
except Exception as e:
logger.error(f"Error integrating content: {str(e)}")
raise
def validateCompleteness(
self,
document: Dict[str, Any]
) -> Tuple[bool, List[str]]:
"""
Validate that all sections have content.
Args:
document: Document structure to validate
Returns:
(is_complete, list_of_missing_sections)
"""
missingSections = []
try:
for doc in document.get("documents", []):
sections = doc.get("sections", [])
for section in sections:
sectionId = section.get("id", "unknown")
elements = section.get("elements", [])
# Check if section has content
if not elements or len(elements) == 0:
# Skip error sections (they have error text)
if not section.get("error"):
missingSections.append(sectionId)
else:
# Validate elements have actual content
hasContent = False
for element in elements:
# Check different content types
if element.get("text") or element.get("base64Data") or \
element.get("headers") or element.get("items") or \
element.get("code"):
hasContent = True
break
if not hasContent and not section.get("error"):
missingSections.append(sectionId)
return len(missingSections) == 0, missingSections
except Exception as e:
logger.error(f"Error validating completeness: {str(e)}")
return False, [f"Validation error: {str(e)}"]
def createErrorSection(
self,
originalSection: Dict[str, Any],
errorMessage: str
) -> Dict[str, Any]:
"""
Create error placeholder section.
Args:
originalSection: Original section that failed
errorMessage: Error message to display
Returns:
Error section with placeholder content
"""
contentType = originalSection.get("content_type", "content")
sectionId = originalSection.get("id", "unknown")
return {
"id": sectionId,
"content_type": "paragraph", # Change to paragraph for error display
"elements": [{
"text": f"[ERROR: Failed to generate {contentType} for section '{sectionId}'. Error: {errorMessage}]"
}],
"order": originalSection.get("order", 0),
"error": True,
"errorMessage": errorMessage,
"originalContentType": contentType,
"title": originalSection.get("title"),
"generation_hint": originalSection.get("generation_hint"),
"complexity": originalSection.get("complexity")
}

View file

@ -0,0 +1,253 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
import json
import logging
import os
from typing import Any, Dict
logger = logging.getLogger(__name__)
def getFileExtension(fileName: str) -> str:
"""Extract file extension from fileName (without dot, lowercased)."""
if '.' in fileName:
return fileName.rsplit('.', 1)[-1].lower()
return ''
def getMimeTypeFromExtension(extension: str) -> str:
"""
Get MIME type based on file extension.
This method consolidates MIME type detection from extension.
Args:
extension: File extension (with or without dot)
Returns:
str: MIME type for the extension
"""
# Normalize extension (remove dot if present)
if extension.startswith('.'):
extension = extension[1:]
# Map extensions to MIME types
mime_types = {
'txt': 'text/plain',
'json': 'application/json',
'xml': 'application/xml',
'csv': 'text/csv',
'html': 'text/html',
'htm': 'text/html',
'md': 'text/markdown',
'py': 'text/x-python',
'js': 'application/javascript',
'css': 'text/css',
'pdf': 'application/pdf',
'doc': 'application/msword',
'docx': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
'xls': 'application/vnd.ms-excel',
'xlsx': 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
'ppt': 'application/vnd.ms-powerpoint',
'pptx': 'application/vnd.openxmlformats-officedocument.presentationml.presentation',
'svg': 'image/svg+xml',
'jpg': 'image/jpeg',
'jpeg': 'image/jpeg',
'png': 'image/png',
'gif': 'image/gif',
'bmp': 'image/bmp',
'webp': 'image/webp',
'zip': 'application/zip',
'rar': 'application/x-rar-compressed',
'7z': 'application/x-7z-compressed',
'tar': 'application/x-tar',
'gz': 'application/gzip'
}
return mime_types.get(extension.lower(), 'application/octet-stream')
def detectContentTypeFromData(fileData: bytes, fileName: str) -> str:
"""
Detect content type from file data and fileName.
This method makes the MIME type detection function accessible through the service center.
Args:
fileData: Raw file data as bytes
fileName: Name of the file
Returns:
str: Detected MIME type
"""
try:
# Check file extension first
ext = os.path.splitext(fileName)[1].lower()
if ext:
# Map common extensions to MIME types
extToMime = {
'.txt': 'text/plain',
'.md': 'text/markdown',
'.csv': 'text/csv',
'.json': 'application/json',
'.xml': 'application/xml',
'.js': 'application/javascript',
'.py': 'application/x-python',
'.svg': 'image/svg+xml',
'.jpg': 'image/jpeg',
'.jpeg': 'image/jpeg',
'.png': 'image/png',
'.gif': 'image/gif',
'.bmp': 'image/bmp',
'.webp': 'image/webp',
'.pdf': 'application/pdf',
'.docx': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
'.doc': 'application/msword',
'.xlsx': 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
'.xls': 'application/vnd.ms-excel',
'.pptx': 'application/vnd.openxmlformats-officedocument.presentationml.presentation',
'.ppt': 'application/vnd.ms-powerpoint',
'.html': 'text/html',
'.htm': 'text/html',
'.css': 'text/css',
'.zip': 'application/zip',
'.rar': 'application/x-rar-compressed',
'.7z': 'application/x-7z-compressed',
'.tar': 'application/x-tar',
'.gz': 'application/gzip'
}
if ext in extToMime:
return extToMime[ext]
# Try to detect from content
if fileData.startswith(b'%PDF'):
return 'application/pdf'
elif fileData.startswith(b'PK\x03\x04'):
# ZIP-based formats (docx, xlsx, pptx)
return 'application/zip'
elif fileData.startswith(b'<'):
# XML-based formats
try:
text = fileData.decode('utf-8', errors='ignore')
if '<svg' in text.lower():
return 'image/svg+xml'
elif '<html' in text.lower():
return 'text/html'
else:
return 'application/xml'
except:
pass
elif fileData.startswith(b'\x89PNG\r\n\x1a\n'):
return 'image/png'
elif fileData.startswith(b'\xff\xd8\xff'):
return 'image/jpeg'
elif fileData.startswith(b'GIF87a') or fileData.startswith(b'GIF89a'):
return 'image/gif'
elif fileData.startswith(b'BM'):
return 'image/bmp'
elif fileData.startswith(b'RIFF') and fileData[8:12] == b'WEBP':
return 'image/webp'
return 'application/octet-stream'
except Exception as e:
logger.error(f"Error detecting content type from data: {str(e)}")
return 'application/octet-stream'
def detectMimeTypeFromData(file_bytes: bytes, fileName: str, service=None) -> str:
"""Detect MIME type from file bytes and fileName using a service if provided."""
try:
if service and hasattr(service, 'detectContentTypeFromData'):
detected = service.detectContentTypeFromData(file_bytes, fileName)
if detected and detected != 'application/octet-stream':
return detected
# Fallback: use our consolidated function
return detectContentTypeFromData(file_bytes, fileName)
except Exception as e:
logger.warning(f"Error in MIME type detection for {fileName}: {str(e)}")
return 'application/octet-stream'
def detectMimeTypeFromContent(content: Any, fileName: str, service=None) -> str:
"""Detect MIME type from content and fileName using a service if provided."""
try:
if isinstance(content, str):
file_bytes = content.encode('utf-8')
elif isinstance(content, dict):
file_bytes = json.dumps(content, ensure_ascii=False).encode('utf-8')
else:
file_bytes = str(content).encode('utf-8')
return detectMimeTypeFromData(file_bytes, fileName, service)
except Exception as e:
logger.warning(f"Error in MIME type detection for {fileName}: {str(e)}")
return 'application/octet-stream'
def convertDocumentDataToString(document_data: Any, file_extension: str) -> str:
"""Convert document data to string content based on file type with enhanced processing."""
try:
if document_data is None:
return ""
if isinstance(document_data, bytes):
# WICHTIG: Decode bytes to string for text files (HTML, text, etc.)
try:
return document_data.decode('utf-8')
except UnicodeDecodeError:
# Fallback: try latin1 or return with error replacement
try:
return document_data.decode('latin1')
except Exception:
return document_data.decode('utf-8', errors='replace')
if isinstance(document_data, str):
return document_data
if isinstance(document_data, dict):
if file_extension == 'json':
return json.dumps(document_data, indent=2, ensure_ascii=False)
elif file_extension in ['txt', 'md', 'html', 'css', 'js', 'py']:
text_fields = ['content', 'text', 'data', 'result', 'summary', 'extracted_content', 'table_data']
for field in text_fields:
if field in document_data:
content = document_data[field]
if isinstance(content, str):
return content
elif isinstance(content, (dict, list)):
return json.dumps(content, indent=2, ensure_ascii=False)
return json.dumps(document_data, indent=2, ensure_ascii=False)
elif file_extension == 'csv':
csv_fields = ['table_data', 'csv_data', 'rows', 'data', 'content', 'text']
for field in csv_fields:
if field in document_data:
content = document_data[field]
if isinstance(content, str):
return content
elif isinstance(content, list):
if content and isinstance(content[0], (list, dict)):
import csv
import io
output = io.StringIO()
if isinstance(content[0], dict):
if content:
fieldnames = content[0].keys()
writer = csv.DictWriter(output, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(content)
else:
writer = csv.writer(output)
writer.writerows(content)
return output.getvalue()
return json.dumps(document_data, indent=2, ensure_ascii=False)
else:
return json.dumps(document_data, indent=2, ensure_ascii=False)
elif isinstance(document_data, list):
if file_extension == 'csv':
import csv
import io
output = io.StringIO()
if document_data and isinstance(document_data[0], dict):
fieldnames = document_data[0].keys()
writer = csv.DictWriter(output, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(document_data)
else:
writer = csv.writer(output)
writer.writerows(document_data)
return output.getvalue()
else:
return json.dumps(document_data, indent=2, ensure_ascii=False)
else:
return str(document_data)
except Exception as e:
logger.error(f"Error converting document data to string: {str(e)}")
return str(document_data)

View file

@ -0,0 +1,560 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
"""
JSON Schema definitions for AI-generated document structures (unified).
This module provides schemas that guide AI to generate structured JSON output
that matches the master template in modules.datamodels.datamodelJson.
"""
from typing import Dict, Any
def getMultiDocumentSchema() -> Dict[str, Any]:
"""Get the JSON schema for multi-document generation (unified)."""
return {
"type": "object",
"required": ["metadata", "documents"],
"properties": {
"metadata": {
"type": "object",
"required": ["split_strategy"],
"properties": {
"split_strategy": {
"type": "string",
"enum": [
"single_document",
"per_entity",
"by_section",
"by_criteria",
"by_data_type",
"custom"
],
"description": "Strategy for splitting content into multiple files"
},
"splitCriteria": {
"type": "object",
"description": "Custom criteria for splitting (e.g., entity_id, category, etc.)"
},
"fileNamingPattern": {
"type": "string",
"description": "Pattern for generating filenames (e.g., '{entity_name}_data.docx')"
},
"source_documents": {
"type": "array",
"items": {"type": "string"},
"description": "List of source document IDs"
},
"extraction_method": {
"type": "string",
"default": "ai_generation",
"description": "Method used for extraction"
}
}
},
"documents": {
"type": "array",
"description": "Array of individual documents to generate",
"items": {
"type": "object",
"required": ["id", "title", "sections", "filename"],
"properties": {
"id": {"type": "string", "description": "Unique document identifier"},
"title": {"type": "string", "description": "Document title"},
"filename": {"type": "string", "description": "Generated filename"},
"sections": {
"type": "array",
"description": "Document sections containing structured content",
"items": {
"type": "object",
"required": ["id", "content_type", "elements", "order"],
"properties": {
"id": {"type": "string", "description": "Unique section identifier"},
"title": {"type": "string", "description": "Section title (optional)"},
"content_type": {
"type": "string",
"enum": [
"table",
"bullet_list",
"paragraph",
"heading",
"code_block",
"image",
"mixed"
],
"description": "Primary content type of this section"
},
"elements": {
"type": "array",
"description": "Content elements in this section",
"items": {
"oneOf": [
{"$ref": "#/definitions/table"},
{"$ref": "#/definitions/bullet_list"},
{"$ref": "#/definitions/paragraph"},
{"$ref": "#/definitions/heading"},
{"$ref": "#/definitions/code_block"},
{"$ref": "#/definitions/image"}
]
}
},
"order": {"type": "integer", "description": "Section order in document"},
"metadata": {
"type": "object",
"description": "Additional section metadata"
}
}
}
},
"metadata": {
"type": "object",
"description": "Document-specific metadata"
}
}
}
}
},
"definitions": {
"table": {
"type": "object",
"required": ["headers", "rows"],
"properties": {
"headers": {
"type": "array",
"items": {"type": "string"},
"description": "Table column headers"
},
"rows": {
"type": "array",
"items": {
"type": "array",
"items": {"type": "string"}
},
"description": "Table data rows"
},
"caption": {
"type": "string",
"description": "Table caption (optional)"
}
}
},
"bullet_list": {
"type": "object",
"required": ["items"],
"properties": {
"items": {
"type": "array",
"items": {
"type": "object",
"required": ["text"],
"properties": {
"text": {"type": "string", "description": "List item text"},
"subitems": {
"type": "array",
"items": {"$ref": "#/definitions/list_item"},
"description": "Nested sub-items (optional)"
}
}
},
"description": "List items"
},
"list_type": {
"type": "string",
"enum": ["bullet", "numbered", "checklist"],
"default": "bullet",
"description": "Type of list"
}
}
},
"list_item": {
"type": "object",
"required": ["text"],
"properties": {
"text": {"type": "string", "description": "List item text"},
"subitems": {
"type": "array",
"items": {"$ref": "#/definitions/list_item"},
"description": "Nested sub-items (optional)"
}
}
},
"paragraph": {
"type": "object",
"required": ["text"],
"properties": {
"text": {"type": "string", "description": "Paragraph text"},
"formatting": {
"type": "object",
"description": "Text formatting (bold, italic, etc.)"
}
}
},
"heading": {
"type": "object",
"required": ["text", "level"],
"properties": {
"text": {"type": "string", "description": "Heading text"},
"level": {
"type": "integer",
"minimum": 1,
"maximum": 6,
"description": "Heading level (1-6)"
}
}
},
"code_block": {
"type": "object",
"required": ["code"],
"properties": {
"code": {"type": "string", "description": "Code content"},
"language": {"type": "string", "description": "Programming language (optional)"}
}
},
"image": {
"type": "object",
"required": ["url"],
"properties": {
"url": {"type": "string", "description": "Image URL or data URI"},
"caption": {"type": "string", "description": "Image caption (optional)"},
"alt": {"type": "string", "description": "Alt text (optional)"}
}
}
}
}
def getDocumentSchema() -> Dict[str, Any]:
"""Get the JSON schema for structured document generation (single document)."""
return {
"type": "object",
"required": ["metadata", "sections"],
"properties": {
"metadata": {
"type": "object",
"required": ["title"],
"properties": {
"title": {"type": "string", "description": "Document title"},
"source_documents": {
"type": "array",
"items": {"type": "string"},
"description": "List of source document IDs"
},
"extraction_method": {
"type": "string",
"default": "ai_generation",
"description": "Method used for extraction"
}
}
},
"sections": {
"type": "array",
"description": "Document sections containing structured content",
"items": {
"type": "object",
"required": ["id", "content_type", "elements", "order"],
"properties": {
"id": {"type": "string", "description": "Unique section identifier"},
"title": {"type": "string", "description": "Section title (optional)"},
"content_type": {
"type": "string",
"enum": [
"table",
"bullet_list",
"paragraph",
"heading",
"code_block",
"image",
"mixed"
],
"description": "Primary content type of this section"
},
"elements": {
"type": "array",
"description": "Content elements in this section",
"items": {
"oneOf": [
{"$ref": "#/definitions/table"},
{"$ref": "#/definitions/bullet_list"},
{"$ref": "#/definitions/paragraph"},
{"$ref": "#/definitions/heading"},
{"$ref": "#/definitions/code_block"},
{"$ref": "#/definitions/image"}
]
}
},
"order": {"type": "integer", "description": "Section order in document"},
"metadata": {
"type": "object",
"description": "Additional section metadata"
}
}
}
},
"summary": {
"type": "string",
"description": "Document summary (optional)"
},
"tags": {
"type": "array",
"items": {"type": "string"},
"description": "Document tags for categorization"
}
},
"definitions": {
"table": {
"type": "object",
"required": ["headers", "rows"],
"properties": {
"headers": {
"type": "array",
"items": {"type": "string"},
"description": "Table column headers"
},
"rows": {
"type": "array",
"items": {
"type": "array",
"items": {"type": "string"}
},
"description": "Table data rows"
},
"caption": {
"type": "string",
"description": "Table caption (optional)"
}
}
},
"bullet_list": {
"type": "object",
"required": ["items"],
"properties": {
"items": {
"type": "array",
"items": {
"type": "object",
"required": ["text"],
"properties": {
"text": {"type": "string", "description": "List item text"},
"subitems": {
"type": "array",
"items": {"$ref": "#/definitions/list_item"},
"description": "Nested sub-items (optional)"
}
}
},
"description": "List items"
},
"list_type": {
"type": "string",
"enum": ["bullet", "numbered", "checklist"],
"default": "bullet",
"description": "Type of list"
}
}
},
"list_item": {
"type": "object",
"required": ["text"],
"properties": {
"text": {"type": "string", "description": "List item text"},
"subitems": {
"type": "array",
"items": {"$ref": "#/definitions/list_item"},
"description": "Nested sub-items (optional)"
}
}
},
"paragraph": {
"type": "object",
"required": ["text"],
"properties": {
"text": {"type": "string", "description": "Paragraph text"},
"formatting": {
"type": "object",
"description": "Text formatting (bold, italic, etc.)"
}
}
},
"heading": {
"type": "object",
"required": ["text", "level"],
"properties": {
"text": {"type": "string", "description": "Heading text"},
"level": {
"type": "integer",
"minimum": 1,
"maximum": 6,
"description": "Heading level (1-6)"
}
}
},
"code_block": {
"type": "object",
"required": ["code"],
"properties": {
"code": {"type": "string", "description": "Code content"},
"language": {"type": "string", "description": "Programming language (optional)"}
}
},
"image": {
"type": "object",
"required": ["url"],
"properties": {
"url": {"type": "string", "description": "Image URL or data URI"},
"caption": {"type": "string", "description": "Image caption (optional)"},
"alt": {"type": "string", "description": "Alt text (optional)"}
}
}
}
}
def getExtractionPromptTemplate() -> str:
"""Get the template for AI extraction prompts that request JSON output."""
return """
You are extracting structured content from documents. Your task is to analyze the provided content and generate a structured JSON document.
IMPORTANT: You must respond with valid JSON only. No additional text, explanations, or formatting outside the JSON structure.
JSON Schema Requirements:
- Extract the actual data from the source documents
- If content is a table, extract it as a table with headers and rows
- If content is a list, extract it as a structured list with items
- If content is text, extract it as paragraphs or headings
- Preserve the original structure and data - do not summarize or interpret
- Use the exact JSON schema provided
Content Types to Extract:
1. Tables: Extract all rows and columns with proper headers
2. Lists: Extract all items with proper nesting
3. Headings: Extract with appropriate levels
4. Paragraphs: Extract as structured text
5. Code: Extract code blocks with language identification
Return only the JSON structure following the schema. Do not include any text before or after the JSON.
"""
def getGenerationPromptTemplate() -> str:
"""Get the template for AI generation prompts that work with JSON input."""
return """
You are generating a document from structured JSON data. Your task is to create a well-formatted document based on the provided structured content.
IMPORTANT: You must respond with valid JSON only, following the document schema.
Generation Guidelines:
- Use the provided JSON structure as the foundation
- Enhance the content with proper formatting and organization
- Ensure logical flow and readability
- Maintain the original data integrity
- Add appropriate headings and sections
- Organize content in a logical sequence
Content Enhancement:
- Tables: Ensure proper headers and data alignment
- Lists: Use appropriate list types (bullet, numbered, checklist)
- Headings: Use appropriate heading levels for hierarchy
- Paragraphs: Ensure proper text flow and formatting
- Code: Preserve code blocks with proper language identification
Return only the enhanced JSON structure following the schema. Do not include any text before or after the JSON.
"""
def getAdaptiveJsonSchema(promptAnalysis: Dict[str, Any] = None) -> Dict[str, Any]:
"""Automatically select appropriate schema based on prompt analysis."""
if promptAnalysis and promptAnalysis.get("is_multi_file", False):
return getMultiDocumentSchema()
else:
return getDocumentSchema()
def validateJsonDocument(jsonData: Dict[str, Any]) -> bool:
"""Validate that the JSON data follows the unified document schema."""
try:
# Basic validation - check required fields
if not isinstance(jsonData, dict):
return False
# Check if it's multi-document or single-document structure
if "documents" in jsonData:
# Multi-document structure
if "metadata" not in jsonData:
return False
metadata = jsonData["metadata"]
if not isinstance(metadata, dict) or "split_strategy" not in metadata:
return False
documents = jsonData["documents"]
if not isinstance(documents, list):
return False
# Validate each document
for doc in documents:
if not isinstance(doc, dict):
return False
required_fields = ["id", "title", "sections", "filename"]
for field in required_fields:
if field not in doc:
return False
# Validate sections in each document
sections = doc.get("sections", [])
if not isinstance(sections, list):
return False
for section in sections:
if not isinstance(section, dict):
return False
section_required = ["id", "content_type", "elements", "order"]
for field in section_required:
if field not in section:
return False
# Validate content_type
valid_types = ["table", "bullet_list", "paragraph", "heading", "code_block", "image", "mixed"]
if section["content_type"] not in valid_types:
return False
# Validate elements
if not isinstance(section["elements"], list):
return False
elif "sections" in jsonData:
# Single-document structure (existing validation)
if "metadata" not in jsonData:
return False
metadata = jsonData["metadata"]
if not isinstance(metadata, dict) or "title" not in metadata:
return False
sections = jsonData["sections"]
if not isinstance(sections, list):
return False
# Validate each section
for i, section in enumerate(sections):
if not isinstance(section, dict):
return False
required_fields = ["id", "content_type", "elements", "order"]
for field in required_fields:
if field not in section:
return False
# Validate content_type
valid_types = ["table", "bullet_list", "paragraph", "heading", "code_block", "image", "mixed"]
if section["content_type"] not in valid_types:
return False
# Validate elements
if not isinstance(section["elements"], list):
return False
else:
return False
return True
except Exception:
return False

View file

@ -0,0 +1,200 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
"""
Prompt builder for document generation.
This module builds prompts for generating documents from extracted content.
"""
import logging
from typing import Dict, Any
from modules.datamodels.datamodelJson import jsonTemplateDocument
logger = logging.getLogger(__name__)
async def buildGenerationPrompt(
outputFormat: str,
userPrompt: str,
title: str,
extracted_content: str = None,
continuationContext: Dict[str, Any] = None,
services: Any = None,
useContentParts: bool = False # ARCHITECTURE: If True, don't include full content in prompt (ContentParts will be used directly)
) -> str:
"""
Build the unified generation prompt using a single JSON template.
Generic solution that works for any user request.
Args:
outputFormat: Target output format (html, pdf, docx, etc.) - not used in prompt
userPrompt: User's original prompt for document generation
title: Title for the document
extracted_content: Optional extracted content from documents to prepend to prompt
continuationContext: Optional context from previous generation for continuation
services: Optional services instance for accessing user language
Returns:
Complete generation prompt string
"""
# Extract user language for document language instruction
userLanguage = 'en' # Default fallback
if services:
try:
# Prefer detected language if available
if hasattr(services, 'currentUserLanguage') and services.currentUserLanguage:
userLanguage = services.currentUserLanguage
elif hasattr(services, 'user') and services.user and hasattr(services.user, 'language'):
userLanguage = services.user.language
except Exception:
pass
# Create a template - let AI generate title if not provided
titleValue = title if title else "Generated Document"
jsonTemplate = jsonTemplateDocument.replace("{{DOCUMENT_TITLE}}", titleValue)
# Build prompt based on whether this is a continuation or first call
# Check if we have valid continuation context with actual JSON fragment
# CRITICAL: Allow continuation even if section_count is 0 (broken JSON that couldn't be parsed)
# as long as we have last_raw_json - this handles cases where JSON is too broken to extract sections
hasContinuation = (
continuationContext
and continuationContext.get("last_raw_json", "")
and continuationContext.get("last_raw_json", "").strip() != "{}"
)
if hasContinuation:
# CONTINUATION PROMPT - use centralized jsonContinuation system
delivered_summary = continuationContext.get("delivered_summary", "")
# Use centralized system: overlap_context and hierarchy_context from jsonContinuation.getContexts()
overlap_context = continuationContext.get("overlap_context")
hierarchy_context = continuationContext.get("hierarchy_context")
# Build continuation text with delivered summary and cut-off information
# CRITICAL: Always include cut-off information if available (per loop_plan.md)
continuationText = f"{delivered_summary}\n\n"
continuationText += "⚠️ CONTINUATION: Response was cut off. Generate ONLY the remaining content that comes AFTER the reference elements below.\n\n"
# Add cut-off point information using centralized jsonContinuation contexts
# These are shown ONLY as REFERENCE to know where generation stopped
if hierarchy_context:
continuationText += "# REFERENCE: Structure context (already delivered - DO NOT repeat):\n"
continuationText += f"{hierarchy_context}\n\n"
if overlap_context:
continuationText += "# REFERENCE: Overlap context - incomplete element at cut point (DO NOT repeat):\n"
continuationText += f"{overlap_context}\n\n"
continuationText += "⚠️ CRITICAL: The elements above are REFERENCE ONLY. They are already delivered.\n"
continuationText += "Generate ONLY what comes AFTER these elements. DO NOT regenerate the entire JSON structure.\n"
continuationText += "Start directly with the next element/section that should follow.\n\n"
# PROMPT FOR CONTINUATION
generationPrompt = f"""{'='*80}
USER REQUEST / USER PROMPT:
{'='*80}
{userPrompt}
{'='*80}
END OF USER REQUEST / USER PROMPT
{'='*80}
CONTINUATION MODE: Response was incomplete. Generate ONLY the remaining content.
LANGUAGE REQUIREMENT: All generated content must be in the language '{userLanguage}'. Generate all text, headings, paragraphs, and content in this language.
{continuationText}
JSON structure template:
{jsonTemplate}
Rules:
- Return ONLY valid JSON (no comments, no trailing commas, double quotes only).
- Reference elements shown above are ALREADY DELIVERED - DO NOT repeat them.
- Generate ONLY the remaining content that comes AFTER the reference elements.
- DO NOT regenerate the entire JSON structure - start directly with what comes next.
- All content must be in the language '{userLanguage}'.
- Output JSON only; no markdown fences or extra text.
Continue generating the remaining content now.
"""
else:
# PROMPT FOR FIRST CALL
# Structure: User request + Extracted content FIRST (if available), then JSON template, then instructions
# ARCHITECTURE: If useContentParts=True, don't include full content in prompt
# ContentParts will be passed directly to callAi for model-aware chunking
if extracted_content and not useContentParts:
# If we have extracted content, put it FIRST and make it very clear it's the source data
generationPrompt = f"""{'='*80}
USER REQUEST / USER PROMPT:
{'='*80}
{userPrompt}
{'='*80}
END OF USER REQUEST / USER PROMPT
{'='*80}
{'='*80}
CRITICAL: USE THIS EXTRACTED CONTENT AS YOUR DATA SOURCE
{'='*80}
The content below contains the ACTUAL DATA extracted from the source documents.
You MUST use this data - DO NOT generate fake or example data.
{'='*80}
EXTRACTED CONTENT FROM DOCUMENTS:
{'='*80}
{extracted_content}
{'='*80}
END OF EXTRACTED CONTENT
{'='*80}
LANGUAGE REQUIREMENT: All generated content must be in the language '{userLanguage}'. Generate all text, headings, paragraphs, and content in this language. If the extracted content is in a different language, translate it to '{userLanguage}' while preserving the structure and meaning.
Generate a VALID JSON response using the EXTRACTED CONTENT above as your data source.
The JSON structure template below shows ONLY the structure pattern - the example values are NOT real data.
You MUST use the actual data from EXTRACTED CONTENT above, NOT the example values from the template.
JSON structure template (structure only - use data from EXTRACTED CONTENT above):
{jsonTemplate}
Instructions:
- Return ONLY valid JSON (strict). No comments. No trailing commas. Use double quotes.
- Do NOT reuse example section IDs; create your own.
- CRITICAL: Use the ACTUAL DATA from EXTRACTED CONTENT above, NOT the example values from the template.
- Generate complete content based on the user request and the extracted content. Do NOT just give an instruction or comments. Deliver the complete response.
- All content must be in the language '{userLanguage}'.
- IMPORTANT: Set a meaningful "filename" in each document with appropriate file extension (e.g., "prime_numbers.txt", "report.docx", "data.json"). The filename should reflect the content and task objective.
- Output JSON only; no markdown fences or extra text.
Generate your complete response using the extracted content data.
"""
else:
# No extracted content - generate from scratch
generationPrompt = f"""{'='*80}
USER REQUEST / USER PROMPT:
{'='*80}
{userPrompt}
{'='*80}
END OF USER REQUEST / USER PROMPT
{'='*80}
LANGUAGE REQUIREMENT: All generated content must be in the language '{userLanguage}'. Generate all text, headings, paragraphs, and content in this language.
Generate a VALID JSON response for the user request. The template below shows ONLY the structure pattern - it is NOT existing content.
JSON structure template:
{jsonTemplate}
Instructions:
- Return ONLY valid JSON (strict). No comments. No trailing commas. Use double quotes.
- Do NOT reuse example section IDs; create your own.
- Generate complete content based on the user request. Do NOT just give an instruction or comments. Deliver the complete response.
- All content must be in the language '{userLanguage}'.
- IMPORTANT: Set a meaningful "filename" in each document with appropriate file extension (e.g., "prime_numbers.txt", "report.docx", "data.json"). The filename should reflect the content and task objective.
- Output JSON only; no markdown fences or extra text.
Generate your complete response.
"""
return generationPrompt.strip()

View file

@ -0,0 +1,540 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
"""
Structure Generator for hierarchical document generation.
Generates document skeleton with section placeholders.
"""
import logging
import json
from typing import Dict, Any, Optional, List
from modules.datamodels.datamodelJson import jsonTemplateDocument
logger = logging.getLogger(__name__)
class StructureGenerator:
"""Generates document structure with section placeholders"""
def __init__(self, services: Any):
self.services = services
async def generateStructure(
self,
userPrompt: str,
documentList: Optional[Any] = None,
cachedContent: Optional[Dict[str, Any]] = None,
contentParts: Optional[List[Any]] = None,
maxSectionLength: int = 500,
existingImages: Optional[List[Dict[str, Any]]] = None
) -> Dict[str, Any]:
"""
Generate document structure with sections.
Args:
userPrompt: User's original prompt
documentList: Optional document references
cachedContent: Optional extracted content cache
contentParts: Optional list of ContentParts to analyze for structure generation
maxSectionLength: Maximum words for simple sections
existingImages: Optional list of existing images to include
Returns:
Document structure with empty elements arrays and contentPartIds per section
"""
try:
# Create structure generation prompt
structurePrompt = self._createStructurePrompt(
userPrompt=userPrompt,
cachedContent=cachedContent,
contentParts=contentParts,
maxSectionLength=maxSectionLength,
existingImages=existingImages or []
)
# Debug: Log structure generation prompt (harmonisiert - keine Checks nötig)
self.services.utils.writeDebugFile(
structurePrompt,
"document_generation_structure_prompt"
)
# Call AI to generate structure
from modules.datamodels.datamodelAi import AiCallOptions, OperationTypeEnum
options = AiCallOptions(
operationType=OperationTypeEnum.DATA_GENERATE,
resultFormat="json"
)
aiResponse = await self.services.ai.callAiContent(
prompt=structurePrompt,
options=options,
outputFormat="json"
)
# Debug: Log structure generation response (harmonisiert - keine Checks nötig)
self.services.utils.writeDebugFile(
aiResponse.content if aiResponse and aiResponse.content else '',
"document_generation_structure_response"
)
if not aiResponse or not aiResponse.content:
raise ValueError("AI structure generation returned empty response")
# Extract and parse JSON
extractedJson = self.services.utils.jsonExtractString(aiResponse.content)
if not extractedJson:
raise ValueError("No JSON found in AI structure response")
structure = json.loads(extractedJson)
# Validate and enhance structure
structure = self._validateAndEnhanceStructure(structure, maxSectionLength)
return structure
except Exception as e:
logger.error(f"Error generating structure: {str(e)}")
raise
def _createStructurePrompt(
self,
userPrompt: str,
cachedContent: Optional[Dict[str, Any]] = None,
contentParts: Optional[List[Any]] = None,
maxSectionLength: int = 500,
existingImages: Optional[List[Dict[str, Any]]] = None
) -> str:
"""
Create prompt for structure generation.
"""
# Get user language
userLanguage = self._getUserLanguage()
# Format cached content if available
cachedContentText = ""
if cachedContent and cachedContent.get("extractedContent"):
cachedContentText = self._formatCachedContent(cachedContent)
# Use provided existingImages or extract from cachedContent
if existingImages is None:
existingImages = []
if cachedContent and cachedContent.get("imageDocuments"):
existingImages = cachedContent.get("imageDocuments", [])
# Format ContentParts as JSON for structure generation
contentPartsJson = ""
if contentParts:
try:
import json
# Convert ContentParts to dict format for JSON serialization
contentPartsList = []
for part in contentParts:
if hasattr(part, 'dict'):
partDict = part.dict()
elif isinstance(part, dict):
partDict = part
else:
# Try to convert to dict
partDict = {
"id": getattr(part, 'id', ''),
"typeGroup": getattr(part, 'typeGroup', ''),
"mimeType": getattr(part, 'mimeType', ''),
"label": getattr(part, 'label', ''),
"metadata": getattr(part, 'metadata', {})
}
# Only include essential fields for structure generation (not full data)
contentPartsList.append({
"id": partDict.get("id", ""),
"typeGroup": partDict.get("typeGroup", ""),
"mimeType": partDict.get("mimeType", ""),
"label": partDict.get("label", ""),
"metadata": partDict.get("metadata", {})
})
contentPartsJson = json.dumps(contentPartsList, indent=2, ensure_ascii=False)
except Exception as e:
logger.warning(f"Could not format ContentParts as JSON: {str(e)}")
contentPartsJson = ""
# Create structure template
structureTemplate = jsonTemplateDocument.replace("{{DOCUMENT_TITLE}}", "Document Title")
prompt = f"""{'='*80}
USER REQUEST:
{'='*80}
{userPrompt}
{'='*80}
TASK: Generate a document STRUCTURE (skeleton) with sections.
Do NOT generate actual content yet - only the structure.
{'='*80}
EXTRACTED CONTENT (if available):
{'='*80}
{cachedContentText if cachedContentText else "No source documents provided."}
{'='*80}
INSTRUCTIONS:
1. Analyze the user request, extracted content, and available ContentParts
2. Create a document structure with CONTENT sections only
3. For each section, specify:
- id: Unique identifier (e.g., "section_title_1", "section_image_1")
- content_type: "heading" | "paragraph" | "image" | "table" | "bullet_list" | "code_block"
- complexity: "simple" (can generate directly) or "complex" (needs sub-prompt)
- generation_hint: Brief description of what content should be generated
- contentPartIds: Array of ContentPart IDs that should be used for this section (e.g., ["part_1", "part_2"]) - can be empty []
- extractionPrompt: (optional) Specific prompt for extracting/processing ContentParts for this section
- image_prompt: (only for image sections) Detailed prompt for image generation
- order: Section order number (starting from 1)
- elements: [] (empty array - will be populated later)
4. Identify image sections:
- If user requests illustrations/images, create image sections
- If existing images are provided in documentList (check EXISTING IMAGES section below), create image sections that reference them
- Add image_prompt field with detailed description for image generation (only for new images)
- Set complexity to "complex" for new images, "simple" for existing/render images
- For existing images: Set image_source to "existing" and image_reference_id to the image document ID
- For images to render (from input documents): Set image_source to "render" and image_reference_id to the image document ID
- Example for new image: {{"id": "section_image_1", "content_type": "image", "complexity": "complex", "generation_hint": "Illustration for chapter 1", "image_prompt": "A detailed description for image generation", "order": 2, "elements": []}}
- Example for existing image: {{"id": "section_image_1", "content_type": "image", "complexity": "simple", "generation_hint": "Include provided image", "image_source": "existing", "image_reference_id": "doc_id_here", "order": 2, "elements": []}}
- Example for render image: {{"id": "section_image_1", "content_type": "image", "complexity": "simple", "generation_hint": "Render input image", "image_source": "render", "image_reference_id": "doc_id_here", "order": 2, "elements": []}}
{'='*80}
EXISTING IMAGES (to include in document):
{'='*80}
{self._formatExistingImages(existingImages) if existingImages else "No existing images provided."}
{'='*80}
6. Identify complex text sections:
- Long chapters (>{maxSectionLength} words expected) should be marked as "complex"
- Short paragraphs/headings should be "simple"
7. Return ONLY valid JSON following this structure:
{structureTemplate}
5. CRITICAL RULES FOR CONTENT PARTS:
- Analyze available ContentParts and determine which ones are needed for each section
- For image sections (content_type == "image"): Include image ContentParts in contentPartIds - images will be integrated as visual elements
- For other sections (heading, paragraph, etc.): If image ContentParts are referenced, they will be referenced as text in the document language (not integrated as images)
- Each section can reference multiple ContentParts via contentPartIds array
- If specific extraction/processing is needed for ContentParts, provide extractionPrompt
- Image references in non-image sections should be automatically derived in the document language (e.g., "siehe Bild 1" in German, "see Image 1" in English)
6. CRITICAL RULES:
- Return ONLY valid JSON (no comments, no trailing commas, double quotes only)
- Follow the exact JSON schema structure provided
- IMPORTANT: All sections MUST have empty elements arrays: "elements": [] (the template shows examples with content, but you must use empty arrays)
- ALL sections MUST include "generation_hint" field with a brief description of what content should be generated
- ALL sections MUST include "complexity" field: "simple" for short content, "complex" for long chapters/images
- ALL sections MUST include "contentPartIds" field (can be empty array [] if no ContentParts needed)
- Image sections MUST include "image_prompt" field with detailed description for image generation
- Order numbers MUST start from 1 (not 0)
- All content must be in the language '{userLanguage}'
- Do NOT generate actual content - only structure (skeleton)
- Use only supported content_type values: "heading", "paragraph", "image", "table", "bullet_list", "code_block"
Return ONLY the JSON structure. No explanations.
"""
return prompt
def _validateAndEnhanceStructure(
self,
structure: Dict[str, Any],
maxSectionLength: int
) -> Dict[str, Any]:
"""
Validate structure and enhance with complexity identification.
"""
try:
# Ensure structure has required fields
if "documents" not in structure:
if "sections" in structure:
# Convert single-document format to multi-document format
structure = {
"metadata": structure.get("metadata", {}),
"documents": [{
"id": "doc_1",
"title": structure.get("metadata", {}).get("title", "Document"),
"filename": "document.json",
"sections": structure.get("sections", [])
}]
}
else:
raise ValueError("Structure missing 'documents' or 'sections' field")
# Process each document
for doc in structure.get("documents", []):
sections = doc.get("sections", [])
# Process and validate sections according to standardized schema
for idx, section in enumerate(sections):
# Ensure required fields
if "id" not in section:
section["id"] = f"section_{idx + 1}"
sectionId = section.get("id", "")
section["order"] = idx + 1
if "elements" not in section:
section["elements"] = []
# Ensure contentPartIds field exists (can be empty array)
if "contentPartIds" not in section:
section["contentPartIds"] = []
# Ensure extractionPrompt field exists (optional)
if "extractionPrompt" not in section:
section["extractionPrompt"] = None
# Identify complexity if not set
if "complexity" not in section:
section["complexity"] = self._identifySectionComplexity(
section,
maxSectionLength
)
# Ensure generation_hint exists (required for content generation)
if "generation_hint" not in section or not section.get("generation_hint"):
# Create meaningful generation hint from section id or content type
contentType = section.get("content_type", "")
# Extract meaningful hint from section ID
meaningfulHint = self._extractMeaningfulHint(sectionId, contentType, section.get("elements", []))
section["generation_hint"] = meaningfulHint
# Ensure image sections have proper configuration
if section.get("content_type") == "image":
imageSource = section.get("image_source", "generate")
if imageSource == "existing" or imageSource == "render":
# Existing or render image - ensure image_reference_id is set
if "image_reference_id" not in section:
logger.warning(f"Image section {sectionId} has image_source='{imageSource}' but no image_reference_id")
# Existing/render images are simple (no generation needed, code integration)
section["complexity"] = "simple"
else:
# New image generation - ensure image_prompt
if "image_prompt" not in section or not section.get("image_prompt"):
# Try to extract from generation_hint
generationHint = section.get("generation_hint", "")
if generationHint:
# Enhance generation_hint to be a proper image prompt
section["image_prompt"] = self._enhanceImagePrompt(generationHint)
else:
# Create default based on document context
docTitle = doc.get("title", "Document")
section["image_prompt"] = f"Generate an illustration for: {docTitle}"
# Ensure complexity is set to complex for new image generation
section["complexity"] = "complex"
return structure
except Exception as e:
logger.error(f"Error validating structure: {str(e)}")
raise
def _identifySectionComplexity(
self,
section: Dict[str, Any],
maxSectionLength: int
) -> str:
"""
Identify if section is simple or complex.
Rules:
- Images: always complex
- Long chapters (>maxSectionLength words): complex
- Others: simple
"""
contentType = section.get("content_type", "")
# Images are always complex
if contentType == "image":
return "complex"
# Check generation_hint for length indicators
generationHint = section.get("generation_hint", "").lower()
# Keywords indicating long content
longContentKeywords = [
"chapter", "long", "detailed", "comprehensive",
"extensive", "full", "complete story"
]
if any(keyword in generationHint for keyword in longContentKeywords):
return "complex"
# Default to simple
return "simple"
def _extractMeaningfulHint(
self,
sectionId: str,
contentType: str,
elements: List[Any]
) -> str:
"""
Extract meaningful generation hint from section ID, content type, or elements.
Args:
sectionId: Section identifier (e.g., "section_heading_current_state")
contentType: Content type (e.g., "heading", "paragraph")
elements: Existing elements if any
Returns:
Meaningful generation hint string
"""
sectionIdLower = sectionId.lower()
# Try to extract text from existing elements first (most accurate)
if elements and isinstance(elements, list) and len(elements) > 0:
firstElement = elements[0]
if isinstance(firstElement, dict):
if "text" in firstElement and firstElement["text"]:
if contentType == "heading":
return firstElement["text"]
elif contentType == "paragraph":
return f"Content paragraph: {firstElement['text'][:50]}..."
# Extract meaningful text from section ID
# Remove common prefixes: "section_", "section_heading_", "section_paragraph_", etc.
meaningfulPart = sectionId
for prefix in ["section_heading_", "section_paragraph_", "section_bullet_list_",
"section_code_block_", "section_image_", "section_"]:
if meaningfulPart.lower().startswith(prefix):
meaningfulPart = meaningfulPart[len(prefix):]
break
# Convert snake_case to Title Case
# e.g., "current_state" -> "Current State"
words = meaningfulPart.replace("_", " ").split()
titleCase = " ".join(word.capitalize() for word in words if word)
# Handle special cases
if "introduction" in sectionIdLower or "intro" in sectionIdLower:
return "Introduction paragraph"
elif "conclusion" in sectionIdLower:
return "Conclusion paragraph"
elif "footer" in sectionIdLower or "copyright" in sectionIdLower:
return "Footer content"
elif "title" in sectionIdLower and "main" in sectionIdLower:
# Main title - try to get from document title or use generic
return "Main document title"
# Create hint based on content type and extracted text
if contentType == "heading":
if titleCase:
return titleCase
else:
return "Section heading"
elif contentType == "paragraph":
if titleCase:
return f"Content paragraph about {titleCase.lower()}"
else:
return f"Content paragraph"
elif contentType == "bullet_list":
if titleCase:
return f"Bullet list: {titleCase.lower()}"
else:
return "Bullet list items"
elif contentType == "code_block":
return "Code content"
else:
if titleCase:
return f"Content for {titleCase.lower()}"
else:
return f"Content for {contentType} section"
def _extractImagePrompts(
self,
structure: Dict[str, Any]
) -> Dict[str, str]:
"""
Extract image generation prompts from structure.
Maps section_id -> image_prompt
"""
imagePrompts = {}
for doc in structure.get("documents", []):
for section in doc.get("sections", []):
if section.get("content_type") == "image":
sectionId = section.get("id")
imagePrompt = section.get("image_prompt")
if sectionId and imagePrompt:
imagePrompts[sectionId] = imagePrompt
return imagePrompts
def _formatCachedContent(
self,
cachedContent: Dict[str, Any]
) -> str:
"""
Format cached content for prompt inclusion.
"""
try:
extractedContent = cachedContent.get("extractedContent", [])
if not extractedContent:
return "No content extracted."
# Format ContentPart objects
formattedParts = []
for extracted in extractedContent:
if hasattr(extracted, 'parts'):
for part in extracted.parts:
if hasattr(part, 'content'):
formattedParts.append(part.content)
elif isinstance(extracted, dict):
formattedParts.append(str(extracted))
else:
formattedParts.append(str(extracted))
return "\n\n".join(formattedParts) if formattedParts else "No content extracted."
except Exception as e:
logger.warning(f"Error formatting cached content: {str(e)}")
return "Error formatting cached content."
def _enhanceImagePrompt(self, generationHint: str) -> str:
"""
Enhance generation hint to be a proper image generation prompt.
Adds visual details and style guidance if missing.
"""
# If hint already contains visual details, use as-is
visualKeywords = ["illustration", "image", "picture", "visual", "depict", "show", "drawing"]
if any(keyword.lower() in generationHint.lower() for keyword in visualKeywords):
return generationHint
# Enhance with visual description
enhanced = f"Create a professional illustration: {generationHint}"
return enhanced
def _formatExistingImages(self, imageDocuments: List[Dict[str, Any]]) -> str:
"""Format existing images list for prompt inclusion"""
if not imageDocuments:
return "No existing images provided."
formatted = []
for i, imgDoc in enumerate(imageDocuments, 1):
formatted.append(f"{i}. Image ID: {imgDoc.get('id')}")
formatted.append(f" File Name: {imgDoc.get('fileName', 'Unknown')}")
formatted.append(f" MIME Type: {imgDoc.get('mimeType', 'Unknown')}")
formatted.append(f" Alt Text: {imgDoc.get('altText', 'Image')}")
formatted.append("")
return "\n".join(formatted)
def _getUserLanguage(self) -> str:
"""Get user language for document generation"""
try:
if self.services:
if hasattr(self.services, 'currentUserLanguage') and self.services.currentUserLanguage:
return self.services.currentUserLanguage
elif hasattr(self.services, 'user') and self.services.user and hasattr(self.services.user, 'language'):
return self.services.user.language
except Exception:
pass
return 'en' # Default fallback

View file

@ -0,0 +1,7 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
"""Messaging service for the service center."""
from .mainServiceMessaging import MessagingService
__all__ = ["MessagingService"]

View file

@ -0,0 +1,368 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
"""
Messaging service for sending messages across different channels.
Provides subscription-based messaging functionality.
Supports both service center (context, get_service) and legacy (services) initialization.
"""
import logging
import re
from typing import List, Optional, Callable, Any
from modules.datamodels.datamodelMessaging import (
MessagingSubscription,
MessagingSubscriptionRegistration,
MessagingDelivery,
MessagingChannel,
MessagingEventParameters,
MessagingSendResult,
MessagingSubscriptionExecutionResult,
DeliveryStatus
)
from modules.interfaces.interfaceMessaging import getInterface as getMessagingInterface
from modules.shared.timeUtils import getUtcTimestamp
logger = logging.getLogger(__name__)
class _ServicesAdapter:
"""Minimal adapter providing interfaceDbComponent for service center mode."""
def __init__(self, context: Any):
from modules.interfaces.interfaceDbManagement import getInterface as getComponentInterface
self.interfaceDbComponent = getComponentInterface(
context.user,
mandateId=context.mandate_id
)
class MessagingService:
"""
Messaging service providing subscription-based messaging functionality.
"""
def __init__(self, context_or_services: Any, get_service: Optional[Callable[[str], Any]] = None):
"""Initialize messaging service.
Args:
context_or_services: ServiceCenterContext (when get_service is callable) or legacy Services hub
get_service: Callable to resolve services (service center mode only)
"""
if get_service is not None and callable(get_service):
# Service center: (context, get_service)
self.services = _ServicesAdapter(context_or_services)
else:
# Legacy: (services,)
self.services = context_or_services
self._messagingInterface = None
def sendMessage(
self,
subject: str,
message: str,
registration: MessagingSubscriptionRegistration
) -> MessagingSendResult:
"""
Sendet eine Nachricht über einen Channel an einen User.
Erstellt MessagingDelivery Record.
Args:
subject: Subject der Nachricht (für E-Mail, leer für SMS)
message: Nachrichtentext
registration: MessagingSubscriptionRegistration mit Channel-Info und userId
Returns:
MessagingSendResult mit Status und Delivery-ID
"""
# Erstelle Delivery Record
delivery = MessagingDelivery(
subscriptionId=registration.subscriptionId,
userId=registration.userId,
channel=registration.channel,
status=DeliveryStatus.PENDING
)
# Speichere Delivery Record
try:
deliveryRecord = self.services.interfaceDbComponent.createDelivery(delivery)
except Exception as e:
logger.error(f"Failed to create delivery record: {str(e)}")
return MessagingSendResult(
success=False,
errorMessage=f"Failed to create delivery record: {str(e)}"
)
try:
# Convert plain text to HTML for email channel
messageToSend = message
if registration.channel == MessagingChannel.EMAIL:
messageToSend = self._textToHtml(message)
# Versende über interfaceMessaging
success = self._getMessagingInterface().send(
channel=registration.channel,
recipient=registration.channelConfig,
subject=subject,
message=messageToSend
)
if success:
# Update Delivery Record
self.services.interfaceDbComponent.updateDelivery(
deliveryRecord["id"],
{
"status": DeliveryStatus.SENT,
"sentAt": getUtcTimestamp()
}
)
return MessagingSendResult(
success=True,
deliveryId=deliveryRecord["id"]
)
else:
# Update Delivery Record mit Fehler
self.services.interfaceDbComponent.updateDelivery(
deliveryRecord["id"],
{
"status": DeliveryStatus.FAILED,
"errorMessage": "Failed to send message"
}
)
return MessagingSendResult(
success=False,
deliveryId=deliveryRecord["id"],
errorMessage="Failed to send message"
)
except Exception as e:
logger.error(f"Error sending message: {str(e)}")
# Update Delivery Record mit Fehler
try:
self.services.interfaceDbComponent.updateDelivery(
deliveryRecord["id"],
{
"status": DeliveryStatus.FAILED,
"errorMessage": str(e)
}
)
except Exception as updateError:
logger.error(f"Failed to update delivery record: {str(updateError)}")
return MessagingSendResult(
success=False,
deliveryId=deliveryRecord["id"],
errorMessage=str(e)
)
def _textToHtml(self, text: str) -> str:
"""
Convert plain text to simple HTML for email display.
- Escapes HTML special characters
- Converts newlines to <br> tags
- Wraps URLs in clickable links
- Wraps in a basic HTML structure with nice styling
Args:
text: Plain text message
Returns:
HTML formatted message
"""
import html
# Check if already HTML (contains HTML tags)
if re.search(r'<[^>]+>', text):
return text
# Escape HTML special characters
escaped = html.escape(text)
# Convert URLs to clickable links (before converting newlines)
urlPattern = r'(https?://[^\s<>"\']+)'
escaped = re.sub(urlPattern, r'<a href="\1" style="color: #0066cc;">\1</a>', escaped)
# Convert newlines to <br> tags
escaped = escaped.replace('\n', '<br>\n')
# Wrap in a nice HTML structure
htmlContent = f"""<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<style>
body {{
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, 'Helvetica Neue', Arial, sans-serif;
font-size: 14px;
line-height: 1.6;
color: #333333;
max-width: 600px;
margin: 0 auto;
padding: 20px;
}}
a {{
color: #0066cc;
}}
</style>
</head>
<body>
{escaped}
</body>
</html>"""
return htmlContent
def sendEmailDirect(
self,
recipient: str,
subject: str,
message: str,
userId: Optional[str] = None
) -> bool:
"""
Send email directly without requiring a subscription.
Used for authentication flows (registration, password reset).
Plain text messages are automatically converted to HTML format.
Args:
recipient: Email address of the recipient
subject: Email subject
message: Email body (can be HTML or plain text - plain text is auto-converted)
userId: Optional user ID for logging/audit purposes
Returns:
bool: True if email was sent successfully, False otherwise
"""
try:
# Convert plain text to HTML if needed
htmlMessage = self._textToHtml(message)
messagingInterface = self._getMessagingInterface()
success = messagingInterface.send(
channel=MessagingChannel.EMAIL,
recipient=recipient,
subject=subject,
message=htmlMessage
)
if success:
logger.info(f"Email sent successfully to {recipient} (userId: {userId})")
else:
logger.warning(f"Failed to send email to {recipient} (userId: {userId})")
return success
except Exception as e:
logger.error(f"Error sending email to {recipient}: {str(e)}", exc_info=True)
return False
def executeSubscription(
self,
subscriptionId: str,
eventParameters: MessagingEventParameters
) -> MessagingSubscriptionExecutionResult:
"""
Führt eine Subscription-Funktion aus.
Args:
subscriptionId: ID der Subscription
eventParameters: Parameter vom Trigger (als Pydantic Model)
Returns:
MessagingSubscriptionExecutionResult
Raises:
ValueError: Wenn Subscription nicht existiert oder nicht enabled ist
FileNotFoundError: Wenn Subscription-Funktion nicht gefunden wird
"""
# Prüfe ob Subscription existiert und enabled ist
subscription = self.services.interfaceDbComponent.getSubscription(subscriptionId)
if not subscription:
raise ValueError(f"Subscription {subscriptionId} not found")
if not subscription.enabled:
logger.warning(f"Subscription {subscriptionId} is disabled, skipping execution")
return MessagingSubscriptionExecutionResult(
success=False,
messagesSent=0,
errorMessage="Subscription is disabled"
)
# Hole alle aktiven Registrierungen für diese Subscription
registrations = self._getSubscribers(subscriptionId)
if not registrations:
logger.info(f"No active registrations for subscription {subscriptionId}")
return MessagingSubscriptionExecutionResult(
success=True,
messagesSent=0
)
# Lade Subscription-Funktion dynamisch
subscriptionFunction = self._loadSubscriptionFunction(subscriptionId)
if not subscriptionFunction:
errorMsg = f"Subscription function not found for {subscriptionId}"
logger.error(errorMsg)
raise FileNotFoundError(errorMsg)
# Führe Funktion aus mit Registrierungen
try:
return subscriptionFunction.execute(eventParameters, registrations, self)
except Exception as e:
logger.error(f"Error executing subscription {subscriptionId}: {str(e)}", exc_info=True)
return MessagingSubscriptionExecutionResult(
success=False,
messagesSent=0,
errorMessage=str(e)
)
def _getSubscribers(
self,
subscriptionId: str,
channel: Optional[MessagingChannel] = None
) -> List[MessagingSubscriptionRegistration]:
"""Holt alle aktiven Subscriber einer Subscription"""
filters = {"enabled": True}
if channel:
filters["channel"] = channel.value
registrations = self.services.interfaceDbComponent.getAllRegistrations(
subscriptionId=subscriptionId
)
# Filter nach enabled und channel
filteredRegistrations = []
for reg in registrations:
if reg.enabled and (not channel or reg.channel == channel):
filteredRegistrations.append(reg)
return filteredRegistrations
def _loadSubscriptionFunction(self, subscriptionId: str) -> Optional[Callable]:
"""
Lädt die Subscription-Funktion dynamisch.
Returns:
Callable mit execute-Methode oder None wenn nicht gefunden
Note:
subscriptionId wird direkt als Dateiname verwendet (z.B. "SystemErrors" -> subSubscriptionSystemErrors.py)
"""
# Format: subSubscription{subscriptionId}.py
functionName = f"subSubscription{subscriptionId}"
moduleName = f"modules.serviceCenter.services.serviceMessaging.subscriptions.{functionName}"
try:
# Dynamisches Import
import importlib
subscriptionModule = importlib.import_module(moduleName)
return subscriptionModule
except ImportError:
# Funktion existiert noch nicht - das ist OK
logger.debug(f"Subscription function {moduleName} not found (this is OK if not yet implemented)")
return None
def _getMessagingInterface(self):
"""Holt das Messaging-Interface (interfaceMessaging)"""
if not self._messagingInterface:
self._messagingInterface = getMessagingInterface()
return self._messagingInterface

View file

@ -0,0 +1,3 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
"""Subscription functions for the messaging service."""

View file

@ -0,0 +1,72 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
"""
Example subscription function for System Errors.
This is a template that can be used as a reference for creating other subscription functions.
"""
from typing import List
from modules.datamodels.datamodelMessaging import (
MessagingEventParameters,
MessagingSubscriptionExecutionResult,
MessagingSubscriptionRegistration,
MessagingChannel
)
def execute(
eventParameters: MessagingEventParameters,
registrations: List[MessagingSubscriptionRegistration],
messagingService
) -> MessagingSubscriptionExecutionResult:
"""
Subscription-Funktion für System-Errors.
Erhält eventParameters vom Trigger und registrations bereits geholt.
Args:
eventParameters: Event-Parameter vom Trigger
registrations: Liste der aktiven Registrierungen für diese Subscription
messagingService: MessagingService-Instanz
Returns:
MessagingSubscriptionExecutionResult mit Status und Anzahl gesendeter Nachrichten
"""
# Gruppiere nach Channel
emailRegistrations = [r for r in registrations if r.channel == MessagingChannel.EMAIL]
smsRegistrations = [r for r in registrations if r.channel == MessagingChannel.SMS]
# Bereite Nachrichten vor (können pro Channel unterschiedlich sein)
triggerData = eventParameters.triggerData
errors = triggerData.get('errors', [])
timestamp = triggerData.get('timestamp', 'Unknown')
emailSubject = "System Error Report"
emailMessage = f"System errors detected at {timestamp}:\n\n{errors}"
smsMessage = f"System Error: {len(errors)} errors detected at {timestamp}"
messagesSent = 0
# Versende über sendMessage
for reg in emailRegistrations:
sendResult = messagingService.sendMessage(
subject=emailSubject,
message=emailMessage,
registration=reg
)
if sendResult.success:
messagesSent += 1
for reg in smsRegistrations:
sendResult = messagingService.sendMessage(
subject="", # SMS hat kein Subject
message=smsMessage,
registration=reg
)
if sendResult.success:
messagesSent += 1
return MessagingSubscriptionExecutionResult(
success=True,
messagesSent=messagesSent
)

View file

@ -0,0 +1,7 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
"""SharePoint service."""
from .mainServiceSharepoint import SharepointService
__all__ = ["SharepointService"]

View file

@ -0,0 +1,825 @@
# Copyright (c) 2025 Patrick Motsch
# All rights reserved.
"""Connector for SharePoint operations using Microsoft Graph API."""
import logging
import aiohttp
import asyncio
import time
from typing import Dict, Any, List, Optional, Callable
logger = logging.getLogger(__name__)
# Cache for discoverSites() to avoid hitting Graph API on every folder-options call (e.g. when UI loads site list).
# Key: token prefix (per user), Value: (expiry_ts, sites). TTL 5 minutes.
_discoverSitesCache: Dict[str, tuple] = {}
_DISCOVER_SITES_TTL_SEC = 300
class SharepointService:
"""SharePoint connector using Microsoft Graph API for reliable authentication."""
def __init__(self, context, get_service: Callable[[str], Any]):
"""Initialize SharePoint service without access token.
Args:
context: ServiceCenterContext with user, mandate_id, etc.
get_service: Service resolver for dependency injection (e.g. security)
Use setAccessTokenFromConnection() method to configure the access token before making API calls.
"""
self._context = context
self._get_service = get_service
self.accessToken = None
self.baseUrl = "https://graph.microsoft.com/v1.0"
def setAccessTokenFromConnection(self, userConnection) -> bool:
"""Set access token from UserConnection.
Args:
userConnection: UserConnection object or dict containing token information
Returns:
bool: True if token was set successfully, False otherwise
"""
try:
if not userConnection:
logger.error("UserConnection is required to set access token")
return False
# Handle both dict and UserConnection object
if isinstance(userConnection, dict):
connectionId = userConnection.get('id')
else:
connectionId = getattr(userConnection, 'id', None)
if not connectionId:
logger.error("UserConnection must have an 'id' field")
return False
# Get a fresh token for this specific connection via security service
security = self._get_service("security")
if not security:
logger.error("Security service not available for token access")
return False
token = security.getFreshToken(connectionId)
if not token:
logger.error(f"No token found for connection {connectionId}")
return False
self.accessToken = token.tokenAccess
logger.info(f"Access token set for connection {connectionId}")
return True
except Exception as e:
logger.error(f"Error setting access token: {str(e)}")
return False
async def _makeGraphApiCall(self, endpoint: str, method: str = "GET", data: bytes = None) -> Dict[str, Any]:
"""Make a Microsoft Graph API call with proper error handling."""
try:
if self.accessToken is None:
logger.error("Access token is not set. Please call setAccessTokenFromConnection() before using the SharePoint service.")
return {"error": "Access token is not set. Please call setAccessTokenFromConnection() before using the SharePoint service."}
headers = {
"Authorization": f"Bearer {self.accessToken}",
"Content-Type": "application/json" if data and method != "PUT" else "application/octet-stream" if data else "application/json"
}
# Remove leading slash from endpoint to avoid double slash
cleanEndpoint = endpoint.lstrip('/')
url = f"{self.baseUrl}/{cleanEndpoint}"
logger.debug(f"Making Graph API call: {method} {url}")
timeout = aiohttp.ClientTimeout(total=30)
async with aiohttp.ClientSession(timeout=timeout) as session:
if method == "GET":
async with session.get(url, headers=headers) as response:
if response.status == 200:
return await response.json()
else:
error_text = await response.text()
logger.error(f"Graph API call failed: {response.status} - {error_text}")
return {"error": f"API call failed: {response.status} - {error_text}"}
elif method == "PUT":
async with session.put(url, headers=headers, data=data) as response:
if response.status in [200, 201]:
return await response.json()
else:
error_text = await response.text()
logger.error(f"Graph API call failed: {response.status} - {error_text}")
return {"error": f"API call failed: {response.status} - {error_text}"}
elif method == "POST":
async with session.post(url, headers=headers, data=data) as response:
if response.status in [200, 201]:
return await response.json()
else:
error_text = await response.text()
logger.error(f"Graph API call failed: {response.status} - {error_text}")
return {"error": f"API call failed: {response.status} - {error_text}"}
elif method == "DELETE":
async with session.delete(url, headers=headers) as response:
if response.status in [200, 204]:
return {}
else:
error_text = await response.text()
logger.error(f"Graph API call failed: {response.status} - {error_text}")
return {"error": f"API call failed: {response.status} - {error_text}"}
except asyncio.TimeoutError:
logger.error(f"Graph API call timed out after 30 seconds: {endpoint}")
return {"error": f"API call timed out after 30 seconds: {endpoint}"}
except Exception as e:
logger.error(f"Error making Graph API call: {str(e)}")
return {"error": f"Error making Graph API call: {str(e)}"}
async def discoverSites(self) -> List[Dict[str, Any]]:
"""Discover all SharePoint sites accessible to the user."""
try:
result = await self._makeGraphApiCall("sites?search=*")
if "error" in result:
logger.error(f"Error discovering SharePoint sites: {result['error']}")
return []
sites = result.get("value", [])
logger.info(f"Discovered {len(sites)} SharePoint sites")
processedSites = []
for site in sites:
siteInfo = {
"id": site.get("id"),
"displayName": site.get("displayName"),
"name": site.get("name"),
"webUrl": site.get("webUrl"),
"description": site.get("description"),
"createdDateTime": site.get("createdDateTime"),
"lastModifiedDateTime": site.get("lastModifiedDateTime")
}
processedSites.append(siteInfo)
logger.debug(f"Site: {siteInfo['displayName']} - {siteInfo['webUrl']}")
return processedSites
except Exception as e:
logger.error(f"Error discovering SharePoint sites: {str(e)}")
return []
async def findSiteByName(self, siteName: str) -> Optional[Dict[str, Any]]:
"""Find a specific SharePoint site by name using direct Graph API call."""
try:
# Try to get the site directly by name using Graph API
endpoint = f"sites/{siteName}"
result = await self._makeGraphApiCall(endpoint)
if result and "error" not in result:
siteInfo = {
"id": result.get("id"),
"displayName": result.get("displayName"),
"name": result.get("name"),
"webUrl": result.get("webUrl"),
"description": result.get("description"),
"createdDateTime": result.get("createdDateTime"),
"lastModifiedDateTime": result.get("lastModifiedDateTime")
}
logger.info(f"Found site directly: {siteInfo['displayName']} - {siteInfo['webUrl']}")
return siteInfo
except Exception as e:
logger.debug(f"Direct site lookup failed for '{siteName}': {str(e)}")
# Fallback to discovery if direct lookup fails
logger.info(f"Direct lookup failed, trying discovery for site: {siteName}")
sites = await self.discoverSites()
if not sites:
logger.warning("No sites discovered")
return None
logger.info(f"Discovered {len(sites)} SharePoint sites:")
for site in sites:
logger.info(f" - {site.get('displayName', 'Unknown')} (ID: {site.get('id', 'Unknown')})")
# Try exact match first
for site in sites:
if site.get("displayName", "").strip().lower() == siteName.strip().lower():
logger.info(f"Found exact match: {site.get('displayName')}")
return site
# Try partial match
for site in sites:
if siteName.lower() in site.get("displayName", "").lower():
logger.info(f"Found partial match: {site.get('displayName')}")
return site
logger.warning(f"No site found matching: {siteName}")
return None
async def findSiteByWebUrl(self, webUrl: str) -> Optional[Dict[str, Any]]:
"""Find a SharePoint site using its web URL (useful for guest sites)."""
try:
# Use the web URL format: sites/{hostname}:/sites/{site-path}
# Extract hostname and site path from the web URL
if not webUrl.startswith("https://"):
webUrl = f"https://{webUrl}"
# Parse the URL to extract hostname and site path
from urllib.parse import urlparse
parsed = urlparse(webUrl)
hostname = parsed.hostname
pathParts = parsed.path.strip('/').split('/')
if len(pathParts) >= 2 and pathParts[0] == 'sites':
sitePath = '/'.join(pathParts[1:]) # Everything after 'sites/'
else:
logger.error(f"Invalid SharePoint URL format: {webUrl}")
return None
endpoint = f"sites/{hostname}:/sites/{sitePath}"
logger.debug(f"Trying web URL format: {endpoint}")
result = await self._makeGraphApiCall(endpoint)
if result and "error" not in result:
siteInfo = {
"id": result.get("id"),
"displayName": result.get("displayName"),
"name": result.get("name"),
"webUrl": result.get("webUrl"),
"description": result.get("description"),
"createdDateTime": result.get("createdDateTime"),
"lastModifiedDateTime": result.get("lastModifiedDateTime")
}
logger.info(f"Found site by web URL: {siteInfo['displayName']} - {siteInfo['webUrl']} (ID: {siteInfo['id']})")
return siteInfo
else:
logger.warning(f"Site not found using web URL: {webUrl}")
return None
except Exception as e:
logger.error(f"Error finding site by web URL: {str(e)}")
return None
async def findSiteByUrl(self, hostname: str, sitePath: str) -> Optional[Dict[str, Any]]:
"""Find a SharePoint site using the site URL format."""
try:
# For guest sites, try different URL formats
urlFormats = [
f"sites/{hostname}:/sites/{sitePath}", # Standard format
f"sites/{hostname}:/sites/{sitePath}/", # With trailing slash
f"sites/{hostname}:/sites/{sitePath.lower()}", # Lowercase
f"sites/{hostname}:/sites/{sitePath.lower()}/", # Lowercase with slash
]
for endpoint in urlFormats:
logger.debug(f"Trying URL format: {endpoint}")
result = await self._makeGraphApiCall(endpoint)
if result and "error" not in result:
siteInfo = {
"id": result.get("id"),
"displayName": result.get("displayName"),
"name": result.get("name"),
"webUrl": result.get("webUrl"),
"description": result.get("description"),
"createdDateTime": result.get("createdDateTime"),
"lastModifiedDateTime": result.get("lastModifiedDateTime")
}
logger.info(f"Found site by URL: {siteInfo['displayName']} - {siteInfo['webUrl']} (ID: {siteInfo['id']})")
return siteInfo
else:
logger.debug(f"URL format failed: {endpoint} - {result.get('error', 'Unknown error')}")
logger.warning(f"Site not found using any URL format for: {hostname}:/sites/{sitePath}")
return None
except Exception as e:
logger.error(f"Error finding site by URL: {str(e)}")
return None
async def getFolderByPath(self, siteId: str, folderPath: str) -> Optional[Dict[str, Any]]:
"""Get folder information by path within a site."""
try:
# Clean the path
cleanPath = folderPath.lstrip('/')
# If path is empty, get root directly
if not cleanPath:
endpoint = f"sites/{siteId}/drive/root"
else:
endpoint = f"sites/{siteId}/drive/root:/{cleanPath}"
result = await self._makeGraphApiCall(endpoint)
if "error" in result:
logger.warning(f"Folder not found at path {folderPath}: {result['error']}")
return None
return result
except Exception as e:
logger.error(f"Error getting folder by path: {str(e)}")
return None
async def uploadFile(self, siteId: str, folderPath: str, fileName: str, content: bytes) -> Dict[str, Any]:
"""Upload a file to SharePoint."""
try:
# Clean the path
cleanPath = folderPath.lstrip('/')
uploadPath = f"{cleanPath.rstrip('/')}/{fileName}"
endpoint = f"sites/{siteId}/drive/root:/{uploadPath}:/content"
logger.info(f"Uploading file to: {endpoint}")
result = await self._makeGraphApiCall(endpoint, method="PUT", data=content)
if "error" in result:
logger.error(f"Upload failed: {result['error']}")
return result
logger.info(f"File uploaded successfully: {fileName}")
return result
except Exception as e:
logger.error(f"Error uploading file: {str(e)}")
return {"error": f"Error uploading file: {str(e)}"}
async def downloadFile(self, siteId: str, fileId: str) -> Optional[bytes]:
"""Download a file from SharePoint."""
try:
if self.accessToken is None:
logger.error("Access token is not set. Please call setAccessTokenFromConnection() before using the SharePoint service.")
return None
endpoint = f"sites/{siteId}/drive/items/{fileId}/content"
headers = {"Authorization": f"Bearer {self.accessToken}"}
timeout = aiohttp.ClientTimeout(total=30)
async with aiohttp.ClientSession(timeout=timeout) as session:
async with session.get(f"{self.baseUrl}/{endpoint}", headers=headers) as response:
if response.status == 200:
return await response.read()
else:
logger.error(f"Download failed: {response.status}")
return None
except Exception as e:
logger.error(f"Error downloading file: {str(e)}")
return None
async def listFolderContents(self, siteId: str, folderPath: str = "") -> List[Dict[str, Any]]:
"""List contents of a folder."""
try:
if not folderPath or folderPath == "/":
endpoint = f"sites/{siteId}/drive/root/children"
else:
cleanPath = folderPath.lstrip('/')
endpoint = f"sites/{siteId}/drive/root:/{cleanPath}:/children"
result = await self._makeGraphApiCall(endpoint)
if "error" in result:
logger.warning(f"Failed to list folder contents: {result['error']}")
return None
items = result.get("value", [])
processedItems = []
for item in items:
# Determine if it's a folder or file
isFolder = 'folder' in item
itemInfo = {
"id": item.get("id"),
"name": item.get("name"),
"type": "folder" if isFolder else "file",
"size": item.get("size", 0),
"createdDateTime": item.get("createdDateTime"),
"lastModifiedDateTime": item.get("lastModifiedDateTime"),
"webUrl": item.get("webUrl")
}
if "file" in item:
itemInfo["mimeType"] = item["file"].get("mimeType")
itemInfo["downloadUrl"] = item.get("@microsoft.graph.downloadUrl")
if "folder" in item:
itemInfo["childCount"] = item["folder"].get("childCount", 0)
processedItems.append(itemInfo)
return processedItems
except Exception as e:
logger.error(f"Error listing folder contents: {str(e)}")
return []
async def searchFiles(self, siteId: str, query: str) -> List[Dict[str, Any]]:
"""Search for files in a site."""
try:
searchQuery = query.replace("'", "''") # Escape single quotes for OData
endpoint = f"sites/{siteId}/drive/root/search(q='{searchQuery}')"
result = await self._makeGraphApiCall(endpoint)
if "error" in result:
logger.warning(f"Search failed: {result['error']}")
return []
items = result.get("value", [])
processedItems = []
for item in items:
isFolder = 'folder' in item
itemInfo = {
"id": item.get("id"),
"name": item.get("name"),
"type": "folder" if isFolder else "file",
"size": item.get("size", 0),
"createdDateTime": item.get("createdDateTime"),
"lastModifiedDateTime": item.get("lastModifiedDateTime"),
"webUrl": item.get("webUrl"),
"parentPath": item.get("parentReference", {}).get("path", "")
}
if "file" in item:
itemInfo["mimeType"] = item["file"].get("mimeType")
itemInfo["downloadUrl"] = item.get("@microsoft.graph.downloadUrl")
processedItems.append(itemInfo)
return processedItems
except Exception as e:
logger.error(f"Error searching files: {str(e)}")
return []
async def copyFileAsync(self, siteId: str, sourceFolder: str, sourceFile: str, destFolder: str, destFile: str) -> None:
"""Copy a file from source to destination folder (like original synchronizer)."""
try:
# First, download the source file
sourcePath = f"{sourceFolder}/{sourceFile}"
fileContent = await self.downloadFileByPath(siteId=siteId, filePath=sourcePath)
if not fileContent:
raise Exception(f"Failed to download source file: {sourcePath}")
# Upload to destination
await self.uploadFile(
siteId=siteId,
folderPath=destFolder,
fileName=destFile,
content=fileContent
)
logger.info(f"File copied: {sourceFile} -> {destFile}")
except Exception as e:
# Provide more specific error information
errorMsg = str(e)
if "itemNotFound" in errorMsg or "404" in errorMsg:
raise Exception(f"Source file not found (404): {sourcePath} - {errorMsg}")
else:
raise Exception(f"Error copying file: {errorMsg}")
async def deleteFile(self, siteId: str, itemId: str) -> bool:
"""Delete a file (or folder) from SharePoint by item ID. Returns True on success."""
try:
if not siteId or not itemId:
logger.warning("deleteFile: siteId and itemId are required")
return False
endpoint = f"sites/{siteId}/drive/items/{itemId}"
result = await self._makeGraphApiCall(endpoint, method="DELETE")
if result and "error" in result:
logger.warning(f"deleteFile failed: {result.get('error')}")
return False
return True
except Exception as e:
logger.error(f"Error deleting file: {str(e)}")
return False
async def downloadFileByPath(self, siteId: str, filePath: str) -> Optional[bytes]:
"""Download a file by its path within a site."""
try:
if self.accessToken is None:
logger.error("Access token is not set. Please call setAccessTokenFromConnection() before using the SharePoint service.")
return None
# Clean the path
cleanPath = filePath.strip('/')
endpoint = f"sites/{siteId}/drive/root:/{cleanPath}:/content"
# Use direct HTTP call for file downloads (binary content)
headers = {
"Authorization": f"Bearer {self.accessToken}",
}
# Remove leading slash from endpoint to avoid double slash
cleanEndpoint = endpoint.lstrip('/')
url = f"{self.baseUrl}/{cleanEndpoint}"
logger.debug(f"Downloading file: GET {url}")
timeout = aiohttp.ClientTimeout(total=30)
async with aiohttp.ClientSession(timeout=timeout) as session:
async with session.get(url, headers=headers) as response:
if response.status == 200:
return await response.read()
else:
error_text = await response.text()
logger.error(f"File download failed: {response.status} - {error_text}")
return None
except Exception as e:
logger.error(f"Error downloading file by path: {str(e)}")
return None
async def _getItemById(self, siteId: str, driveId: str, itemId: str) -> Optional[Dict[str, Any]]:
"""Verify that an item exists by getting it by ID."""
try:
endpoint = f"sites/{siteId}/drives/{driveId}/items/{itemId}"
result = await self._makeGraphApiCall(endpoint)
if "error" in result:
logger.warning(f"Item {itemId} not found: {result['error']}")
return None
return result
except Exception as e:
logger.warning(f"Error verifying item {itemId}: {str(e)}")
return None
async def _findDriveForItem(self, siteId: str, itemId: str) -> Optional[str]:
"""Find which drive contains a specific item by trying to get it from all drives."""
try:
endpoint = f"sites/{siteId}/drives"
drivesResult = await self._makeGraphApiCall(endpoint)
if "error" in drivesResult:
logger.warning(f"Could not get drives for site {siteId}: {drivesResult['error']}")
return None
drives = drivesResult.get("value", [])
if not drives:
logger.warning(f"No drives found for site {siteId}")
return None
for drive in drives:
driveId = drive.get("id")
if not driveId:
continue
itemInfo = await self._getItemById(siteId, driveId, itemId)
if itemInfo:
logger.info(f"Found item {itemId} in drive {drive.get('name', driveId)}")
return driveId
logger.warning(f"Item {itemId} not found in any drive for site {siteId}")
return None
except Exception as e:
logger.warning(f"Error finding drive for item {itemId}: {str(e)}")
return None
async def getFolderUsageAnalytics(self, siteId: str, driveId: str, itemId: str, startDateTime: Optional[str] = None, endDateTime: Optional[str] = None, interval: str = "day") -> Dict[str, Any]:
"""Get usage analytics for a folder or file."""
try:
from datetime import datetime, timedelta, timezone
if not endDateTime:
endDateTime = datetime.now(timezone.utc).isoformat().replace('+00:00', 'Z')
if not startDateTime:
startDate = datetime.now(timezone.utc) - timedelta(days=30)
startDateTime = startDate.isoformat().replace('+00:00', 'Z')
endpoint = f"sites/{siteId}/drives/{driveId}/items/{itemId}/getActivitiesByInterval"
endpoint += f"?startDateTime={startDateTime}&endDateTime={endDateTime}&interval={interval}"
result = await self._makeGraphApiCall(endpoint)
if "error" in result:
errorMsg = result.get('error', '')
if isinstance(errorMsg, str) and '404' in errorMsg:
itemInfo = await self._getItemById(siteId, driveId, itemId)
if not itemInfo:
correctDriveId = await self._findDriveForItem(siteId, itemId)
if correctDriveId and correctDriveId != driveId:
endpoint = f"sites/{siteId}/drives/{correctDriveId}/items/{itemId}/getActivitiesByInterval"
endpoint += f"?startDateTime={startDateTime}&endDateTime={endDateTime}&interval={interval}"
result = await self._makeGraphApiCall(endpoint)
if "error" not in result:
return result
itemInfo = await self._getItemById(siteId, correctDriveId, itemId)
if itemInfo:
return {
"value": [],
"note": "No analytics data available for this item. The item exists but may not have activity data or analytics may not be supported for this item type."
}
else:
return result
else:
return result
return result
except Exception as e:
logger.error(f"Error getting folder usage analytics: {str(e)}")
return {"error": f"Error getting folder usage analytics: {str(e)}"}
async def getDriveId(self, siteId: str, driveName: Optional[str] = None) -> Optional[str]:
"""Get drive ID for a site."""
try:
endpoint = f"sites/{siteId}/drives"
result = await self._makeGraphApiCall(endpoint)
if "error" in result:
logger.error(f"Error getting drives: {result['error']}")
return None
drives = result.get("value", [])
if not driveName:
for drive in drives:
if drive.get("name") == "Documents" or drive.get("name") == "Shared Documents":
return drive.get("id")
if drives:
return drives[0].get("id")
return None
for drive in drives:
if drive.get("name", "").lower() == driveName.lower():
return drive.get("id")
return None
except Exception as e:
logger.error(f"Error getting drive ID: {str(e)}")
return None
def extractSiteFromStandardPath(self, pathQuery: str) -> Optional[Dict[str, str]]:
"""
Extract site name from Microsoft-standard server-relative path:
/sites/company-share/Freigegebene Dokumente/...
Returns dict with keys: siteName, innerPath (no leading slash) on success, else None.
"""
try:
if not pathQuery or not pathQuery.startswith('/sites/'):
return None
remainder = pathQuery[7:]
if '/' not in remainder:
return {"siteName": remainder, "innerPath": ""}
siteName, inner = remainder.split('/', 1)
siteName = siteName.strip()
innerPath = inner.strip()
if not siteName:
return None
return {"siteName": siteName, "innerPath": innerPath}
except Exception as e:
logger.error(f"Error extracting site from standard path '{pathQuery}': {str(e)}")
return None
async def getSiteByStandardPath(self, sitePath: str, allSites: Optional[List[Dict[str, Any]]] = None) -> Optional[Dict[str, Any]]:
"""Get SharePoint site directly by Microsoft-standard path (/sites/SiteName)."""
try:
from urllib.parse import urlparse
hostname = None
if allSites and len(allSites) > 0:
webUrl = allSites[0].get("webUrl", "")
hostname = urlparse(webUrl).hostname if webUrl else None
if not hostname:
rootSite = await self._makeGraphApiCall("sites/root")
if rootSite and "webUrl" in rootSite and "error" not in rootSite:
hostname = urlparse(rootSite.get("webUrl", "")).hostname
if not hostname:
minimalSites = await self.discoverSites()
if not minimalSites:
return None
hostname = urlparse(minimalSites[0].get("webUrl", "")).hostname
if not hostname:
return None
endpoint = f"sites/{hostname}:/sites/{sitePath}"
result = await self._makeGraphApiCall(endpoint)
if "error" in result:
return None
return {
"id": result.get("id"),
"displayName": result.get("displayName"),
"name": result.get("name"),
"webUrl": result.get("webUrl"),
"description": result.get("description"),
"createdDateTime": result.get("createdDateTime"),
"lastModifiedDateTime": result.get("lastModifiedDateTime")
}
except Exception as e:
logger.error(f"Error getting site by standard path '{sitePath}': {str(e)}")
return None
def filterSitesByHint(self, sites: List[Dict[str, Any]], siteHint: str) -> List[Dict[str, Any]]:
"""Filter discovered sites by a human-entered site hint (case-insensitive substring)."""
try:
if not siteHint:
return sites
hint = siteHint.strip().lower()
filtered: List[Dict[str, Any]] = []
for site in sites:
name = (site.get("displayName") or "").lower()
webUrl = (site.get("webUrl") or "").lower()
if hint in name or hint in webUrl:
filtered.append(site)
return filtered if filtered else sites
except Exception as e:
logger.error(f"Error filtering sites by hint '{siteHint}': {str(e)}")
return sites
async def resolveSitesFromPathQuery(self, pathQuery: str, allSites: Optional[List[Dict[str, Any]]] = None) -> List[Dict[str, Any]]:
"""Resolve sites from pathQuery. Handles both Microsoft-standard paths and regular paths."""
try:
if pathQuery.startswith('/sites/'):
parsedPath = self.extractSiteFromStandardPath(pathQuery)
if parsedPath:
siteName = parsedPath.get("siteName")
directSite = await self.getSiteByStandardPath(siteName, allSites)
if directSite:
logger.info(f"Got site directly by standard path - no need to discover all sites")
return [directSite]
else:
logger.warning(f"Could not get site directly, falling back to site discovery")
if not allSites:
allSites = await self.discoverSites()
if not allSites:
logger.warning("No SharePoint sites found or accessible")
return []
if pathQuery.startswith('/sites/'):
parsedPath = self.extractSiteFromStandardPath(pathQuery)
if parsedPath:
siteName = parsedPath.get("siteName")
sites = self.filterSitesByHint(allSites, siteName)
if not sites:
logger.warning(f"No SharePoint site found matching '{siteName}'")
return []
logger.info(f"Filtered to site(s) matching '{siteName}': {[s['displayName'] for s in sites]}")
return sites
else:
return allSites
else:
return allSites
except Exception as e:
logger.error(f"Error resolving sites from pathQuery '{pathQuery}': {str(e)}")
return []
def validatePathQuery(self, pathQuery: str) -> tuple[bool, Optional[str]]:
"""Validate pathQuery format. Returns (isValid, errorMessage)."""
try:
if not pathQuery or pathQuery.strip() == "" or pathQuery.strip() == "*":
return False, "pathQuery cannot be empty or '*'"
if not pathQuery.startswith('/'):
return False, "pathQuery must start with '/' and include site name with Microsoft-standard syntax /sites/<SiteName>/... e.g. /sites/company-share/Freigegebene Dokumente/Work"
validPathPrefixes = ['/sites/', '/Documents', '/documents', '/Shared Documents', '/shared documents']
if not any(pathQuery.startswith(prefix) for prefix in validPathPrefixes):
return False, f"Invalid pathQuery '{pathQuery}'. This appears to be search terms, not a valid SharePoint path. Use findDocumentPath action first to search for folders, then use the returned folder path as pathQuery."
return True, None
except Exception as e:
logger.error(f"Error validating pathQuery '{pathQuery}': {str(e)}")
return False, f"Error validating pathQuery: {str(e)}"
def detectFolderType(self, item: Dict[str, Any]) -> bool:
"""Detect if an item is a folder using improved detection logic."""
try:
if 'folder' in item:
return True
webUrl = item.get('webUrl', '')
name = item.get('name', '')
if '.' not in name and ('/' in webUrl or '\\' in webUrl):
return True
return False
except Exception as e:
logger.error(f"Error detecting folder type: {str(e)}")
return False

Some files were not shown because too many files have changed in this diff Show more