first version of service center
implemented on chatbot
This commit is contained in:
parent
6dc2afafb9
commit
53d2d9d873
111 changed files with 37504 additions and 49 deletions
8
app.py
8
app.py
|
|
@ -324,6 +324,14 @@ async def lifespan(app: FastAPI):
|
|||
except Exception as e:
|
||||
logger.error(f"Feature catalog registration failed: {e}")
|
||||
|
||||
# Pre-warm service center modules (avoids first-request import latency)
|
||||
try:
|
||||
from modules.serviceCenter import preWarm
|
||||
preWarm()
|
||||
logger.info("Service center pre-warm completed")
|
||||
except Exception as e:
|
||||
logger.warning(f"Service center pre-warm failed (non-critical): {e}")
|
||||
|
||||
# Get event user for feature lifecycle (system-level user for background operations)
|
||||
rootInterface = getRootInterface()
|
||||
eventUser = rootInterface.getUserByUsername("event")
|
||||
|
|
|
|||
318
docs/SERVICE_ARCHITECTURE_DOCUMENTATION.md
Normal file
318
docs/SERVICE_ARCHITECTURE_DOCUMENTATION.md
Normal file
|
|
@ -0,0 +1,318 @@
|
|||
# Gateway Service Architecture Documentation
|
||||
|
||||
This document describes the structure, design patterns, and key components of the two service architectures in the gateway:
|
||||
|
||||
1. **`modules/serviceCenter`** — the new service center (context-based DI, RBAC-aware)
|
||||
2. **`modules/services`** — the legacy services hub (monolithic hub, eager loading)
|
||||
|
||||
---
|
||||
|
||||
## 1. `modules/serviceCenter` — New Service Center
|
||||
|
||||
### 1.1 File/Folder Structure
|
||||
|
||||
```
|
||||
modules/serviceCenter/
|
||||
├── __init__.py # Public API: getService, preWarm, registerServiceObjects, can_access_service
|
||||
├── context.py # ServiceCenterContext dataclass
|
||||
├── registry.py # Service definitions (CORE_SERVICES, IMPORTABLE_SERVICES, RBAC)
|
||||
├── resolver.py # Resolution logic, dependency injection, legacy fallback
|
||||
├── core/ # Core services (internal building blocks, no RBAC)
|
||||
│ ├── __init__.py
|
||||
│ ├── serviceUtils/
|
||||
│ │ └── mainServiceUtils.py
|
||||
│ ├── serviceSecurity/
|
||||
│ │ └── mainServiceSecurity.py
|
||||
│ └── serviceStreaming/
|
||||
│ └── mainServiceStreaming.py
|
||||
└── services/ # Feature-facing importable services (RBAC-protected)
|
||||
├── __init__.py
|
||||
├── serviceAi/
|
||||
├── serviceBilling/
|
||||
├── serviceChat/
|
||||
├── serviceExtraction/
|
||||
├── serviceGeneration/
|
||||
├── serviceMessaging/
|
||||
├── serviceNeutralization/
|
||||
├── serviceSharepoint/
|
||||
├── serviceTicket/
|
||||
└── serviceWeb/
|
||||
```
|
||||
|
||||
**Design distinction:**
|
||||
- **Core services** — Internal utilities (utils, security, streaming). Never directly requested by features. No RBAC. Security and streaming are fully migrated; legacy hub delegates to service center.
|
||||
- **Importable services** — Feature-facing. Have `objectKey` for RBAC (e.g. `service.web`, `service.extraction`).
|
||||
|
||||
---
|
||||
|
||||
### 1.2 Service Registration and Discovery
|
||||
|
||||
**Registration:** Services are declared statically in `registry.py`.
|
||||
|
||||
- **CORE_SERVICES**: Internal services with `module`, `class`, `dependencies`.
|
||||
- **IMPORTABLE_SERVICES**: Feature-facing services with `module`, `class`, `dependencies`, `objectKey`, `label`.
|
||||
- **SERVICE_RBAC_OBJECTS**: Derived from IMPORTABLE_SERVICES for catalog registration.
|
||||
|
||||
**Discovery:** No dynamic discovery. All services are explicitly listed in the registry. Adding a service requires editing `registry.py`.
|
||||
|
||||
```python
|
||||
# Example from registry.py
|
||||
IMPORTABLE_SERVICES = {
|
||||
"ai": {
|
||||
"module": "modules.serviceCenter.services.serviceAi.mainServiceAi",
|
||||
"class": "AiService",
|
||||
"dependencies": ["chat", "utils", "extraction", "billing"],
|
||||
"objectKey": "service.ai",
|
||||
"label": {"en": "AI", "de": "KI", "fr": "IA"},
|
||||
},
|
||||
# ...
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 1.3 Dependency Injection and Factory Patterns
|
||||
|
||||
**Constructor pattern:** Services receive two arguments from the resolver:
|
||||
1. `context: ServiceCenterContext` — user, mandate_id, feature_instance_id, workflow
|
||||
2. `get_service: Callable[[str], Any]` — function to resolve other services by key
|
||||
|
||||
```python
|
||||
# Service Center service constructor
|
||||
def __init__(self, context, get_service: Callable[[str], Any]):
|
||||
self._context = context
|
||||
self._get_service = get_service
|
||||
```
|
||||
|
||||
**Dependency resolution:**
|
||||
- The resolver (`resolver.py`) builds a `get_service` callable that recursively resolves dependencies.
|
||||
- Dependencies are declared in the registry (e.g. `"dependencies": ["chat", "utils", "extraction", "billing"]`).
|
||||
- Circular dependencies are detected and raise `RuntimeError`.
|
||||
- Resolution is cached per `(user, mandate_id, feature_instance_id)` + `key`.
|
||||
|
||||
**Legacy fallback:** If a service fails to load from the service center, the resolver falls back to the legacy `Services` hub when `legacy_hub` is provided.
|
||||
|
||||
---
|
||||
|
||||
### 1.4 Main Entry Points and Usage Patterns
|
||||
|
||||
| Entry Point | Purpose |
|
||||
|-------------|---------|
|
||||
| `getService(key, context, legacy_hub=None)` | Resolve a service by key for the given context |
|
||||
| `preWarm(service_keys=None)` | Pre-load service modules at startup (called in `app.py` lifespan) |
|
||||
| `registerServiceObjects(catalogService)` | Register service RBAC objects (called via `registerAllFeaturesInCatalog`) |
|
||||
| `can_access_service(user, rbac, service_key, ...)` | RBAC check for service access |
|
||||
| `ServiceCenterContext(user, mandate_id, feature_instance_id, workflow)` | Context dataclass |
|
||||
|
||||
**Typical usage (chatbot feature):**
|
||||
```python
|
||||
from modules.serviceCenter import getService
|
||||
from modules.serviceCenter.context import ServiceCenterContext
|
||||
|
||||
ctx = ServiceCenterContext(user=user, mandate_id=mandateId, feature_instance_id=featureInstanceId, workflow=workflow)
|
||||
ai_service = getService("ai", ctx, legacy_hub=None)
|
||||
chat_service = getService("chat", ctx, legacy_hub=None)
|
||||
```
|
||||
|
||||
**Feature-level hub (e.g. chatbot):**
|
||||
- `getChatbotServices()` builds a lightweight hub with only required services.
|
||||
- Uses `REQUIRED_SERVICES` to resolve only `chat`, `ai`, `billing`, `streaming`.
|
||||
- Returns a `_ChatbotServiceHub` object with `chat`, `ai`, `billing`, `streaming`, `interfaceDbComponent`, etc.
|
||||
|
||||
---
|
||||
|
||||
### 1.5 Initialization and Bootstrapping
|
||||
|
||||
1. **`app.py` lifespan:**
|
||||
- `registerAllFeaturesInCatalog(catalogService)` → calls `registerServiceObjects(catalogService)` for service RBAC objects
|
||||
- `preWarm()` — imports all service modules to avoid first-request latency
|
||||
|
||||
2. **`registerAllFeaturesInCatalog` (modules/system/registry.py):**
|
||||
- Registers system RBAC objects
|
||||
- Registers service center RBAC objects via `registerServiceObjects`
|
||||
- Registers feature RBAC objects
|
||||
|
||||
3. **First request:**
|
||||
- `getService(key, ctx)` → `resolve()` loads module, instantiates class with `(context, get_service)`, caches instance
|
||||
|
||||
---
|
||||
|
||||
## 2. `modules/services` — Legacy Services Hub
|
||||
|
||||
### 2.1 File/Folder Structure
|
||||
|
||||
```
|
||||
modules/services/
|
||||
├── __init__.py # Services class, getInterface(), PublicService, _loadFeatureInterfaces, _loadFeatureServices
|
||||
├── serviceAi/
|
||||
│ └── mainServiceAi.py
|
||||
├── serviceBilling/
|
||||
│ └── mainServiceBilling.py
|
||||
├── serviceChat/
|
||||
│ └── mainServiceChat.py
|
||||
├── serviceExtraction/
|
||||
│ ├── extractors/
|
||||
│ ├── chunking/
|
||||
│ ├── merging/
|
||||
│ ├── subRegistry.py
|
||||
│ ├── subPipeline.py
|
||||
│ └── mainServiceExtraction.py
|
||||
├── serviceGeneration/
|
||||
│ ├── paths/
|
||||
│ ├── renderers/
|
||||
│ └── mainServiceGeneration.py
|
||||
├── serviceMessaging/
|
||||
│ └── mainServiceMessaging.py
|
||||
├── serviceNormalization/
|
||||
│ └── mainServiceNormalization.py
|
||||
├── serviceSharepoint/
|
||||
│ └── mainServiceSharepoint.py
|
||||
├── serviceStreaming/
|
||||
│ ├── eventManager.py
|
||||
│ ├── helpers.py
|
||||
│ └── mainServiceStreaming.py
|
||||
├── serviceTicket/
|
||||
│ └── mainServiceTicket.py
|
||||
├── serviceUtils/
|
||||
│ └── mainServiceUtils.py
|
||||
├── serviceWeb/
|
||||
│ └── mainServiceWeb.py
|
||||
└── serviceSecurity/
|
||||
└── mainServiceSecurity.py
|
||||
```
|
||||
|
||||
**No core vs. services split.** All services live in flat `serviceX/` directories.
|
||||
|
||||
---
|
||||
|
||||
### 2.2 Service Registration and Discovery
|
||||
|
||||
**Registration:** Services are **eagerly loaded** in `Services.__init__()` via hardcoded imports. No registry file.
|
||||
|
||||
**Discovery:**
|
||||
- **Shared services:** Loaded explicitly in `__init__` from `modules/services/serviceX/mainServiceX.py`.
|
||||
- **Feature services:** Discovered dynamically via `_loadFeatureServices()` — scans `modules/features/*/service*/mainService*.py` and instantiates classes ending with `"Service"`.
|
||||
|
||||
```python
|
||||
# Shared services — hardcoded in Services.__init__
|
||||
from .serviceSharepoint.mainServiceSharepoint import SharepointService
|
||||
self.sharepoint = PublicService(SharepointService(self))
|
||||
from .serviceChat.mainServiceChat import ChatService
|
||||
self.chat = PublicService(ChatService(self))
|
||||
# ... etc.
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 2.3 Dependency Injection / Factory Patterns
|
||||
|
||||
**Constructor pattern:** Services receive the entire `Services` hub as their single dependency.
|
||||
|
||||
```python
|
||||
# Legacy service constructor
|
||||
def __init__(self, services):
|
||||
self.services = services
|
||||
```
|
||||
|
||||
**No explicit dependency graph.** Services access other services via `self.services.<attr>` (e.g. `self.services.interfaceDbComponent`, `self.services.extraction`). All services are loaded at construction time.
|
||||
|
||||
**PublicService proxy:** Services are wrapped in `PublicService(target, functionsOnly=True)` to expose only callable, non-private attributes. Reduces accidental access to internal state.
|
||||
|
||||
**BillingService:** Uses a separate factory `getService(currentUser, mandateId, featureInstanceId, featureCode)` and a module-level cache. Not integrated with the hub’s constructor pattern.
|
||||
|
||||
---
|
||||
|
||||
### 2.4 Main Entry Points and Usage Patterns
|
||||
|
||||
| Entry Point | Purpose |
|
||||
|-------------|---------|
|
||||
| `getInterface(user, workflow, mandateId, featureInstanceId)` | Returns a `Services` instance |
|
||||
| `Services` | Central hub with all services and interfaces |
|
||||
|
||||
**Typical usage:**
|
||||
```python
|
||||
from modules.services import getInterface as getServices
|
||||
|
||||
services = getServices(user, workflow, mandateId=mandateId, featureInstanceId=featureInstanceId)
|
||||
ai = services.ai
|
||||
extraction = services.extraction
|
||||
```
|
||||
|
||||
**Interfaces loaded at construction:**
|
||||
- `interfaceDbApp`, `interfaceDbComponent`, `interfaceDbChat`, `rbac`
|
||||
- Plus dynamically loaded `interfaceFeature*` from feature containers
|
||||
|
||||
---
|
||||
|
||||
### 2.5 Initialization and Bootstrapping
|
||||
|
||||
1. **No startup bootstrap** — services load on first `getInterface()` call.
|
||||
2. **Construction flow:**
|
||||
- `getInterface(user, ...)` → `Services(user, ...)`
|
||||
- `Services.__init__`:
|
||||
- Loads DB interfaces (`interfaceDbApp`, `interfaceDbComponent`, `interfaceDbChat`)
|
||||
- Instantiates all shared services (sharepoint, ticket, chat, utils, security, streaming, ai, extraction, generation, web)
|
||||
- Calls `_loadFeatureInterfaces()` — discovers `interfaceFeature*.py` in features
|
||||
- Calls `_loadFeatureServices()` — discovers `service*/mainService*.py` in features, overrides hub attributes
|
||||
|
||||
3. **Feature services:** If a feature defines `serviceAi/mainServiceAi.py`, it overrides `services.ai`. Shared `serviceAi` is only used when no feature override exists.
|
||||
|
||||
---
|
||||
|
||||
## 3. Side-by-Side Comparison
|
||||
|
||||
| Aspect | Service Center | Legacy Services |
|
||||
|--------|----------------|-----------------|
|
||||
| **Location** | `modules/serviceCenter/` | `modules/services/` |
|
||||
| **Entry point** | `getService(key, context, legacy_hub)` | `getInterface(user, ...)` → `Services` |
|
||||
| **Constructor** | `(context, get_service)` | `(services)` — full hub |
|
||||
| **Context** | `ServiceCenterContext` (user, mandate_id, feature_instance_id, workflow) | Full `Services` with interfaces |
|
||||
| **Dependencies** | Declared in registry, resolved lazily via `get_service("key")` | Via `self.services.<attr>` |
|
||||
| **Loading** | Lazy (on first `getService`), cached per context | Eager (all at construction) |
|
||||
| **RBAC** | Per-service `objectKey` in registry, `can_access_service()` | Shared via hub `.rbac` |
|
||||
| **Feature services** | Not applicable — features use `getService(key, ctx)` | Discovered via `_loadFeatureServices()` |
|
||||
| **Pre-warm** | `preWarm()` in app lifespan | None |
|
||||
| **Bootstrap** | `registerServiceObjects()` via `registerAllFeaturesInCatalog` | None |
|
||||
|
||||
---
|
||||
|
||||
## 4. Coexistence and Migration
|
||||
|
||||
- **Service center** can fall back to **legacy hub** when `legacy_hub` is passed to `getService`.
|
||||
- **Chatbot** uses service center via `getChatbotServices()` and does not use the legacy hub.
|
||||
- **Billing, routes, teamsbot, commcoach, etc.** still use `modules.services` (e.g. `getInterface`, `getService` from `serviceBilling`).
|
||||
- **`ServiceCenterContext`** is used when calling `getService`. Features that pass `workflow=None` use a placeholder workflow for billing/featureCode.
|
||||
- Migration plan: `docs/SERVICE_CENTER_MIGRATION_PLAN.md`.
|
||||
|
||||
---
|
||||
|
||||
## 5. Service Center Resolver Flow
|
||||
|
||||
```
|
||||
getService("ai", ctx, legacy_hub)
|
||||
→ resolve("ai", ctx, cache, resolving, legacy_hub)
|
||||
→ Check cache (cache_key = user_mandate_feature_ai)
|
||||
→ If cache hit: return cached instance
|
||||
→ If miss:
|
||||
→ _load_service_class("modules.serviceCenter.services.serviceAi.mainServiceAi", "AiService")
|
||||
→ Resolve dependencies: chat, utils, extraction, billing (recursive resolve)
|
||||
→ instance = AiService(ctx, get_service)
|
||||
→ cache[cache_key] = instance
|
||||
→ return instance
|
||||
→ On ImportError/ModuleNotFoundError: _get_from_legacy(legacy_hub, "ai") if legacy_hub
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Key Files Reference
|
||||
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `serviceCenter/registry.py` | Service definitions, dependency graph, RBAC keys |
|
||||
| `serviceCenter/resolver.py` | Resolution logic, caching, legacy fallback |
|
||||
| `serviceCenter/context.py` | `ServiceCenterContext` dataclass |
|
||||
| `serviceCenter/__init__.py` | `getService`, `preWarm`, `registerServiceObjects`, `can_access_service` |
|
||||
| `services/__init__.py` | `Services` class, `getInterface`, `PublicService`, feature discovery |
|
||||
| `system/registry.py` | `registerAllFeaturesInCatalog` (calls `registerServiceObjects`) |
|
||||
| `app.py` | Lifespan: `preWarm()`, `registerAllFeaturesInCatalog()` |
|
||||
| `features/chatbot/mainChatbot.py` | Example: `getChatbotServices()` using service center |
|
||||
217
docs/SERVICE_CENTER_MIGRATION_PLAN.md
Normal file
217
docs/SERVICE_CENTER_MIGRATION_PLAN.md
Normal file
|
|
@ -0,0 +1,217 @@
|
|||
# Service Center Migration Plan
|
||||
|
||||
## Overview
|
||||
|
||||
This document describes a **step-by-step plan** to migrate from the old `modules/services` (Services hub) to the new `modules/serviceCenter`. The migration is **incremental**—one feature at a time—with UI-driven testing after each step.
|
||||
|
||||
**Recommended first feature: Chatbot** — it has a clear UI, limited service dependencies, and is already partially using the service center (AI, generation, billing).
|
||||
|
||||
---
|
||||
|
||||
## Architecture Summary
|
||||
|
||||
### Current State
|
||||
|
||||
| Component | Location | Notes |
|
||||
|-----------|----------|-------|
|
||||
| **Service Center** | `modules/serviceCenter/` | New: registry, resolver, context-based DI |
|
||||
| **Services Hub** | `modules/services/` | Legacy: `getInterface()` → `Services` instance |
|
||||
| **Chatbot** | `modules/features/chatbot/` | Uses `getServices()` → `.chat`, `.ai` |
|
||||
|
||||
### Service Center vs Legacy Services
|
||||
|
||||
| Aspect | Service Center | Legacy Services |
|
||||
|--------|----------------|-----------------|
|
||||
| **Constructor** | `(context: ServiceCenterContext, get_service)` | `(services: Services)` — receives hub |
|
||||
| **Context** | Minimal: user, mandate_id, feature_instance_id, workflow | Full hub with all interfaces |
|
||||
| **Dependencies** | Injected via `get_service("key")` | Via `self.services.<attr>` |
|
||||
| **RBAC** | Per-service `objectKey` in registry | Shared via hub |
|
||||
| **Pre-warm** | `preWarm()` at app startup | Loaded on first use |
|
||||
|
||||
### Services Already Using Service Center (in Services class)
|
||||
|
||||
The `Services` class in `modules/services/__init__.py` already uses `getService()` for:
|
||||
|
||||
- `messaging`
|
||||
- `ai`
|
||||
- `generation`
|
||||
- `billing`
|
||||
|
||||
### Services Still Using Legacy Direct Imports
|
||||
|
||||
- `chat` ← **Target for Phase 1**
|
||||
- `sharepoint`
|
||||
- `ticket`
|
||||
- `utils`
|
||||
- `security`
|
||||
- `streaming`
|
||||
- `extraction`
|
||||
- `web`
|
||||
|
||||
---
|
||||
|
||||
## Phase 1: Migrate Chatbot to Use Service Center for Chat
|
||||
|
||||
**Goal:** Switch the Chatbot feature to get the Chat service from Service Center instead of the legacy hub. This validates the full flow with minimal risk.
|
||||
|
||||
### Step 1.1: Switch Services Class to Use Service Center for Chat
|
||||
|
||||
**File:** `modules/services/__init__.py`
|
||||
|
||||
**Change:** Replace the direct ChatService import with `getService("chat", ...)`.
|
||||
|
||||
```python
|
||||
# BEFORE (line ~126-127):
|
||||
from .serviceChat.mainServiceChat import ChatService
|
||||
self.chat = PublicService(ChatService(self))
|
||||
|
||||
# AFTER:
|
||||
self.chat = PublicService(getService("chat", _ctx, legacy_hub=self))
|
||||
```
|
||||
|
||||
The `_ctx` (ServiceCenterContext) is already created for messaging/ai/generation. Add `workflow=self.workflow` to the context if not already present (it should be—check the existing `_ctx` creation around line 109–116).
|
||||
|
||||
**Verification:**
|
||||
1. Ensure `ServiceCenterContext` includes `workflow` when Services has one (chatbot often passes `workflow=None` initially).
|
||||
2. The service center ChatService gets `interfaceDbComponent` from `getInterface(context.user, mandateId=context.mandate_id)` — same as legacy. The chatbot calls `getFileInfo(fileId)` which only needs `interfaceDbComponent`, not workflow.
|
||||
|
||||
### Step 1.2: Ensure Service Center Context Has Workflow
|
||||
|
||||
**File:** `modules/services/__init__.py`
|
||||
|
||||
Verify the existing context creation:
|
||||
|
||||
```python
|
||||
_ctx = ServiceCenterContext(
|
||||
user=self.user,
|
||||
mandate_id=self.mandateId,
|
||||
feature_instance_id=self.featureInstanceId,
|
||||
workflow=self.workflow,
|
||||
)
|
||||
```
|
||||
|
||||
If `workflow` is missing, add it. The ChatService uses `_context.workflow` for methods like `getChatDocumentsFromDocumentList`; for `getFileInfo` it is not needed.
|
||||
|
||||
### Step 1.3: Run Unit Tests
|
||||
|
||||
```powershell
|
||||
cd c:\Users\IdaDittrich\Documents\01_Code\gateway
|
||||
pytest tests/unit/serviceCenter/test_service_center_imports.py -v
|
||||
python tests/scripts/smoke_test_service_center.py
|
||||
```
|
||||
|
||||
### Step 1.4: Manual UI Test — Chatbot with File Upload
|
||||
|
||||
1. **Start the gateway:**
|
||||
```powershell
|
||||
cd c:\Users\IdaDittrich\Documents\01_Code\gateway
|
||||
uvicorn app:app --reload --host 0.0.0.0 --port 8000
|
||||
```
|
||||
|
||||
2. **Start the frontend** (if using frontend_nyla):
|
||||
```powershell
|
||||
cd c:\Users\IdaDittrich\Documents\01_Code\frontend_nyla
|
||||
npm run dev
|
||||
```
|
||||
|
||||
3. **Log in** as a user with access to the Chatbot feature.
|
||||
|
||||
4. **Open a Chatbot instance** (navigate to the chatbot feature, select or create an instance).
|
||||
|
||||
5. **Create a new conversation** — click "New conversation" or equivalent.
|
||||
|
||||
6. **Attach a file** — upload a PDF or document before sending.
|
||||
|
||||
7. **Send a message** — e.g. "Summarize this document."
|
||||
|
||||
8. **Verify:**
|
||||
- No 500 errors in gateway logs
|
||||
- File is processed (chat service’s `getFileInfo` is used when creating `ChatbotDocument`s)
|
||||
- AI response streams back correctly (AI service already from service center)
|
||||
|
||||
### Step 1.5: Rollback if Needed
|
||||
|
||||
If something breaks, revert the change in `modules/services/__init__.py`:
|
||||
|
||||
```python
|
||||
from .serviceChat.mainServiceChat import ChatService
|
||||
self.chat = PublicService(ChatService(self))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Phase 2 (Future): Migrate Extraction for Chatbot
|
||||
|
||||
The chatbot may use extraction when processing documents. After Phase 1 is stable:
|
||||
|
||||
1. Switch `Services` to use `getService("extraction", _ctx, legacy_hub=self)` instead of direct import.
|
||||
2. Ensure `ExtractionService` in service center has the same interface as the legacy one.
|
||||
3. Re-test chatbot with document-heavy prompts.
|
||||
|
||||
---
|
||||
|
||||
## Phase 3 (Future): Migrate Remaining Services
|
||||
|
||||
| Service | Used By | Priority |
|
||||
|---------|---------|----------|
|
||||
| utils | Chat, Extraction, AI, Web, Generation | High (core) |
|
||||
| security | Sharepoint | Medium |
|
||||
| streaming | Workflows, Chatbot SSE | Medium |
|
||||
| sharepoint | Sharepoint workflows | Medium |
|
||||
| ticket | Ticket system | Low |
|
||||
| web | Web research workflows | Medium |
|
||||
|
||||
---
|
||||
|
||||
## Service Center Bootstrap (Already Done)
|
||||
|
||||
The app already:
|
||||
- Calls `preWarm()` at startup (`app.py` lifespan)
|
||||
- Has `registerServiceObjects()` available for RBAC catalog (call from bootstrap if needed)
|
||||
|
||||
### Optional: Register Service RBAC Objects
|
||||
|
||||
If you want service-level RBAC (e.g. `can_access_service()`), call during bootstrap:
|
||||
|
||||
```python
|
||||
# In app.py lifespan or interfaceBootstrap
|
||||
from modules.serviceCenter import registerServiceObjects
|
||||
from modules.security.rbacCatalog import getCatalogService
|
||||
catalogService = getCatalogService()
|
||||
registerServiceObjects(catalogService)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Testing Checklist (Chatbot Phase 1)
|
||||
|
||||
- [ ] Unit tests pass: `pytest tests/unit/serviceCenter/ -v`
|
||||
- [ ] Smoke test passes: `python tests/scripts/smoke_test_service_center.py`
|
||||
- [ ] Gateway starts without import errors
|
||||
- [ ] Chatbot UI loads
|
||||
- [ ] New conversation creates successfully
|
||||
- [ ] Message without file sends and gets AI response
|
||||
- [ ] Message with file attachment sends and gets AI response
|
||||
- [ ] No errors in gateway logs during the above flows
|
||||
|
||||
---
|
||||
|
||||
## File Summary for Phase 1
|
||||
|
||||
| File | Action |
|
||||
|------|--------|
|
||||
| `modules/services/__init__.py` | Replace `ChatService` import with `getService("chat", _ctx, legacy_hub=self)` |
|
||||
| (No other changes) | Service center ChatService and resolver already support legacy fallback |
|
||||
|
||||
---
|
||||
|
||||
## FAQ
|
||||
|
||||
**Q: Why start with Chat instead of Utils?**
|
||||
A: Chat has a clear UI path (chatbot) and only a few call sites. Utils is used everywhere; migrating it later reduces risk.
|
||||
|
||||
**Q: What if `getService("chat", ctx)` fails?**
|
||||
A: The resolver passes `legacy_hub=self`, so it falls back to the legacy `Services.chat` if the service center module fails to load. You get graceful degradation.
|
||||
|
||||
**Q: Can I test without the frontend?**
|
||||
A: Yes. Use the API directly, e.g. `POST /api/chatbot/{instanceId}/start/stream` with a valid `UserInputRequest` (with `listFileId` for file upload).
|
||||
92
docs/SERVICE_CENTER_VS_LEGACY_COMPARISON.md
Normal file
92
docs/SERVICE_CENTER_VS_LEGACY_COMPARISON.md
Normal file
|
|
@ -0,0 +1,92 @@
|
|||
# Service Center vs Legacy Services Hub — Comparison & Assessment
|
||||
|
||||
## Executive Summary
|
||||
|
||||
The **Service Center** (`modules/serviceCenter`) is a superior architecture compared to the legacy **Services Hub** (`modules/services`). It was worthwhile to create it. The main benefits are: **explicit dependency graph**, **lazy loading**, **per-service RBAC**, and **context-scoped resolution** without carrying the entire hub. The legacy hub remains valid for incremental migration and backward compatibility.
|
||||
|
||||
---
|
||||
|
||||
## 1. Architecture Comparison
|
||||
|
||||
| Aspect | Service Center | Legacy Services Hub |
|
||||
|--------|----------------|---------------------|
|
||||
| **Location** | `modules/serviceCenter/` | `modules/services/` |
|
||||
| **Entry point** | `getService(key, context, legacy_hub)` | `getInterface(user, ...)` → `Services` |
|
||||
| **Constructor** | `(context, get_service)` | `(services)` — full hub |
|
||||
| **Dependencies** | Declared in registry, resolved lazily via `get_service("key")` | Via `self.services.<attr>` — all services always present |
|
||||
| **Loading** | **Lazy** — only requested services + deps | **Eager** — everything at construction |
|
||||
| **RBAC** | Per-service `objectKey`, `can_access_service()` | Shared via hub `.rbac` |
|
||||
| **Caching** | Per-context cache (user + mandate + featureInstance) | No instance cache — new `Services` each call |
|
||||
| **Feature override** | N/A — features use `getService` directly | Feature services override hub attributes |
|
||||
| **Pre-warm** | `preWarm()` at app startup | None |
|
||||
| **Structure** | Core vs importable split; explicit registry | Flat `serviceX/` dirs; discovery via glob |
|
||||
|
||||
---
|
||||
|
||||
## 2. Which Setup is Better?
|
||||
|
||||
**Service Center is better** for these reasons:
|
||||
|
||||
### 2.1 Explicit Dependency Graph
|
||||
- Dependencies are declared in `registry.py` (e.g. `"ai": {"dependencies": ["chat", "utils", "extraction", "billing"]}`).
|
||||
- Circular dependencies are detected and raise `RuntimeError`.
|
||||
- Easier to reason about and refactor.
|
||||
|
||||
### 2.2 Lazy Loading & Resource Efficiency
|
||||
- Only requested services (and their transitive deps) are loaded.
|
||||
- A feature like chatbot needs `chat`, `ai`, `billing`, `streaming` — not `sharepoint`, `ticket`, `neutralization`, etc.
|
||||
- Legacy hub loads **everything** on first `getInterface()`.
|
||||
|
||||
### 2.3 Context-Scoped Resolution
|
||||
- Each request gets a `ServiceCenterContext` (user, mandate_id, feature_instance_id, workflow).
|
||||
- Resolution is cached per context. Same user+mandate+feature → same instances.
|
||||
- No need to pass or construct a full hub.
|
||||
|
||||
### 2.4 Per-Service RBAC
|
||||
- Services have `objectKey` (e.g. `service.ai`, `service.extraction`).
|
||||
- `can_access_service(user, rbac, service_key)` checks before resolving.
|
||||
- Finer-grained control than a single hub-level RBAC.
|
||||
|
||||
### 2.5 Separation of Concerns
|
||||
- **Core services** (utils, security, streaming): internal, no RBAC.
|
||||
- **Importable services** (ai, billing, extraction, etc.): feature-facing, RBAC-protected.
|
||||
- Clear distinction vs. flat structure in legacy.
|
||||
|
||||
### 2.6 Pre-warm for Cold Start
|
||||
- `preWarm()` imports all service modules at startup.
|
||||
- First request avoids import latency.
|
||||
- Legacy has no equivalent.
|
||||
|
||||
---
|
||||
|
||||
## 3. When Legacy Still Makes Sense
|
||||
|
||||
- **Migration**: Features that haven’t moved yet still use `getInterface()`.
|
||||
- **Feature overrides**: Feature-specific services (e.g. `serviceAi/mainServiceAi.py` in a feature) that override hub attributes.
|
||||
- **Backward compatibility**: `legacy_hub` fallback in Service Center allows gradual migration.
|
||||
|
||||
---
|
||||
|
||||
## 4. Did It Make Sense to Create the Service Center?
|
||||
|
||||
**Yes.** The legacy hub has inherent limitations:
|
||||
|
||||
1. **Monolithic hub** — every `getInterface()` constructs a full `Services` object with all services, interfaces, and feature discovery.
|
||||
2. **Implicit dependencies** — services grab what they need via `self.services.<attr>`, leading to hidden coupling.
|
||||
3. **No explicit RBAC per service** — access control is at the hub level.
|
||||
4. **Eager loading** — every request pays for all services even when only a few are used.
|
||||
|
||||
Service Center addresses these while keeping a migration path via `legacy_hub` fallback. The Chatbot feature already uses it successfully.
|
||||
|
||||
---
|
||||
|
||||
## 5. Benchmark Script
|
||||
|
||||
Run the comparison script to measure runtime and memory:
|
||||
|
||||
```bash
|
||||
# From gateway root
|
||||
python tests/benchmarks/benchmark_service_center_vs_legacy.py
|
||||
```
|
||||
|
||||
See `tests/benchmarks/benchmark_service_center_vs_legacy.py` for details on metrics and methodology.
|
||||
|
|
@ -34,7 +34,6 @@ from modules.features.chatbot.bridges.tools import (
|
|||
create_tavily_search_tool,
|
||||
create_send_streaming_message_tool,
|
||||
)
|
||||
from modules.services.serviceStreaming import ChatStreamingHelper
|
||||
from modules.datamodels.datamodelUam import User
|
||||
|
||||
if TYPE_CHECKING:
|
||||
|
|
@ -585,6 +584,7 @@ class Chatbot:
|
|||
workflow_id: str = "default"
|
||||
config: Optional["ChatbotConfig"] = None
|
||||
_event_manager: Any = None
|
||||
_chat_streaming_helper: Any = None # From service center streaming service
|
||||
|
||||
@classmethod
|
||||
async def create(
|
||||
|
|
@ -596,6 +596,7 @@ class Chatbot:
|
|||
config: Optional["ChatbotConfig"] = None,
|
||||
event_manager=None,
|
||||
planner_model: Optional[AICenterChatModel] = None,
|
||||
chat_streaming_helper=None,
|
||||
) -> "Chatbot":
|
||||
"""Factory method to create and configure a Chatbot instance.
|
||||
|
||||
|
|
@ -607,6 +608,7 @@ class Chatbot:
|
|||
config: Optional chatbot configuration for dynamic tool enablement.
|
||||
event_manager: Optional event manager for streaming (passed from route).
|
||||
planner_model: Optional fast model for planner/routing (default: same as model).
|
||||
chat_streaming_helper: ChatStreamingHelper from service center streaming service.
|
||||
|
||||
Returns:
|
||||
A configured Chatbot instance.
|
||||
|
|
@ -619,6 +621,7 @@ class Chatbot:
|
|||
config=config,
|
||||
_event_manager=event_manager,
|
||||
planner_model=planner_model,
|
||||
_chat_streaming_helper=chat_streaming_helper,
|
||||
)
|
||||
configured_tools = await instance._configure_tools()
|
||||
instance._tools = configured_tools
|
||||
|
|
@ -1244,10 +1247,11 @@ class Chatbot:
|
|||
if etype == "on_chain_end" and _is_root(event):
|
||||
output_obj = edata.get("output")
|
||||
|
||||
# Extract message list from the graph's final output
|
||||
final_msgs = ChatStreamingHelper.extract_messages_from_output(
|
||||
output_obj=output_obj
|
||||
)
|
||||
# Extract message list from the graph's final output (ChatStreamingHelper from service center)
|
||||
helper = self._chat_streaming_helper
|
||||
if not helper:
|
||||
raise RuntimeError("ChatStreamingHelper required; pass chat_streaming_helper to Chatbot.create()")
|
||||
final_msgs = helper.extract_messages_from_output(output_obj=output_obj)
|
||||
|
||||
# Normalize for the frontend (only user/assistant with text content)
|
||||
# Exclude planner-only and SQL-path intermediate messages from chat display
|
||||
|
|
@ -1255,9 +1259,9 @@ class Chatbot:
|
|||
chat_history_payload: List[dict] = []
|
||||
for m in final_msgs:
|
||||
if isinstance(m, BaseMessage):
|
||||
d = ChatStreamingHelper.message_to_dict(msg=m)
|
||||
d = helper.message_to_dict(msg=m)
|
||||
elif isinstance(m, dict):
|
||||
d = ChatStreamingHelper.dict_message_to_dict(obj=m)
|
||||
d = helper.dict_message_to_dict(obj=m)
|
||||
else:
|
||||
continue
|
||||
if d.get("role") not in ("user", "assistant") or not d.get("content"):
|
||||
|
|
|
|||
|
|
@ -48,12 +48,25 @@ RESOURCE_OBJECTS = [
|
|||
},
|
||||
]
|
||||
|
||||
# Service requirements for chatbot — resolved via service center
|
||||
# Service requirements - services this feature needs from the service center
|
||||
# Format: [{serviceKey, meta}]. Used by getChatbotServices() to resolve only needed services.
|
||||
REQUIRED_SERVICES = [
|
||||
{"serviceKey": "chat", "meta": {"usage": "File info, document handling"}},
|
||||
{"serviceKey": "ai", "meta": {"usage": "AI calls, conversation name generation"}},
|
||||
{"serviceKey": "billing", "meta": {"usage": "Usage tracking, balance checks"}},
|
||||
{"serviceKey": "streaming", "meta": {"usage": "Event manager, ChatStreamingHelper"}},
|
||||
{
|
||||
"serviceKey": "chat",
|
||||
"meta": {"usage": "File info, document handling"}
|
||||
},
|
||||
{
|
||||
"serviceKey": "ai",
|
||||
"meta": {"usage": "AI calls, conversation name generation"}
|
||||
},
|
||||
{
|
||||
"serviceKey": "billing",
|
||||
"meta": {"usage": "Usage tracking, balance checks"}
|
||||
},
|
||||
{
|
||||
"serviceKey": "streaming",
|
||||
"meta": {"usage": "Event manager, ChatStreamingHelper"}
|
||||
},
|
||||
]
|
||||
|
||||
# Template roles for this feature
|
||||
|
|
@ -123,6 +136,108 @@ def getFeatureDefinition() -> Dict[str, Any]:
|
|||
}
|
||||
|
||||
|
||||
def getRequiredServiceKeys() -> List[str]:
|
||||
"""Return list of service keys this feature requires."""
|
||||
return [s["serviceKey"] for s in REQUIRED_SERVICES]
|
||||
|
||||
|
||||
def getChatbotServices(
|
||||
user,
|
||||
mandateId: Optional[str] = None,
|
||||
featureInstanceId: Optional[str] = None,
|
||||
workflow=None,
|
||||
) -> Any:
|
||||
"""
|
||||
Get a service hub for the chatbot feature using the service center.
|
||||
Resolves only the services declared in REQUIRED_SERVICES.
|
||||
|
||||
Returns a hub-like object with: chat, ai, billing, streaming,
|
||||
plus interfaceDbComponent, user, mandateId, featureInstanceId.
|
||||
"""
|
||||
from modules.serviceCenter import getService
|
||||
from modules.serviceCenter.context import ServiceCenterContext
|
||||
from modules.interfaces.interfaceDbManagement import getInterface as getComponentInterface
|
||||
|
||||
# Provide workflow or placeholder so billing/etc get featureCode
|
||||
_workflow = workflow
|
||||
if _workflow is None:
|
||||
_workflow = type("_Placeholder", (), {"featureCode": FEATURE_CODE})()
|
||||
ctx = ServiceCenterContext(
|
||||
user=user,
|
||||
mandate_id=mandateId,
|
||||
feature_instance_id=featureInstanceId,
|
||||
workflow=_workflow,
|
||||
)
|
||||
|
||||
hub = _ChatbotServiceHub()
|
||||
hub.user = user
|
||||
hub.mandateId = mandateId
|
||||
hub.featureInstanceId = featureInstanceId
|
||||
hub.workflow = workflow
|
||||
hub.interfaceDbComponent = getComponentInterface(user, mandateId=mandateId, featureInstanceId=featureInstanceId)
|
||||
|
||||
for spec in REQUIRED_SERVICES:
|
||||
key = spec["serviceKey"]
|
||||
try:
|
||||
svc = getService(key, ctx, legacy_hub=None)
|
||||
setattr(hub, key, svc)
|
||||
except Exception as e:
|
||||
logger.warning(f"Could not resolve service '{key}' for chatbot: {e}")
|
||||
setattr(hub, key, None)
|
||||
|
||||
return hub
|
||||
|
||||
|
||||
def getChatStreamingHelper():
|
||||
"""
|
||||
Get ChatStreamingHelper utility class (used by chatbot for message normalization).
|
||||
Resolves via service center streaming service.
|
||||
"""
|
||||
from modules.serviceCenter import getService
|
||||
from modules.serviceCenter.context import ServiceCenterContext
|
||||
# Minimal context - streaming service only needs it for resolver
|
||||
ctx = ServiceCenterContext(user=__get_placeholder_user(), mandate_id=None, feature_instance_id=None)
|
||||
streaming = getService("streaming", ctx, legacy_hub=None)
|
||||
return streaming.getChatStreamingHelper() if streaming else None
|
||||
|
||||
|
||||
def __get_placeholder_user():
|
||||
"""Placeholder user for contexts that only need service resolution (e.g. ChatStreamingHelper)."""
|
||||
from modules.datamodels.datamodelUam import User
|
||||
return User(id="system", email="system@placeholder", firstName="System", lastName="Placeholder")
|
||||
|
||||
|
||||
def getEventManager(user, mandateId: Optional[str] = None, featureInstanceId: Optional[str] = None):
|
||||
"""
|
||||
Get the global event manager for SSE streaming (used by chatbot routes).
|
||||
"""
|
||||
from modules.serviceCenter import getService
|
||||
from modules.serviceCenter.context import ServiceCenterContext
|
||||
|
||||
ctx = ServiceCenterContext(
|
||||
user=user,
|
||||
mandate_id=mandateId,
|
||||
feature_instance_id=featureInstanceId,
|
||||
)
|
||||
streaming = getService("streaming", ctx, legacy_hub=None)
|
||||
return streaming.getEventManager()
|
||||
|
||||
|
||||
class _ChatbotServiceHub:
|
||||
"""Lightweight hub exposing only services required by the chatbot feature."""
|
||||
user = None
|
||||
mandateId = None
|
||||
featureInstanceId = None
|
||||
workflow = None
|
||||
interfaceDbComponent = None
|
||||
chat = None
|
||||
ai = None
|
||||
billing = None
|
||||
streaming = None
|
||||
featureCode = "chatbot"
|
||||
allowedProviders = None
|
||||
|
||||
|
||||
def getUiObjects() -> List[Dict[str, Any]]:
|
||||
"""Return UI objects for RBAC catalog registration."""
|
||||
return UI_OBJECTS
|
||||
|
|
|
|||
|
|
@ -31,7 +31,7 @@ from modules.features.chatbot.interfaceFeatureChatbot import ChatbotConversation
|
|||
|
||||
# Import chatbot feature
|
||||
from modules.features.chatbot import chatProcess
|
||||
from modules.services.serviceStreaming import get_event_manager
|
||||
from modules.features.chatbot.mainChatbot import getEventManager
|
||||
|
||||
# Pre-warm AI connectors when this router loads (before first request).
|
||||
# Ensures connectors are ready; avoids 4–8 s delay on first chatbot message.
|
||||
|
|
@ -250,7 +250,7 @@ async def stream_chatbot_start(
|
|||
# Validate instance access
|
||||
mandateId = _validateInstanceAccess(instanceId, context)
|
||||
|
||||
event_manager = get_event_manager()
|
||||
event_manager = getEventManager(context.user, mandateId=mandateId, featureInstanceId=instanceId)
|
||||
|
||||
try:
|
||||
# Use workflowId from query parameter if provided, otherwise from request body
|
||||
|
|
@ -462,7 +462,7 @@ async def stop_chatbot(
|
|||
) -> ChatbotConversation:
|
||||
"""Stops a running chatbot workflow."""
|
||||
# Validate instance access
|
||||
_validateInstanceAccess(instanceId, context)
|
||||
mandateId = _validateInstanceAccess(instanceId, context)
|
||||
|
||||
try:
|
||||
# Get chatbot interface with instance context
|
||||
|
|
@ -489,7 +489,7 @@ async def stop_chatbot(
|
|||
"lastActivity": getUtcTimestamp()
|
||||
})
|
||||
|
||||
event_manager = get_event_manager()
|
||||
event_manager = getEventManager(context.user, mandateId=mandateId, featureInstanceId=instanceId)
|
||||
# Store log entry (createLog emits when event_manager is provided)
|
||||
interfaceDbChat.createLog({
|
||||
"id": f"log_{uuid.uuid4()}",
|
||||
|
|
|
|||
|
|
@ -91,15 +91,17 @@ async def chatProcess(
|
|||
ChatbotConversation instance
|
||||
"""
|
||||
try:
|
||||
# Get services from service center (only chat, ai, billing, streaming — avoids ~90ms legacy hub)
|
||||
# Get services from service center (only services declared in mainChatbot.REQUIRED_SERVICES)
|
||||
services = getChatbotServices(currentUser, mandateId=mandateId, featureInstanceId=featureInstanceId)
|
||||
services.featureCode = 'chatbot'
|
||||
|
||||
# Config and model warm run in background task — return stream ~2–3 s faster for normal feel
|
||||
chatbot_config = None
|
||||
# Load instance config and apply allowedProviders for AI calls (conversation name + main chat)
|
||||
chatbot_config = await _load_chatbot_config(featureInstanceId)
|
||||
if chatbot_config.model.allowedProviders:
|
||||
services.allowedProviders = chatbot_config.model.allowedProviders
|
||||
logger.info(f"Chatbot instance {featureInstanceId}: restricting to providers {chatbot_config.model.allowedProviders}")
|
||||
|
||||
# Reuse hub's interfaceDbChat (ChatObjects) - avoids duplicate DB init
|
||||
interfaceDbChat = services.interfaceDbChat
|
||||
from modules.features.chatbot.interfaceFeatureChatbot import getInterface as getChatbotInterface
|
||||
interfaceDbChat = getChatbotInterface(currentUser, mandateId=mandateId, featureInstanceId=featureInstanceId)
|
||||
|
||||
# Create or load workflow (event_manager passed from route)
|
||||
if workflowId:
|
||||
|
|
@ -162,6 +164,10 @@ async def chatProcess(
|
|||
# Create event queue for new workflow (for streaming)
|
||||
event_manager.create_queue(workflow.id)
|
||||
|
||||
# Reload workflow to get current message count
|
||||
workflow = interfaceDbChat.getWorkflow(workflow.id)
|
||||
services.workflow = workflow # Required for chat service document resolution
|
||||
|
||||
# Process uploaded files and create ChatbotDocuments
|
||||
user_documents = []
|
||||
if userInput.listFileId and len(userInput.listFileId) > 0:
|
||||
|
|
@ -1204,49 +1210,45 @@ def _preflight_billing_check(services, mandateId: str, featureInstanceId: Option
|
|||
"""
|
||||
Pre-flight billing check before starting chatbot AI processing.
|
||||
Raises if mandate has insufficient balance or no providers allowed.
|
||||
Uses services.billing from service center (REQUIRED_SERVICES).
|
||||
Exception types from BillingService class (service center billing API).
|
||||
"""
|
||||
from modules.services.serviceBilling.mainServiceBilling import (
|
||||
getService as getBillingService,
|
||||
InsufficientBalanceException,
|
||||
ProviderNotAllowedException,
|
||||
BillingContextError,
|
||||
)
|
||||
user = services.user
|
||||
featureCode = "chatbot"
|
||||
from modules.serviceCenter.services.serviceBilling import BillingService
|
||||
|
||||
billingService = services.billing
|
||||
if not billingService:
|
||||
raise BillingService.BillingContextError("Billing service not available for chatbot")
|
||||
try:
|
||||
billingService = getBillingService(user, mandateId, featureInstanceId, featureCode)
|
||||
balanceCheck = billingService.checkBalance(0.01)
|
||||
if not balanceCheck.allowed:
|
||||
raise InsufficientBalanceException(
|
||||
raise BillingService.InsufficientBalanceException(
|
||||
currentBalance=balanceCheck.currentBalance or 0.0,
|
||||
requiredAmount=0.01,
|
||||
message=f"Ungenuegendes Guthaben. Aktuell: CHF {balanceCheck.currentBalance:.2f}"
|
||||
)
|
||||
rbacAllowedProviders = billingService.getallowedProviders()
|
||||
if not rbacAllowedProviders:
|
||||
raise ProviderNotAllowedException(
|
||||
raise BillingService.ProviderNotAllowedException(
|
||||
provider="any",
|
||||
message="Keine AI-Provider fuer Ihre Rolle freigegeben. Kontaktieren Sie Ihren Administrator."
|
||||
)
|
||||
except (InsufficientBalanceException, ProviderNotAllowedException):
|
||||
except (BillingService.InsufficientBalanceException, BillingService.ProviderNotAllowedException):
|
||||
raise
|
||||
except Exception as e:
|
||||
logger.error(f"Billing pre-flight failed: {e}")
|
||||
raise BillingContextError(f"Billing check failed: {e}")
|
||||
raise BillingService.BillingContextError(f"Billing check failed: {e}")
|
||||
|
||||
|
||||
def _create_chatbot_billing_callback(services, workflow_id: str):
|
||||
"""
|
||||
Create billing callback for AICenterChatModel. Records each AI call to poweron_billing.
|
||||
Uses services.billing from service center (REQUIRED_SERVICES).
|
||||
"""
|
||||
from modules.services.serviceBilling.mainServiceBilling import getService as getBillingService
|
||||
from modules.datamodels.datamodelAi import AiCallResponse
|
||||
|
||||
user = services.user
|
||||
mandateId = services.mandateId
|
||||
featureInstanceId = getattr(services, "featureInstanceId", None)
|
||||
featureCode = "chatbot"
|
||||
billingService = getBillingService(user, mandateId, featureInstanceId, featureCode)
|
||||
billingService = services.billing
|
||||
if not billingService:
|
||||
return lambda _: None # No-op callback if billing unavailable
|
||||
|
||||
def _billing_callback(response: AiCallResponse) -> None:
|
||||
if not response or getattr(response, "errorCount", 0) > 0:
|
||||
|
|
@ -1389,6 +1391,11 @@ async def _processChatbotMessageLangGraph(
|
|||
)
|
||||
|
||||
# Create chatbot instance with config for dynamic tool configuration
|
||||
chat_streaming_helper = None
|
||||
if services.streaming:
|
||||
chat_streaming_helper = services.streaming.getChatStreamingHelper()
|
||||
if not chat_streaming_helper:
|
||||
logger.warning("ChatStreamingHelper not available from streaming service; message normalization may fail")
|
||||
chatbot = await Chatbot.create(
|
||||
model=model,
|
||||
memory=memory,
|
||||
|
|
@ -1397,6 +1404,7 @@ async def _processChatbotMessageLangGraph(
|
|||
config=config,
|
||||
event_manager=event_manager,
|
||||
planner_model=planner_model,
|
||||
chat_streaming_helper=chat_streaming_helper,
|
||||
)
|
||||
|
||||
# Emit synthetic status for real-time UI feedback
|
||||
|
|
|
|||
136
modules/serviceCenter/__init__.py
Normal file
136
modules/serviceCenter/__init__.py
Normal file
|
|
@ -0,0 +1,136 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""
|
||||
Service Center.
|
||||
Central registry for core and importable services with per-feature resolution.
|
||||
"""
|
||||
|
||||
import logging
|
||||
from typing import Any, List, Optional
|
||||
|
||||
from modules.serviceCenter.context import ServiceCenterContext
|
||||
from modules.serviceCenter.registry import (
|
||||
CORE_SERVICES,
|
||||
IMPORTABLE_SERVICES,
|
||||
SERVICE_RBAC_OBJECTS,
|
||||
)
|
||||
from modules.serviceCenter.resolver import (
|
||||
resolve,
|
||||
get_resolution_cache,
|
||||
clear_cache,
|
||||
)
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
def getService(
|
||||
key: str,
|
||||
context: ServiceCenterContext,
|
||||
legacy_hub: Optional[Any] = None,
|
||||
) -> Any:
|
||||
"""
|
||||
Get a service instance by key for the given context.
|
||||
|
||||
Args:
|
||||
key: Service key (e.g., "web", "extraction", "utils")
|
||||
context: ServiceCenterContext with user, mandate_id, feature_instance_id, workflow
|
||||
legacy_hub: Optional legacy Services instance for fallback when service not yet migrated
|
||||
|
||||
Returns:
|
||||
Service instance
|
||||
"""
|
||||
cache = get_resolution_cache()
|
||||
resolving = set()
|
||||
return resolve(key, context, cache, resolving, legacy_hub=legacy_hub)
|
||||
|
||||
|
||||
def preWarm(service_keys: Optional[List[str]] = None) -> None:
|
||||
"""
|
||||
Pre-load service modules at startup to avoid first-request latency.
|
||||
|
||||
Args:
|
||||
service_keys: Optional list of keys to preload. If None, preloads all registered services.
|
||||
"""
|
||||
import importlib
|
||||
|
||||
keys = service_keys or list(CORE_SERVICES.keys()) + list(IMPORTABLE_SERVICES.keys())
|
||||
for key in keys:
|
||||
spec = CORE_SERVICES.get(key) or IMPORTABLE_SERVICES.get(key)
|
||||
if not spec:
|
||||
continue
|
||||
try:
|
||||
importlib.import_module(spec["module"])
|
||||
logger.debug(f"Pre-warmed service module: {key}")
|
||||
except (ImportError, ModuleNotFoundError) as e:
|
||||
logger.debug(f"Could not pre-warm {key}: {e}")
|
||||
|
||||
|
||||
def registerServiceObjects(catalogService) -> bool:
|
||||
"""Register service RBAC objects in the catalog. Call at startup."""
|
||||
try:
|
||||
for obj in SERVICE_RBAC_OBJECTS:
|
||||
catalogService.registerResourceObject(
|
||||
featureCode="system",
|
||||
objectKey=obj["objectKey"],
|
||||
label=obj["label"],
|
||||
meta=obj.get("meta"),
|
||||
)
|
||||
logger.info(f"Registered {len(SERVICE_RBAC_OBJECTS)} service RBAC objects")
|
||||
return True
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to register service RBAC objects: {e}")
|
||||
return False
|
||||
|
||||
|
||||
def can_access_service(
|
||||
user,
|
||||
rbac,
|
||||
service_key: str,
|
||||
mandate_id: Optional[str] = None,
|
||||
feature_instance_id: Optional[str] = None,
|
||||
allow_when_no_rbac: bool = True,
|
||||
) -> bool:
|
||||
"""
|
||||
Check if user has permission to access the given service.
|
||||
|
||||
Args:
|
||||
user: User object
|
||||
rbac: RbacClass instance (e.g. from interfaceDbApp.rbac)
|
||||
service_key: Service key (e.g., "web", "extraction")
|
||||
mandate_id: Optional mandate context
|
||||
feature_instance_id: Optional feature instance context
|
||||
allow_when_no_rbac: If True, allow when rbac is None (migration/default)
|
||||
|
||||
Returns:
|
||||
True if user has view permission on the service
|
||||
"""
|
||||
if not rbac:
|
||||
return allow_when_no_rbac
|
||||
if service_key not in IMPORTABLE_SERVICES:
|
||||
return False
|
||||
obj = IMPORTABLE_SERVICES[service_key]
|
||||
object_key = obj.get("objectKey")
|
||||
if not object_key:
|
||||
return False
|
||||
from modules.datamodels.datamodelRbac import AccessRuleContext
|
||||
permissions = rbac.getUserPermissions(
|
||||
user,
|
||||
AccessRuleContext.RESOURCE,
|
||||
object_key,
|
||||
mandateId=mandate_id,
|
||||
featureInstanceId=feature_instance_id,
|
||||
)
|
||||
return permissions.view if permissions else False
|
||||
|
||||
|
||||
__all__ = [
|
||||
"ServiceCenterContext",
|
||||
"getService",
|
||||
"preWarm",
|
||||
"clear_cache",
|
||||
"registerServiceObjects",
|
||||
"can_access_service",
|
||||
"SERVICE_RBAC_OBJECTS",
|
||||
"CORE_SERVICES",
|
||||
"IMPORTABLE_SERVICES",
|
||||
]
|
||||
32
modules/serviceCenter/context.py
Normal file
32
modules/serviceCenter/context.py
Normal file
|
|
@ -0,0 +1,32 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""
|
||||
Service Center Context.
|
||||
Minimal context passed to services: user, mandate, feature instance, workflow.
|
||||
"""
|
||||
|
||||
from dataclasses import dataclass
|
||||
from typing import Any, Optional
|
||||
|
||||
from modules.datamodels.datamodelUam import User
|
||||
|
||||
|
||||
@dataclass
|
||||
class ServiceCenterContext:
|
||||
"""Context for service resolution: user, mandate, feature instance, optional workflow."""
|
||||
|
||||
user: User
|
||||
mandate_id: Optional[str] = None
|
||||
feature_instance_id: Optional[str] = None
|
||||
workflow_id: Optional[str] = None
|
||||
workflow: Any = None
|
||||
|
||||
@property
|
||||
def mandateId(self) -> Optional[str]:
|
||||
"""Alias for mandate_id (backward compatibility)."""
|
||||
return self.mandate_id
|
||||
|
||||
@property
|
||||
def featureInstanceId(self) -> Optional[str]:
|
||||
"""Alias for feature_instance_id (backward compatibility)."""
|
||||
return self.feature_instance_id
|
||||
3
modules/serviceCenter/core/__init__.py
Normal file
3
modules/serviceCenter/core/__init__.py
Normal file
|
|
@ -0,0 +1,3 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""Core services - internal building blocks, not requested by features."""
|
||||
7
modules/serviceCenter/core/serviceSecurity/__init__.py
Normal file
7
modules/serviceCenter/core/serviceSecurity/__init__.py
Normal file
|
|
@ -0,0 +1,7 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""Security core service."""
|
||||
|
||||
from .mainServiceSecurity import SecurityService
|
||||
|
||||
__all__ = ["SecurityService"]
|
||||
|
|
@ -0,0 +1,81 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""
|
||||
Security service for token management operations.
|
||||
Core service - not requested by features directly.
|
||||
"""
|
||||
|
||||
import logging
|
||||
from typing import Optional, Callable, Any
|
||||
|
||||
from modules.datamodels.datamodelSecurity import Token
|
||||
from modules.auth import TokenManager
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class SecurityService:
|
||||
"""Security service providing token management operations."""
|
||||
|
||||
def __init__(self, context: Any, get_service: Callable[[str], Any]):
|
||||
"""Initialize with service center context and resolver."""
|
||||
self._context = context
|
||||
self._get_service = get_service
|
||||
self._tokenManager = TokenManager()
|
||||
from modules.interfaces.interfaceDbApp import getInterface as getAppInterface
|
||||
self._interfaceDbApp = getAppInterface(
|
||||
context.user,
|
||||
mandateId=context.mandate_id,
|
||||
)
|
||||
|
||||
def getFreshToken(self, connectionId: str, secondsBeforeExpiry: int = 30 * 60) -> Optional[Token]:
|
||||
"""Get a fresh token for a connection, refreshing when expiring soon."""
|
||||
try:
|
||||
token = self._interfaceDbApp.getConnectionToken(connectionId)
|
||||
if not token:
|
||||
return None
|
||||
return self._tokenManager.ensureFreshToken(
|
||||
token,
|
||||
secondsBeforeExpiry=secondsBeforeExpiry,
|
||||
saveCallback=lambda t: self._interfaceDbApp.saveConnectionToken(t)
|
||||
)
|
||||
except Exception as e:
|
||||
logger.error(f"getFreshToken: Error fetching or refreshing token for connection {connectionId}: {e}")
|
||||
return None
|
||||
|
||||
def refreshToken(self, oldToken: Token) -> Optional[Token]:
|
||||
"""Refresh an expired token using the appropriate OAuth service."""
|
||||
try:
|
||||
return self._tokenManager.refreshToken(oldToken)
|
||||
except Exception as e:
|
||||
logger.error(f"refreshToken: Error refreshing token: {e}")
|
||||
return None
|
||||
|
||||
def ensureFreshToken(self, token: Token, *, secondsBeforeExpiry: int = 30 * 60,
|
||||
saveCallback: Optional[Callable[[Token], None]] = None) -> Optional[Token]:
|
||||
"""Ensure a token is fresh; refresh if expiring within threshold."""
|
||||
try:
|
||||
return self._tokenManager.ensureFreshToken(
|
||||
token,
|
||||
secondsBeforeExpiry=secondsBeforeExpiry,
|
||||
saveCallback=saveCallback
|
||||
)
|
||||
except Exception as e:
|
||||
logger.error(f"ensureFreshToken: Error ensuring fresh token: {e}")
|
||||
return None
|
||||
|
||||
def refreshMicrosoftToken(self, refreshToken: str, userId: str, oldToken: Token) -> Optional[Token]:
|
||||
"""Refresh Microsoft OAuth token using refresh token."""
|
||||
try:
|
||||
return self._tokenManager.refreshMicrosoftToken(refreshToken, userId, oldToken)
|
||||
except Exception as e:
|
||||
logger.error(f"refreshMicrosoftToken: Error refreshing Microsoft token: {e}")
|
||||
return None
|
||||
|
||||
def refreshGoogleToken(self, refreshToken: str, userId: str, oldToken: Token) -> Optional[Token]:
|
||||
"""Refresh Google OAuth token using refresh token."""
|
||||
try:
|
||||
return self._tokenManager.refreshGoogleToken(refreshToken, userId, oldToken)
|
||||
except Exception as e:
|
||||
logger.error(f"refreshGoogleToken: Error refreshing Google token: {e}")
|
||||
return None
|
||||
9
modules/serviceCenter/core/serviceStreaming/__init__.py
Normal file
9
modules/serviceCenter/core/serviceStreaming/__init__.py
Normal file
|
|
@ -0,0 +1,9 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""Streaming core service for SSE event management."""
|
||||
|
||||
from .eventManager import EventManager, get_event_manager
|
||||
from .helpers import ChatStreamingHelper
|
||||
from .mainServiceStreaming import StreamingService
|
||||
|
||||
__all__ = ["EventManager", "get_event_manager", "ChatStreamingHelper", "StreamingService"]
|
||||
158
modules/serviceCenter/core/serviceStreaming/eventManager.py
Normal file
158
modules/serviceCenter/core/serviceStreaming/eventManager.py
Normal file
|
|
@ -0,0 +1,158 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""
|
||||
Event manager for SSE streaming.
|
||||
Manages event queues for Server-Sent Events (SSE) streaming across features.
|
||||
"""
|
||||
|
||||
import logging
|
||||
import asyncio
|
||||
from typing import Dict, Optional, Any
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class EventManager:
|
||||
"""
|
||||
Manages event queues for SSE streaming.
|
||||
Each workflow has its own async queue for events.
|
||||
"""
|
||||
|
||||
def __init__(self):
|
||||
"""Initialize the event manager."""
|
||||
self._queues: Dict[str, asyncio.Queue] = {}
|
||||
self._cleanup_tasks: Dict[str, asyncio.Task] = {}
|
||||
|
||||
def create_queue(self, workflow_id: str) -> asyncio.Queue:
|
||||
"""
|
||||
Create an event queue for a workflow.
|
||||
|
||||
Args:
|
||||
workflow_id: Workflow ID
|
||||
|
||||
Returns:
|
||||
Async queue for events
|
||||
"""
|
||||
if workflow_id not in self._queues:
|
||||
self._queues[workflow_id] = asyncio.Queue()
|
||||
logger.debug(f"Created event queue for workflow {workflow_id}")
|
||||
return self._queues[workflow_id]
|
||||
|
||||
def get_queue(self, workflow_id: str) -> Optional[asyncio.Queue]:
|
||||
"""
|
||||
Get the event queue for a workflow.
|
||||
|
||||
Args:
|
||||
workflow_id: Workflow ID
|
||||
|
||||
Returns:
|
||||
Async queue if exists, None otherwise
|
||||
"""
|
||||
return self._queues.get(workflow_id)
|
||||
|
||||
def has_queue(self, workflow_id: str) -> bool:
|
||||
"""
|
||||
Check if a queue exists for a workflow.
|
||||
|
||||
Args:
|
||||
workflow_id: Workflow ID
|
||||
|
||||
Returns:
|
||||
True if queue exists, False otherwise
|
||||
"""
|
||||
return workflow_id in self._queues
|
||||
|
||||
async def emit_event(
|
||||
self,
|
||||
context_id: str,
|
||||
event_type: str,
|
||||
data: Dict[str, Any],
|
||||
event_category: str = "chat",
|
||||
message: Optional[str] = None,
|
||||
step: Optional[str] = None
|
||||
) -> None:
|
||||
"""
|
||||
Emit an event to the queue for a workflow.
|
||||
|
||||
Args:
|
||||
context_id: Workflow ID (context)
|
||||
event_type: Type of event (e.g., "chatdata", "complete", "error")
|
||||
data: Event data dictionary
|
||||
event_category: Category of event (e.g., "chat", "workflow")
|
||||
message: Optional message string
|
||||
step: Optional step identifier
|
||||
"""
|
||||
queue = self._queues.get(context_id)
|
||||
if not queue:
|
||||
# DEBUG level: This is normal for background workflows without active SSE listener
|
||||
return
|
||||
|
||||
event = {
|
||||
"type": event_type,
|
||||
"data": data,
|
||||
"category": event_category,
|
||||
"message": message,
|
||||
"step": step
|
||||
}
|
||||
|
||||
try:
|
||||
await queue.put(event)
|
||||
logger.debug(f"Emitted {event_type} event for workflow {context_id}")
|
||||
except Exception as e:
|
||||
logger.error(f"Error emitting event for workflow {context_id}: {e}", exc_info=True)
|
||||
|
||||
async def cleanup(self, workflow_id: str, delay: float = 60.0) -> None:
|
||||
"""
|
||||
Schedule cleanup of a queue after a delay.
|
||||
|
||||
Args:
|
||||
workflow_id: Workflow ID
|
||||
delay: Delay in seconds before cleanup
|
||||
"""
|
||||
# Cancel existing cleanup task if any
|
||||
if workflow_id in self._cleanup_tasks:
|
||||
self._cleanup_tasks[workflow_id].cancel()
|
||||
|
||||
async def _cleanup():
|
||||
try:
|
||||
await asyncio.sleep(delay)
|
||||
if workflow_id in self._queues:
|
||||
# Drain remaining events
|
||||
queue = self._queues[workflow_id]
|
||||
while not queue.empty():
|
||||
try:
|
||||
queue.get_nowait()
|
||||
except asyncio.QueueEmpty:
|
||||
break
|
||||
|
||||
# Remove queue
|
||||
del self._queues[workflow_id]
|
||||
logger.info(f"Cleaned up event queue for workflow {workflow_id}")
|
||||
except asyncio.CancelledError:
|
||||
logger.debug(f"Cleanup cancelled for workflow {workflow_id}")
|
||||
except Exception as e:
|
||||
logger.error(f"Error during cleanup for workflow {workflow_id}: {e}", exc_info=True)
|
||||
finally:
|
||||
if workflow_id in self._cleanup_tasks:
|
||||
del self._cleanup_tasks[workflow_id]
|
||||
|
||||
# Schedule cleanup
|
||||
task = asyncio.create_task(_cleanup())
|
||||
self._cleanup_tasks[workflow_id] = task
|
||||
|
||||
|
||||
# Global event manager instance
|
||||
_event_manager: Optional[EventManager] = None
|
||||
|
||||
|
||||
def get_event_manager() -> EventManager:
|
||||
"""
|
||||
Get the global event manager instance.
|
||||
|
||||
Returns:
|
||||
EventManager instance
|
||||
"""
|
||||
global _event_manager
|
||||
if _event_manager is None:
|
||||
_event_manager = EventManager()
|
||||
return _event_manager
|
||||
242
modules/serviceCenter/core/serviceStreaming/helpers.py
Normal file
242
modules/serviceCenter/core/serviceStreaming/helpers.py
Normal file
|
|
@ -0,0 +1,242 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""Streaming helper utilities for chat message processing and normalization."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from typing import Any, Dict, List, Literal, Mapping, Optional
|
||||
|
||||
from langchain_core.messages import (
|
||||
AIMessage,
|
||||
BaseMessage,
|
||||
HumanMessage,
|
||||
SystemMessage,
|
||||
ToolMessage,
|
||||
)
|
||||
|
||||
Role = Literal["user", "assistant", "system", "tool"]
|
||||
|
||||
|
||||
class ChatStreamingHelper:
|
||||
"""Pure helper methods for streaming and message normalization.
|
||||
|
||||
This class provides static utility methods for converting between different
|
||||
message formats, extracting content, and normalizing message structures
|
||||
for streaming chat applications.
|
||||
"""
|
||||
|
||||
@staticmethod
|
||||
def role_from_message(*, msg: BaseMessage) -> Role:
|
||||
"""Extract the role from a BaseMessage instance.
|
||||
|
||||
Args:
|
||||
msg: The BaseMessage instance to extract the role from.
|
||||
|
||||
Returns:
|
||||
The role as a string literal: "user", "assistant", "system", or "tool".
|
||||
Defaults to "assistant" if the message type is not recognized.
|
||||
|
||||
Examples:
|
||||
>>> from langchain_core.messages import HumanMessage
|
||||
>>> msg = HumanMessage(content="Hello")
|
||||
>>> ChatStreamingHelper.role_from_message(msg=msg)
|
||||
'user'
|
||||
"""
|
||||
if isinstance(msg, HumanMessage):
|
||||
return "user"
|
||||
if isinstance(msg, AIMessage):
|
||||
return "assistant"
|
||||
if isinstance(msg, SystemMessage):
|
||||
return "system"
|
||||
if isinstance(msg, ToolMessage):
|
||||
return "tool"
|
||||
return getattr(msg, "role", "assistant")
|
||||
|
||||
@staticmethod
|
||||
def flatten_content(*, content: Any) -> str:
|
||||
"""Convert complex content structures to plain text.
|
||||
|
||||
This method handles various content formats including strings, lists of
|
||||
content parts, and dictionaries with text fields. It's designed to
|
||||
normalize content from different message sources into a consistent
|
||||
plain text format.
|
||||
|
||||
Args:
|
||||
content: The content to flatten. Can be:
|
||||
- str: Returned as-is after stripping whitespace
|
||||
- list: Each item processed and joined with newlines
|
||||
- dict: Text extracted from "text" or "content" fields
|
||||
- None: Returns empty string
|
||||
- Any other type: Converted to string
|
||||
|
||||
Returns:
|
||||
The flattened content as a plain text string with whitespace stripped.
|
||||
|
||||
Examples:
|
||||
>>> content = [{"type": "text", "text": "Hello"}, {"type": "text", "text": "world"}]
|
||||
>>> ChatStreamingHelper.flatten_content(content=content)
|
||||
'Hello\nworld'
|
||||
|
||||
>>> content = {"text": "Simple message"}
|
||||
>>> ChatStreamingHelper.flatten_content(content=content)
|
||||
'Simple message'
|
||||
"""
|
||||
if content is None:
|
||||
return ""
|
||||
if isinstance(content, str):
|
||||
return content.strip()
|
||||
if isinstance(content, list):
|
||||
parts: List[str] = []
|
||||
for part in content:
|
||||
if isinstance(part, dict):
|
||||
if "text" in part and isinstance(part["text"], str):
|
||||
parts.append(part["text"])
|
||||
elif part.get("type") == "text" and isinstance(
|
||||
part.get("text"), str
|
||||
):
|
||||
parts.append(part["text"])
|
||||
elif "content" in part and isinstance(part["content"], str):
|
||||
parts.append(part["content"])
|
||||
else:
|
||||
# Fallback for unknown dictionary structures
|
||||
val = part.get("value")
|
||||
if isinstance(val, str):
|
||||
parts.append(val)
|
||||
else:
|
||||
parts.append(str(part))
|
||||
return "\n".join(p.strip() for p in parts if p is not None)
|
||||
if isinstance(content, dict):
|
||||
if "text" in content and isinstance(content["text"], str):
|
||||
return content["text"].strip()
|
||||
if "content" in content and isinstance(content["content"], str):
|
||||
return content["content"].strip()
|
||||
return str(content).strip()
|
||||
|
||||
@staticmethod
|
||||
def message_to_dict(*, msg: BaseMessage) -> Dict[str, Any]:
|
||||
"""Convert a BaseMessage instance to a dictionary for streaming output.
|
||||
|
||||
This method normalizes BaseMessage instances into a consistent dictionary
|
||||
format suitable for JSON serialization and streaming to clients.
|
||||
|
||||
Args:
|
||||
msg: The BaseMessage instance to convert.
|
||||
|
||||
Returns:
|
||||
A dictionary containing:
|
||||
- "role": The message role (user, assistant, system, tool)
|
||||
- "content": The flattened message content as plain text
|
||||
- "tool_calls": Tool calls if present (optional)
|
||||
- "name": Message name if present (optional)
|
||||
|
||||
Examples:
|
||||
>>> from langchain_core.messages import HumanMessage
|
||||
>>> msg = HumanMessage(content="Hello there")
|
||||
>>> result = ChatStreamingHelper.message_to_dict(msg=msg)
|
||||
>>> result["role"]
|
||||
'user'
|
||||
>>> result["content"]
|
||||
'Hello there'
|
||||
"""
|
||||
payload: Dict[str, Any] = {
|
||||
"role": ChatStreamingHelper.role_from_message(msg=msg),
|
||||
"content": ChatStreamingHelper.flatten_content(
|
||||
content=getattr(msg, "content", "")
|
||||
),
|
||||
}
|
||||
tool_calls = getattr(msg, "tool_calls", None)
|
||||
if tool_calls:
|
||||
payload["tool_calls"] = tool_calls
|
||||
name = getattr(msg, "name", None)
|
||||
if name:
|
||||
payload["name"] = name
|
||||
return payload
|
||||
|
||||
@staticmethod
|
||||
def dict_message_to_dict(*, obj: Mapping[str, Any]) -> Dict[str, Any]:
|
||||
"""Convert a dictionary-shaped message to a normalized dictionary.
|
||||
|
||||
This method handles messages that come from serialized state and are
|
||||
represented as dictionaries rather than BaseMessage instances. It
|
||||
normalizes various dictionary formats into a consistent structure.
|
||||
|
||||
Args:
|
||||
obj: The dictionary-shaped message to convert. Expected to contain
|
||||
fields like "role", "type", "content", "text", etc.
|
||||
|
||||
Returns:
|
||||
A normalized dictionary containing:
|
||||
- "role": The message role (user, assistant, system, tool)
|
||||
- "content": The flattened message content as plain text
|
||||
- "tool_calls": Tool calls if present (optional)
|
||||
- "name": Message name if present (optional)
|
||||
|
||||
Examples:
|
||||
>>> obj = {"type": "human", "content": "Hello"}
|
||||
>>> result = ChatStreamingHelper.dict_message_to_dict(obj=obj)
|
||||
>>> result["role"]
|
||||
'user'
|
||||
>>> result["content"]
|
||||
'Hello'
|
||||
"""
|
||||
role: Optional[str] = obj.get("role")
|
||||
if not role:
|
||||
# Handle alternative type field mappings
|
||||
typ = obj.get("type")
|
||||
if typ in ("human", "user"):
|
||||
role = "user"
|
||||
elif typ in ("ai", "assistant"):
|
||||
role = "assistant"
|
||||
elif typ in ("system",):
|
||||
role = "system"
|
||||
elif typ in ("tool", "function"):
|
||||
role = "tool"
|
||||
|
||||
content = obj.get("content")
|
||||
if content is None and "text" in obj:
|
||||
content = obj["text"]
|
||||
|
||||
out: Dict[str, Any] = {
|
||||
"role": role or "assistant",
|
||||
"content": ChatStreamingHelper.flatten_content(content=content),
|
||||
}
|
||||
if "tool_calls" in obj:
|
||||
out["tool_calls"] = obj["tool_calls"]
|
||||
if obj.get("name"):
|
||||
out["name"] = obj["name"]
|
||||
return out
|
||||
|
||||
@staticmethod
|
||||
def extract_messages_from_output(*, output_obj: Any) -> List[Any]:
|
||||
"""Extract messages from LangGraph output objects.
|
||||
|
||||
This method handles various output formats from LangGraph execution,
|
||||
extracting the messages list from different possible structures.
|
||||
|
||||
Args:
|
||||
output_obj: The output object from LangGraph execution. Can be:
|
||||
- An object with a "messages" attribute
|
||||
- A dictionary with a "messages" key
|
||||
- Any other object (returns empty list)
|
||||
|
||||
Returns:
|
||||
A list of extracted messages, or an empty list if no messages
|
||||
are found or if the output object is None.
|
||||
|
||||
Examples:
|
||||
>>> output = {"messages": [{"role": "user", "content": "Hello"}]}
|
||||
>>> messages = ChatStreamingHelper.extract_messages_from_output(output_obj=output)
|
||||
>>> len(messages)
|
||||
1
|
||||
"""
|
||||
if output_obj is None:
|
||||
return []
|
||||
|
||||
# Try to parse dicts first
|
||||
if isinstance(output_obj, dict):
|
||||
msgs = output_obj.get("messages")
|
||||
return msgs if isinstance(msgs, list) else []
|
||||
|
||||
# Then try to get messages attribute
|
||||
msgs = getattr(output_obj, "messages", None)
|
||||
return msgs if isinstance(msgs, list) else []
|
||||
|
|
@ -0,0 +1,31 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""
|
||||
Streaming service for SSE event management.
|
||||
Core service - not requested by features directly.
|
||||
"""
|
||||
|
||||
import logging
|
||||
from typing import Any, Callable
|
||||
|
||||
from modules.serviceCenter.core.serviceStreaming.eventManager import EventManager, get_event_manager
|
||||
from modules.serviceCenter.core.serviceStreaming.helpers import ChatStreamingHelper
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class StreamingService:
|
||||
"""Streaming service providing access to SSE event infrastructure."""
|
||||
|
||||
def __init__(self, context: Any, get_service: Callable[[str], Any]):
|
||||
"""Initialize with service center context and resolver."""
|
||||
self._context = context
|
||||
self._get_service = get_service
|
||||
|
||||
def getEventManager(self) -> EventManager:
|
||||
"""Get the global event manager instance for SSE streaming."""
|
||||
return get_event_manager()
|
||||
|
||||
def getChatStreamingHelper(self):
|
||||
"""Get ChatStreamingHelper utility for message normalization (no legacy import at call site)."""
|
||||
return ChatStreamingHelper
|
||||
7
modules/serviceCenter/core/serviceUtils/__init__.py
Normal file
7
modules/serviceCenter/core/serviceUtils/__init__.py
Normal file
|
|
@ -0,0 +1,7 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""Utils core service."""
|
||||
|
||||
from .mainServiceUtils import UtilsService
|
||||
|
||||
__all__ = ["UtilsService"]
|
||||
185
modules/serviceCenter/core/serviceUtils/mainServiceUtils.py
Normal file
185
modules/serviceCenter/core/serviceUtils/mainServiceUtils.py
Normal file
|
|
@ -0,0 +1,185 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""
|
||||
Utility service for common operations across the gateway.
|
||||
Provides centralized access to configuration, events, and other utilities.
|
||||
Core service - not requested by features directly.
|
||||
"""
|
||||
|
||||
import logging
|
||||
from typing import Any, Optional, Dict, Callable, List
|
||||
from modules.shared.configuration import APP_CONFIG
|
||||
from modules.shared.eventManagement import eventManager
|
||||
from modules.shared.timeUtils import getUtcTimestamp
|
||||
from modules.shared import jsonUtils
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class UtilsService:
|
||||
"""Utility service providing common operations."""
|
||||
|
||||
def __init__(self, context, get_service: Callable[[str], Any]):
|
||||
"""Initialize with service center context and resolver."""
|
||||
self._context = context
|
||||
self._get_service = get_service
|
||||
|
||||
# ===== Event handling =====
|
||||
|
||||
def eventRegisterCron(self, job_id: str, func: Callable, cron_kwargs: Dict[str, Any],
|
||||
replace_existing: bool = True, coalesce: bool = True,
|
||||
max_instances: int = 1, misfire_grace_time: int = 1800):
|
||||
"""Register a cron job with the event manager."""
|
||||
try:
|
||||
eventManager.registerCron(
|
||||
jobId=job_id,
|
||||
func=func,
|
||||
cronKwargs=cron_kwargs,
|
||||
replaceExisting=replace_existing,
|
||||
coalesce=coalesce,
|
||||
maxInstances=max_instances,
|
||||
misfireGraceTime=misfire_grace_time
|
||||
)
|
||||
logger.info(f"Registered cron job '{job_id}' with schedule: {cron_kwargs}")
|
||||
except Exception as e:
|
||||
logger.error(f"Error registering cron job '{job_id}': {str(e)}")
|
||||
|
||||
def eventRegisterInterval(self, job_id: str, func: Callable, seconds: Optional[int] = None,
|
||||
minutes: Optional[int] = None, hours: Optional[int] = None,
|
||||
replace_existing: bool = True, coalesce: bool = True,
|
||||
max_instances: int = 1, misfire_grace_time: int = 1800):
|
||||
"""Register an interval job with the event manager."""
|
||||
try:
|
||||
eventManager.registerInterval(
|
||||
jobId=job_id,
|
||||
func=func,
|
||||
seconds=seconds,
|
||||
minutes=minutes,
|
||||
hours=hours,
|
||||
replaceExisting=replace_existing,
|
||||
coalesce=coalesce,
|
||||
maxInstances=max_instances,
|
||||
misfireGraceTime=misfire_grace_time
|
||||
)
|
||||
logger.info(f"Registered interval job '{job_id}' (h={hours}, m={minutes}, s={seconds})")
|
||||
except Exception as e:
|
||||
logger.error(f"Error registering interval job '{job_id}': {str(e)}")
|
||||
|
||||
def eventRemove(self, job_id: str):
|
||||
"""Remove a scheduled job from the event manager."""
|
||||
try:
|
||||
eventManager.remove(job_id)
|
||||
logger.info(f"Removed job '{job_id}'")
|
||||
except Exception as e:
|
||||
logger.error(f"Error removing job '{job_id}': {str(e)}")
|
||||
|
||||
def configGet(self, key: str, default: Any = None, user_id: str = "system") -> Any:
|
||||
"""Get a configuration value with optional default."""
|
||||
try:
|
||||
return APP_CONFIG.get(key, default, user_id)
|
||||
except Exception as e:
|
||||
logger.error(f"Error getting config '{key}': {str(e)}")
|
||||
return default
|
||||
|
||||
def timestampGetUtc(self) -> float:
|
||||
"""Get current UTC timestamp."""
|
||||
try:
|
||||
return getUtcTimestamp()
|
||||
except Exception as e:
|
||||
logger.error(f"Error getting UTC timestamp: {str(e)}")
|
||||
return 0.0
|
||||
|
||||
# ===== Debug Tools =====
|
||||
|
||||
def writeDebugFile(self, content: str, fileType: str, documents: Optional[List] = None) -> None:
|
||||
"""Wrapper to write debug files via shared debugLogger."""
|
||||
try:
|
||||
from modules.shared.debugLogger import writeDebugFile as _writeDebugFile
|
||||
_writeDebugFile(content, fileType, documents)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
def debugLogToFile(self, message: str, context: str = "DEBUG"):
|
||||
"""Wrapper to log debug messages via shared debugLogger."""
|
||||
try:
|
||||
from modules.shared.debugLogger import debugLogToFile as _debugLogToFile
|
||||
_debugLogToFile(message, context)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
def storeDebugMessageAndDocuments(self, message, currentUser, mandateId=None, featureInstanceId=None):
|
||||
"""Wrapper to store debug messages and documents via interfaceDbChat."""
|
||||
try:
|
||||
from modules.interfaces.interfaceDbChat import storeDebugMessageAndDocuments as _storeDebugMessageAndDocuments
|
||||
_storeDebugMessageAndDocuments(message, currentUser, mandateId=mandateId, featureInstanceId=featureInstanceId)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
def writeDebugArtifact(self, fileName: str, obj: Any):
|
||||
"""Backward-compatible wrapper that now writes via writeDebugFile."""
|
||||
try:
|
||||
import json
|
||||
if isinstance(obj, (dict, list)):
|
||||
content = json.dumps(obj, ensure_ascii=False, indent=2)
|
||||
else:
|
||||
content = str(obj)
|
||||
from modules.shared.debugLogger import writeDebugFile as _writeDebugFile
|
||||
_writeDebugFile(content, fileName)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# ===== Prompt sanitization =====
|
||||
|
||||
def sanitizePromptContent(self, content: str, contentType: str = "text") -> str:
|
||||
"""Centralized prompt content sanitization."""
|
||||
if not content:
|
||||
return ""
|
||||
try:
|
||||
import re
|
||||
content_str = str(content)
|
||||
sanitized = re.sub(r'[\x00-\x08\x0B\x0C\x0E-\x1F\x7F]', '', content_str)
|
||||
if contentType == "userinput":
|
||||
sanitized = sanitized.replace('{', '{{').replace('}', '}}')
|
||||
sanitized = sanitized.replace('"', '\\"').replace("'", "\\'")
|
||||
return f"'{sanitized}'"
|
||||
elif contentType == "json":
|
||||
sanitized = sanitized.replace('\\', '\\\\').replace('"', '\\"')
|
||||
sanitized = sanitized.replace('\n', '\\n').replace('\r', '\\r').replace('\t', '\\t')
|
||||
elif contentType == "document":
|
||||
sanitized = sanitized.replace('\\', '\\\\').replace('"', '\\"').replace("'", "\\'")
|
||||
sanitized = sanitized.replace('\n', '\\n').replace('\r', '\\r').replace('\t', '\\t')
|
||||
else:
|
||||
sanitized = sanitized.replace('\\', '\\\\').replace('"', '\\"').replace("'", "\\'")
|
||||
sanitized = sanitized.replace('\n', '\\n').replace('\r', '\\r').replace('\t', '\\t')
|
||||
return sanitized
|
||||
except Exception as e:
|
||||
logger.error(f"Error sanitizing prompt content: {str(e)}")
|
||||
return "[ERROR: Content could not be safely sanitized]"
|
||||
|
||||
# ===== JSON utility wrappers =====
|
||||
|
||||
def jsonStripCodeFences(self, text: str) -> str:
|
||||
return jsonUtils.stripCodeFences(text)
|
||||
|
||||
def jsonExtractFirstBalanced(self, text: str) -> str:
|
||||
return jsonUtils.extractFirstBalancedJson(text)
|
||||
|
||||
def jsonNormalizeText(self, text: str) -> str:
|
||||
return jsonUtils.normalizeJsonText(text)
|
||||
|
||||
def jsonExtractString(self, text: str) -> str:
|
||||
return jsonUtils.extractJsonString(text)
|
||||
|
||||
def jsonTryParse(self, text) -> tuple:
|
||||
return jsonUtils.tryParseJson(text)
|
||||
|
||||
# ===== Enum utility functions =====
|
||||
|
||||
def mapToEnum(self, enum_class, value_str, default_value):
|
||||
"""Map string value to enum."""
|
||||
if not value_str:
|
||||
return default_value
|
||||
for enum_item in enum_class:
|
||||
if enum_item.value.lower() == value_str.lower():
|
||||
return enum_item
|
||||
return default_value
|
||||
108
modules/serviceCenter/registry.py
Normal file
108
modules/serviceCenter/registry.py
Normal file
|
|
@ -0,0 +1,108 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""
|
||||
Service Center Registry.
|
||||
Service definitions, dependency graph, and RBAC object keys.
|
||||
"""
|
||||
|
||||
from typing import Dict, List, Any
|
||||
|
||||
# Core services: internal building blocks, no RBAC, never requested by features
|
||||
CORE_SERVICES: Dict[str, Dict[str, Any]] = {
|
||||
"utils": {
|
||||
"module": "modules.serviceCenter.core.serviceUtils.mainServiceUtils",
|
||||
"class": "UtilsService",
|
||||
"dependencies": [],
|
||||
},
|
||||
"security": {
|
||||
"module": "modules.serviceCenter.core.serviceSecurity.mainServiceSecurity",
|
||||
"class": "SecurityService",
|
||||
"dependencies": [],
|
||||
},
|
||||
"streaming": {
|
||||
"module": "modules.serviceCenter.core.serviceStreaming.mainServiceStreaming",
|
||||
"class": "StreamingService",
|
||||
"dependencies": [],
|
||||
},
|
||||
}
|
||||
|
||||
# Importable services: feature-facing, RBAC-protected
|
||||
IMPORTABLE_SERVICES: Dict[str, Dict[str, Any]] = {
|
||||
"ticket": {
|
||||
"module": "modules.serviceCenter.services.serviceTicket.mainServiceTicket",
|
||||
"class": "TicketService",
|
||||
"dependencies": [],
|
||||
"objectKey": "service.ticket",
|
||||
"label": {"en": "Ticket System", "de": "Ticket-System", "fr": "Système de tickets"},
|
||||
},
|
||||
"messaging": {
|
||||
"module": "modules.serviceCenter.services.serviceMessaging.mainServiceMessaging",
|
||||
"class": "MessagingService",
|
||||
"dependencies": [],
|
||||
"objectKey": "service.messaging",
|
||||
"label": {"en": "Messaging", "de": "Nachrichten", "fr": "Messagerie"},
|
||||
},
|
||||
"billing": {
|
||||
"module": "modules.serviceCenter.services.serviceBilling.mainServiceBilling",
|
||||
"class": "BillingService",
|
||||
"dependencies": [],
|
||||
"objectKey": "service.billing",
|
||||
"label": {"en": "Billing", "de": "Abrechnung", "fr": "Facturation"},
|
||||
},
|
||||
"sharepoint": {
|
||||
"module": "modules.serviceCenter.services.serviceSharepoint.mainServiceSharepoint",
|
||||
"class": "SharepointService",
|
||||
"dependencies": ["security"],
|
||||
"objectKey": "service.sharepoint",
|
||||
"label": {"en": "SharePoint", "de": "SharePoint", "fr": "SharePoint"},
|
||||
},
|
||||
"chat": {
|
||||
"module": "modules.serviceCenter.services.serviceChat.mainServiceChat",
|
||||
"class": "ChatService",
|
||||
"dependencies": ["utils"],
|
||||
"objectKey": "service.chat",
|
||||
"label": {"en": "Chat", "de": "Chat", "fr": "Chat"},
|
||||
},
|
||||
"extraction": {
|
||||
"module": "modules.serviceCenter.services.serviceExtraction.mainServiceExtraction",
|
||||
"class": "ExtractionService",
|
||||
"dependencies": ["chat", "utils"],
|
||||
"objectKey": "service.extraction",
|
||||
"label": {"en": "Extraction", "de": "Extraktion", "fr": "Extraction"},
|
||||
},
|
||||
"generation": {
|
||||
"module": "modules.serviceCenter.services.serviceGeneration.mainServiceGeneration",
|
||||
"class": "GenerationService",
|
||||
"dependencies": ["utils", "chat"],
|
||||
"objectKey": "service.generation",
|
||||
"label": {"en": "Generation", "de": "Generierung", "fr": "Génération"},
|
||||
},
|
||||
"ai": {
|
||||
"module": "modules.serviceCenter.services.serviceAi.mainServiceAi",
|
||||
"class": "AiService",
|
||||
"dependencies": ["chat", "utils", "extraction", "billing"],
|
||||
"objectKey": "service.ai",
|
||||
"label": {"en": "AI", "de": "KI", "fr": "IA"},
|
||||
},
|
||||
"web": {
|
||||
"module": "modules.serviceCenter.services.serviceWeb.mainServiceWeb",
|
||||
"class": "WebService",
|
||||
"dependencies": ["ai", "chat", "utils"],
|
||||
"objectKey": "service.web",
|
||||
"label": {"en": "Web Research", "de": "Web-Recherche", "fr": "Recherche Web"},
|
||||
},
|
||||
"neutralization": {
|
||||
"module": "modules.serviceCenter.services.serviceNeutralization.mainServiceNeutralization",
|
||||
"class": "NeutralizationService",
|
||||
"dependencies": ["extraction", "generation"],
|
||||
"objectKey": "service.neutralization",
|
||||
"label": {"en": "Neutralization", "de": "Neutralisierung", "fr": "Neutralisation"},
|
||||
},
|
||||
}
|
||||
|
||||
# RBAC objects for service-level access control (for catalog registration)
|
||||
SERVICE_RBAC_OBJECTS: List[Dict[str, Any]] = [
|
||||
{"objectKey": s["objectKey"], "label": s["label"]}
|
||||
for s in IMPORTABLE_SERVICES.values()
|
||||
if "objectKey" in s
|
||||
]
|
||||
170
modules/serviceCenter/resolver.py
Normal file
170
modules/serviceCenter/resolver.py
Normal file
|
|
@ -0,0 +1,170 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""
|
||||
Service Center Resolver.
|
||||
Resolution logic, dependency injection, and optional legacy fallback.
|
||||
"""
|
||||
|
||||
import importlib
|
||||
import logging
|
||||
from typing import Any, Callable, Dict, Optional, Set
|
||||
|
||||
from modules.serviceCenter.context import ServiceCenterContext
|
||||
from modules.serviceCenter.registry import CORE_SERVICES, IMPORTABLE_SERVICES
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Type for get_service callable passed to services
|
||||
GetServiceFunc = Callable[[str], Any]
|
||||
|
||||
|
||||
def _make_context_id(ctx: ServiceCenterContext) -> str:
|
||||
"""Create a stable cache key from context."""
|
||||
return f"{id(ctx.user)}_{ctx.mandate_id or ''}_{ctx.feature_instance_id or ''}"
|
||||
|
||||
|
||||
def _load_service_class(module_path: str, class_name: str):
|
||||
"""Load service class from module."""
|
||||
module = importlib.import_module(module_path)
|
||||
return getattr(module, class_name)
|
||||
|
||||
|
||||
def _create_legacy_hub(ctx: ServiceCenterContext) -> Any:
|
||||
"""Create legacy Services instance for fallback when service not yet migrated."""
|
||||
from modules.services import getInterface
|
||||
return getInterface(
|
||||
ctx.user,
|
||||
workflow=ctx.workflow,
|
||||
mandateId=ctx.mandate_id,
|
||||
featureInstanceId=ctx.feature_instance_id,
|
||||
)
|
||||
|
||||
|
||||
def _get_from_legacy(legacy_hub: Any, key: str) -> Any:
|
||||
"""Map service key to legacy hub attribute (for fallback when service center module fails)."""
|
||||
key_to_attr = {
|
||||
"utils": "utils",
|
||||
"security": "security",
|
||||
"streaming": "streaming",
|
||||
"ticket": "ticket",
|
||||
"messaging": "messaging",
|
||||
"billing": "billing",
|
||||
"sharepoint": "sharepoint",
|
||||
"chat": "chat",
|
||||
"extraction": "extraction",
|
||||
"generation": "generation",
|
||||
"ai": "ai",
|
||||
"web": "web",
|
||||
"neutralization": "neutralization",
|
||||
}
|
||||
attr = key_to_attr.get(key)
|
||||
if attr and hasattr(legacy_hub, attr):
|
||||
return getattr(legacy_hub, attr)
|
||||
return None
|
||||
|
||||
|
||||
def resolve(
|
||||
key: str,
|
||||
context: ServiceCenterContext,
|
||||
cache: Dict[str, Any],
|
||||
resolving: Set[str],
|
||||
legacy_hub: Optional[Any] = None,
|
||||
) -> Any:
|
||||
"""
|
||||
Resolve a service by key. Uses cache, resolves dependencies recursively.
|
||||
Falls back to legacy_hub if service module cannot be loaded.
|
||||
"""
|
||||
cache_key = f"{_make_context_id(context)}_{key}"
|
||||
if cache_key in cache:
|
||||
return cache[cache_key]
|
||||
|
||||
if key in resolving:
|
||||
raise RuntimeError(f"Circular dependency detected for service: {key}")
|
||||
|
||||
def get_service(dep_key: str) -> Any:
|
||||
return resolve(dep_key, context, cache, resolving, legacy_hub)
|
||||
|
||||
# Try core first
|
||||
if key in CORE_SERVICES:
|
||||
spec = CORE_SERVICES[key]
|
||||
try:
|
||||
cls = _load_service_class(spec["module"], spec["class"])
|
||||
resolving.add(key)
|
||||
try:
|
||||
for dep in spec.get("dependencies", []):
|
||||
get_service(dep)
|
||||
finally:
|
||||
resolving.discard(key)
|
||||
instance = cls(context, get_service)
|
||||
cache[cache_key] = instance
|
||||
return instance
|
||||
except (ImportError, ModuleNotFoundError, AttributeError) as e:
|
||||
logger.debug(f"Could not load core service '{key}' from service center: {e}")
|
||||
if legacy_hub:
|
||||
fallback = _get_from_legacy(legacy_hub, key)
|
||||
if fallback is not None:
|
||||
cache[cache_key] = fallback
|
||||
return fallback
|
||||
raise
|
||||
|
||||
# Try importable
|
||||
if key in IMPORTABLE_SERVICES:
|
||||
spec = IMPORTABLE_SERVICES[key]
|
||||
try:
|
||||
cls = _load_service_class(spec["module"], spec["class"])
|
||||
resolving.add(key)
|
||||
try:
|
||||
for dep in spec.get("dependencies", []):
|
||||
get_service(dep)
|
||||
finally:
|
||||
resolving.discard(key)
|
||||
instance = cls(context, get_service)
|
||||
cache[cache_key] = instance
|
||||
return instance
|
||||
except (ImportError, ModuleNotFoundError, AttributeError) as e:
|
||||
logger.debug(f"Could not load importable service '{key}' from service center: {e}")
|
||||
if legacy_hub:
|
||||
fallback = _get_from_legacy(legacy_hub, key)
|
||||
if fallback is not None:
|
||||
cache[cache_key] = fallback
|
||||
return fallback
|
||||
raise
|
||||
|
||||
if legacy_hub:
|
||||
fallback = _get_from_legacy(legacy_hub, key)
|
||||
if fallback is not None:
|
||||
cache[cache_key] = fallback
|
||||
return fallback
|
||||
|
||||
raise KeyError(f"Unknown service: {key}")
|
||||
|
||||
|
||||
# Module-level cache for service instances (per context)
|
||||
_resolution_cache: Dict[str, Any] = {}
|
||||
_cache_lock: Optional[Any] = None
|
||||
|
||||
try:
|
||||
from threading import Lock
|
||||
_cache_lock = Lock()
|
||||
except ImportError:
|
||||
pass
|
||||
|
||||
|
||||
def get_resolution_cache() -> Dict[str, Any]:
|
||||
"""Get the module-level resolution cache (for preWarm/clear)."""
|
||||
return _resolution_cache
|
||||
|
||||
|
||||
def clear_cache() -> None:
|
||||
"""Clear the resolution cache."""
|
||||
lock = _cache_lock if _cache_lock is not None else _DummyLock()
|
||||
with lock:
|
||||
_resolution_cache.clear()
|
||||
|
||||
|
||||
class _DummyLock:
|
||||
def __enter__(self):
|
||||
return self
|
||||
|
||||
def __exit__(self, *args):
|
||||
pass
|
||||
3
modules/serviceCenter/services/__init__.py
Normal file
3
modules/serviceCenter/services/__init__.py
Normal file
|
|
@ -0,0 +1,3 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""Importable services - feature-facing, RBAC-protected."""
|
||||
7
modules/serviceCenter/services/serviceAi/__init__.py
Normal file
7
modules/serviceCenter/services/serviceAi/__init__.py
Normal file
|
|
@ -0,0 +1,7 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""AI service."""
|
||||
|
||||
from .mainServiceAi import AiService
|
||||
|
||||
__all__ = ["AiService"]
|
||||
1573
modules/serviceCenter/services/serviceAi/mainServiceAi.py
Normal file
1573
modules/serviceCenter/services/serviceAi/mainServiceAi.py
Normal file
File diff suppressed because it is too large
Load diff
665
modules/serviceCenter/services/serviceAi/subAiCallLooping.py
Normal file
665
modules/serviceCenter/services/serviceAi/subAiCallLooping.py
Normal file
|
|
@ -0,0 +1,665 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""
|
||||
AI Call Looping Module
|
||||
|
||||
Handles AI calls with looping and repair logic, including:
|
||||
- Looping with JSON repair and continuation
|
||||
- KPI definition and tracking
|
||||
- Progress tracking and iteration management
|
||||
|
||||
FLOW LOGIC
|
||||
|
||||
VARIABLES:
|
||||
- jsonBase: str (merged JSON so far, starts empty)
|
||||
- lastValidCompletePart: str (fallback for failures)
|
||||
- mergeFailCount: int = 0 (max 3)
|
||||
|
||||
FLOW:
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ 1. BUILD PROMPT │
|
||||
│ - First: original prompt │
|
||||
│ - Next: buildContinuationContext(lastRawResponse) │
|
||||
├─────────────────────────────────────────────────────────────────┤
|
||||
│ 2. CALL AI → response fragment │
|
||||
├─────────────────────────────────────────────────────────────────┤
|
||||
│ 4. MERGE jsonBase + response │
|
||||
│ ├─ FAILS: repeat prompt, fails++ (if >=3 return fallback) │
|
||||
│ └─ SUCCEEDS: try parse │
|
||||
│ ├─ SUCCEEDS: FINISHED │
|
||||
│ └─ FAILS: → step 5 │
|
||||
├─────────────────────────────────────────────────────────────────┤
|
||||
│ 5. GET CONTEXTS (merge OK, parse failed) │
|
||||
│ getContexts(mergedJson) → │
|
||||
│ - If no cut point: overlapContext = "" │
|
||||
│ - Store contexts for next iteration │
|
||||
├─────────────────────────────────────────────────────────────────┤
|
||||
│ 6. DECIDE │
|
||||
│ ├─ jsonParsingSuccess=true AND overlapContext="": │
|
||||
│ │ FINISHED. return completePart │
|
||||
│ ├─ jsonParsingSuccess=true AND overlapContext!="": │
|
||||
│ │ CONTINUE, fails=0 │
|
||||
│ └─ ELSE: repeat prompt, fails++ │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
|
||||
|
||||
"""
|
||||
|
||||
import json
|
||||
import logging
|
||||
from typing import Dict, Any, List, Optional, Callable
|
||||
|
||||
from modules.datamodels.datamodelAi import (
|
||||
AiCallRequest, AiCallOptions
|
||||
)
|
||||
from modules.datamodels.datamodelExtraction import ContentPart
|
||||
from .subJsonResponseHandling import JsonResponseHandler
|
||||
from .subLoopingUseCases import LoopingUseCaseRegistry
|
||||
from modules.workflows.processing.shared.stateTools import checkWorkflowStopped
|
||||
from modules.shared.jsonContinuation import getContexts
|
||||
from modules.shared.jsonUtils import buildContinuationContext, extractJsonString, tryParseJson
|
||||
from modules.shared.jsonUtils import tryParseJson
|
||||
from modules.shared.jsonUtils import closeJsonStructures
|
||||
from modules.shared.jsonUtils import stripCodeFences, normalizeJsonText
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class AiCallLooper:
|
||||
"""Handles AI calls with looping and repair logic."""
|
||||
|
||||
def __init__(self, services, aiService, responseParser):
|
||||
"""Initialize AiCallLooper with service center, AI service, and response parser access."""
|
||||
self.services = services
|
||||
self.aiService = aiService
|
||||
self.responseParser = responseParser
|
||||
self.useCaseRegistry = LoopingUseCaseRegistry() # Initialize use case registry
|
||||
|
||||
async def callAiWithLooping(
|
||||
self,
|
||||
prompt: str,
|
||||
options: AiCallOptions,
|
||||
debugPrefix: str = "ai_call",
|
||||
promptBuilder: Optional[Callable] = None,
|
||||
promptArgs: Optional[Dict[str, Any]] = None,
|
||||
operationId: Optional[str] = None,
|
||||
userPrompt: Optional[str] = None,
|
||||
contentParts: Optional[List[ContentPart]] = None, # ARCHITECTURE: Support ContentParts for large content
|
||||
useCaseId: str = None # REQUIRED: Explicit use case ID - no auto-detection, no fallback
|
||||
) -> str:
|
||||
"""
|
||||
Shared core function for AI calls with repair-based looping system.
|
||||
Automatically repairs broken JSON and continues generation seamlessly.
|
||||
|
||||
Args:
|
||||
prompt: The prompt to send to AI
|
||||
options: AI call configuration options
|
||||
debugPrefix: Prefix for debug file names
|
||||
promptBuilder: Optional function to rebuild prompts for continuation
|
||||
promptArgs: Optional arguments for prompt builder
|
||||
operationId: Optional operation ID for progress tracking
|
||||
userPrompt: Optional user prompt for KPI definition
|
||||
contentParts: Optional content parts for first iteration
|
||||
useCaseId: REQUIRED: Explicit use case ID - no auto-detection, no fallback
|
||||
|
||||
Returns:
|
||||
Complete AI response after all iterations
|
||||
"""
|
||||
# REQUIRED: useCaseId must be provided - no auto-detection, no fallback
|
||||
if not useCaseId:
|
||||
errorMsg = (
|
||||
"useCaseId is REQUIRED for callAiWithLooping. "
|
||||
"No auto-detection - must explicitly specify use case ID. "
|
||||
f"Available use cases: {list(self.useCaseRegistry.useCases.keys())}"
|
||||
)
|
||||
logger.error(errorMsg)
|
||||
raise ValueError(errorMsg)
|
||||
|
||||
# Validate use case exists
|
||||
useCase = self.useCaseRegistry.get(useCaseId)
|
||||
if not useCase:
|
||||
errorMsg = (
|
||||
f"Use case '{useCaseId}' not found in registry. "
|
||||
f"Available use cases: {list(self.useCaseRegistry.useCases.keys())}"
|
||||
)
|
||||
logger.error(errorMsg)
|
||||
raise ValueError(errorMsg)
|
||||
|
||||
maxIterations = 50 # Prevent infinite loops
|
||||
iteration = 0
|
||||
allSections = [] # Accumulate all sections across iterations
|
||||
lastRawResponse = None # Store last raw JSON response for continuation
|
||||
|
||||
# JSON Base Iteration System:
|
||||
# - jsonBase: the merged JSON string (replaces accumulatedDirectJson array)
|
||||
# - After each iteration, new response is merged with jsonBase
|
||||
# - On merge success: check if complete, store contexts for next iteration
|
||||
# - On merge fail: retry with same prompt, increment fails
|
||||
jsonBase = None # Merged JSON string (starts None, set on first response)
|
||||
|
||||
# Merge fail tracking - stop after 3 consecutive merge failures
|
||||
MAX_MERGE_FAILS = 3
|
||||
mergeFailCount = 0 # Global counter for merge failures across entire loop
|
||||
lastValidCompletePart = None # Store last successfully parsed completePart for fallback
|
||||
|
||||
# Get parent operation ID for iteration operations (parentId should be operationId, not log entry ID)
|
||||
parentOperationId = operationId # Use the parent's operationId directly
|
||||
|
||||
while iteration < maxIterations:
|
||||
iteration += 1
|
||||
|
||||
# Create separate operation for each iteration with parent reference
|
||||
iterationOperationId = None
|
||||
if operationId:
|
||||
iterationOperationId = f"{operationId}_iter_{iteration}"
|
||||
self.services.chat.progressLogStart(
|
||||
iterationOperationId,
|
||||
"AI Call",
|
||||
f"Iteration {iteration}",
|
||||
"",
|
||||
parentOperationId=parentOperationId
|
||||
)
|
||||
|
||||
# Build iteration prompt
|
||||
# CRITICAL: Build continuation prompt if we have sections OR if we have a previous response (even if broken)
|
||||
# This ensures continuation prompts are built even when JSON is so broken that no sections can be extracted
|
||||
if (len(allSections) > 0 or lastRawResponse) and promptBuilder and promptArgs:
|
||||
# Extract templateStructure and basePrompt from promptArgs (REQUIRED)
|
||||
templateStructure = promptArgs.get("templateStructure")
|
||||
if not templateStructure:
|
||||
raise ValueError(
|
||||
f"templateStructure is REQUIRED in promptArgs for use case '{useCaseId}'. "
|
||||
"Prompt creation functions must return (prompt, templateStructure) tuple."
|
||||
)
|
||||
|
||||
basePrompt = promptArgs.get("basePrompt")
|
||||
if not basePrompt:
|
||||
# Fallback: use prompt parameter (should be the same)
|
||||
basePrompt = prompt
|
||||
logger.warning(
|
||||
f"basePrompt not found in promptArgs for use case '{useCaseId}', "
|
||||
"using prompt parameter instead. This may indicate a bug."
|
||||
)
|
||||
|
||||
# This is a continuation - build continuation context with raw JSON and rebuild prompt
|
||||
continuationContext = buildContinuationContext(
|
||||
allSections, lastRawResponse, useCaseId, templateStructure
|
||||
)
|
||||
if not lastRawResponse:
|
||||
logger.warning(f"Iteration {iteration}: No previous response available for continuation!")
|
||||
|
||||
# Store valid completePart from continuation context for fallback on merge failures
|
||||
# Use getContexts to check if completePart is parseable and store it
|
||||
if lastRawResponse and not lastValidCompletePart:
|
||||
try:
|
||||
contexts = getContexts(lastRawResponse)
|
||||
if contexts.jsonParsingSuccess and contexts.completePart:
|
||||
lastValidCompletePart = contexts.completePart
|
||||
logger.debug(f"Iteration {iteration}: Stored initial valid completePart ({len(lastValidCompletePart)} chars)")
|
||||
except Exception as e:
|
||||
logger.debug(f"Iteration {iteration}: Failed to extract completePart: {e}")
|
||||
|
||||
# Unified prompt builder call: Continuation builders only need continuationContext, templateStructure, and basePrompt
|
||||
# All initial context (section, userPrompt, etc.) is already in basePrompt, so promptArgs is not needed
|
||||
# Extract templateStructure and basePrompt from promptArgs (they're explicit parameters)
|
||||
iterationPrompt = await promptBuilder(
|
||||
continuationContext=continuationContext,
|
||||
templateStructure=templateStructure,
|
||||
basePrompt=basePrompt
|
||||
)
|
||||
else:
|
||||
# First iteration - use original prompt
|
||||
iterationPrompt = prompt
|
||||
|
||||
# Make AI call
|
||||
try:
|
||||
checkWorkflowStopped(self.services)
|
||||
if iterationOperationId:
|
||||
self.services.chat.progressLogUpdate(iterationOperationId, 0.3, "Calling AI model")
|
||||
# ARCHITECTURE: Pass ContentParts directly to AiCallRequest
|
||||
# This allows model-aware chunking to handle large content properly
|
||||
# ContentParts are only passed in first iteration (continuations don't need them)
|
||||
request = AiCallRequest(
|
||||
prompt=iterationPrompt,
|
||||
context="",
|
||||
options=options,
|
||||
contentParts=contentParts if iteration == 1 else None # Only pass ContentParts in first iteration
|
||||
)
|
||||
|
||||
# Write the ACTUAL prompt sent to AI
|
||||
# For section content generation: write prompt for first iteration and continuation iterations
|
||||
# For document generation: write prompt for each iteration
|
||||
isSectionContent = "_section_" in debugPrefix
|
||||
if iteration == 1:
|
||||
self.services.utils.writeDebugFile(iterationPrompt, f"{debugPrefix}_prompt")
|
||||
elif isSectionContent:
|
||||
# Save continuation prompts for section_content debugging
|
||||
self.services.utils.writeDebugFile(iterationPrompt, f"{debugPrefix}_prompt_iteration_{iteration}")
|
||||
else:
|
||||
# Document generation - save all iteration prompts
|
||||
self.services.utils.writeDebugFile(iterationPrompt, f"{debugPrefix}_prompt_iteration_{iteration}")
|
||||
|
||||
response = await self.aiService.callAi(request)
|
||||
result = response.content
|
||||
|
||||
# Track bytes for progress reporting
|
||||
bytesReceived = len(result.encode('utf-8')) if result else 0
|
||||
totalBytesSoFar = sum(len(section.get('content', '').encode('utf-8')) if isinstance(section.get('content'), str) else 0 for section in allSections) + bytesReceived
|
||||
|
||||
# Update progress after AI call with byte information
|
||||
if iterationOperationId:
|
||||
# Format bytes for display (kB or MB)
|
||||
if totalBytesSoFar < 1024:
|
||||
bytesDisplay = f"{totalBytesSoFar}B"
|
||||
elif totalBytesSoFar < 1024 * 1024:
|
||||
bytesDisplay = f"{totalBytesSoFar / 1024:.1f}kB"
|
||||
else:
|
||||
bytesDisplay = f"{totalBytesSoFar / (1024 * 1024):.1f}MB"
|
||||
self.services.chat.progressLogUpdate(iterationOperationId, 0.6, f"AI response received ({bytesDisplay})")
|
||||
|
||||
# Write raw AI response to debug file
|
||||
# For section content generation: write response for first iteration and continuation iterations
|
||||
# For document generation: write response for each iteration
|
||||
if iteration == 1:
|
||||
self.services.utils.writeDebugFile(result, f"{debugPrefix}_response")
|
||||
elif isSectionContent:
|
||||
# Save continuation responses for section_content debugging
|
||||
self.services.utils.writeDebugFile(result, f"{debugPrefix}_response_iteration_{iteration}")
|
||||
else:
|
||||
# Document generation - save all iteration responses
|
||||
self.services.utils.writeDebugFile(result, f"{debugPrefix}_response_iteration_{iteration}")
|
||||
|
||||
# Note: Stats are now stored centrally in callAi() - no need to duplicate here
|
||||
|
||||
# Check for error response using generic error detection (errorCount > 0 or modelName == "error")
|
||||
if hasattr(response, 'errorCount') and response.errorCount > 0:
|
||||
errorMsg = f"Iteration {iteration}: Error response detected (errorCount={response.errorCount}), stopping loop: {result[:200] if result else 'empty'}"
|
||||
logger.error(errorMsg)
|
||||
break
|
||||
|
||||
if hasattr(response, 'modelName') and response.modelName == "error":
|
||||
errorMsg = f"Iteration {iteration}: Error response detected (modelName=error), stopping loop: {result[:200] if result else 'empty'}"
|
||||
logger.error(errorMsg)
|
||||
break
|
||||
|
||||
if not result or not result.strip():
|
||||
logger.warning(f"Iteration {iteration}: Empty response, stopping")
|
||||
break
|
||||
|
||||
# Check if this is a text response (not document generation)
|
||||
# Text responses don't need JSON parsing - return immediately after first successful response
|
||||
isTextResponse = (promptBuilder is None and promptArgs is None) or debugPrefix == "text"
|
||||
|
||||
if isTextResponse:
|
||||
# For text responses, return the text immediately - no JSON parsing needed
|
||||
logger.info(f"Iteration {iteration}: Text response received, returning immediately")
|
||||
if iterationOperationId:
|
||||
self.services.chat.progressLogFinish(iterationOperationId, True)
|
||||
return result
|
||||
|
||||
# NOTE: Do NOT update lastRawResponse here!
|
||||
# lastRawResponse should only be updated after successful merge
|
||||
# This ensures retry iterations use the correct base context
|
||||
|
||||
# Handle use cases that return JSON directly (no section extraction needed)
|
||||
# Check if use case supports direct return (all registered use cases do)
|
||||
if useCase and not useCase.requiresExtraction:
|
||||
# =====================================================================
|
||||
# ITERATION FLOW (Simplified)
|
||||
# =====================================================================
|
||||
# Step 4: MERGE jsonBase + new response
|
||||
# - FAILS: repeat prompt, increment fails cont (if >=3 return fallback)
|
||||
# - SUCCEEDS: try parse
|
||||
# - SUCCEEDS: FINISHED
|
||||
# - FAILS: proceed to Step 5
|
||||
# Step 5: GET CONTEXTS (merge OK, parse failed)
|
||||
# - getContexts() with repair
|
||||
# - If no cut point: overlapContext = ""
|
||||
# Step 6: DECIDE
|
||||
# - jsonParsingSuccess=true AND overlapContext="": FINISHED
|
||||
# - jsonParsingSuccess=true AND overlapContext!="": continue, fails=0
|
||||
# - ELSE: repeat prompt, increment fails count
|
||||
# =====================================================================
|
||||
|
||||
# STEP 4: MERGE jsonBase + new response
|
||||
# Use candidateJson to hold merged result until we confirm it's valid
|
||||
candidateJson = None
|
||||
|
||||
if jsonBase is None:
|
||||
# First iteration - candidate is the current result
|
||||
candidateJson = result
|
||||
logger.debug(f"Iteration {iteration}: First response, candidateJson ({len(candidateJson)} chars)")
|
||||
else:
|
||||
# Merge jsonBase with new response
|
||||
logger.info(f"Iteration {iteration}: Merging jsonBase ({len(jsonBase)} chars) with new response ({len(result)} chars)")
|
||||
mergedJsonString, hasOverlap = JsonResponseHandler.mergeJsonStringsWithOverlap(jsonBase, result)
|
||||
|
||||
if not hasOverlap:
|
||||
# MERGE FAILED - repeat prompt with unchanged jsonBase
|
||||
mergeFailCount += 1
|
||||
logger.warning(
|
||||
f"Iteration {iteration}: Merge failed, no overlap found "
|
||||
f"(fail {mergeFailCount}/{MAX_MERGE_FAILS})"
|
||||
)
|
||||
|
||||
if mergeFailCount >= MAX_MERGE_FAILS:
|
||||
# Max failures reached - return last valid completePart
|
||||
logger.error(
|
||||
f"Iteration {iteration}: Max merge failures ({MAX_MERGE_FAILS}) reached, "
|
||||
"returning last valid completePart"
|
||||
)
|
||||
if iterationOperationId:
|
||||
self.services.chat.progressLogFinish(iterationOperationId, False)
|
||||
|
||||
if lastValidCompletePart:
|
||||
try:
|
||||
extracted = extractJsonString(lastValidCompletePart)
|
||||
parsed, parseErr, _ = tryParseJson(extracted)
|
||||
if parseErr is None and parsed:
|
||||
normalized = self._normalizeJsonStructure(parsed, useCase)
|
||||
return json.dumps(normalized, indent=2, ensure_ascii=False)
|
||||
except Exception:
|
||||
pass
|
||||
return lastValidCompletePart
|
||||
else:
|
||||
# No valid fallback - return whatever we have
|
||||
return jsonBase if jsonBase else ""
|
||||
|
||||
# Not at max failures - retry with same prompt (jsonBase unchanged)
|
||||
if iterationOperationId:
|
||||
self.services.chat.progressLogUpdate(
|
||||
iterationOperationId, 0.7,
|
||||
f"Merge failed ({mergeFailCount}/{MAX_MERGE_FAILS}), retrying"
|
||||
)
|
||||
self.services.chat.progressLogFinish(iterationOperationId, True)
|
||||
continue
|
||||
|
||||
# MERGE SUCCEEDED - set candidate (don't update jsonBase yet!)
|
||||
candidateJson = mergedJsonString
|
||||
logger.debug(f"Iteration {iteration}: Merge succeeded, candidateJson ({len(candidateJson)} chars)")
|
||||
|
||||
# Update lastRawResponse ONLY after we have a valid candidateJson
|
||||
# (first iteration or successful merge - NOT on merge failure!)
|
||||
# This ensures retry iterations use the correct base context
|
||||
lastRawResponse = candidateJson
|
||||
|
||||
# Try direct parse of candidate
|
||||
try:
|
||||
extracted = extractJsonString(candidateJson)
|
||||
parsed, parseErr, _ = tryParseJson(extracted)
|
||||
if parseErr is None and parsed:
|
||||
# Direct parse succeeded - FINISHED
|
||||
# Commit candidate to jsonBase
|
||||
jsonBase = candidateJson
|
||||
logger.info(f"Iteration {iteration}: Direct parse succeeded, JSON is complete")
|
||||
normalized = self._normalizeJsonStructure(parsed, useCase)
|
||||
result = json.dumps(normalized, indent=2, ensure_ascii=False)
|
||||
|
||||
if iterationOperationId:
|
||||
self.services.chat.progressLogFinish(iterationOperationId, True)
|
||||
|
||||
if not useCase.finalResultHandler:
|
||||
raise ValueError(
|
||||
f"Use case '{useCaseId}' is missing required 'finalResultHandler' callback."
|
||||
)
|
||||
return useCase.finalResultHandler(
|
||||
result, normalized, extracted, debugPrefix, self.services
|
||||
)
|
||||
except Exception as e:
|
||||
logger.debug(f"Iteration {iteration}: Direct parse failed: {e}")
|
||||
|
||||
# STEP 5: GET CONTEXTS (merge OK, parse failed = cut JSON)
|
||||
# Use candidateJson for context extraction
|
||||
contexts = getContexts(candidateJson)
|
||||
overlapInfo = "(empty=complete)" if contexts.overlapContext == "" else f"({len(contexts.overlapContext)} chars)"
|
||||
logger.debug(
|
||||
f"Iteration {iteration}: getContexts() -> "
|
||||
f"jsonParsingSuccess={contexts.jsonParsingSuccess}, "
|
||||
f"overlapContext={overlapInfo}"
|
||||
)
|
||||
|
||||
# STEP 6: DECIDE based on jsonParsingSuccess and overlapContext
|
||||
if contexts.jsonParsingSuccess and contexts.overlapContext == "":
|
||||
# JSON is complete (no cut point) - FINISHED
|
||||
# Use completePart for final result (closed, repaired JSON)
|
||||
# No more merging needed, so we don't need the cut version
|
||||
jsonBase = contexts.completePart
|
||||
logger.info(f"Iteration {iteration}: jsonParsingSuccess=true, overlapContext='', JSON complete")
|
||||
|
||||
# Store and parse completePart
|
||||
lastValidCompletePart = contexts.completePart
|
||||
|
||||
try:
|
||||
extracted = extractJsonString(contexts.completePart)
|
||||
parsed, parseErr, _ = tryParseJson(extracted)
|
||||
if parseErr is None and parsed:
|
||||
normalized = self._normalizeJsonStructure(parsed, useCase)
|
||||
result = json.dumps(normalized, indent=2, ensure_ascii=False)
|
||||
|
||||
if iterationOperationId:
|
||||
self.services.chat.progressLogFinish(iterationOperationId, True)
|
||||
|
||||
if not useCase.finalResultHandler:
|
||||
raise ValueError(
|
||||
f"Use case '{useCaseId}' is missing required 'finalResultHandler' callback."
|
||||
)
|
||||
return useCase.finalResultHandler(
|
||||
result, normalized, extracted, debugPrefix, self.services
|
||||
)
|
||||
except Exception as e:
|
||||
logger.warning(f"Iteration {iteration}: Failed to parse completePart: {e}")
|
||||
|
||||
# Fallback: return completePart as-is
|
||||
if iterationOperationId:
|
||||
self.services.chat.progressLogFinish(iterationOperationId, True)
|
||||
return contexts.completePart
|
||||
|
||||
elif contexts.jsonParsingSuccess and contexts.overlapContext != "":
|
||||
# JSON parseable but has cut point - CONTINUE to next iteration
|
||||
# CRITICAL: Use hierarchyContext (CUT json) as jsonBase for next merge!
|
||||
# - hierarchyContext = the truncated JSON at cut point (needed for overlap matching)
|
||||
# - completePart = closed JSON (for validation/fallback only)
|
||||
# The next AI fragment's overlap must match the CUT point, not closed structures
|
||||
jsonBase = contexts.hierarchyContext
|
||||
logger.info(
|
||||
f"Iteration {iteration}: jsonParsingSuccess=true, overlapContext not empty, "
|
||||
f"continuing iteration (jsonBase updated to hierarchyContext: {len(jsonBase)} chars)"
|
||||
)
|
||||
|
||||
# Store valid completePart as fallback (different from jsonBase!)
|
||||
lastValidCompletePart = contexts.completePart
|
||||
|
||||
# Reset fail counter on successful progress
|
||||
mergeFailCount = 0
|
||||
|
||||
# Update lastRawResponse for continuation prompt building
|
||||
# Use the CUT version for prompt context as well
|
||||
lastRawResponse = jsonBase
|
||||
|
||||
if iterationOperationId:
|
||||
self.services.chat.progressLogUpdate(iterationOperationId, 0.7, "JSON incomplete, requesting continuation")
|
||||
self.services.chat.progressLogFinish(iterationOperationId, True)
|
||||
continue
|
||||
|
||||
else:
|
||||
# JSON not parseable after repair - repeat prompt, increment fails
|
||||
# Do NOT update jsonBase - keep previous valid state
|
||||
mergeFailCount += 1
|
||||
logger.warning(
|
||||
f"Iteration {iteration}: jsonParsingSuccess=false, "
|
||||
f"repeat prompt (fail {mergeFailCount}/{MAX_MERGE_FAILS})"
|
||||
)
|
||||
|
||||
if mergeFailCount >= MAX_MERGE_FAILS:
|
||||
# Max failures reached - return last valid completePart
|
||||
logger.error(
|
||||
f"Iteration {iteration}: Max failures ({MAX_MERGE_FAILS}) reached, "
|
||||
"returning last valid completePart"
|
||||
)
|
||||
if iterationOperationId:
|
||||
self.services.chat.progressLogFinish(iterationOperationId, False)
|
||||
|
||||
if lastValidCompletePart:
|
||||
try:
|
||||
extracted = extractJsonString(lastValidCompletePart)
|
||||
parsed, parseErr, _ = tryParseJson(extracted)
|
||||
if parseErr is None and parsed:
|
||||
normalized = self._normalizeJsonStructure(parsed, useCase)
|
||||
return json.dumps(normalized, indent=2, ensure_ascii=False)
|
||||
except Exception:
|
||||
pass
|
||||
return lastValidCompletePart
|
||||
else:
|
||||
return jsonBase if jsonBase else ""
|
||||
|
||||
# Not at max - retry with same prompt
|
||||
# Do NOT update jsonBase or lastRawResponse - keep previous for retry
|
||||
if iterationOperationId:
|
||||
self.services.chat.progressLogUpdate(
|
||||
iterationOperationId, 0.7,
|
||||
f"Parse failed ({mergeFailCount}/{MAX_MERGE_FAILS}), retrying"
|
||||
)
|
||||
self.services.chat.progressLogFinish(iterationOperationId, True)
|
||||
continue
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error in AI call iteration {iteration}: {str(e)}")
|
||||
if iterationOperationId:
|
||||
self.services.chat.progressLogFinish(iterationOperationId, False)
|
||||
break
|
||||
|
||||
if iteration >= maxIterations:
|
||||
logger.warning(f"AI call stopped after maximum iterations ({maxIterations})")
|
||||
|
||||
# This code path should never be reached because all registered use cases
|
||||
# return early when JSON is complete. This would only execute for use cases that
|
||||
# require section extraction, but no such use cases are currently registered.
|
||||
logger.error(f"Unexpected code path: reached end of loop without return for use case '{useCaseId}'")
|
||||
return result if result else ""
|
||||
|
||||
def _isJsonStringIncomplete(self, jsonString: str) -> bool:
|
||||
"""
|
||||
Check if JSON string is incomplete (truncated) BEFORE closing/parsing.
|
||||
|
||||
This is critical because if JSON is truncated, closing it makes it appear complete,
|
||||
but we need to detect the truncation to continue iteration.
|
||||
|
||||
Args:
|
||||
jsonString: JSON string to check
|
||||
|
||||
Returns:
|
||||
True if JSON string appears incomplete/truncated, False otherwise
|
||||
"""
|
||||
if not jsonString or not jsonString.strip():
|
||||
return False
|
||||
|
||||
# Normalize JSON string
|
||||
normalized = stripCodeFences(normalizeJsonText(jsonString)).strip()
|
||||
if not normalized:
|
||||
return False
|
||||
|
||||
# Find first '{' or '[' to start
|
||||
startIdx = -1
|
||||
for i, char in enumerate(normalized):
|
||||
if char in '{[':
|
||||
startIdx = i
|
||||
break
|
||||
|
||||
if startIdx == -1:
|
||||
return False
|
||||
|
||||
jsonContent = normalized[startIdx:]
|
||||
|
||||
# Check if structures are balanced (all opened structures are closed)
|
||||
braceCount = 0
|
||||
bracketCount = 0
|
||||
inString = False
|
||||
escapeNext = False
|
||||
|
||||
for char in jsonContent:
|
||||
if escapeNext:
|
||||
escapeNext = False
|
||||
continue
|
||||
|
||||
if char == '\\':
|
||||
escapeNext = True
|
||||
continue
|
||||
|
||||
if char == '"':
|
||||
inString = not inString
|
||||
continue
|
||||
|
||||
if not inString:
|
||||
if char == '{':
|
||||
braceCount += 1
|
||||
elif char == '}':
|
||||
braceCount -= 1
|
||||
elif char == '[':
|
||||
bracketCount += 1
|
||||
elif char == ']':
|
||||
bracketCount -= 1
|
||||
|
||||
# If structures are unbalanced, JSON is incomplete
|
||||
if braceCount > 0 or bracketCount > 0:
|
||||
return True
|
||||
|
||||
# Check if JSON ends with incomplete value (e.g., unclosed string, incomplete number, trailing comma)
|
||||
trimmed = jsonContent.rstrip()
|
||||
if not trimmed:
|
||||
return False
|
||||
|
||||
# Check for trailing comma (might indicate incomplete)
|
||||
if trimmed.endswith(','):
|
||||
# Trailing comma might indicate incomplete, but could also be valid
|
||||
# Check if there's a closing bracket/brace after the comma
|
||||
return False # Trailing comma alone doesn't mean incomplete
|
||||
|
||||
# Check if ends with incomplete string (odd number of quotes)
|
||||
quoteCount = jsonContent.count('"')
|
||||
if quoteCount % 2 == 1:
|
||||
# Odd number of quotes - string is not closed
|
||||
return True
|
||||
|
||||
# Check if ends mid-value (e.g., ends with "417 instead of "4170. 41719"])
|
||||
# Look for patterns that suggest truncation:
|
||||
# - Ends with incomplete number (e.g., "417)
|
||||
# - Ends with incomplete array element (e.g., ["417)
|
||||
# - Ends with incomplete object property (e.g., {"key": "val)
|
||||
|
||||
# If JSON parses successfully without closing, it's complete
|
||||
parsed, parseErr, _ = tryParseJson(jsonContent)
|
||||
if parseErr is None:
|
||||
# Parses successfully - it's complete
|
||||
return False
|
||||
|
||||
# If it doesn't parse, try closing it and see if that helps
|
||||
closed = closeJsonStructures(jsonContent)
|
||||
parsedClosed, parseErrClosed, _ = tryParseJson(closed)
|
||||
|
||||
if parseErrClosed is None:
|
||||
# Only parses after closing - it was incomplete
|
||||
return True
|
||||
|
||||
# Doesn't parse even after closing - might be malformed, but assume incomplete to be safe
|
||||
return True
|
||||
|
||||
def _normalizeJsonStructure(self, parsed: Any, useCase) -> Any:
|
||||
"""
|
||||
Normalize JSON structure to ensure consistent format before merging.
|
||||
Handles different response formats and converts them to expected structure.
|
||||
|
||||
Args:
|
||||
parsed: Parsed JSON object (can be dict, list, or primitive)
|
||||
useCase: LoopingUseCase instance with jsonNormalizer callback
|
||||
|
||||
Returns:
|
||||
Normalized JSON structure
|
||||
"""
|
||||
# Use callback to normalize JSON structure (REQUIRED - no fallback)
|
||||
if not useCase or not useCase.jsonNormalizer:
|
||||
raise ValueError(
|
||||
f"Use case '{useCase.useCaseId if useCase else 'unknown'}' is missing required 'jsonNormalizer' callback. "
|
||||
"All use cases must provide a jsonNormalizer function."
|
||||
)
|
||||
return useCase.jsonNormalizer(parsed, useCase.useCaseId)
|
||||
|
||||
721
modules/serviceCenter/services/serviceAi/subContentExtraction.py
Normal file
721
modules/serviceCenter/services/serviceAi/subContentExtraction.py
Normal file
|
|
@ -0,0 +1,721 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""
|
||||
Content Extraction Module
|
||||
|
||||
Handles content extraction and preparation, including:
|
||||
- Extracting content from documents based on intents
|
||||
- Processing pre-extracted documents
|
||||
- Vision AI for image text extraction
|
||||
- AI processing of text content
|
||||
"""
|
||||
import json
|
||||
import logging
|
||||
import base64
|
||||
from typing import Dict, Any, List, Optional
|
||||
|
||||
from modules.datamodels.datamodelChat import ChatDocument
|
||||
from modules.datamodels.datamodelExtraction import ContentPart, DocumentIntent, ExtractionOptions, MergeStrategy
|
||||
from modules.workflows.processing.shared.stateTools import checkWorkflowStopped
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class ContentExtractor:
|
||||
"""Handles content extraction and preparation."""
|
||||
|
||||
def __init__(self, services, aiService, intentAnalyzer):
|
||||
"""Initialize ContentExtractor with service center, AI service, and intent analyzer access."""
|
||||
self.services = services
|
||||
self.aiService = aiService
|
||||
self.intentAnalyzer = intentAnalyzer
|
||||
|
||||
async def extractAndPrepareContent(
|
||||
self,
|
||||
documents: List[ChatDocument],
|
||||
documentIntents: List[DocumentIntent],
|
||||
parentOperationId: str,
|
||||
getIntentForDocument: callable
|
||||
) -> List[ContentPart]:
|
||||
"""
|
||||
Phase 5B: Extrahiert Content basierend auf Intents und bereitet ContentParts mit Metadaten vor.
|
||||
Gibt Liste von ContentParts im passenden Format zurück.
|
||||
|
||||
WICHTIG: Ein Dokument kann mehrere ContentParts erzeugen, wenn mehrere Intents vorhanden sind.
|
||||
Beispiel: Bild mit intents=["extract", "render"] erzeugt:
|
||||
- ContentPart(contentFormat="object", ...) für Rendering
|
||||
- ContentPart(contentFormat="extracted", ...) für Text-Analyse
|
||||
|
||||
Args:
|
||||
documents: Liste der zu verarbeitenden Dokumente
|
||||
documentIntents: Liste von DocumentIntent-Objekten
|
||||
parentOperationId: Parent Operation-ID für ChatLog-Hierarchie
|
||||
getIntentForDocument: Callable to get intent for document ID
|
||||
|
||||
Returns:
|
||||
Liste von ContentParts mit vollständigen Metadaten
|
||||
"""
|
||||
# Erstelle Operation-ID für Extraktion
|
||||
extractionOperationId = f"{parentOperationId}_content_extraction"
|
||||
|
||||
# Starte ChatLog mit Parent-Referenz
|
||||
self.services.chat.progressLogStart(
|
||||
extractionOperationId,
|
||||
"Content Extraction",
|
||||
"Extraction",
|
||||
f"Extracting from {len(documents)} documents",
|
||||
parentOperationId=parentOperationId
|
||||
)
|
||||
|
||||
try:
|
||||
allContentParts = []
|
||||
|
||||
for document in documents:
|
||||
checkWorkflowStopped(self.services)
|
||||
# Check if document is already a ContentExtracted document (pre-extracted JSON)
|
||||
logger.debug(f"Checking document {document.id} ({document.fileName}, mimeType={document.mimeType}) for pre-extracted content")
|
||||
preExtracted = self.intentAnalyzer.resolvePreExtractedDocument(document)
|
||||
|
||||
if preExtracted:
|
||||
logger.info(f"✅ Found pre-extracted document: {document.fileName} -> Original: {preExtracted['originalDocument']['fileName']}")
|
||||
logger.info(f" Pre-extracted document ID: {document.id}, Original document ID: {preExtracted['originalDocument']['id']}")
|
||||
logger.info(f" ContentParts count: {len(preExtracted['contentExtracted'].parts) if preExtracted['contentExtracted'].parts else 0}")
|
||||
|
||||
# Verwende bereits extrahierte ContentParts direkt
|
||||
contentExtracted = preExtracted["contentExtracted"]
|
||||
|
||||
# WICHTIG: Intent muss für das JSON-Dokument gefunden werden, nicht für das Original
|
||||
# (Intent-Analyse mappt bereits zurück zu JSON-Dokument-ID)
|
||||
intent = getIntentForDocument(document.id, documentIntents)
|
||||
logger.info(f" Intent lookup for document {document.id}: found={intent is not None}")
|
||||
if intent:
|
||||
logger.info(f" Intent: {intent.intents}, extractionPrompt: {intent.extractionPrompt[:100] if intent.extractionPrompt else None}...")
|
||||
else:
|
||||
logger.warning(f" ⚠️ No intent found for pre-extracted document {document.id}! Available intent documentIds: {[i.documentId for i in documentIntents]}")
|
||||
|
||||
if contentExtracted.parts:
|
||||
# CRITICAL: Process pre-extracted parts - analyze structure parts for nested content
|
||||
processedParts = []
|
||||
for part in contentExtracted.parts:
|
||||
# Überspringe leere Parts (Container ohne Daten)
|
||||
if not part.data or (isinstance(part.data, str) and len(part.data.strip()) == 0):
|
||||
if part.typeGroup == "container":
|
||||
continue # Überspringe leere Container
|
||||
|
||||
# CRITICAL: Check if structure part contains nested parts (e.g., JSON with documentData.parts)
|
||||
if part.typeGroup == "structure" and part.mimeType == "application/json" and part.data:
|
||||
nestedParts = self._extractNestedPartsFromStructure(part, document, preExtracted, intent)
|
||||
if nestedParts:
|
||||
# Replace structure part with extracted nested parts
|
||||
processedParts.extend(nestedParts)
|
||||
logger.info(f"✅ Extracted {len(nestedParts)} nested parts from structure part {part.id}")
|
||||
continue # Skip original structure part
|
||||
|
||||
# Keep original part if no nested parts found
|
||||
processedParts.append(part)
|
||||
|
||||
# Use processed parts (with nested parts extracted)
|
||||
for part in processedParts:
|
||||
if not part.metadata:
|
||||
part.metadata = {}
|
||||
|
||||
# Ensure metadata is complete
|
||||
if "documentId" not in part.metadata:
|
||||
part.metadata["documentId"] = document.id
|
||||
|
||||
# WICHTIG: Prüfe Intent für dieses Part
|
||||
partIntent = intent.intents if intent else ["extract"]
|
||||
|
||||
# Debug-Logging für Intent-Verarbeitung
|
||||
logger.debug(f"Processing part {part.id}: typeGroup={part.typeGroup}, intents={partIntent}, hasData={bool(part.data)}, dataLength={len(str(part.data)) if part.data else 0}")
|
||||
|
||||
# WICHTIG: Ein Part kann mehrere Intents haben - erstelle für jeden Intent einen ContentPart
|
||||
# Generische Intent-Verarbeitung für ALLE Content-Typen
|
||||
hasReferenceIntent = "reference" in partIntent
|
||||
hasRenderIntent = "render" in partIntent
|
||||
hasExtractIntent = "extract" in partIntent
|
||||
hasPartData = bool(part.data) and (not isinstance(part.data, str) or len(part.data.strip()) > 0)
|
||||
|
||||
logger.debug(f"Part {part.id}: reference={hasReferenceIntent}, render={hasRenderIntent}, extract={hasExtractIntent}, hasData={hasPartData}")
|
||||
|
||||
# SAFETY: For images with any intent, always ensure render is included
|
||||
# This ensures the image object part is always available for later rendering
|
||||
isImage = part.typeGroup == "image" or (part.mimeType and part.mimeType.startswith("image/"))
|
||||
if isImage and hasPartData and not hasRenderIntent:
|
||||
logger.info(f"🖼️ Auto-adding render intent for image {part.id} (original intents: {partIntent})")
|
||||
hasRenderIntent = True
|
||||
|
||||
# Track ob der originale Part bereits hinzugefügt wurde
|
||||
originalPartAdded = False
|
||||
|
||||
# 1. Reference Intent: Erstelle Reference ContentPart
|
||||
if hasReferenceIntent:
|
||||
referencePart = ContentPart(
|
||||
id=f"ref_{document.id}_{part.id}",
|
||||
label=f"Reference: {part.label or 'Content'}",
|
||||
typeGroup="reference",
|
||||
mimeType=part.mimeType or "application/octet-stream",
|
||||
data="", # Leer - nur Referenz
|
||||
metadata={
|
||||
"contentFormat": "reference",
|
||||
"documentId": document.id,
|
||||
"documentReference": f"docItem:{document.id}:{preExtracted['originalDocument']['fileName']}",
|
||||
"intent": "reference",
|
||||
"usageHint": f"Reference: {preExtracted['originalDocument']['fileName']}",
|
||||
"originalFileName": preExtracted["originalDocument"]["fileName"]
|
||||
}
|
||||
)
|
||||
allContentParts.append(referencePart)
|
||||
logger.debug(f"✅ Created reference ContentPart for {part.id}")
|
||||
|
||||
# 2. Render Intent: Erstelle Object ContentPart (für Binary/Image Rendering)
|
||||
if hasRenderIntent and hasPartData:
|
||||
# Prüfe ob es ein Binary/Image ist (kann gerendert werden)
|
||||
isRenderable = (
|
||||
part.typeGroup == "image" or
|
||||
part.typeGroup == "binary" or
|
||||
(part.mimeType and (
|
||||
part.mimeType.startswith("image/") or
|
||||
part.mimeType.startswith("video/") or
|
||||
part.mimeType.startswith("audio/") or
|
||||
self._isBinary(part.mimeType)
|
||||
))
|
||||
)
|
||||
|
||||
if isRenderable:
|
||||
objectPart = ContentPart(
|
||||
id=f"obj_{document.id}_{part.id}",
|
||||
label=f"Object: {part.label or 'Content'}",
|
||||
typeGroup=part.typeGroup,
|
||||
mimeType=part.mimeType or "application/octet-stream",
|
||||
data=part.data, # Base64/Binary data ist bereits vorhanden
|
||||
metadata={
|
||||
"contentFormat": "object",
|
||||
"documentId": document.id,
|
||||
"intent": "render",
|
||||
"usageHint": f"Render as visual element: {preExtracted['originalDocument']['fileName']}",
|
||||
"originalFileName": preExtracted["originalDocument"]["fileName"],
|
||||
"relatedExtractedPartId": f"extracted_{document.id}_{part.id}" if hasExtractIntent else None
|
||||
}
|
||||
)
|
||||
allContentParts.append(objectPart)
|
||||
logger.debug(f"✅ Created object ContentPart for {part.id} (render intent)")
|
||||
else:
|
||||
logger.warning(f"⚠️ Part {part.id} has render intent but is not renderable (typeGroup={part.typeGroup}, mimeType={part.mimeType})")
|
||||
elif hasRenderIntent and not hasPartData:
|
||||
logger.warning(f"⚠️ Part {part.id} has render intent but no data, skipping render part")
|
||||
|
||||
# 3. Extract Intent: Erstelle Extracted ContentPart (NO AI processing here - happens during section generation)
|
||||
if hasExtractIntent:
|
||||
# For images: Keep as image part with extract intent - Vision AI extraction happens during section generation
|
||||
if part.typeGroup == "image" and hasPartData:
|
||||
logger.info(f"📷 Image {part.id} with extract intent - will be processed with Vision AI during section generation")
|
||||
# Keep image part as-is, mark with extract intent
|
||||
part.metadata.update({
|
||||
"contentFormat": "extracted", # Marked for extraction, but not yet extracted
|
||||
"intent": "extract",
|
||||
"originalFileName": preExtracted["originalDocument"]["fileName"],
|
||||
"relatedObjectPartId": f"obj_{document.id}_{part.id}" if hasRenderIntent else None,
|
||||
"extractionPrompt": intent.extractionPrompt if intent and intent.extractionPrompt else "Extract all text content from this image.",
|
||||
"needsVisionExtraction": True # Flag to indicate Vision AI extraction needed
|
||||
})
|
||||
allContentParts.append(part)
|
||||
originalPartAdded = True
|
||||
else:
|
||||
# For text/table content: Use directly as extracted (no AI processing here)
|
||||
# AI processing with extractionPrompt happens during section generation
|
||||
if not originalPartAdded:
|
||||
part.metadata.update({
|
||||
"contentFormat": "extracted",
|
||||
"intent": "extract",
|
||||
"fromExtractContent": True,
|
||||
"skipExtraction": True, # Already extracted (raw extraction)
|
||||
"originalFileName": preExtracted["originalDocument"]["fileName"],
|
||||
"relatedObjectPartId": f"obj_{document.id}_{part.id}" if hasRenderIntent else None,
|
||||
"extractionPrompt": intent.extractionPrompt if intent and intent.extractionPrompt else None
|
||||
})
|
||||
# Stelle sicher dass contentFormat gesetzt ist
|
||||
if "contentFormat" not in part.metadata:
|
||||
part.metadata["contentFormat"] = "extracted"
|
||||
allContentParts.append(part)
|
||||
originalPartAdded = True
|
||||
logger.debug(f"✅ Using pre-extracted ContentPart {part.id} as extracted (no AI processing needed)")
|
||||
|
||||
# 4. Fallback: Wenn kein Intent vorhanden oder Part wurde noch nicht hinzugefügt
|
||||
# (sollte normalerweise nicht vorkommen, da default "extract" ist)
|
||||
if not hasReferenceIntent and not hasRenderIntent and not hasExtractIntent and not originalPartAdded:
|
||||
logger.warning(f"⚠️ Part {part.id} has no recognized intents, adding as extracted by default")
|
||||
part.metadata.update({
|
||||
"contentFormat": "extracted",
|
||||
"intent": "extract",
|
||||
"fromExtractContent": True,
|
||||
"skipExtraction": True,
|
||||
"originalFileName": preExtracted["originalDocument"]["fileName"]
|
||||
})
|
||||
allContentParts.append(part)
|
||||
originalPartAdded = True
|
||||
|
||||
logger.info(f"✅ Using {len([p for p in contentExtracted.parts if p.data and len(str(p.data)) > 0])} pre-extracted ContentParts from ContentExtracted document {document.fileName}")
|
||||
logger.info(f" Original document: {preExtracted['originalDocument']['fileName']}")
|
||||
continue # Skip normal extraction for this document
|
||||
|
||||
# Check if it's standardized JSON format (has "documents" or "sections")
|
||||
if document.mimeType == "application/json":
|
||||
try:
|
||||
docBytes = self.services.interfaceDbComponent.getFileData(document.fileId)
|
||||
if docBytes:
|
||||
docData = docBytes.decode('utf-8')
|
||||
jsonData = json.loads(docData)
|
||||
|
||||
if isinstance(jsonData, dict) and ("documents" in jsonData or "sections" in jsonData):
|
||||
logger.info(f"Document is already in standardized JSON format, using as reference")
|
||||
# Create reference ContentPart for structured JSON
|
||||
contentPart = ContentPart(
|
||||
id=f"ref_{document.id}",
|
||||
label=f"Reference: {document.fileName}",
|
||||
typeGroup="structure",
|
||||
mimeType="application/json",
|
||||
data=docData,
|
||||
metadata={
|
||||
"contentFormat": "reference",
|
||||
"documentId": document.id,
|
||||
"documentReference": f"docItem:{document.id}:{document.fileName}",
|
||||
"skipExtraction": True,
|
||||
"intent": "reference"
|
||||
}
|
||||
)
|
||||
allContentParts.append(contentPart)
|
||||
logger.info(f"✅ Using JSON document directly without extraction")
|
||||
continue # Skip normal extraction for this document
|
||||
except Exception as e:
|
||||
logger.warning(f"Could not parse JSON document {document.fileName}, will extract normally: {str(e)}")
|
||||
# Continue with normal extraction
|
||||
|
||||
# Normal extraction path
|
||||
intent = getIntentForDocument(document.id, documentIntents)
|
||||
|
||||
if not intent:
|
||||
# Try to find intent by similar UUID (fix for AI UUID hallucination)
|
||||
correctedIntent = self._findIntentBySimilarId(document.id, documentIntents)
|
||||
if correctedIntent:
|
||||
logger.warning(f"Found intent for document {document.id} using UUID correction (original: {correctedIntent.documentId})")
|
||||
# Create new intent with correct document ID
|
||||
intent = DocumentIntent(
|
||||
documentId=document.id,
|
||||
intents=correctedIntent.intents,
|
||||
extractionPrompt=correctedIntent.extractionPrompt,
|
||||
reasoning=f"Intent matched by UUID similarity (original: {correctedIntent.documentId})"
|
||||
)
|
||||
else:
|
||||
# Default: extract für alle Dokumente ohne Intent
|
||||
logger.warning(f"No intent found for document {document.id}, using default 'extract'")
|
||||
intent = DocumentIntent(
|
||||
documentId=document.id,
|
||||
intents=["extract"],
|
||||
extractionPrompt="Extract all content from the document",
|
||||
reasoning="Default intent: no specific intent found"
|
||||
)
|
||||
|
||||
# WICHTIG: Prüfe alle Intents - ein Dokument kann mehrere ContentParts erzeugen
|
||||
|
||||
if "reference" in intent.intents:
|
||||
# Erstelle Reference ContentPart
|
||||
contentPart = ContentPart(
|
||||
id=f"ref_{document.id}",
|
||||
label=f"Reference: {document.fileName}",
|
||||
typeGroup="reference",
|
||||
mimeType=document.mimeType,
|
||||
data="",
|
||||
metadata={
|
||||
"contentFormat": "reference",
|
||||
"documentId": document.id,
|
||||
"documentReference": f"docItem:{document.id}:{document.fileName}",
|
||||
"intent": "reference",
|
||||
"usageHint": f"Reference document: {document.fileName}"
|
||||
}
|
||||
)
|
||||
allContentParts.append(contentPart)
|
||||
|
||||
# WICHTIG: "render" und "extract" können beide vorhanden sein!
|
||||
# In diesem Fall erzeugen wir BEIDE ContentParts
|
||||
|
||||
# SAFETY: For images with any intent, always create object part for later rendering
|
||||
isImageDocument = document.mimeType and document.mimeType.startswith("image/")
|
||||
shouldAutoRender = isImageDocument and "render" not in intent.intents and ("extract" in intent.intents or "reference" in intent.intents)
|
||||
if shouldAutoRender:
|
||||
logger.info(f"🖼️ Auto-adding render for image document {document.id} (original intents: {intent.intents})")
|
||||
|
||||
if "render" in intent.intents or shouldAutoRender:
|
||||
# Für Images/Binary: extrahiere als Object
|
||||
if document.mimeType.startswith("image/") or self._isBinary(document.mimeType):
|
||||
try:
|
||||
# Lade Binary-Daten (getFileData ist nicht async - keine await nötig)
|
||||
binaryData = self.services.interfaceDbComponent.getFileData(document.fileId)
|
||||
if not binaryData:
|
||||
logger.warning(f"No binary data found for document {document.id}")
|
||||
continue
|
||||
base64Data = base64.b64encode(binaryData).decode('utf-8')
|
||||
|
||||
contentPart = ContentPart(
|
||||
id=f"obj_{document.id}",
|
||||
label=f"Object: {document.fileName}",
|
||||
typeGroup="image" if document.mimeType.startswith("image/") else "binary",
|
||||
mimeType=document.mimeType,
|
||||
data=base64Data,
|
||||
metadata={
|
||||
"contentFormat": "object",
|
||||
"documentId": document.id,
|
||||
"intent": "render",
|
||||
"usageHint": f"Render as visual element: {document.fileName}",
|
||||
"originalFileName": document.fileName,
|
||||
# Verknüpfung zu extracted Part (falls vorhanden)
|
||||
"relatedExtractedPartId": f"ext_{document.id}" if "extract" in intent.intents else None
|
||||
}
|
||||
)
|
||||
allContentParts.append(contentPart)
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to load binary data for document {document.id}: {str(e)}")
|
||||
|
||||
if "extract" in intent.intents:
|
||||
# Extrahiere Content mit Extraction Service
|
||||
extractionPrompt = intent.extractionPrompt or "Extract all content from the document"
|
||||
|
||||
# Debug-Log (harmonisiert)
|
||||
self.services.utils.writeDebugFile(
|
||||
extractionPrompt,
|
||||
f"content_extraction_prompt_{document.id}"
|
||||
)
|
||||
|
||||
# Führe Extraktion aus
|
||||
|
||||
extractionOptions = ExtractionOptions(
|
||||
prompt=extractionPrompt,
|
||||
mergeStrategy=MergeStrategy()
|
||||
)
|
||||
|
||||
# extractContent ist nicht async - keine await nötig
|
||||
checkWorkflowStopped(self.services)
|
||||
extractedResults = self.services.extraction.extractContent(
|
||||
[document],
|
||||
extractionOptions,
|
||||
operationId=extractionOperationId,
|
||||
parentOperationId=extractionOperationId
|
||||
)
|
||||
|
||||
# Konvertiere extrahierte Ergebnisse zu ContentParts mit Metadaten
|
||||
# Check if object part exists (either explicit render or auto-render for images)
|
||||
hasObjectPart = "render" in intent.intents or shouldAutoRender
|
||||
|
||||
for extracted in extractedResults:
|
||||
for part in extracted.parts:
|
||||
# Markiere als extracted Format
|
||||
part.metadata.update({
|
||||
"contentFormat": "extracted",
|
||||
"documentId": document.id,
|
||||
"extractionPrompt": extractionPrompt,
|
||||
"intent": "extract",
|
||||
"usageHint": f"Use extracted content from {document.fileName}",
|
||||
# Verknüpfung zu object Part (falls vorhanden - including auto-render for images)
|
||||
"relatedObjectPartId": f"obj_{document.id}" if hasObjectPart else None
|
||||
})
|
||||
|
||||
# For images: Mark that Vision AI extraction is needed during section generation
|
||||
if part.typeGroup == "image":
|
||||
part.metadata["needsVisionExtraction"] = True
|
||||
logger.info(f"📷 Image part {part.id} marked for Vision AI extraction during section generation")
|
||||
|
||||
# Stelle sicher, dass ID eindeutig ist (falls object Part existiert)
|
||||
if hasObjectPart:
|
||||
part.id = f"ext_{document.id}_{part.id}"
|
||||
allContentParts.append(part)
|
||||
|
||||
# Debug-Log (harmonisiert)
|
||||
self.services.utils.writeDebugFile(
|
||||
json.dumps([part.dict() for part in allContentParts], indent=2, default=str),
|
||||
"content_extraction_result"
|
||||
)
|
||||
|
||||
# State 2 Validation: Validate and auto-fix ContentParts
|
||||
validatedParts = []
|
||||
for part in allContentParts:
|
||||
# Validation 2.1: Skip ContentParts without documentId
|
||||
if not part.metadata.get("documentId"):
|
||||
logger.warning(f"Skipping ContentPart {part.id} - missing documentId in metadata")
|
||||
continue
|
||||
|
||||
# Validation 2.2: Skip ContentParts with invalid contentFormat
|
||||
contentFormat = part.metadata.get("contentFormat")
|
||||
if contentFormat not in ["extracted", "object", "reference"]:
|
||||
logger.warning(
|
||||
f"Skipping ContentPart {part.id} - invalid contentFormat: {contentFormat}"
|
||||
)
|
||||
continue
|
||||
|
||||
validatedParts.append(part)
|
||||
|
||||
# ChatLog abschließen
|
||||
self.services.chat.progressLogFinish(extractionOperationId, True)
|
||||
|
||||
return validatedParts
|
||||
|
||||
except Exception as e:
|
||||
self.services.chat.progressLogFinish(extractionOperationId, False)
|
||||
logger.error(f"Error in extractAndPrepareContent: {str(e)}")
|
||||
raise
|
||||
|
||||
async def extractTextFromImage(self, imagePart: ContentPart, extractionPrompt: str) -> Optional[str]:
|
||||
"""
|
||||
Extrahiere Text aus einem Image-Part mit Vision AI.
|
||||
|
||||
Args:
|
||||
imagePart: ContentPart mit typeGroup="image"
|
||||
extractionPrompt: Prompt für die Text-Extraktion
|
||||
|
||||
Returns:
|
||||
Extrahierter Text oder None bei Fehler
|
||||
"""
|
||||
try:
|
||||
from modules.datamodels.datamodelAi import AiCallRequest, AiCallOptions, OperationTypeEnum
|
||||
|
||||
# Final extraction prompt
|
||||
finalPrompt = extractionPrompt or "Extract all text content from this image. Return only the extracted text, no additional formatting."
|
||||
|
||||
# Debug-Log (harmonisiert)
|
||||
self.services.utils.writeDebugFile(
|
||||
finalPrompt,
|
||||
f"content_extraction_prompt_image_{imagePart.id}"
|
||||
)
|
||||
|
||||
# Erstelle AI-Call-Request mit Image-Part
|
||||
request = AiCallRequest(
|
||||
prompt=finalPrompt,
|
||||
context="",
|
||||
options=AiCallOptions(operationType=OperationTypeEnum.IMAGE_ANALYSE),
|
||||
contentParts=[imagePart]
|
||||
)
|
||||
|
||||
# Verwende AI-Service für Vision AI-Verarbeitung
|
||||
checkWorkflowStopped(self.services)
|
||||
response = await self.aiService.callAi(request)
|
||||
|
||||
# Debug-Log für Response (harmonisiert)
|
||||
if response and response.content:
|
||||
self.services.utils.writeDebugFile(
|
||||
response.content,
|
||||
f"content_extraction_response_image_{imagePart.id}"
|
||||
)
|
||||
|
||||
if response and response.content:
|
||||
return response.content.strip()
|
||||
|
||||
# Kein Content zurückgegeben - return error message für Debugging
|
||||
errorMsg = f"Vision AI extraction failed: No content returned for image {imagePart.id}"
|
||||
logger.warning(errorMsg)
|
||||
return f"[ERROR: {errorMsg}]"
|
||||
except Exception as e:
|
||||
errorMsg = f"Vision AI extraction failed for image {imagePart.id}: {str(e)}"
|
||||
logger.error(errorMsg)
|
||||
import traceback
|
||||
logger.debug(f"Traceback: {traceback.format_exc()}")
|
||||
# Return error message statt None für Debugging
|
||||
return f"[ERROR: {errorMsg}]"
|
||||
|
||||
async def processTextContentWithAi(self, textPart: ContentPart, extractionPrompt: str) -> Optional[str]:
|
||||
"""
|
||||
Verarbeite Text-Content mit AI basierend auf extractionPrompt.
|
||||
|
||||
WICHTIG: Pre-extracted ContentParts von context.extractContent enthalten RAW extrahierten Text
|
||||
(z.B. aus PDF-Text-Layer). Wenn "extract" Intent vorhanden ist, muss dieser Text mit AI
|
||||
verarbeitet werden (Transformation, Strukturierung, etc.) basierend auf extractionPrompt.
|
||||
|
||||
Args:
|
||||
textPart: ContentPart mit typeGroup="text" (oder anderer Text-basierter Typ)
|
||||
extractionPrompt: Prompt für die AI-Verarbeitung des Textes
|
||||
|
||||
Returns:
|
||||
AI-verarbeiteter Text oder None bei Fehler
|
||||
"""
|
||||
try:
|
||||
from modules.datamodels.datamodelAi import AiCallRequest, AiCallOptions, OperationTypeEnum
|
||||
|
||||
# Final extraction prompt
|
||||
finalPrompt = extractionPrompt or "Process and extract the key information from the following text content."
|
||||
|
||||
# Debug-Log (harmonisiert) - log prompt with text preview
|
||||
textPreview = textPart.data[:500] + "..." if textPart.data and len(textPart.data) > 500 else (textPart.data or "")
|
||||
promptWithContext = f"{finalPrompt}\n\n--- Text Content (preview) ---\n{textPreview}"
|
||||
self.services.utils.writeDebugFile(
|
||||
promptWithContext,
|
||||
f"content_extraction_prompt_text_{textPart.id}"
|
||||
)
|
||||
|
||||
# Erstelle Text-ContentPart für AI-Verarbeitung
|
||||
# Verwende den vorhandenen Text als Input
|
||||
textContentPart = ContentPart(
|
||||
id=textPart.id,
|
||||
label=textPart.label,
|
||||
typeGroup="text",
|
||||
mimeType="text/plain",
|
||||
data=textPart.data if textPart.data else "",
|
||||
metadata=textPart.metadata.copy() if textPart.metadata else {}
|
||||
)
|
||||
|
||||
# Erstelle AI-Call-Request mit Text-Part
|
||||
request = AiCallRequest(
|
||||
prompt=finalPrompt,
|
||||
context="",
|
||||
options=AiCallOptions(operationType=OperationTypeEnum.DATA_EXTRACT),
|
||||
contentParts=[textContentPart]
|
||||
)
|
||||
|
||||
# Verwende AI-Service für Text-Verarbeitung
|
||||
checkWorkflowStopped(self.services)
|
||||
response = await self.aiService.callAi(request)
|
||||
|
||||
# Debug-Log für Response (harmonisiert)
|
||||
if response and response.content:
|
||||
self.services.utils.writeDebugFile(
|
||||
response.content,
|
||||
f"content_extraction_response_text_{textPart.id}"
|
||||
)
|
||||
|
||||
if response and response.content:
|
||||
return response.content.strip()
|
||||
|
||||
# Kein Content zurückgegeben - return error message für Debugging
|
||||
errorMsg = f"AI text processing failed: No content returned for text part {textPart.id}"
|
||||
logger.warning(errorMsg)
|
||||
return f"[ERROR: {errorMsg}]"
|
||||
except Exception as e:
|
||||
errorMsg = f"AI text processing failed for text part {textPart.id}: {str(e)}"
|
||||
logger.error(errorMsg)
|
||||
import traceback
|
||||
logger.debug(f"Traceback: {traceback.format_exc()}")
|
||||
# Return error message statt None für Debugging
|
||||
return f"[ERROR: {errorMsg}]"
|
||||
|
||||
def _isBinary(self, mimeType: str) -> bool:
|
||||
"""Prüfe ob MIME-Type binary ist."""
|
||||
binaryTypes = [
|
||||
"application/octet-stream",
|
||||
"application/pdf",
|
||||
"application/zip",
|
||||
"application/x-zip-compressed"
|
||||
]
|
||||
return mimeType in binaryTypes or mimeType.startswith("image/") or mimeType.startswith("video/") or mimeType.startswith("audio/")
|
||||
|
||||
def _extractNestedPartsFromStructure(
|
||||
self,
|
||||
structurePart: ContentPart,
|
||||
document: ChatDocument,
|
||||
preExtracted: Dict[str, Any],
|
||||
intent: Optional[Any]
|
||||
) -> List[ContentPart]:
|
||||
"""
|
||||
Extract nested parts from a structure ContentPart (e.g., JSON with documentData.parts).
|
||||
|
||||
This is a generic function that analyzes pre-processed ContentParts and extracts
|
||||
any nested parts that are embedded in structure data (typically JSON).
|
||||
|
||||
Works with standard ContentExtracted format: documentData.parts array.
|
||||
Each nested part is extracted as a separate ContentPart with proper metadata.
|
||||
|
||||
Args:
|
||||
structurePart: ContentPart with typeGroup="structure" containing nested parts
|
||||
document: The document this part belongs to
|
||||
preExtracted: Pre-extracted document metadata
|
||||
intent: Document intent for nested parts
|
||||
|
||||
Returns:
|
||||
List of extracted ContentParts, empty if no nested parts found
|
||||
"""
|
||||
nestedParts = []
|
||||
|
||||
try:
|
||||
# Parse JSON structure
|
||||
jsonData = json.loads(structurePart.data)
|
||||
|
||||
# Check for standard ContentExtracted format: documentData.parts
|
||||
if isinstance(jsonData, dict):
|
||||
documentData = jsonData.get("documentData")
|
||||
if isinstance(documentData, dict):
|
||||
parts = documentData.get("parts", [])
|
||||
if isinstance(parts, list) and len(parts) > 0:
|
||||
# Extract each nested part
|
||||
for nestedPartData in parts:
|
||||
if not isinstance(nestedPartData, dict):
|
||||
continue
|
||||
|
||||
nestedPartId = nestedPartData.get("id") or f"nested_{len(nestedParts)}"
|
||||
nestedTypeGroup = nestedPartData.get("typeGroup", "text")
|
||||
nestedMimeType = nestedPartData.get("mimeType", "text/plain")
|
||||
nestedLabel = nestedPartData.get("label", structurePart.label)
|
||||
nestedData = nestedPartData.get("data", "")
|
||||
nestedMetadata = nestedPartData.get("metadata", {})
|
||||
|
||||
# Create ContentPart for nested part
|
||||
nestedPart = ContentPart(
|
||||
id=f"{structurePart.id}_{nestedPartId}",
|
||||
parentId=structurePart.id,
|
||||
label=nestedLabel,
|
||||
typeGroup=nestedTypeGroup,
|
||||
mimeType=nestedMimeType,
|
||||
data=nestedData,
|
||||
metadata={
|
||||
**nestedMetadata,
|
||||
"documentId": document.id,
|
||||
"fromNestedStructure": True,
|
||||
"parentStructurePartId": structurePart.id,
|
||||
"originalFileName": preExtracted["originalDocument"]["fileName"]
|
||||
}
|
||||
)
|
||||
|
||||
nestedParts.append(nestedPart)
|
||||
logger.debug(f"✅ Extracted nested part: {nestedPart.id} (typeGroup={nestedTypeGroup}, mimeType={nestedMimeType})")
|
||||
|
||||
# If no nested parts found, return empty list (original part will be kept)
|
||||
if not nestedParts:
|
||||
logger.debug(f"No nested parts found in structure part {structurePart.id}")
|
||||
|
||||
except json.JSONDecodeError as e:
|
||||
logger.warning(f"Could not parse structure part {structurePart.id} as JSON: {str(e)}")
|
||||
except Exception as e:
|
||||
logger.error(f"Error extracting nested parts from structure part {structurePart.id}: {str(e)}")
|
||||
|
||||
return nestedParts
|
||||
|
||||
def _findIntentBySimilarId(self, documentId: str, documentIntents: List[DocumentIntent]) -> Optional[DocumentIntent]:
|
||||
"""
|
||||
Versucht ein Intent zu finden, dessen UUID ähnlich zur angegebenen Dokument-ID ist.
|
||||
Dies hilft bei AI UUID-Halluzinationen (z.B. 4451 -> 4551).
|
||||
|
||||
Args:
|
||||
documentId: Die Dokument-ID für die ein Intent gesucht wird
|
||||
documentIntents: Liste aller verfügbaren DocumentIntents
|
||||
|
||||
Returns:
|
||||
DocumentIntent mit ähnlicher UUID falls gefunden, sonst None
|
||||
"""
|
||||
if not documentId or len(documentId) != 36: # UUID Format: 8-4-4-4-12
|
||||
return None
|
||||
|
||||
# Prüfe ob es eine UUID ist (Format: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx)
|
||||
if documentId.count('-') != 4:
|
||||
return None
|
||||
|
||||
for intent in documentIntents:
|
||||
intentId = intent.documentId
|
||||
if len(intentId) != 36:
|
||||
continue
|
||||
|
||||
# Zähle unterschiedliche Zeichen
|
||||
differences = sum(c1 != c2 for c1, c2 in zip(documentId, intentId))
|
||||
|
||||
# Wenn nur 1-2 Zeichen unterschiedlich sind, ist es wahrscheinlich ein Typo
|
||||
if differences <= 2:
|
||||
# Prüfe ob die Struktur ähnlich ist (gleiche Positionen der Bindestriche)
|
||||
if documentId.count('-') == intentId.count('-'):
|
||||
return intent
|
||||
|
||||
return None
|
||||
|
||||
369
modules/serviceCenter/services/serviceAi/subDocumentIntents.py
Normal file
369
modules/serviceCenter/services/serviceAi/subDocumentIntents.py
Normal file
|
|
@ -0,0 +1,369 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""
|
||||
Document Intent Analysis Module
|
||||
|
||||
Handles analysis of document intents, including:
|
||||
- Clarifying which documents need extraction vs reference
|
||||
- Resolving pre-extracted documents
|
||||
- Building intent analysis prompts
|
||||
"""
|
||||
import json
|
||||
import logging
|
||||
from typing import Dict, Any, List, Optional
|
||||
|
||||
from modules.datamodels.datamodelChat import ChatDocument
|
||||
from modules.datamodels.datamodelExtraction import DocumentIntent
|
||||
from modules.workflows.processing.shared.stateTools import checkWorkflowStopped
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class DocumentIntentAnalyzer:
|
||||
"""Handles document intent analysis and resolution."""
|
||||
|
||||
def __init__(self, services, aiService):
|
||||
"""Initialize DocumentIntentAnalyzer with service center and AI service access."""
|
||||
self.services = services
|
||||
self.aiService = aiService
|
||||
|
||||
async def clarifyDocumentIntents(
|
||||
self,
|
||||
documents: List[ChatDocument],
|
||||
userPrompt: str,
|
||||
actionParameters: Dict[str, Any],
|
||||
parentOperationId: str
|
||||
) -> List[DocumentIntent]:
|
||||
"""
|
||||
Phase 5A: Analysiert, welche Dokumente Extraktion vs Referenz benötigen.
|
||||
Gibt DocumentIntent für jedes Dokument zurück.
|
||||
|
||||
Args:
|
||||
documents: Liste der zu verarbeitenden Dokumente
|
||||
userPrompt: User-Anfrage
|
||||
actionParameters: Action-spezifische Parameter (z.B. resultType, outputFormat)
|
||||
parentOperationId: Parent Operation-ID für ChatLog-Hierarchie
|
||||
|
||||
Returns:
|
||||
Liste von DocumentIntent-Objekten
|
||||
"""
|
||||
# Erstelle Operation-ID für Intent-Analyse
|
||||
intentOperationId = f"{parentOperationId}_intent_analysis"
|
||||
|
||||
# Starte ChatLog mit Parent-Referenz
|
||||
self.services.chat.progressLogStart(
|
||||
intentOperationId,
|
||||
"Document Intent Analysis",
|
||||
"Intent Analysis",
|
||||
f"Analyzing {len(documents)} documents",
|
||||
parentOperationId=parentOperationId
|
||||
)
|
||||
|
||||
try:
|
||||
# Mappe pre-extracted JSONs zu ursprünglichen Dokument-IDs für Intent-Analyse
|
||||
documentMapping = {} # Maps original doc ID -> JSON doc ID
|
||||
resolvedDocuments = []
|
||||
|
||||
for doc in documents:
|
||||
preExtracted = self.resolvePreExtractedDocument(doc)
|
||||
if preExtracted:
|
||||
originalDocId = preExtracted["originalDocument"]["id"]
|
||||
documentMapping[originalDocId] = doc.id
|
||||
# Erstelle temporäres ChatDocument für ursprüngliches Dokument
|
||||
originalDoc = ChatDocument(
|
||||
id=originalDocId,
|
||||
fileName=preExtracted["originalDocument"]["fileName"],
|
||||
mimeType=preExtracted["originalDocument"]["mimeType"],
|
||||
fileSize=preExtracted["originalDocument"].get("fileSize", doc.fileSize),
|
||||
fileId=doc.fileId, # Behalte fileId vom JSON
|
||||
messageId=doc.messageId if hasattr(doc, 'messageId') else None # Behalte messageId falls vorhanden
|
||||
)
|
||||
resolvedDocuments.append(originalDoc)
|
||||
else:
|
||||
resolvedDocuments.append(doc)
|
||||
|
||||
# Baue Intent-Analyse-Prompt mit ursprünglichen Dokumenten
|
||||
intentPrompt = self._buildIntentAnalysisPrompt(userPrompt, resolvedDocuments, actionParameters)
|
||||
|
||||
# AI-Call (verwende callAiPlanning für einfache JSON-Responses)
|
||||
# Debug-Logs werden bereits von callAiPlanning geschrieben
|
||||
checkWorkflowStopped(self.services)
|
||||
aiResponse = await self.aiService.callAiPlanning(
|
||||
prompt=intentPrompt,
|
||||
debugType="document_intent_analysis"
|
||||
)
|
||||
|
||||
# Parse Result und mappe zurück zu JSON-Dokument-IDs falls nötig
|
||||
intentsData = json.loads(self.services.utils.jsonExtractString(aiResponse))
|
||||
documentIntents = []
|
||||
for intent in intentsData.get("intents", []):
|
||||
docId = intent.get("documentId")
|
||||
# Wenn Intent für ursprüngliches Dokument, mappe zurück zu JSON-Dokument-ID
|
||||
if docId in documentMapping:
|
||||
intent["documentId"] = documentMapping[docId]
|
||||
documentIntents.append(DocumentIntent(**intent))
|
||||
|
||||
# Debug-Log (harmonisiert)
|
||||
self.services.utils.writeDebugFile(
|
||||
json.dumps([intent.dict() for intent in documentIntents], indent=2),
|
||||
"document_intent_analysis_result"
|
||||
)
|
||||
|
||||
# State 1 Validation: Validate and auto-fix document intents
|
||||
documentIds = {d.id for d in documents}
|
||||
validatedIntents = []
|
||||
|
||||
for intent in documentIntents:
|
||||
# Validation 1.2: Skip intents for unknown documents
|
||||
if intent.documentId not in documentIds:
|
||||
# Try to find similar UUID (fix AI hallucination/typo)
|
||||
correctedDocId = self._findSimilarDocumentId(intent.documentId, documentIds)
|
||||
if correctedDocId:
|
||||
logger.warning(f"Corrected UUID typo in AI response: {intent.documentId} -> {correctedDocId}")
|
||||
intent.documentId = correctedDocId
|
||||
else:
|
||||
logger.warning(f"Skipping intent for unknown document: {intent.documentId}")
|
||||
continue
|
||||
validatedIntents.append(intent)
|
||||
|
||||
# Validation 1.1: Documents without intents are OK (not needed)
|
||||
# Intents for non-existing documents are already filtered above
|
||||
documentIntents = validatedIntents
|
||||
|
||||
# ChatLog abschließen
|
||||
self.services.chat.progressLogFinish(intentOperationId, True)
|
||||
|
||||
return documentIntents
|
||||
|
||||
except Exception as e:
|
||||
self.services.chat.progressLogFinish(intentOperationId, False)
|
||||
logger.error(f"Error in clarifyDocumentIntents: {str(e)}")
|
||||
raise
|
||||
|
||||
def resolvePreExtractedDocument(self, document: ChatDocument) -> Optional[Dict[str, Any]]:
|
||||
"""
|
||||
Prüft ob ein JSON-Dokument bereits extrahierte ContentParts enthält.
|
||||
Gibt Dict zurück mit:
|
||||
- originalDocument: ChatDocument-Info des ursprünglichen Dokuments
|
||||
- contentExtracted: ContentExtracted-Objekt mit Parts
|
||||
- parts: Liste der ContentParts
|
||||
|
||||
Returns None wenn kein pre-extracted Format erkannt wird.
|
||||
"""
|
||||
if document.mimeType != "application/json":
|
||||
logger.debug(f"Document {document.id} is not JSON (mimeType={document.mimeType}), skipping pre-extracted check")
|
||||
return None
|
||||
|
||||
try:
|
||||
docBytes = self.services.interfaceDbComponent.getFileData(document.fileId)
|
||||
if not docBytes:
|
||||
return None
|
||||
|
||||
docData = docBytes.decode('utf-8')
|
||||
jsonData = json.loads(docData)
|
||||
|
||||
if not isinstance(jsonData, dict):
|
||||
return None
|
||||
|
||||
# Check for ContentExtracted format
|
||||
# Nur Format 1 (ActionDocument-Format mit validationMetadata) wird unterstützt
|
||||
documentData = None
|
||||
|
||||
validationMetadata = jsonData.get("validationMetadata", {})
|
||||
actionType = validationMetadata.get("actionType")
|
||||
logger.debug(f"JSON document {document.id}: validationMetadata.actionType={actionType}, keys={list(jsonData.keys())}")
|
||||
|
||||
if actionType == "context.extractContent":
|
||||
# Format: {"validationMetadata": {"actionType": "context.extractContent"}, "documentData": {...}}
|
||||
documentData = jsonData.get("documentData")
|
||||
logger.debug(f"Found ContentExtracted via validationMetadata for {document.fileName}, documentData keys: {list(documentData.keys()) if documentData else None}")
|
||||
else:
|
||||
logger.debug(f"JSON document {document.id} does not have actionType='context.extractContent' (got: {actionType})")
|
||||
|
||||
if documentData:
|
||||
|
||||
try:
|
||||
# Stelle sicher, dass "id" vorhanden ist
|
||||
if "id" not in documentData:
|
||||
documentData["id"] = document.id
|
||||
|
||||
contentExtracted = ContentExtracted(**documentData)
|
||||
|
||||
if contentExtracted.parts:
|
||||
# Extrahiere ursprüngliche Dokument-Info aus den Parts
|
||||
originalDocId = None
|
||||
originalFileName = None
|
||||
originalMimeType = None
|
||||
|
||||
for part in contentExtracted.parts:
|
||||
if part.metadata:
|
||||
# Versuche ursprüngliche Dokument-Info zu finden
|
||||
if not originalDocId and part.metadata.get("documentId"):
|
||||
originalDocId = part.metadata.get("documentId")
|
||||
if not originalFileName and part.metadata.get("originalFileName"):
|
||||
originalFileName = part.metadata.get("originalFileName")
|
||||
if not originalMimeType and part.metadata.get("documentMimeType"):
|
||||
originalMimeType = part.metadata.get("documentMimeType")
|
||||
|
||||
# Falls nicht gefunden, versuche aus documentName zu extrahieren
|
||||
if not originalFileName:
|
||||
# Versuche aus documentName zu extrahieren (z.B. "B2025-02c_28_extracted_...json" -> "B2025-02c_28.pdf")
|
||||
if document.fileName and "_extracted_" in document.fileName:
|
||||
originalFileName = document.fileName.split("_extracted_")[0] + ".pdf"
|
||||
|
||||
return {
|
||||
"originalDocument": {
|
||||
"id": originalDocId or document.id,
|
||||
"fileName": originalFileName or document.fileName,
|
||||
"mimeType": originalMimeType or "application/pdf",
|
||||
"fileSize": document.fileSize
|
||||
},
|
||||
"contentExtracted": contentExtracted,
|
||||
"parts": contentExtracted.parts
|
||||
}
|
||||
except Exception as parseError:
|
||||
logger.warning(f"Could not parse ContentExtracted format from {document.fileName}: {str(parseError)}")
|
||||
logger.debug(f"JSON keys: {list(jsonData.keys())}, has parts: {'parts' in jsonData}")
|
||||
import traceback
|
||||
logger.debug(f"Parse error traceback: {traceback.format_exc()}")
|
||||
return None
|
||||
else:
|
||||
logger.debug(f"JSON document {document.id} has no documentData (actionType={actionType})")
|
||||
|
||||
return None
|
||||
except Exception as e:
|
||||
logger.debug(f"Error resolving pre-extracted document {document.fileName}: {str(e)}")
|
||||
return None
|
||||
|
||||
def _buildIntentAnalysisPrompt(
|
||||
self,
|
||||
userPrompt: str,
|
||||
documents: List[ChatDocument],
|
||||
actionParameters: Dict[str, Any]
|
||||
) -> str:
|
||||
"""Baue Prompt für Intent-Analyse."""
|
||||
# Baue Dokument-Liste - zeige ursprüngliche Dokumente für pre-extracted JSONs
|
||||
docListText = ""
|
||||
for i, doc in enumerate(documents, 1):
|
||||
# Prüfe ob es ein pre-extracted JSON ist
|
||||
preExtracted = self.resolvePreExtractedDocument(doc)
|
||||
|
||||
if preExtracted:
|
||||
# Zeige ursprüngliches Dokument statt JSON
|
||||
originalDoc = preExtracted["originalDocument"]
|
||||
partsInfo = f" (contains {len(preExtracted['parts'])} pre-extracted parts: {', '.join([p.typeGroup for p in preExtracted['parts'] if p.data and len(str(p.data)) > 0])})"
|
||||
docListText += f"\n{i}. Document ID: {originalDoc['id']}\n"
|
||||
docListText += f" File Name: {originalDoc['fileName']}{partsInfo}\n"
|
||||
docListText += f" MIME Type: {originalDoc['mimeType']}\n"
|
||||
docListText += f" File Size: {originalDoc.get('fileSize', doc.fileSize)} bytes\n"
|
||||
else:
|
||||
# Normales Dokument
|
||||
docListText += f"\n{i}. Document ID: {doc.id}\n"
|
||||
docListText += f" File Name: {doc.fileName}\n"
|
||||
docListText += f" MIME Type: {doc.mimeType}\n"
|
||||
docListText += f" File Size: {doc.fileSize} bytes\n"
|
||||
|
||||
outputFormat = actionParameters.get("outputFormat", "txt")
|
||||
|
||||
# FENCE user input to prevent prompt injection
|
||||
fencedUserPrompt = f"""```user_request
|
||||
{userPrompt}
|
||||
```"""
|
||||
|
||||
prompt = f"""USER REQUEST:
|
||||
{fencedUserPrompt}
|
||||
|
||||
DOCUMENTS TO ANALYZE:
|
||||
{docListText}
|
||||
|
||||
TASK: For each document, determine its intents (can be multiple):
|
||||
- "extract": Content extraction needed (text, structure, OCR, etc.)
|
||||
- "render": Image/binary should be rendered as-is (visual element)
|
||||
- "reference": Document reference/attachment (no extraction, just reference)
|
||||
|
||||
TASK: For each document, determine:
|
||||
1. Intents (can be multiple): "extract", "render", "reference"
|
||||
Note: Output format and language are NOT determined here - they will be
|
||||
determined during structure generation (Phase 3) in the chapter structure JSON
|
||||
|
||||
OUTPUT FORMAT: {outputFormat} (global fallback - for reference only)
|
||||
|
||||
RETURN JSON:
|
||||
{{
|
||||
"intents": [
|
||||
{{
|
||||
"documentId": "doc_1",
|
||||
"intents": ["extract"],
|
||||
"extractionPrompt": "Extract all text content, preserving structure",
|
||||
"reasoning": "User needs text content for document generation"
|
||||
}},
|
||||
{{
|
||||
"documentId": "doc_2",
|
||||
"intents": ["extract", "render"],
|
||||
"extractionPrompt": "Extract text content from image using vision AI",
|
||||
"reasoning": "Image contains text that needs extraction, but also should be rendered visually"
|
||||
}},
|
||||
{{
|
||||
"documentId": "doc_3",
|
||||
"intents": ["reference"],
|
||||
"extractionPrompt": null,
|
||||
"reasoning": "Document is only used as reference, no extraction needed"
|
||||
}}
|
||||
]
|
||||
}}
|
||||
|
||||
CRITICAL RULES:
|
||||
1. For images (mimeType starts with "image/"):
|
||||
- If user wants to "include" or "show" images → add "render"
|
||||
- If user wants to "analyze", "read text", or "extract text" from images → add "extract"
|
||||
- Can have BOTH "extract" and "render" if image needs both text extraction and visual rendering
|
||||
|
||||
2. For text documents:
|
||||
- If user mentions "template" or "structure" → "reference" or "extract" based on context
|
||||
- If user mentions "reference" or "context" → "reference"
|
||||
- Default → "extract"
|
||||
|
||||
3. Consider output format:
|
||||
- For formats like PDF, DOCX, PPTX: images usually need "render"
|
||||
- For formats like CSV, JSON: usually "extract" only
|
||||
- For HTML: can have both "extract" and "render"
|
||||
|
||||
Return ONLY valid JSON following the structure above.
|
||||
"""
|
||||
return prompt
|
||||
|
||||
def _findSimilarDocumentId(self, incorrectId: str, validIds: set) -> Optional[str]:
|
||||
"""
|
||||
Versucht eine ähnliche Dokument-ID zu finden, falls die AI die UUID geändert hat.
|
||||
Prüft auf UUID-Typo (z.B. 4451 -> 4551).
|
||||
|
||||
Args:
|
||||
incorrectId: Die falsche UUID aus der AI-Response
|
||||
validIds: Set von gültigen Dokument-IDs
|
||||
|
||||
Returns:
|
||||
Korrigierte UUID falls gefunden, sonst None
|
||||
"""
|
||||
if not incorrectId or len(incorrectId) != 36: # UUID Format: 8-4-4-4-12
|
||||
return None
|
||||
|
||||
# Prüfe ob es eine UUID ist (Format: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx)
|
||||
if incorrectId.count('-') != 4:
|
||||
return None
|
||||
|
||||
# Versuche Levenshtein-ähnliche Suche: Prüfe ob nur 1-2 Zeichen unterschiedlich sind
|
||||
for validId in validIds:
|
||||
if len(validId) != 36:
|
||||
continue
|
||||
|
||||
# Zähle unterschiedliche Zeichen
|
||||
differences = sum(c1 != c2 for c1, c2 in zip(incorrectId, validId))
|
||||
|
||||
# Wenn nur 1-2 Zeichen unterschiedlich sind, ist es wahrscheinlich ein Typo
|
||||
if differences <= 2:
|
||||
# Prüfe ob die Struktur ähnlich ist (gleiche Positionen der Bindestriche)
|
||||
if incorrectId.count('-') == validId.count('-'):
|
||||
return validId
|
||||
|
||||
return None
|
||||
|
||||
2081
modules/serviceCenter/services/serviceAi/subJsonMerger.py
Normal file
2081
modules/serviceCenter/services/serviceAi/subJsonMerger.py
Normal file
File diff suppressed because it is too large
Load diff
3121
modules/serviceCenter/services/serviceAi/subJsonResponseHandling.py
Normal file
3121
modules/serviceCenter/services/serviceAi/subJsonResponseHandling.py
Normal file
File diff suppressed because it is too large
Load diff
293
modules/serviceCenter/services/serviceAi/subLoopingUseCases.py
Normal file
293
modules/serviceCenter/services/serviceAi/subLoopingUseCases.py
Normal file
|
|
@ -0,0 +1,293 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""
|
||||
Generic Looping Use Case System
|
||||
|
||||
Provides parametrized looping infrastructure supporting different JSON formats and use cases.
|
||||
"""
|
||||
|
||||
import logging
|
||||
from dataclasses import dataclass, field
|
||||
from typing import Dict, Any, List, Optional, Callable
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Callback functions for use-case-specific logic
|
||||
|
||||
def _handleSectionContentFinalResult(result: str, parsedJsonForUseCase: Any, extractedJsonForUseCase: str,
|
||||
debugPrefix: str, services: Any) -> str:
|
||||
"""Handle final result for section_content: return raw result to preserve all JSON blocks."""
|
||||
final_json = result # Return raw response to preserve all JSON blocks
|
||||
# Write final merged result for section_content (overwrites iteration 1 response with complete merged result)
|
||||
if services and hasattr(services, 'utils') and hasattr(services.utils, 'writeDebugFile'):
|
||||
services.utils.writeDebugFile(final_json, f"{debugPrefix}_response")
|
||||
return final_json
|
||||
|
||||
|
||||
def _handleChapterStructureFinalResult(result: str, parsedJsonForUseCase: Any, extractedJsonForUseCase: str,
|
||||
debugPrefix: str, services: Any) -> str:
|
||||
"""Handle final result for chapter_structure: format JSON and write debug file."""
|
||||
import json
|
||||
final_json = json.dumps(parsedJsonForUseCase, indent=2, ensure_ascii=False) if parsedJsonForUseCase else (extractedJsonForUseCase or result)
|
||||
# Write final result for chapter structure
|
||||
if services and hasattr(services, 'utils') and hasattr(services.utils, 'writeDebugFile'):
|
||||
services.utils.writeDebugFile(final_json, f"{debugPrefix}_final_result")
|
||||
return final_json
|
||||
|
||||
|
||||
def _handleCodeStructureFinalResult(result: str, parsedJsonForUseCase: Any, extractedJsonForUseCase: str,
|
||||
debugPrefix: str, services: Any) -> str:
|
||||
"""Handle final result for code_structure: format JSON and write debug file."""
|
||||
import json
|
||||
final_json = json.dumps(parsedJsonForUseCase, indent=2, ensure_ascii=False) if parsedJsonForUseCase else (extractedJsonForUseCase or result)
|
||||
# Write final result for code structure
|
||||
if services and hasattr(services, 'utils') and hasattr(services.utils, 'writeDebugFile'):
|
||||
services.utils.writeDebugFile(final_json, f"{debugPrefix}_final_result")
|
||||
return final_json
|
||||
|
||||
|
||||
def _handleCodeContentFinalResult(result: str, parsedJsonForUseCase: Any, extractedJsonForUseCase: str,
|
||||
debugPrefix: str, services: Any) -> str:
|
||||
"""Handle final result for code_content: format JSON."""
|
||||
import json
|
||||
final_json = json.dumps(parsedJsonForUseCase, indent=2, ensure_ascii=False) if parsedJsonForUseCase else (extractedJsonForUseCase or result)
|
||||
return final_json
|
||||
|
||||
|
||||
def _normalizeSectionContentJson(parsed: Any, useCaseId: str) -> Any:
|
||||
"""Normalize JSON structure for section_content use case."""
|
||||
# For section_content, expect {"elements": [...]} structure
|
||||
if isinstance(parsed, list):
|
||||
# Check if list contains strings (invalid format) or element objects
|
||||
if parsed and isinstance(parsed[0], str):
|
||||
# Invalid format - list of strings instead of elements
|
||||
# Try to convert strings to paragraph elements as fallback
|
||||
logger.debug(f"Received list of strings instead of elements array, converting to paragraph elements")
|
||||
elements = []
|
||||
for text in parsed:
|
||||
if isinstance(text, str) and text.strip():
|
||||
elements.append({
|
||||
"type": "paragraph",
|
||||
"content": {
|
||||
"text": text.strip()
|
||||
}
|
||||
})
|
||||
return {"elements": elements} if elements else {"elements": []}
|
||||
else:
|
||||
# Convert plain list of elements to elements structure
|
||||
return {"elements": parsed}
|
||||
elif isinstance(parsed, dict):
|
||||
# If it already has "elements", return as-is
|
||||
if "elements" in parsed:
|
||||
return parsed
|
||||
# If it has "type" and looks like an element, wrap in elements array
|
||||
elif parsed.get("type"):
|
||||
return {"elements": [parsed]}
|
||||
# Otherwise, assume it's already in correct format
|
||||
else:
|
||||
return parsed
|
||||
|
||||
# For other use cases, return as-is (they have their own structures)
|
||||
return parsed
|
||||
|
||||
|
||||
def _normalizeDefaultJson(parsed: Any, useCaseId: str) -> Any:
|
||||
"""Default normalizer: return as-is."""
|
||||
return parsed
|
||||
|
||||
|
||||
@dataclass
|
||||
class LoopingUseCase:
|
||||
"""Configuration for a specific looping use case."""
|
||||
|
||||
# Identification
|
||||
useCaseId: str # "section_content", "chapter_structure", "code_structure", "code_content"
|
||||
|
||||
# JSON Format Detection
|
||||
jsonTemplate: Dict[str, Any] # Expected JSON structure template
|
||||
detectionKeys: List[str] # Keys to check for format detection (e.g., ["elements"], ["chapters"], ["files"])
|
||||
detectionPath: str # JSONPath to check (e.g., "documents[0].chapters", "files[0].content")
|
||||
|
||||
# Prompt Building
|
||||
initialPromptBuilder: Optional[Callable] = None # Function to build initial prompt
|
||||
continuationPromptBuilder: Optional[Callable] = None # Function to build continuation prompt
|
||||
|
||||
# Accumulation & Merging
|
||||
accumulator: Optional[Callable] = None # Function to accumulate fragments
|
||||
merger: Optional[Callable] = None # Function to merge accumulated data
|
||||
|
||||
# Continuation Context
|
||||
continuationContextBuilder: Optional[Callable] = None # Build continuation context for this format
|
||||
|
||||
# Result Building
|
||||
resultBuilder: Optional[Callable] = None # Build final result from accumulated data
|
||||
|
||||
# Use-case-specific handlers (callbacks to avoid if/elif chains in generic code)
|
||||
finalResultHandler: Optional[Callable] = None # Handle final result formatting and debug file writing
|
||||
jsonNormalizer: Optional[Callable] = None # Normalize JSON structure for this use case
|
||||
|
||||
# Metadata
|
||||
supportsAccumulation: bool = True # Whether this use case supports accumulation
|
||||
requiresExtraction: bool = False # Whether this requires extraction (like sections)
|
||||
|
||||
|
||||
class LoopingUseCaseRegistry:
|
||||
"""Registry of all looping use cases."""
|
||||
|
||||
def __init__(self):
|
||||
self.useCases: Dict[str, LoopingUseCase] = {}
|
||||
self._registerDefaultUseCases()
|
||||
|
||||
def register(self, useCase: LoopingUseCase):
|
||||
"""Register a new use case."""
|
||||
self.useCases[useCase.useCaseId] = useCase
|
||||
logger.debug(f"Registered looping use case: {useCase.useCaseId}")
|
||||
|
||||
def get(self, useCaseId: str) -> Optional[LoopingUseCase]:
|
||||
"""Get use case by ID."""
|
||||
return self.useCases.get(useCaseId)
|
||||
|
||||
def detectUseCase(self, parsedJson: Dict[str, Any]) -> Optional[str]:
|
||||
"""Detect which use case matches the JSON structure."""
|
||||
for useCaseId, useCase in self.useCases.items():
|
||||
if self._matchesFormat(parsedJson, useCase):
|
||||
return useCaseId
|
||||
return None
|
||||
|
||||
def _matchesFormat(self, json: Dict[str, Any], useCase: LoopingUseCase) -> bool:
|
||||
"""Check if JSON matches use case format."""
|
||||
# Check top-level keys
|
||||
for key in useCase.detectionKeys:
|
||||
if key in json:
|
||||
return True
|
||||
|
||||
# Check nested path using simple dictionary traversal (no jsonpath_ng needed)
|
||||
if useCase.detectionPath:
|
||||
try:
|
||||
# Simple path matching without jsonpath_ng
|
||||
# Format: "documents[0].chapters" or "files[0].content"
|
||||
pathParts = useCase.detectionPath.split(".")
|
||||
current = json
|
||||
|
||||
for part in pathParts:
|
||||
# Handle array indices like "documents[0]"
|
||||
if "[" in part and "]" in part:
|
||||
key = part.split("[")[0]
|
||||
index = int(part.split("[")[1].split("]")[0])
|
||||
if isinstance(current, dict) and key in current:
|
||||
if isinstance(current[key], list) and 0 <= index < len(current[key]):
|
||||
current = current[key][index]
|
||||
else:
|
||||
return False
|
||||
else:
|
||||
return False
|
||||
else:
|
||||
# Regular key access
|
||||
if isinstance(current, dict) and part in current:
|
||||
current = current[part]
|
||||
else:
|
||||
return False
|
||||
|
||||
# If we successfully traversed the path, it matches
|
||||
return True
|
||||
except Exception as e:
|
||||
logger.debug(f"Path matching failed for {useCase.useCaseId}: {e}")
|
||||
|
||||
return False
|
||||
|
||||
def _registerDefaultUseCases(self):
|
||||
"""Register default use cases."""
|
||||
|
||||
# Use Case 1: Section Content Generation
|
||||
# Returns JSON with "elements" array directly
|
||||
self.register(LoopingUseCase(
|
||||
useCaseId="section_content",
|
||||
jsonTemplate={"elements": []},
|
||||
detectionKeys=["elements"],
|
||||
detectionPath="",
|
||||
initialPromptBuilder=None, # Will use default prompt builder
|
||||
continuationPromptBuilder=None, # Will use default continuation builder
|
||||
accumulator=None, # Direct return, no accumulation
|
||||
merger=None,
|
||||
continuationContextBuilder=None, # Will use default continuation context
|
||||
resultBuilder=None, # Return JSON directly
|
||||
finalResultHandler=_handleSectionContentFinalResult,
|
||||
jsonNormalizer=_normalizeSectionContentJson,
|
||||
supportsAccumulation=False,
|
||||
requiresExtraction=False
|
||||
))
|
||||
|
||||
# Use Case 2: Chapter Structure Generation
|
||||
# Returns JSON with "documents[0].chapters" structure
|
||||
self.register(LoopingUseCase(
|
||||
useCaseId="chapter_structure",
|
||||
jsonTemplate={"documents": [{"chapters": []}]},
|
||||
detectionKeys=["chapters"],
|
||||
detectionPath="documents[0].chapters",
|
||||
initialPromptBuilder=None,
|
||||
continuationPromptBuilder=None,
|
||||
accumulator=None, # Direct return, no accumulation
|
||||
merger=None,
|
||||
continuationContextBuilder=None,
|
||||
resultBuilder=None, # Return JSON directly
|
||||
finalResultHandler=_handleChapterStructureFinalResult,
|
||||
jsonNormalizer=_normalizeDefaultJson,
|
||||
supportsAccumulation=False,
|
||||
requiresExtraction=False
|
||||
))
|
||||
|
||||
# Use Case 3: Code Structure Generation
|
||||
self.register(LoopingUseCase(
|
||||
useCaseId="code_structure",
|
||||
jsonTemplate={
|
||||
"metadata": {
|
||||
"language": "",
|
||||
"projectType": "single_file|multi_file",
|
||||
"projectName": ""
|
||||
},
|
||||
"files": [
|
||||
{
|
||||
"id": "",
|
||||
"filename": "",
|
||||
"fileType": "",
|
||||
"dependencies": [],
|
||||
"imports": [],
|
||||
"functions": [],
|
||||
"classes": []
|
||||
}
|
||||
]
|
||||
},
|
||||
detectionKeys=["files"],
|
||||
detectionPath="files",
|
||||
initialPromptBuilder=None,
|
||||
continuationPromptBuilder=None,
|
||||
accumulator=None, # Direct return
|
||||
merger=None,
|
||||
continuationContextBuilder=None,
|
||||
resultBuilder=None,
|
||||
finalResultHandler=_handleCodeStructureFinalResult,
|
||||
jsonNormalizer=_normalizeDefaultJson,
|
||||
supportsAccumulation=False,
|
||||
requiresExtraction=False
|
||||
))
|
||||
|
||||
# Use Case 5: Code Content Generation (NEW)
|
||||
self.register(LoopingUseCase(
|
||||
useCaseId="code_content",
|
||||
jsonTemplate={"files": [{"content": "", "functions": []}]},
|
||||
detectionKeys=["content", "functions"],
|
||||
detectionPath="files[0].content",
|
||||
initialPromptBuilder=None,
|
||||
continuationPromptBuilder=None,
|
||||
accumulator=None, # Will use default accumulator
|
||||
merger=None, # Will use default merger
|
||||
continuationContextBuilder=None,
|
||||
resultBuilder=None, # Will use default result builder
|
||||
finalResultHandler=_handleCodeContentFinalResult,
|
||||
jsonNormalizer=_normalizeDefaultJson,
|
||||
supportsAccumulation=True,
|
||||
requiresExtraction=False
|
||||
))
|
||||
|
||||
logger.info(f"Registered {len(self.useCases)} default looping use cases")
|
||||
|
||||
275
modules/serviceCenter/services/serviceAi/subResponseParsing.py
Normal file
275
modules/serviceCenter/services/serviceAi/subResponseParsing.py
Normal file
|
|
@ -0,0 +1,275 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""
|
||||
Response Parsing Module
|
||||
|
||||
Handles parsing of AI responses, including:
|
||||
- Section extraction from responses
|
||||
- JSON completeness detection
|
||||
- Loop detection
|
||||
- Document metadata extraction
|
||||
- Final result building
|
||||
"""
|
||||
import json
|
||||
import logging
|
||||
from typing import Dict, Any, List, Optional, Tuple
|
||||
|
||||
from modules.shared.jsonUtils import extractJsonString, repairBrokenJson, extractSectionsFromDocument
|
||||
from .subJsonResponseHandling import JsonResponseHandler
|
||||
from modules.datamodels.datamodelAi import JsonAccumulationState
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class ResponseParser:
|
||||
"""Handles parsing of AI responses and completion detection."""
|
||||
|
||||
def __init__(self, services):
|
||||
"""Initialize ResponseParser with service center access."""
|
||||
self.services = services
|
||||
|
||||
def extractSectionsFromResponse(
|
||||
self,
|
||||
result: str,
|
||||
iteration: int,
|
||||
debugPrefix: str,
|
||||
allSections: List[Dict[str, Any]] = None,
|
||||
accumulationState: Optional[JsonAccumulationState] = None
|
||||
) -> Tuple[List[Dict[str, Any]], bool, Optional[Dict[str, Any]], Optional[JsonAccumulationState]]:
|
||||
"""
|
||||
Extract sections from AI response, handling both valid and broken JSON.
|
||||
|
||||
NEW BEHAVIOR:
|
||||
- First iteration: Check if complete, if not start accumulation
|
||||
- Subsequent iterations: Accumulate strings, parse when complete
|
||||
|
||||
Returns:
|
||||
Tuple of:
|
||||
- sections: Extracted sections
|
||||
- wasJsonComplete: True if JSON is complete
|
||||
- parsedResult: Parsed JSON object
|
||||
- updatedAccumulationState: Updated accumulation state (None if not in accumulation mode)
|
||||
"""
|
||||
if allSections is None:
|
||||
allSections = []
|
||||
|
||||
if iteration == 1:
|
||||
# First iteration - check if complete
|
||||
parsed = None
|
||||
try:
|
||||
extracted = extractJsonString(result)
|
||||
parsed = json.loads(extracted)
|
||||
|
||||
# Check completeness
|
||||
if JsonResponseHandler.isJsonComplete(parsed):
|
||||
# Complete JSON - no accumulation needed
|
||||
sections = extractSectionsFromDocument(parsed)
|
||||
logger.info(f"Iteration 1: Complete JSON detected, no accumulation needed")
|
||||
return sections, True, parsed, None # No accumulation
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# Incomplete - try to extract partial sections from broken JSON
|
||||
logger.info(f"Iteration 1: Incomplete JSON detected, attempting to extract partial sections")
|
||||
|
||||
partialSections = []
|
||||
if parsed:
|
||||
# Try to extract sections from parsed (even if incomplete)
|
||||
partialSections = extractSectionsFromDocument(parsed)
|
||||
else:
|
||||
# Try to repair broken JSON and extract sections
|
||||
try:
|
||||
repaired = repairBrokenJson(result)
|
||||
if repaired:
|
||||
partialSections = extractSectionsFromDocument(repaired)
|
||||
parsed = repaired # Use repaired version for accumulation state
|
||||
except Exception:
|
||||
pass # If repair fails, continue with empty sections
|
||||
|
||||
|
||||
# Define KPIs (async call - need to handle this)
|
||||
# For now, create accumulation state without KPIs, will be updated after async call
|
||||
accumulationState = JsonAccumulationState(
|
||||
accumulatedJsonString=result,
|
||||
isAccumulationMode=True,
|
||||
lastParsedResult=parsed,
|
||||
allSections=partialSections,
|
||||
kpis=[]
|
||||
)
|
||||
|
||||
# Note: KPI definition will be done in the caller (async context)
|
||||
return partialSections, False, parsed, accumulationState
|
||||
|
||||
else:
|
||||
# Subsequent iterations - accumulate
|
||||
if accumulationState and accumulationState.isAccumulationMode:
|
||||
accumulated, sections, isComplete, parsedResult = \
|
||||
JsonResponseHandler.accumulateAndParseJsonFragments(
|
||||
accumulationState.accumulatedJsonString,
|
||||
result,
|
||||
allSections,
|
||||
iteration
|
||||
)
|
||||
|
||||
# Update accumulation state
|
||||
accumulationState.accumulatedJsonString = accumulated
|
||||
accumulationState.lastParsedResult = parsedResult
|
||||
accumulationState.allSections = allSections + sections if sections else allSections
|
||||
accumulationState.isAccumulationMode = not isComplete
|
||||
|
||||
# Log accumulated JSON for debugging
|
||||
if parsedResult:
|
||||
accumulated_json_str = json.dumps(parsedResult, indent=2, ensure_ascii=False)
|
||||
self.services.utils.writeDebugFile(accumulated_json_str, f"{debugPrefix}_accumulated_json_iteration_{iteration}.json")
|
||||
|
||||
return sections, isComplete, parsedResult, accumulationState
|
||||
else:
|
||||
# No accumulation mode - process normally (shouldn't happen)
|
||||
logger.warning(f"Iteration {iteration}: No accumulation state but iteration > 1")
|
||||
return [], False, None, None
|
||||
|
||||
def shouldContinueGeneration(
|
||||
self,
|
||||
allSections: List[Dict[str, Any]],
|
||||
iteration: int,
|
||||
wasJsonComplete: bool,
|
||||
rawResponse: str = None
|
||||
) -> bool:
|
||||
"""
|
||||
Determine if AI generation loop should continue.
|
||||
|
||||
CRITICAL: This is ONLY about AI Loop Completion, NOT Action DoD!
|
||||
Action DoD is checked AFTER the AI Loop completes in _refineDecide.
|
||||
|
||||
Simple logic:
|
||||
- If JSON parsing failed or incomplete → continue (needs more content)
|
||||
- If JSON parses successfully and is complete → stop (all content delivered)
|
||||
- Loop detection prevents infinite loops
|
||||
|
||||
CRITICAL: JSON completeness is determined by parsing, NOT by last character check!
|
||||
Returns True if we should continue, False if AI Loop is done.
|
||||
"""
|
||||
if len(allSections) == 0:
|
||||
return True # No sections yet, continue
|
||||
|
||||
# CRITERION 1: If JSON was incomplete/broken (parsing failed or incomplete) - continue to repair/complete
|
||||
if not wasJsonComplete:
|
||||
logger.info(f"Iteration {iteration}: JSON incomplete/broken - continuing to complete")
|
||||
return True
|
||||
|
||||
# CRITERION 2: JSON is complete (parsed successfully) - check for loop detection
|
||||
if self._isStuckInLoop(allSections, iteration):
|
||||
logger.warning(f"Iteration {iteration}: Detected potential infinite loop - stopping AI loop")
|
||||
return False
|
||||
|
||||
# JSON is complete and not stuck in loop - done
|
||||
logger.info(f"Iteration {iteration}: JSON complete - AI loop done")
|
||||
return False
|
||||
|
||||
def _isStuckInLoop(
|
||||
self,
|
||||
allSections: List[Dict[str, Any]],
|
||||
iteration: int
|
||||
) -> bool:
|
||||
"""
|
||||
Detect if we're stuck in a loop (same content being repeated).
|
||||
|
||||
Generic approach: Check if recent iterations are adding minimal or duplicate content.
|
||||
"""
|
||||
if iteration < 3:
|
||||
return False # Need at least 3 iterations to detect a loop
|
||||
|
||||
if len(allSections) == 0:
|
||||
return False
|
||||
|
||||
# Check if last section is very small (might be stuck)
|
||||
lastSection = allSections[-1]
|
||||
elements = lastSection.get("elements", [])
|
||||
|
||||
if isinstance(elements, list) and elements:
|
||||
lastElem = elements[-1] if elements else {}
|
||||
else:
|
||||
lastElem = elements if isinstance(elements, dict) else {}
|
||||
|
||||
# Check content size of last section
|
||||
lastSectionSize = 0
|
||||
if isinstance(lastElem, dict):
|
||||
for key, value in lastElem.items():
|
||||
if isinstance(value, str):
|
||||
lastSectionSize += len(value)
|
||||
elif isinstance(value, list):
|
||||
lastSectionSize += len(str(value))
|
||||
|
||||
# If last section is very small and we've done many iterations, might be stuck
|
||||
if lastSectionSize < 100 and iteration > 10:
|
||||
logger.warning(f"Potential loop detected: iteration {iteration}, last section size {lastSectionSize}")
|
||||
return True
|
||||
|
||||
return False
|
||||
|
||||
def extractDocumentMetadata(
|
||||
self,
|
||||
parsedResult: Dict[str, Any]
|
||||
) -> Optional[Dict[str, Any]]:
|
||||
"""
|
||||
Extract document metadata (title, filename) from parsed AI response.
|
||||
Returns dict with 'title' and 'filename' keys if found, None otherwise.
|
||||
"""
|
||||
if not isinstance(parsedResult, dict):
|
||||
return None
|
||||
|
||||
# Try to get from documents array (preferred structure)
|
||||
if "documents" in parsedResult and isinstance(parsedResult["documents"], list) and len(parsedResult["documents"]) > 0:
|
||||
firstDoc = parsedResult["documents"][0]
|
||||
if isinstance(firstDoc, dict):
|
||||
title = firstDoc.get("title")
|
||||
filename = firstDoc.get("filename")
|
||||
if title or filename:
|
||||
return {
|
||||
"title": title,
|
||||
"filename": filename
|
||||
}
|
||||
|
||||
return None
|
||||
|
||||
def buildFinalResultFromSections(
|
||||
self,
|
||||
allSections: List[Dict[str, Any]],
|
||||
documentMetadata: Optional[Dict[str, Any]] = None
|
||||
) -> str:
|
||||
"""
|
||||
Build final JSON result from accumulated sections.
|
||||
Uses AI-provided metadata (title, filename) if available.
|
||||
"""
|
||||
if not allSections:
|
||||
return ""
|
||||
|
||||
# Extract metadata from AI response if available
|
||||
title = "Generated Document"
|
||||
filename = "document.json"
|
||||
if documentMetadata:
|
||||
if documentMetadata.get("title"):
|
||||
title = documentMetadata["title"]
|
||||
if documentMetadata.get("filename"):
|
||||
filename = documentMetadata["filename"]
|
||||
|
||||
# Build documents structure
|
||||
# Assuming single document for now
|
||||
documents = [{
|
||||
"id": "doc_1",
|
||||
"title": title,
|
||||
"filename": filename,
|
||||
"sections": allSections
|
||||
}]
|
||||
|
||||
result = {
|
||||
"metadata": {
|
||||
"split_strategy": "single_document",
|
||||
"source_documents": [],
|
||||
"extraction_method": "ai_generation"
|
||||
},
|
||||
"documents": documents
|
||||
}
|
||||
|
||||
return json.dumps(result, indent=2)
|
||||
|
||||
2593
modules/serviceCenter/services/serviceAi/subStructureFilling.py
Normal file
2593
modules/serviceCenter/services/serviceAi/subStructureFilling.py
Normal file
File diff suppressed because it is too large
Load diff
|
|
@ -0,0 +1,508 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""
|
||||
Structure Generation Module
|
||||
|
||||
Handles document structure generation, including:
|
||||
- Generating document structure with sections
|
||||
- Building structure prompts
|
||||
"""
|
||||
import json
|
||||
import logging
|
||||
from typing import Dict, Any, List, Optional
|
||||
|
||||
from modules.datamodels.datamodelExtraction import ContentPart
|
||||
from modules.datamodels.datamodelAi import AiCallOptions, OperationTypeEnum, PriorityEnum, ProcessingModeEnum
|
||||
from modules.workflows.processing.shared.stateTools import checkWorkflowStopped
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class StructureGenerator:
|
||||
"""Handles document structure generation."""
|
||||
|
||||
def __init__(self, services, aiService):
|
||||
"""Initialize StructureGenerator with service center and AI service access."""
|
||||
self.services = services
|
||||
self.aiService = aiService
|
||||
|
||||
def _getUserLanguage(self) -> str:
|
||||
"""Get user language for document generation"""
|
||||
try:
|
||||
if self.services:
|
||||
# Prefer detected language if available (from user intention analysis)
|
||||
if hasattr(self.services, 'currentUserLanguage') and self.services.currentUserLanguage:
|
||||
return self.services.currentUserLanguage
|
||||
# Fallback to user's preferred language
|
||||
elif hasattr(self.services, 'user') and self.services.user and hasattr(self.services.user, 'language'):
|
||||
return self.services.user.language
|
||||
except Exception:
|
||||
pass
|
||||
return 'en' # Default fallback
|
||||
|
||||
async def generateStructure(
|
||||
self,
|
||||
userPrompt: str,
|
||||
contentParts: List[ContentPart],
|
||||
outputFormat: Optional[str] = None,
|
||||
parentOperationId: str = None
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
Phase 5C: Generiert Chapter-Struktur (Table of Contents).
|
||||
Definiert für jedes Chapter:
|
||||
- Level, Title
|
||||
- contentParts (unified object with instruction and/or caption per part)
|
||||
- generationHint
|
||||
|
||||
Generate document structure with per-document format determination.
|
||||
Multiple documents can be produced with different formats (e.g., one PDF, one HTML).
|
||||
AI determines formats per-document from user prompt. The outputFormat parameter is
|
||||
only a validation fallback - used if AI doesn't return format per document.
|
||||
|
||||
Args:
|
||||
userPrompt: User-Anfrage
|
||||
contentParts: Alle vorbereiteten ContentParts mit Metadaten
|
||||
outputFormat: Optional global format fallback. If omitted, formats are determined
|
||||
from user prompt by AI. Used as validation fallback if AI doesn't
|
||||
return format per document. Defaults to "txt" if not provided.
|
||||
parentOperationId: Parent Operation-ID für ChatLog-Hierarchie
|
||||
|
||||
Returns:
|
||||
Struktur-Dict mit documents und chapters (nicht sections!)
|
||||
"""
|
||||
# If outputFormat not provided, use "txt" as fallback for validation
|
||||
# AI will determine formats per document from user prompt
|
||||
if not outputFormat:
|
||||
outputFormat = "txt"
|
||||
logger.debug("outputFormat not provided - using 'txt' as validation fallback, formats determined from prompt")
|
||||
# Erstelle Operation-ID für Struktur-Generierung
|
||||
structureOperationId = f"{parentOperationId}_structure_generation"
|
||||
|
||||
# Starte ChatLog mit Parent-Referenz
|
||||
formatDisplay = outputFormat if outputFormat else "auto-determined"
|
||||
self.services.chat.progressLogStart(
|
||||
structureOperationId,
|
||||
"Chapter Structure Generation",
|
||||
"Structure",
|
||||
f"Generating chapter structure (format: {formatDisplay})",
|
||||
parentOperationId=parentOperationId
|
||||
)
|
||||
|
||||
try:
|
||||
# Baue Chapter-Struktur-Prompt mit Content-Index
|
||||
structurePrompt = self._buildChapterStructurePrompt(
|
||||
userPrompt=userPrompt,
|
||||
contentParts=contentParts,
|
||||
outputFormat=outputFormat
|
||||
)
|
||||
|
||||
# AI-Call für Chapter-Struktur-Generierung mit Looping-Unterstützung
|
||||
# Use _callAiWithLooping instead of callAiPlanning to support continuation if response is cut
|
||||
options = AiCallOptions(
|
||||
operationType=OperationTypeEnum.DATA_GENERATE,
|
||||
priority=PriorityEnum.QUALITY,
|
||||
processingMode=ProcessingModeEnum.DETAILED,
|
||||
compressPrompt=False,
|
||||
compressContext=False,
|
||||
resultFormat="json"
|
||||
)
|
||||
|
||||
structurePrompt, templateStructure = self._buildChapterStructurePrompt(
|
||||
userPrompt=userPrompt,
|
||||
contentParts=contentParts,
|
||||
outputFormat=outputFormat
|
||||
)
|
||||
|
||||
# Create prompt builder for continuation support
|
||||
async def buildChapterStructurePromptWithContinuation(
|
||||
continuationContext: Any,
|
||||
templateStructure: str,
|
||||
basePrompt: str
|
||||
) -> str:
|
||||
"""Build chapter structure prompt with continuation context. Uses unified signature.
|
||||
|
||||
Note: All initial context (userPrompt, contentParts, outputFormat, etc.) is already
|
||||
contained in basePrompt. This function only adds continuation-specific instructions.
|
||||
"""
|
||||
# Extract continuation context fields (only what's needed for continuation)
|
||||
incompletePart = continuationContext.incomplete_part
|
||||
lastRawJson = continuationContext.last_raw_json
|
||||
|
||||
# Generate both overlap context and hierarchy context using jsonContinuation
|
||||
overlapContext = ""
|
||||
unifiedContext = ""
|
||||
if lastRawJson:
|
||||
# Get contexts directly from jsonContinuation
|
||||
from modules.shared.jsonContinuation import getContexts
|
||||
contexts = getContexts(lastRawJson)
|
||||
overlapContext = contexts.overlapContext
|
||||
unifiedContext = contexts.hierarchyContextForPrompt
|
||||
elif incompletePart:
|
||||
unifiedContext = incompletePart
|
||||
else:
|
||||
unifiedContext = "Unable to extract context - response was completely broken"
|
||||
|
||||
# Build unified continuation prompt format
|
||||
continuationPrompt = f"""{basePrompt}
|
||||
|
||||
--- CONTINUATION REQUEST ---
|
||||
The previous JSON response was incomplete. Continue from where it stopped.
|
||||
|
||||
Context showing structure hierarchy with cut point:
|
||||
```
|
||||
{unifiedContext}
|
||||
```
|
||||
|
||||
Overlap Requirement:
|
||||
To ensure proper merging, your response MUST start EXACTLY with the overlap context shown below, then continue with new content.
|
||||
|
||||
Overlap context (start your response with this exact text):
|
||||
```json
|
||||
{overlapContext if overlapContext else "No overlap context available"}
|
||||
```
|
||||
|
||||
TASK:
|
||||
1. Start your response EXACTLY with the overlap context shown above (character by character)
|
||||
2. Continue seamlessly from where the overlap context ends
|
||||
3. Complete the remaining content following the JSON structure template above
|
||||
4. Return ONLY valid JSON following the structure template - no overlap/continuation wrapper objects
|
||||
|
||||
CRITICAL:
|
||||
- Your response MUST begin with the exact overlap context text (this enables automatic merging)
|
||||
- Continue seamlessly after the overlap context with new content
|
||||
- Your response must be valid JSON matching the structure template above"""
|
||||
return continuationPrompt
|
||||
|
||||
# Call AI with looping support
|
||||
# NOTE: Do NOT pass contentParts here - we only need metadata for structure generation
|
||||
# The contentParts metadata is already included in the prompt (contentPartsIndex)
|
||||
# Actual content extraction happens later during section generation
|
||||
checkWorkflowStopped(self.services)
|
||||
aiResponseJson = await self.aiService.callAiWithLooping(
|
||||
prompt=structurePrompt,
|
||||
options=options,
|
||||
debugPrefix="chapter_structure_generation",
|
||||
promptBuilder=buildChapterStructurePromptWithContinuation,
|
||||
promptArgs={
|
||||
"userPrompt": userPrompt,
|
||||
"outputFormat": outputFormat,
|
||||
"templateStructure": templateStructure,
|
||||
"basePrompt": structurePrompt
|
||||
},
|
||||
useCaseId="chapter_structure", # REQUIRED: Explicit use case ID
|
||||
operationId=structureOperationId,
|
||||
userPrompt=userPrompt,
|
||||
contentParts=None # Do not pass ContentParts - only metadata needed, not content extraction
|
||||
)
|
||||
|
||||
# Parse the complete JSON response (looping system already handles completion)
|
||||
extractedJson = self.services.utils.jsonExtractString(aiResponseJson)
|
||||
parsedJson, parseError, cleanedJson = self.services.utils.jsonTryParse(extractedJson)
|
||||
|
||||
if parseError is not None:
|
||||
# Even with looping, try repair as fallback
|
||||
logger.warning(f"JSON parsing failed after looping: {str(parseError)}. Attempting repair...")
|
||||
from modules.shared import jsonUtils
|
||||
repairedJson = jsonUtils.repairBrokenJson(extractedJson)
|
||||
if repairedJson:
|
||||
parsedJson, parseError, _ = self.services.utils.jsonTryParse(json.dumps(repairedJson))
|
||||
if parseError is None:
|
||||
logger.info("Successfully repaired and parsed JSON structure after looping")
|
||||
structure = parsedJson
|
||||
else:
|
||||
logger.error(f"Failed to parse repaired JSON: {str(parseError)}")
|
||||
raise ValueError(f"Failed to parse JSON structure after repair: {str(parseError)}")
|
||||
else:
|
||||
logger.error(f"Failed to repair JSON. Parse error: {str(parseError)}")
|
||||
logger.error(f"Cleaned JSON preview (first 500 chars): {cleanedJson[:500]}")
|
||||
raise ValueError(f"Failed to parse JSON structure: {str(parseError)}")
|
||||
else:
|
||||
structure = parsedJson
|
||||
|
||||
# State 3 Validation: Validate and auto-fix structure
|
||||
# Validation 3.1: Structure missing 'documents' field
|
||||
if "documents" not in structure:
|
||||
raise ValueError("Structure missing 'documents' field - cannot auto-fix")
|
||||
|
||||
documents = structure["documents"]
|
||||
|
||||
# Validation 3.2: Structure has no documents
|
||||
if not isinstance(documents, list) or len(documents) == 0:
|
||||
raise ValueError("Structure has no documents - cannot generate without documents")
|
||||
|
||||
# Import renderer registry for format validation (existing infrastructure)
|
||||
from modules.serviceCenter.services.serviceGeneration.renderers.registry import getRenderer
|
||||
|
||||
# Validate and fix each document
|
||||
for doc in documents:
|
||||
# Validation 3.3 & 3.4: Document outputFormat
|
||||
# outputFormat parameter is optional - if omitted, formats determined from prompt by AI
|
||||
# Use as fallback only if AI doesn't return format per document
|
||||
# Multiple documents can have different formats (e.g., one PDF, one HTML)
|
||||
globalFormatFallback = outputFormat or "txt" # Fallback for validation
|
||||
|
||||
if "outputFormat" not in doc or not doc["outputFormat"]:
|
||||
# AI didn't return format or returned empty - use global fallback
|
||||
doc["outputFormat"] = globalFormatFallback
|
||||
logger.warning(f"Document {doc.get('id')} missing outputFormat - using fallback: {doc['outputFormat']}")
|
||||
else:
|
||||
# AI returned format - validate using existing renderer registry
|
||||
formatName = str(doc["outputFormat"]).lower().strip()
|
||||
renderer = getRenderer(formatName) # Uses existing infrastructure
|
||||
|
||||
if not renderer:
|
||||
# Format doesn't match any renderer - use txt (simple approach)
|
||||
logger.warning(f"Document {doc.get('id')} has format without renderer: {formatName}, using 'txt'")
|
||||
doc["outputFormat"] = "txt"
|
||||
else:
|
||||
# Valid format with renderer - normalize and keep AI result
|
||||
doc["outputFormat"] = formatName
|
||||
logger.debug(f"Document {doc.get('id')} using AI-determined format: {formatName}")
|
||||
|
||||
# Validation 3.5 & 3.6: Document language
|
||||
# Use validated currentUserLanguage (always valid, validated during user intention analysis)
|
||||
# Access via _getUserLanguage() which uses self.services.currentUserLanguage
|
||||
userPromptLanguage = self._getUserLanguage() # Uses validated currentUserLanguage infrastructure
|
||||
|
||||
if "language" not in doc or not isinstance(doc["language"], str) or len(doc["language"]) != 2:
|
||||
# AI didn't return language or invalid format - use validated currentUserLanguage
|
||||
doc["language"] = userPromptLanguage
|
||||
if "language" not in doc:
|
||||
logger.warning(f"Document {doc.get('id')} missing language - using currentUserLanguage: {userPromptLanguage}")
|
||||
else:
|
||||
logger.warning(f"Document {doc.get('id')} has invalid language format from AI: {doc['language']}, using currentUserLanguage")
|
||||
else:
|
||||
# AI returned valid language format - normalize
|
||||
doc["language"] = doc["language"].lower().strip()[:2]
|
||||
logger.debug(f"Document {doc.get('id')} using AI-determined language: {doc['language']}")
|
||||
|
||||
# Validation 3.7: Document missing 'chapters' field
|
||||
if "chapters" not in doc:
|
||||
raise ValueError(f"Document {doc.get('id')} missing 'chapters' field - cannot auto-fix")
|
||||
|
||||
# Validation 3.8: Chapter missing 'contentParts' field
|
||||
for chapter in doc["chapters"]:
|
||||
if "contentParts" not in chapter:
|
||||
raise ValueError(f"Chapter {chapter.get('id')} missing 'contentParts' field - cannot auto-fix")
|
||||
|
||||
# ChatLog abschließen
|
||||
self.services.chat.progressLogFinish(structureOperationId, True)
|
||||
|
||||
return structure
|
||||
|
||||
except Exception as e:
|
||||
self.services.chat.progressLogFinish(structureOperationId, False)
|
||||
logger.error(f"Error in generateStructure: {str(e)}")
|
||||
raise
|
||||
|
||||
def _buildChapterStructurePrompt(
|
||||
self,
|
||||
userPrompt: str,
|
||||
contentParts: List[ContentPart],
|
||||
outputFormat: str
|
||||
) -> tuple[str, str]:
|
||||
"""Baue Prompt für Chapter-Struktur-Generierung."""
|
||||
# Baue ContentParts-Index - filtere leere Parts heraus
|
||||
contentPartsIndex = ""
|
||||
validParts = []
|
||||
filteredParts = []
|
||||
|
||||
for part in contentParts:
|
||||
contentFormat = part.metadata.get("contentFormat", "unknown")
|
||||
|
||||
# WICHTIG: Reference Parts haben absichtlich leere Daten - immer einschließen
|
||||
if contentFormat == "reference":
|
||||
validParts.append(part)
|
||||
logger.debug(f"Including reference ContentPart {part.id} (intentionally empty data)")
|
||||
continue
|
||||
|
||||
# Überspringe leere Parts (keine Daten oder nur Container ohne Inhalt)
|
||||
# ABER: Reference Parts wurden bereits oben behandelt
|
||||
if not part.data or (isinstance(part.data, str) and len(part.data.strip()) == 0):
|
||||
# Überspringe Container-Parts ohne Daten
|
||||
if part.typeGroup == "container" and not part.data:
|
||||
filteredParts.append((part.id, "container without data"))
|
||||
continue
|
||||
# Überspringe andere leere Parts (aber nicht Reference, die wurden bereits behandelt)
|
||||
if not part.data:
|
||||
filteredParts.append((part.id, f"no data (format: {contentFormat})"))
|
||||
continue
|
||||
|
||||
validParts.append(part)
|
||||
logger.debug(f"Including ContentPart {part.id}: format={contentFormat}, type={part.typeGroup}, dataLength={len(str(part.data)) if part.data else 0}")
|
||||
|
||||
if filteredParts:
|
||||
logger.debug(f"Filtered out {len(filteredParts)} empty ContentParts: {filteredParts}")
|
||||
|
||||
logger.info(f"Building structure prompt with {len(validParts)} valid ContentParts (from {len(contentParts)} total)")
|
||||
|
||||
# Baue Index nur für gültige Parts
|
||||
for i, part in enumerate(validParts, 1):
|
||||
contentFormat = part.metadata.get("contentFormat", "unknown")
|
||||
originalFileName = part.metadata.get('originalFileName', 'N/A')
|
||||
|
||||
contentPartsIndex += f"\n{i}. ContentPart ID: {part.id}\n"
|
||||
contentPartsIndex += f" Format: {contentFormat}\n"
|
||||
contentPartsIndex += f" Type: {part.typeGroup}\n"
|
||||
contentPartsIndex += f" MIME Type: {part.mimeType or 'N/A'}\n"
|
||||
contentPartsIndex += f" Source: {part.metadata.get('documentId', 'unknown')}\n"
|
||||
contentPartsIndex += f" Original file name: {originalFileName}\n"
|
||||
contentPartsIndex += f" Usage hint: {part.metadata.get('usageHint', 'N/A')}\n"
|
||||
|
||||
if not contentPartsIndex:
|
||||
contentPartsIndex = "\n(No content parts available)"
|
||||
|
||||
# Get language from services (user intention analysis)
|
||||
language = self._getUserLanguage()
|
||||
logger.debug(f"Using language from services (user intention analysis) for structure generation: {language}")
|
||||
|
||||
# Create template structure explicitly (not extracted from prompt)
|
||||
# This ensures exact identity between initial and continuation prompts
|
||||
templateStructure = f"""{{
|
||||
"metadata": {{
|
||||
"title": "Document Title",
|
||||
"language": "{language}"
|
||||
}},
|
||||
"documents": [{{
|
||||
"id": "doc_1",
|
||||
"title": "Document Title",
|
||||
"filename": "document.{outputFormat}",
|
||||
"outputFormat": "{outputFormat}",
|
||||
"language": "{language}",
|
||||
"chapters": [
|
||||
{{
|
||||
"id": "chapter_1",
|
||||
"level": 1,
|
||||
"title": "Chapter Title",
|
||||
"contentParts": {{
|
||||
"extracted_part_id": {{
|
||||
"instruction": "Use extracted content with ALL relevant details from user request"
|
||||
}}
|
||||
}},
|
||||
"generationHint": "Detailed description including ALL relevant details from user request for this chapter",
|
||||
"sections": []
|
||||
}}
|
||||
]
|
||||
}}]
|
||||
}}"""
|
||||
|
||||
prompt = f"""# TASK: Plan Document Structure (Documents + Chapters)
|
||||
|
||||
This is a STRUCTURE PLANNING task. You define which documents to create and which chapters each document will have.
|
||||
Chapter CONTENT will be generated in a later step - here you only plan the STRUCTURE and assign content references.
|
||||
Return EXACTLY ONE complete JSON object. Do not generate multiple JSON objects, alternatives, or variations. Do not use separators like "---" between JSON objects.
|
||||
|
||||
## USER REQUEST (for context)
|
||||
```
|
||||
{userPrompt}
|
||||
```
|
||||
|
||||
## AVAILABLE CONTENT PARTS
|
||||
{contentPartsIndex}
|
||||
|
||||
## CONTENT ASSIGNMENT RULE
|
||||
|
||||
CRITICAL: Every chapter MUST have contentParts assigned if it relates to documents/images/data from the user request.
|
||||
If the user request mentions documents/images/data, then EVERY chapter that generates content related to those references MUST assign the relevant ContentParts explicitly.
|
||||
|
||||
Assignment logic:
|
||||
- If chapter DISPLAYS a document/image → assign "object" format ContentPart with "caption"
|
||||
- If chapter generates text content ABOUT a document/image/data → assign ContentPart with "instruction":
|
||||
- Prefer "extracted" format if available (contains analyzed/extracted content)
|
||||
- If only "object" format is available, use "object" format with "instruction" (to write ABOUT the image/document)
|
||||
- If chapter's generationHint or purpose relates to a document/image/data mentioned in user request → it MUST have ContentParts assigned
|
||||
- Multiple chapters might assign the same ContentPart (e.g., one chapter displays image, another writes about it)
|
||||
- Use ContentPart IDs exactly as listed in AVAILABLE CONTENT PARTS above
|
||||
- Empty contentParts are only allowed if chapter generates content WITHOUT referencing any documents/images/data from the user request
|
||||
|
||||
CRITICAL RULE: If the user request mentions BOTH:
|
||||
a) Documents/images/data (listed in AVAILABLE CONTENT PARTS above), AND
|
||||
b) Generic content types (article text, main content, body text, etc.)
|
||||
Then chapters that generate those generic content types MUST assign the relevant ContentParts, because the content should relate to or be based on the provided documents/images/data.
|
||||
|
||||
## CONTENT EFFICIENCY PRINCIPLES
|
||||
- Generate COMPACT content: Focus on essential information only
|
||||
- AVOID verbose, lengthy, or repetitive text - be concise and direct
|
||||
- Prioritize FACTS over filler text - no introductions like "In this chapter..."
|
||||
- Minimize system resources: shorter content = faster processing
|
||||
- Quality over quantity: precise, meaningful content rather than padding
|
||||
|
||||
## CHAPTER STRUCTURE REQUIREMENTS
|
||||
- Generate chapters based on USER REQUEST - analyze what structure the user wants
|
||||
- Create ONLY the minimum chapters needed to cover the user's request - avoid over-structuring
|
||||
- IMPORTANT: Each chapter MUST have ALL these fields:
|
||||
- id: Unique identifier (e.g., "chapter_1")
|
||||
- level: Heading level (1, 2, 3, etc.)
|
||||
- title: Chapter title
|
||||
- contentParts: Object mapping ContentPart IDs to usage instructions (MUST assign if chapter relates to documents/data from user request)
|
||||
- generationHint: Description of what content to generate (including formatting/styling requirements)
|
||||
- sections: Empty array [] (REQUIRED - sections are generated in next phase)
|
||||
- contentParts: {{"partId": {{"instruction": "..."}} or {{"caption": "..."}} or both}} - Assign ContentParts as required by CONTENT ASSIGNMENT RULE above
|
||||
- The "instruction" field for each ContentPart MUST contain ALL relevant details from the USER REQUEST that apply to content extraction for this specific chapter. Include all formatting rules, data requirements, constraints, and specifications mentioned in the user request that are relevant for processing this ContentPart in this chapter.
|
||||
- generationHint: Keep CONCISE but include relevant details from the USER REQUEST. Focus on WHAT to generate, not HOW to phrase it verbosely.
|
||||
- The number of chapters depends on the user request - create only what is requested. Do NOT create chapters for topics without available data.
|
||||
|
||||
CRITICAL: Only create chapters for CONTENT sections, not for formatting/styling requirements. Formatting/styling requirements to be included in each generationHint if needed.
|
||||
|
||||
## DOCUMENT STRUCTURE
|
||||
|
||||
For each document, determine:
|
||||
- outputFormat: From USER REQUEST (explicit mention or infer from purpose/content type). Default: "{outputFormat}". Multiple documents can have different formats.
|
||||
- language: From USER REQUEST (map to ISO 639-1: de, en, fr, it...). Default: "{language}". Multiple documents can have different languages.
|
||||
- chapters: Structure appropriately for the format (e.g., pptx=slides, docx=sections, xlsx=worksheets). Match format capabilities and constraints.
|
||||
|
||||
Required JSON fields:
|
||||
- metadata: {{"title": "...", "language": "..."}}
|
||||
- documents: Array with id, title, filename, outputFormat, language, chapters[]
|
||||
- chapters: Array with id, level, title, contentParts, generationHint, sections[]
|
||||
|
||||
EXAMPLE STRUCTURE (for reference only - adapt to user request):
|
||||
{{
|
||||
"metadata": {{
|
||||
"title": "Document Title",
|
||||
"language": "{language}"
|
||||
}},
|
||||
"documents": [{{
|
||||
"id": "doc_1",
|
||||
"title": "Document Title",
|
||||
"filename": "document.{outputFormat}",
|
||||
"outputFormat": "{outputFormat}",
|
||||
"language": "{language}",
|
||||
"chapters": [
|
||||
{{
|
||||
"id": "chapter_1",
|
||||
"level": 1,
|
||||
"title": "Chapter Title",
|
||||
"contentParts": {{
|
||||
"extracted_part_id": {{
|
||||
"instruction": "Use extracted content with ALL relevant details from user request"
|
||||
}}
|
||||
}},
|
||||
"generationHint": "Detailed description including ALL relevant details from user request for this chapter",
|
||||
"sections": []
|
||||
}}
|
||||
]
|
||||
}}]
|
||||
}}
|
||||
|
||||
CRITICAL INSTRUCTIONS:
|
||||
- Generate chapters based on USER REQUEST, NOT based on the example above
|
||||
- The example shows the JSON structure format, NOT the required chapters
|
||||
- Create only the chapters that match the user's request
|
||||
- Adapt chapter titles and structure to match the user's specific request
|
||||
- Determine outputFormat and language for each document by analyzing the USER REQUEST above
|
||||
- The example shows placeholders "{outputFormat}" and "{language}" - YOU MUST REPLACE THESE with actual values determined from the USER REQUEST
|
||||
|
||||
MANDATORY CONTENT ASSIGNMENT CHECK:
|
||||
For each chapter, verify:
|
||||
1. Does the user request mention documents/images/data? (e.g., "photo", "image", "document", "data", "based on", "about")
|
||||
2. Does this chapter's generationHint, title, or purpose relate to those documents/images/data mentioned in step 1?
|
||||
- Examples: "article about the photo", "text describing the image", "analysis of the document", "content based on the data"
|
||||
- Even if chapter doesn't explicitly say "about the image", if user request mentions both the image AND this chapter's content type → relate them
|
||||
3. If YES to both → chapter MUST have contentParts assigned (cannot be empty {{}})
|
||||
4. If ContentPart is "object" format and chapter needs to write ABOUT it → assign with "instruction" field, not just "caption"
|
||||
|
||||
OUTPUT FORMAT: Start with {{ and end with }}. Do NOT use markdown code fences (```json). Do NOT add explanatory text before or after the JSON. Return ONLY the JSON object itself.
|
||||
"""
|
||||
return prompt, templateStructure
|
||||
|
||||
|
|
@ -0,0 +1,7 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""Billing service."""
|
||||
|
||||
from .mainServiceBilling import BillingService, getService, InsufficientBalanceException, ProviderNotAllowedException, BillingContextError
|
||||
|
||||
__all__ = ["BillingService", "getService", "InsufficientBalanceException", "ProviderNotAllowedException", "BillingContextError"]
|
||||
|
|
@ -0,0 +1,436 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""
|
||||
Billing Service - Central service for billing operations.
|
||||
|
||||
Handles:
|
||||
- Balance checks before AI operations
|
||||
- Cost recording after AI operations
|
||||
- Provider permission checks via RBAC
|
||||
- Price calculation with markup
|
||||
"""
|
||||
|
||||
import logging
|
||||
from typing import Dict, Any, List, Optional
|
||||
from datetime import datetime
|
||||
|
||||
from modules.datamodels.datamodelUam import User
|
||||
from modules.datamodels.datamodelBilling import (
|
||||
BillingModelEnum,
|
||||
BillingCheckResult,
|
||||
TransactionTypeEnum,
|
||||
ReferenceTypeEnum,
|
||||
BillingTransaction,
|
||||
BillingBalanceResponse,
|
||||
)
|
||||
from modules.interfaces.interfaceDbBilling import getInterface as getBillingInterface
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Markup percentage for internal pricing (+50% für Infrastruktur und Platform Service + 50% für Währungsrisiko ==> Faktor 2.0)
|
||||
BILLING_MARKUP_PERCENT = 100
|
||||
|
||||
# Singleton cache
|
||||
_billingServices: Dict[str, "BillingService"] = {}
|
||||
|
||||
|
||||
def getService(currentUser: User, mandateId: str, featureInstanceId: str = None, featureCode: str = None) -> "BillingService":
|
||||
"""
|
||||
Factory function to get or create a BillingService instance.
|
||||
|
||||
Args:
|
||||
currentUser: Current user object
|
||||
mandateId: Mandate ID for context
|
||||
featureInstanceId: Optional feature instance ID
|
||||
featureCode: Optional feature code (e.g., 'chatplayground', 'automation')
|
||||
|
||||
Returns:
|
||||
BillingService instance
|
||||
"""
|
||||
cacheKey = f"{currentUser.id}_{mandateId}_{featureInstanceId}"
|
||||
|
||||
if cacheKey not in _billingServices:
|
||||
_billingServices[cacheKey] = BillingService(currentUser, mandateId, featureInstanceId, featureCode)
|
||||
else:
|
||||
_billingServices[cacheKey].setContext(currentUser, mandateId, featureInstanceId, featureCode)
|
||||
|
||||
return _billingServices[cacheKey]
|
||||
|
||||
|
||||
def _get_feature_code_from_context(context) -> Optional[str]:
|
||||
"""Extract featureCode from ServiceCenterContext."""
|
||||
if context.workflow and hasattr(context.workflow, "feature") and context.workflow.feature:
|
||||
return getattr(context.workflow.feature, "code", None)
|
||||
return getattr(context.workflow, "featureCode", None) if context.workflow else None
|
||||
|
||||
|
||||
class BillingService:
|
||||
"""
|
||||
Central billing service for AI operations.
|
||||
|
||||
Responsibilities:
|
||||
- Check balance before operations
|
||||
- Record usage costs
|
||||
- Apply pricing markup
|
||||
- Check provider permissions via RBAC
|
||||
|
||||
Supports both service center (context, get_service) and legacy (user, mandateId, ...) initialization.
|
||||
"""
|
||||
|
||||
def __init__(self, context_or_user, mandateId=None, featureInstanceId=None, featureCode=None, get_service=None):
|
||||
"""
|
||||
Initialize the billing service.
|
||||
|
||||
Service center: (context, get_service) - resolver passes exactly these two args
|
||||
Legacy: (currentUser, mandateId, featureInstanceId, featureCode) from getService() factory
|
||||
"""
|
||||
# Detect service center: second arg is callable (get_service)
|
||||
if mandateId is not None and callable(mandateId):
|
||||
ctx = context_or_user
|
||||
get_service = mandateId
|
||||
self.currentUser = ctx.user
|
||||
self.mandateId = ctx.mandate_id or ""
|
||||
self.featureInstanceId = ctx.feature_instance_id
|
||||
self.featureCode = _get_feature_code_from_context(ctx)
|
||||
elif get_service is not None and hasattr(context_or_user, "user"):
|
||||
ctx = context_or_user
|
||||
self.currentUser = ctx.user
|
||||
self.mandateId = ctx.mandate_id or ""
|
||||
self.featureInstanceId = ctx.feature_instance_id
|
||||
self.featureCode = _get_feature_code_from_context(ctx)
|
||||
else:
|
||||
self.currentUser = context_or_user
|
||||
self.mandateId = mandateId or ""
|
||||
self.featureInstanceId = featureInstanceId
|
||||
self.featureCode = featureCode
|
||||
|
||||
self._billingInterface = getBillingInterface(self.currentUser, self.mandateId)
|
||||
self._settingsCache = None
|
||||
|
||||
def setContext(
|
||||
self,
|
||||
currentUser: User,
|
||||
mandateId: str,
|
||||
featureInstanceId: str = None,
|
||||
featureCode: str = None
|
||||
):
|
||||
"""Update service context."""
|
||||
self.currentUser = currentUser
|
||||
self.mandateId = mandateId
|
||||
self.featureInstanceId = featureInstanceId
|
||||
self.featureCode = featureCode
|
||||
self._billingInterface = getBillingInterface(currentUser, mandateId)
|
||||
self._settingsCache = None
|
||||
|
||||
def _getSettings(self) -> Optional[Dict[str, Any]]:
|
||||
"""Get billing settings with caching."""
|
||||
if self._settingsCache is None:
|
||||
self._settingsCache = self._billingInterface.getSettings(self.mandateId)
|
||||
return self._settingsCache
|
||||
|
||||
# =========================================================================
|
||||
# Price Calculation
|
||||
# =========================================================================
|
||||
|
||||
def calculatePriceWithMarkup(self, basePriceCHF: float) -> float:
|
||||
"""
|
||||
Calculate final price with markup.
|
||||
|
||||
The AICore plugins return prices in their original currency (USD).
|
||||
This method applies the configured markup percentage.
|
||||
|
||||
Args:
|
||||
basePriceCHF: Base price from AI model (actually USD from provider)
|
||||
|
||||
Returns:
|
||||
Final price in CHF with markup applied
|
||||
"""
|
||||
if basePriceCHF <= 0:
|
||||
return 0.0
|
||||
|
||||
# Apply markup (50% = multiply by 1.5)
|
||||
markup_multiplier = 1 + (BILLING_MARKUP_PERCENT / 100)
|
||||
return round(basePriceCHF * markup_multiplier, 6)
|
||||
|
||||
# =========================================================================
|
||||
# Balance Operations
|
||||
# =========================================================================
|
||||
|
||||
def checkBalance(self, estimatedCost: float = 0.0) -> BillingCheckResult:
|
||||
"""
|
||||
Check if the current user/mandate has sufficient balance.
|
||||
|
||||
Args:
|
||||
estimatedCost: Estimated cost of the operation (with markup applied)
|
||||
|
||||
Returns:
|
||||
BillingCheckResult indicating if operation is allowed
|
||||
"""
|
||||
return self._billingInterface.checkBalance(
|
||||
self.mandateId,
|
||||
self.currentUser.id,
|
||||
estimatedCost
|
||||
)
|
||||
|
||||
def hasBalance(self, estimatedCost: float = 0.0) -> bool:
|
||||
"""
|
||||
Quick check if balance is sufficient.
|
||||
|
||||
Args:
|
||||
estimatedCost: Estimated cost with markup
|
||||
|
||||
Returns:
|
||||
True if operation is allowed
|
||||
"""
|
||||
result = self.checkBalance(estimatedCost)
|
||||
return result.allowed
|
||||
|
||||
def getCurrentBalance(self) -> float:
|
||||
"""
|
||||
Get current balance for the user/mandate.
|
||||
|
||||
Returns:
|
||||
Current balance in CHF
|
||||
"""
|
||||
result = self.checkBalance(0.0)
|
||||
return result.currentBalance or 0.0
|
||||
|
||||
# =========================================================================
|
||||
# Usage Recording
|
||||
# =========================================================================
|
||||
|
||||
def recordUsage(
|
||||
self,
|
||||
priceCHF: float,
|
||||
workflowId: str = None,
|
||||
aicoreProvider: str = None,
|
||||
aicoreModel: str = None,
|
||||
description: str = None
|
||||
) -> Optional[Dict[str, Any]]:
|
||||
"""
|
||||
Record AI usage cost as a billing transaction.
|
||||
|
||||
This method:
|
||||
1. Applies the pricing markup
|
||||
2. Creates a DEBIT transaction
|
||||
3. Updates the account balance
|
||||
|
||||
Args:
|
||||
priceCHF: Base price from AI model (before markup)
|
||||
workflowId: Optional workflow ID
|
||||
aicoreProvider: AICore provider name (e.g., 'anthropic', 'openai')
|
||||
aicoreModel: AICore model name (e.g., 'claude-4-sonnet', 'gpt-4o')
|
||||
description: Optional description
|
||||
|
||||
Returns:
|
||||
Created transaction dict or None if not recorded
|
||||
"""
|
||||
if priceCHF <= 0:
|
||||
return None
|
||||
|
||||
# Apply markup
|
||||
finalPrice = self.calculatePriceWithMarkup(priceCHF)
|
||||
|
||||
if finalPrice <= 0:
|
||||
return None
|
||||
|
||||
# Build description
|
||||
if not description:
|
||||
description = f"AI Usage: {aicoreModel or aicoreProvider or 'unknown'}"
|
||||
|
||||
return self._billingInterface.recordUsage(
|
||||
mandateId=self.mandateId,
|
||||
userId=self.currentUser.id,
|
||||
priceCHF=finalPrice,
|
||||
workflowId=workflowId,
|
||||
featureInstanceId=self.featureInstanceId,
|
||||
featureCode=self.featureCode,
|
||||
aicoreProvider=aicoreProvider,
|
||||
aicoreModel=aicoreModel,
|
||||
description=description
|
||||
)
|
||||
|
||||
# =========================================================================
|
||||
# Provider Permission Check (via RBAC)
|
||||
# =========================================================================
|
||||
|
||||
def isProviderAllowed(self, provider: str) -> bool:
|
||||
"""
|
||||
Check if the user has permission to use an AICore provider.
|
||||
|
||||
Uses RBAC to check for resource permission:
|
||||
resource.aicore.{provider}
|
||||
|
||||
Args:
|
||||
provider: Provider name (e.g., 'anthropic', 'openai')
|
||||
|
||||
Returns:
|
||||
True if provider is allowed
|
||||
"""
|
||||
try:
|
||||
from modules.security.rbac import RbacClass
|
||||
from modules.datamodels.datamodelRbac import AccessRuleContext
|
||||
from modules.security.rootAccess import getRootDbAppConnector
|
||||
|
||||
# Get database connector via established pattern
|
||||
dbApp = getRootDbAppConnector()
|
||||
|
||||
rbac = RbacClass(dbApp, dbApp)
|
||||
resourceKey = f"resource.aicore.{provider}"
|
||||
|
||||
# Check if user has view permission for this resource (view = use for RESOURCE context)
|
||||
permissions = rbac.getUserPermissions(
|
||||
self.currentUser,
|
||||
AccessRuleContext.RESOURCE,
|
||||
resourceKey,
|
||||
mandateId=self.mandateId
|
||||
)
|
||||
|
||||
return permissions.view
|
||||
except Exception as e:
|
||||
logger.warning(f"Error checking provider permission: {e}")
|
||||
# Default to allowed if RBAC check fails
|
||||
return True
|
||||
|
||||
def getallowedProviders(self) -> List[str]:
|
||||
"""
|
||||
Get list of AICore providers the user is allowed to use.
|
||||
|
||||
Returns:
|
||||
List of allowed provider names
|
||||
"""
|
||||
try:
|
||||
from modules.aicore.aicoreModelRegistry import modelRegistry
|
||||
|
||||
# Get all available providers
|
||||
connectors = modelRegistry.discoverConnectors()
|
||||
allProviders = [c.getConnectorType() for c in connectors]
|
||||
|
||||
# Filter by RBAC permissions
|
||||
return [p for p in allProviders if self.isProviderAllowed(p)]
|
||||
except Exception as e:
|
||||
logger.warning(f"Error getting allowed providers: {e}")
|
||||
return []
|
||||
|
||||
# =========================================================================
|
||||
# Admin Operations
|
||||
# =========================================================================
|
||||
|
||||
def addCredit(
|
||||
self,
|
||||
amount: float,
|
||||
description: str = "Manual credit",
|
||||
referenceType: ReferenceTypeEnum = ReferenceTypeEnum.ADMIN
|
||||
) -> Optional[Dict[str, Any]]:
|
||||
"""
|
||||
Add credit to the account (admin operation).
|
||||
|
||||
Args:
|
||||
amount: Amount to credit (positive)
|
||||
description: Transaction description
|
||||
referenceType: Reference type (ADMIN, PAYMENT, SYSTEM)
|
||||
|
||||
Returns:
|
||||
Created transaction dict or None
|
||||
"""
|
||||
if amount <= 0:
|
||||
return None
|
||||
|
||||
settings = self._getSettings()
|
||||
if not settings:
|
||||
logger.warning(f"No billing settings for mandate {self.mandateId}")
|
||||
return None
|
||||
|
||||
billingModel = BillingModelEnum(settings.get("billingModel", BillingModelEnum.UNLIMITED.value))
|
||||
|
||||
# Get or create account
|
||||
if billingModel == BillingModelEnum.PREPAY_USER:
|
||||
account = self._billingInterface.getOrCreateUserAccount(
|
||||
self.mandateId,
|
||||
self.currentUser.id,
|
||||
initialBalance=0.0
|
||||
)
|
||||
else:
|
||||
account = self._billingInterface.getOrCreateMandateAccount(
|
||||
self.mandateId,
|
||||
initialBalance=0.0
|
||||
)
|
||||
|
||||
# Create credit transaction
|
||||
transaction = BillingTransaction(
|
||||
accountId=account["id"],
|
||||
transactionType=TransactionTypeEnum.CREDIT,
|
||||
amount=amount,
|
||||
description=description,
|
||||
referenceType=referenceType
|
||||
)
|
||||
|
||||
return self._billingInterface.createTransaction(transaction)
|
||||
|
||||
# =========================================================================
|
||||
# Statistics & Reporting
|
||||
# =========================================================================
|
||||
|
||||
def getBalancesForUser(self) -> List[BillingBalanceResponse]:
|
||||
"""
|
||||
Get all billing balances for the current user.
|
||||
|
||||
Returns:
|
||||
List of balance responses for each mandate
|
||||
"""
|
||||
return self._billingInterface.getBalancesForUser(self.currentUser.id)
|
||||
|
||||
def getTransactionHistory(self, limit: int = 100) -> List[Dict[str, Any]]:
|
||||
"""
|
||||
Get transaction history for the user across all mandates.
|
||||
|
||||
Args:
|
||||
limit: Maximum number of transactions
|
||||
|
||||
Returns:
|
||||
List of transactions
|
||||
"""
|
||||
return self._billingInterface.getTransactionsForUser(self.currentUser.id, limit=limit)
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# Exception Classes
|
||||
# ============================================================================
|
||||
|
||||
class InsufficientBalanceException(Exception):
|
||||
"""Raised when there's insufficient balance for an operation."""
|
||||
|
||||
def __init__(self, currentBalance: float, requiredAmount: float, message: str = None):
|
||||
self.currentBalance = currentBalance
|
||||
self.requiredAmount = requiredAmount
|
||||
self.message = message or f"Insufficient balance. Current: {currentBalance:.2f} CHF, Required: {requiredAmount:.2f} CHF"
|
||||
super().__init__(self.message)
|
||||
|
||||
|
||||
class ProviderNotAllowedException(Exception):
|
||||
"""Raised when a user doesn't have permission to use an AI provider."""
|
||||
|
||||
def __init__(self, provider: str, message: str = None):
|
||||
self.provider = provider
|
||||
self.message = message or f"Provider '{provider}' is not allowed for your role"
|
||||
super().__init__(self.message)
|
||||
|
||||
|
||||
class BillingContextError(Exception):
|
||||
"""Raised when billing context is incomplete (missing mandateId, user, etc.).
|
||||
|
||||
This is a FAIL-SAFE error: AI calls MUST NOT proceed without valid billing context.
|
||||
Acts like a 0 CHF credit card pre-authorization check - validates that billing
|
||||
CAN be recorded before any expensive AI operation starts.
|
||||
"""
|
||||
|
||||
def __init__(self, message: str = None):
|
||||
self.message = message or "Billing context incomplete - AI call blocked"
|
||||
super().__init__(self.message)
|
||||
|
||||
|
||||
# Expose exception classes on BillingService so consumers can use service.InsufficientBalanceException
|
||||
# instead of importing from this module
|
||||
BillingService.InsufficientBalanceException = InsufficientBalanceException
|
||||
BillingService.ProviderNotAllowedException = ProviderNotAllowedException
|
||||
BillingService.BillingContextError = BillingContextError
|
||||
104
modules/serviceCenter/services/serviceBilling/stripeCheckout.py
Normal file
104
modules/serviceCenter/services/serviceBilling/stripeCheckout.py
Normal file
|
|
@ -0,0 +1,104 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""
|
||||
Stripe Checkout service for billing credit top-ups.
|
||||
Creates Checkout Sessions for redirect-based payment flow.
|
||||
"""
|
||||
|
||||
import logging
|
||||
from typing import Optional
|
||||
|
||||
from modules.shared.configuration import APP_CONFIG
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Server-side allowed amounts in CHF - never trust client
|
||||
ALLOWED_AMOUNTS_CHF = [10, 25, 50, 100, 250, 500]
|
||||
|
||||
|
||||
def create_checkout_session(
|
||||
mandate_id: str,
|
||||
user_id: Optional[str],
|
||||
amount_chf: float
|
||||
) -> str:
|
||||
"""
|
||||
Create a Stripe Checkout Session for credit top-up.
|
||||
|
||||
Amount and currency are validated server-side. The client-provided amount
|
||||
must match an allowed preset.
|
||||
|
||||
Args:
|
||||
mandate_id: Target mandate ID
|
||||
user_id: Target user ID (for PREPAY_USER) or None (for mandate pool)
|
||||
amount_chf: Amount in CHF (must be in ALLOWED_AMOUNTS_CHF)
|
||||
|
||||
Returns:
|
||||
Stripe Checkout Session URL for redirect
|
||||
|
||||
Raises:
|
||||
ValueError: If amount is invalid
|
||||
"""
|
||||
import stripe
|
||||
|
||||
# Validate amount server-side
|
||||
if amount_chf not in ALLOWED_AMOUNTS_CHF:
|
||||
raise ValueError(
|
||||
f"Invalid amount {amount_chf} CHF. Allowed: {ALLOWED_AMOUNTS_CHF}"
|
||||
)
|
||||
|
||||
# Pin API version from config (match Stripe Dashboard)
|
||||
api_version = APP_CONFIG.get("STRIPE_API_VERSION")
|
||||
if api_version:
|
||||
stripe.api_version = api_version
|
||||
|
||||
# Get secrets
|
||||
secret_key = APP_CONFIG.get("STRIPE_SECRET_KEY_SECRET") or APP_CONFIG.get("STRIPE_SECRET_KEY")
|
||||
if not secret_key:
|
||||
raise ValueError("STRIPE_SECRET_KEY_SECRET not configured")
|
||||
|
||||
stripe.api_key = secret_key
|
||||
|
||||
frontend_url = APP_CONFIG.get("APP_FRONTEND_URL", "https://nyla-int.poweron-center.net")
|
||||
base_path = "/admin/billing"
|
||||
success_url = f"{frontend_url.rstrip('/')}{base_path}?success=true&session_id={{CHECKOUT_SESSION_ID}}"
|
||||
cancel_url = f"{frontend_url.rstrip('/')}{base_path}?canceled=true"
|
||||
|
||||
# Amount in cents for Stripe (CHF uses 2 decimal places)
|
||||
amount_cents = int(round(amount_chf * 100))
|
||||
|
||||
metadata = {
|
||||
"mandateId": mandate_id,
|
||||
"amountChf": str(amount_chf),
|
||||
}
|
||||
if user_id:
|
||||
metadata["userId"] = user_id
|
||||
|
||||
session = stripe.checkout.Session.create(
|
||||
mode="payment",
|
||||
line_items=[
|
||||
{
|
||||
"price_data": {
|
||||
"currency": "chf",
|
||||
"unit_amount": amount_cents,
|
||||
"product_data": {
|
||||
"name": "Guthaben aufladen",
|
||||
"description": "AI Service Guthaben (CHF)",
|
||||
},
|
||||
},
|
||||
"quantity": 1,
|
||||
}
|
||||
],
|
||||
success_url=success_url,
|
||||
cancel_url=cancel_url,
|
||||
metadata=metadata,
|
||||
)
|
||||
|
||||
if not session or not session.url:
|
||||
raise ValueError("Stripe Checkout Session creation failed")
|
||||
|
||||
logger.info(
|
||||
f"Created Stripe Checkout Session {session.id} for mandate {mandate_id}, "
|
||||
f"amount {amount_chf} CHF"
|
||||
)
|
||||
|
||||
return session.url
|
||||
7
modules/serviceCenter/services/serviceChat/__init__.py
Normal file
7
modules/serviceCenter/services/serviceChat/__init__.py
Normal file
|
|
@ -0,0 +1,7 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""Chat service."""
|
||||
|
||||
from .mainServiceChat import ChatService
|
||||
|
||||
__all__ = ["ChatService"]
|
||||
1086
modules/serviceCenter/services/serviceChat/mainServiceChat.py
Normal file
1086
modules/serviceCenter/services/serviceChat/mainServiceChat.py
Normal file
File diff suppressed because it is too large
Load diff
|
|
@ -0,0 +1,7 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
from .mainServiceExtraction import ExtractionService
|
||||
|
||||
__all__ = ["ExtractionService"]
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,4 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,184 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
from typing import Any, Dict, List
|
||||
import base64
|
||||
import io
|
||||
|
||||
from modules.datamodels.datamodelExtraction import ContentPart
|
||||
from ..subRegistry import Chunker
|
||||
|
||||
|
||||
class ImageChunker(Chunker):
|
||||
"""Chunker for reducing image size through resizing, compression, and tiling."""
|
||||
|
||||
def chunk(self, part: ContentPart, options: Dict[str, Any]) -> list[Dict[str, Any]]:
|
||||
"""
|
||||
Chunk an image by reducing its size through various strategies.
|
||||
|
||||
Args:
|
||||
part: ContentPart containing image data (base64 encoded)
|
||||
options: Chunking options including:
|
||||
- imageChunkSize: Maximum size in bytes for each chunk
|
||||
- imageMaxPixels: Maximum pixels (width*height) for the image
|
||||
- imageQuality: JPEG quality (0-100, default 85)
|
||||
- imageTileSize: Size for tiling if image is still too large
|
||||
|
||||
Returns:
|
||||
List of image chunks with reduced size
|
||||
"""
|
||||
maxBytes = int(options.get("imageChunkSize", 1000000)) # 1MB default
|
||||
maxPixels = int(options.get("imageMaxPixels", 1024 * 1024)) # 1MP default
|
||||
quality = int(options.get("imageQuality", 85))
|
||||
tileSize = int(options.get("imageTileSize", 512)) # 512x512 tiles
|
||||
|
||||
chunks: List[Dict[str, Any]] = []
|
||||
|
||||
try:
|
||||
# Lazy import PIL to avoid hanging during module import
|
||||
from PIL import Image
|
||||
|
||||
# Decode base64 image data
|
||||
imageData = base64.b64decode(part.data)
|
||||
image = Image.open(io.BytesIO(imageData))
|
||||
|
||||
# Get original dimensions
|
||||
originalWidth, originalHeight = image.size
|
||||
originalPixels = originalWidth * originalHeight
|
||||
|
||||
# Strategy 1: If image is small enough, return as-is
|
||||
if len(part.data) <= maxBytes and originalPixels <= maxPixels:
|
||||
chunks.append({
|
||||
"data": part.data,
|
||||
"size": len(part.data),
|
||||
"order": 0,
|
||||
"metadata": {
|
||||
"originalSize": len(part.data),
|
||||
"originalPixels": originalPixels,
|
||||
"strategy": "original"
|
||||
}
|
||||
})
|
||||
return chunks
|
||||
|
||||
# Strategy 2: Resize to fit within pixel limit
|
||||
if originalPixels > maxPixels:
|
||||
# Calculate new dimensions maintaining aspect ratio
|
||||
scale = (maxPixels / originalPixels) ** 0.5
|
||||
newWidth = int(originalWidth * scale)
|
||||
newHeight = int(originalHeight * scale)
|
||||
|
||||
# Ensure minimum size
|
||||
newWidth = max(newWidth, 64)
|
||||
newHeight = max(newHeight, 64)
|
||||
|
||||
image = image.resize((newWidth, newHeight), Image.Resampling.LANCZOS)
|
||||
|
||||
# Strategy 3: Compress with quality reduction
|
||||
currentSize = len(part.data)
|
||||
currentQuality = quality
|
||||
|
||||
while currentSize > maxBytes and currentQuality > 10:
|
||||
# Compress image
|
||||
output = io.BytesIO()
|
||||
image.save(output, format='JPEG', quality=currentQuality, optimize=True)
|
||||
compressedData = output.getvalue()
|
||||
compressedB64 = base64.b64encode(compressedData).decode('utf-8')
|
||||
currentSize = len(compressedB64)
|
||||
|
||||
if currentSize <= maxBytes:
|
||||
chunks.append({
|
||||
"data": compressedB64,
|
||||
"size": currentSize,
|
||||
"order": 0,
|
||||
"metadata": {
|
||||
"originalSize": len(part.data),
|
||||
"originalPixels": originalPixels,
|
||||
"compressedSize": currentSize,
|
||||
"quality": currentQuality,
|
||||
"strategy": "compressed"
|
||||
}
|
||||
})
|
||||
return chunks
|
||||
|
||||
currentQuality -= 10
|
||||
|
||||
# Strategy 4: Tile the image if still too large
|
||||
if currentSize > maxBytes:
|
||||
chunks = self._tileImage(image, maxBytes, tileSize, quality, originalPixels)
|
||||
return chunks
|
||||
|
||||
# Fallback: Return compressed version even if over limit
|
||||
output = io.BytesIO()
|
||||
image.save(output, format='JPEG', quality=10, optimize=True)
|
||||
compressedData = output.getvalue()
|
||||
compressedB64 = base64.b64encode(compressedData).decode('utf-8')
|
||||
|
||||
chunks.append({
|
||||
"data": compressedB64,
|
||||
"size": len(compressedB64),
|
||||
"order": 0,
|
||||
"metadata": {
|
||||
"originalSize": len(part.data),
|
||||
"originalPixels": originalPixels,
|
||||
"compressedSize": len(compressedB64),
|
||||
"quality": 10,
|
||||
"strategy": "fallback_compressed"
|
||||
}
|
||||
})
|
||||
|
||||
except Exception as e:
|
||||
# Fallback: Return original data with error metadata
|
||||
chunks.append({
|
||||
"data": part.data,
|
||||
"size": len(part.data),
|
||||
"order": 0,
|
||||
"metadata": {
|
||||
"originalSize": len(part.data),
|
||||
"strategy": "error_fallback",
|
||||
"error": str(e)
|
||||
}
|
||||
})
|
||||
|
||||
return chunks
|
||||
|
||||
def _tileImage(self, image: Any, maxBytes: int, tileSize: int, quality: int, originalPixels: int) -> List[Dict[str, Any]]:
|
||||
"""Split image into tiles if it's still too large after compression."""
|
||||
chunks = []
|
||||
width, height = image.size
|
||||
|
||||
# Calculate tile grid
|
||||
tilesX = (width + tileSize - 1) // tileSize
|
||||
tilesY = (height + tileSize - 1) // tileSize
|
||||
|
||||
for y in range(tilesY):
|
||||
for x in range(tilesX):
|
||||
# Calculate tile boundaries
|
||||
left = x * tileSize
|
||||
top = y * tileSize
|
||||
right = min(left + tileSize, width)
|
||||
bottom = min(top + tileSize, height)
|
||||
|
||||
# Extract tile
|
||||
tile = image.crop((left, top, right, bottom))
|
||||
|
||||
# Compress tile
|
||||
output = io.BytesIO()
|
||||
tile.save(output, format='JPEG', quality=quality, optimize=True)
|
||||
tileData = output.getvalue()
|
||||
tileB64 = base64.b64encode(tileData).decode('utf-8')
|
||||
|
||||
chunks.append({
|
||||
"data": tileB64,
|
||||
"size": len(tileB64),
|
||||
"order": y * tilesX + x,
|
||||
"metadata": {
|
||||
"originalSize": len(image.tobytes()),
|
||||
"originalPixels": originalPixels,
|
||||
"tileSize": tileSize,
|
||||
"tilePosition": f"{x},{y}",
|
||||
"tileBounds": f"{left},{top},{right},{bottom}",
|
||||
"quality": quality,
|
||||
"strategy": "tiled"
|
||||
}
|
||||
})
|
||||
|
||||
return chunks
|
||||
|
|
@ -0,0 +1,91 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
from typing import Any, Dict, List
|
||||
import json
|
||||
|
||||
from modules.datamodels.datamodelExtraction import ContentPart
|
||||
from ..subRegistry import Chunker
|
||||
|
||||
|
||||
class StructureChunker(Chunker):
|
||||
def chunk(self, part: ContentPart, options: Dict[str, Any]) -> list[Dict[str, Any]]:
|
||||
maxBytes = int(options.get("structureChunkSize", 40000))
|
||||
data = part.data or ""
|
||||
# best-effort: try JSON list/object bucketing; else fallback to line-based
|
||||
chunks: List[Dict[str, Any]] = []
|
||||
try:
|
||||
obj = json.loads(data)
|
||||
def emit(bucket: Any):
|
||||
text = json.dumps(bucket, ensure_ascii=False)
|
||||
chunks.append({"data": text, "size": len(text.encode('utf-8')), "order": len(chunks)})
|
||||
if isinstance(obj, list):
|
||||
bucket: list[Any] = []
|
||||
size = 0
|
||||
for item in obj:
|
||||
text = json.dumps(item, ensure_ascii=False)
|
||||
s = len(text.encode('utf-8'))
|
||||
if size + s > maxBytes and bucket:
|
||||
emit(bucket)
|
||||
bucket = [item]
|
||||
size = s
|
||||
else:
|
||||
bucket.append(item)
|
||||
size += s
|
||||
if bucket:
|
||||
emit(bucket)
|
||||
else:
|
||||
# JSON object (dict) - check if it fits
|
||||
text = json.dumps(obj, ensure_ascii=False)
|
||||
textSize = len(text.encode('utf-8'))
|
||||
if textSize <= maxBytes:
|
||||
emit(obj)
|
||||
else:
|
||||
# Object too large - try to split by keys if possible
|
||||
# For large objects, we need to chunk by character boundaries
|
||||
# since we can't split JSON objects arbitrarily
|
||||
if isinstance(obj, dict) and len(obj) > 1:
|
||||
# Try to split object into multiple chunks by keys
|
||||
# This preserves JSON structure better than line-based chunking
|
||||
currentChunk: Dict[str, Any] = {}
|
||||
currentSize = 2 # Start with "{}" overhead
|
||||
for key, value in obj.items():
|
||||
itemText = json.dumps({key: value}, ensure_ascii=False)
|
||||
itemSize = len(itemText.encode('utf-8'))
|
||||
# Account for comma and spacing between items
|
||||
if currentChunk:
|
||||
itemSize += 2 # ", " separator
|
||||
|
||||
if currentSize + itemSize > maxBytes and currentChunk:
|
||||
# Current chunk is full, emit it
|
||||
emit(currentChunk)
|
||||
currentChunk = {key: value}
|
||||
currentSize = len(itemText.encode('utf-8'))
|
||||
else:
|
||||
currentChunk[key] = value
|
||||
currentSize += itemSize
|
||||
|
||||
# Emit remaining chunk
|
||||
if currentChunk:
|
||||
emit(currentChunk)
|
||||
else:
|
||||
# Single large value or can't split - fallback to line chunking
|
||||
raise ValueError("too large")
|
||||
except Exception:
|
||||
current: List[str] = []
|
||||
size = 0
|
||||
for line in data.split('\n'):
|
||||
s = len(line.encode('utf-8')) + 1
|
||||
if size + s > maxBytes and current:
|
||||
text = '\n'.join(current)
|
||||
chunks.append({"data": text, "size": len(text.encode('utf-8')), "order": len(chunks)})
|
||||
current = [line]
|
||||
size = s
|
||||
else:
|
||||
current.append(line)
|
||||
size += s
|
||||
if current:
|
||||
text = '\n'.join(current)
|
||||
chunks.append({"data": text, "size": len(text.encode('utf-8')), "order": len(chunks)})
|
||||
return chunks
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,30 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
from typing import Any, Dict, List
|
||||
|
||||
from modules.datamodels.datamodelExtraction import ContentPart
|
||||
from ..subRegistry import Chunker
|
||||
|
||||
|
||||
class TableChunker(Chunker):
|
||||
def chunk(self, part: ContentPart, options: Dict[str, Any]) -> list[Dict[str, Any]]:
|
||||
maxBytes = int(options.get("tableChunkSize", 40000))
|
||||
chunks: List[Dict[str, Any]] = []
|
||||
current: List[str] = []
|
||||
size = 0
|
||||
for line in part.data.split('\n'):
|
||||
lineSize = len(line.encode('utf-8')) + 1
|
||||
if size + lineSize > maxBytes and current:
|
||||
data = '\n'.join(current)
|
||||
chunks.append({"data": data, "size": len(data.encode('utf-8')), "order": len(chunks)})
|
||||
current = [line]
|
||||
size = lineSize
|
||||
else:
|
||||
current.append(line)
|
||||
size += lineSize
|
||||
if current:
|
||||
data = '\n'.join(current)
|
||||
chunks.append({"data": data, "size": len(data.encode('utf-8')), "order": len(chunks)})
|
||||
return chunks
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,58 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
from typing import Any, Dict, List
|
||||
import logging
|
||||
|
||||
from modules.datamodels.datamodelExtraction import ContentPart
|
||||
from ..subRegistry import Chunker
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class TextChunker(Chunker):
|
||||
def chunk(self, part: ContentPart, options: Dict[str, Any]) -> list[Dict[str, Any]]:
|
||||
maxBytes = int(options.get("textChunkSize", 40000))
|
||||
logger.debug(f"TextChunker: textChunkSize from options: {options.get('textChunkSize', 'NOT_FOUND')}")
|
||||
logger.debug(f"TextChunker: using maxBytes: {maxBytes}")
|
||||
chunks: List[Dict[str, Any]] = []
|
||||
|
||||
# Split by lines first (preferred method for text)
|
||||
lines = part.data.split('\n')
|
||||
current: List[str] = []
|
||||
size = 0
|
||||
|
||||
for line in lines:
|
||||
lineSize = len(line.encode('utf-8')) + 1 # +1 for newline character
|
||||
if size + lineSize > maxBytes and current:
|
||||
# Current chunk is full, save it and start new one
|
||||
data = '\n'.join(current)
|
||||
chunks.append({"data": data, "size": len(data.encode('utf-8')), "order": len(chunks)})
|
||||
current = []
|
||||
size = 0
|
||||
|
||||
# If a single line is larger than maxBytes, split it by character boundaries
|
||||
if lineSize > maxBytes:
|
||||
# Split the long line into chunks
|
||||
lineBytes = line.encode('utf-8')
|
||||
lineStart = 0
|
||||
while lineStart < len(lineBytes):
|
||||
chunkBytes = lineBytes[lineStart:lineStart + maxBytes]
|
||||
chunkText = chunkBytes.decode('utf-8', errors='ignore')
|
||||
chunks.append({"data": chunkText, "size": len(chunkBytes), "order": len(chunks)})
|
||||
lineStart += maxBytes
|
||||
# Don't add this line to current, it's already chunked
|
||||
continue
|
||||
|
||||
# Add line to current chunk
|
||||
current.append(line)
|
||||
size += lineSize
|
||||
|
||||
# Add remaining lines as final chunk
|
||||
if current:
|
||||
data = '\n'.join(current)
|
||||
chunks.append({"data": data, "size": len(data.encode('utf-8')), "order": len(chunks)})
|
||||
|
||||
logger.debug(f"TextChunker: Created {len(chunks)} chunks, total input size: {len(part.data.encode('utf-8'))} bytes")
|
||||
return chunks
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,4 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,47 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
from typing import Any, Dict, List
|
||||
import base64
|
||||
|
||||
from ..subUtils import makeId
|
||||
from modules.datamodels.datamodelExtraction import ContentPart
|
||||
from ..subRegistry import Extractor
|
||||
|
||||
|
||||
class BinaryExtractor(Extractor):
|
||||
"""
|
||||
Fallback extractor for unsupported file types.
|
||||
|
||||
This extractor handles any file type that doesn't match other extractors.
|
||||
It encodes the file as base64 and marks it as binary data.
|
||||
|
||||
Supported formats:
|
||||
- All file types (fallback)
|
||||
- MIME types: application/octet-stream (default)
|
||||
- File extensions: All (fallback)
|
||||
"""
|
||||
|
||||
def detect(self, fileName: str, mimeType: str, headBytes: bytes) -> bool:
|
||||
return True
|
||||
|
||||
def getSupportedExtensions(self) -> list[str]:
|
||||
"""Return list of supported file extensions (all)."""
|
||||
return [] # Accepts all extensions as fallback
|
||||
|
||||
def getSupportedMimeTypes(self) -> list[str]:
|
||||
"""Return list of supported MIME types (all)."""
|
||||
return [] # Accepts all MIME types as fallback
|
||||
|
||||
def extract(self, fileBytes: bytes, context: Dict[str, Any]) -> List[ContentPart]:
|
||||
mimeType = context.get("mimeType") or "application/octet-stream"
|
||||
return [ContentPart(
|
||||
id=makeId(),
|
||||
parentId=None,
|
||||
label="binary",
|
||||
typeGroup="binary",
|
||||
mimeType=mimeType,
|
||||
data=base64.b64encode(fileBytes).decode("utf-8"),
|
||||
metadata={"size": len(fileBytes), "warning": "Unsupported file type"}
|
||||
)]
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,45 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
from typing import Any, Dict, List
|
||||
|
||||
from modules.datamodels.datamodelExtraction import ContentPart
|
||||
from ..subUtils import makeId
|
||||
from ..subRegistry import Extractor
|
||||
|
||||
|
||||
class CsvExtractor(Extractor):
|
||||
"""
|
||||
Extractor for CSV files.
|
||||
|
||||
Supported formats:
|
||||
- MIME types: text/csv
|
||||
- File extensions: .csv
|
||||
- Special handling: Treats as table data
|
||||
"""
|
||||
|
||||
def detect(self, fileName: str, mimeType: str, headBytes: bytes) -> bool:
|
||||
return mimeType == "text/csv" or (fileName or "").lower().endswith(".csv")
|
||||
|
||||
def getSupportedExtensions(self) -> list[str]:
|
||||
"""Return list of supported file extensions."""
|
||||
return [".csv"]
|
||||
|
||||
def getSupportedMimeTypes(self) -> list[str]:
|
||||
"""Return list of supported MIME types."""
|
||||
return ["text/csv"]
|
||||
|
||||
def extract(self, fileBytes: bytes, context: Dict[str, Any]) -> List[ContentPart]:
|
||||
fileName = context.get("fileName")
|
||||
mimeType = context.get("mimeType") or "text/csv"
|
||||
data = fileBytes.decode("utf-8", errors="replace")
|
||||
return [ContentPart(
|
||||
id=makeId(),
|
||||
parentId=None,
|
||||
label="main",
|
||||
typeGroup="table",
|
||||
mimeType=mimeType,
|
||||
data=data,
|
||||
metadata={"size": len(fileBytes)}
|
||||
)]
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,109 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
from typing import Any, Dict, List
|
||||
import io
|
||||
|
||||
from ..subUtils import makeId
|
||||
from modules.datamodels.datamodelExtraction import ContentPart
|
||||
from ..subRegistry import Extractor
|
||||
|
||||
|
||||
class DocxExtractor(Extractor):
|
||||
"""
|
||||
Extractor for Microsoft Word documents.
|
||||
|
||||
Supported formats:
|
||||
- MIME types: application/vnd.openxmlformats-officedocument.wordprocessingml.document
|
||||
- File extensions: .docx
|
||||
- Special handling: Extracts paragraphs and tables (converts tables to CSV)
|
||||
- Dependencies: python-docx
|
||||
"""
|
||||
|
||||
def __init__(self):
|
||||
self._loaded = False
|
||||
self._haveLibs = False
|
||||
|
||||
def _load(self):
|
||||
if self._loaded:
|
||||
return
|
||||
self._loaded = True
|
||||
try:
|
||||
global docx
|
||||
import docx # python-docx
|
||||
self._haveLibs = True
|
||||
except Exception:
|
||||
self._haveLibs = False
|
||||
|
||||
def detect(self, fileName: str, mimeType: str, headBytes: bytes) -> bool:
|
||||
return mimeType == "application/vnd.openxmlformats-officedocument.wordprocessingml.document" or (fileName or "").lower().endswith(".docx")
|
||||
|
||||
def getSupportedExtensions(self) -> list[str]:
|
||||
"""Return list of supported file extensions."""
|
||||
return [".docx"]
|
||||
|
||||
def getSupportedMimeTypes(self) -> list[str]:
|
||||
"""Return list of supported MIME types."""
|
||||
return ["application/vnd.openxmlformats-officedocument.wordprocessingml.document"]
|
||||
|
||||
def extract(self, fileBytes: bytes, context: Dict[str, Any]) -> List[ContentPart]:
|
||||
self._load()
|
||||
parts: List[ContentPart] = []
|
||||
rootId = makeId()
|
||||
parts.append(ContentPart(
|
||||
id=rootId,
|
||||
parentId=None,
|
||||
label="docx",
|
||||
typeGroup="container",
|
||||
mimeType="application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
||||
data="",
|
||||
metadata={"size": len(fileBytes)}
|
||||
))
|
||||
|
||||
if not self._haveLibs:
|
||||
parts.append(ContentPart(
|
||||
id=makeId(),
|
||||
parentId=rootId,
|
||||
label="binary",
|
||||
typeGroup="binary",
|
||||
mimeType="application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
||||
data="",
|
||||
metadata={"size": len(fileBytes), "warning": "DOCX lib not available"}
|
||||
))
|
||||
return parts
|
||||
|
||||
with io.BytesIO(fileBytes) as buf:
|
||||
d = docx.Document(buf)
|
||||
# paragraphs
|
||||
for i, para in enumerate(d.paragraphs):
|
||||
text = para.text or ""
|
||||
if text.strip():
|
||||
parts.append(ContentPart(
|
||||
id=makeId(),
|
||||
parentId=rootId,
|
||||
label=f"p_{i+1}",
|
||||
typeGroup="text",
|
||||
mimeType="text/plain",
|
||||
data=text,
|
||||
metadata={"size": len(text.encode('utf-8'))}
|
||||
))
|
||||
# tables → CSV rows
|
||||
for ti, table in enumerate(d.tables):
|
||||
rows: list[str] = []
|
||||
for row in table.rows:
|
||||
cells = [ (cell.text or "").replace('"', '""') for cell in row.cells ]
|
||||
rows.append(",".join([f'"{c}"' for c in cells]))
|
||||
csvData = "\n".join(rows)
|
||||
if csvData:
|
||||
parts.append(ContentPart(
|
||||
id=makeId(),
|
||||
parentId=rootId,
|
||||
label=f"table_{ti+1}",
|
||||
typeGroup="table",
|
||||
mimeType="text/csv",
|
||||
data=csvData,
|
||||
metadata={"size": len(csvData.encode('utf-8'))}
|
||||
))
|
||||
|
||||
return parts
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,50 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
from typing import Any, Dict, List
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
from modules.datamodels.datamodelExtraction import ContentPart
|
||||
from ..subUtils import makeId
|
||||
from ..subRegistry import Extractor
|
||||
|
||||
|
||||
class HtmlExtractor(Extractor):
|
||||
"""
|
||||
Extractor for HTML files.
|
||||
|
||||
Supported formats:
|
||||
- MIME types: text/html
|
||||
- File extensions: .html, .htm
|
||||
- Special handling: Uses BeautifulSoup for parsing
|
||||
- Dependencies: beautifulsoup4
|
||||
"""
|
||||
|
||||
def detect(self, fileName: str, mimeType: str, headBytes: bytes) -> bool:
|
||||
return mimeType == "text/html" or (fileName or "").lower().endswith((".html", ".htm"))
|
||||
|
||||
def getSupportedExtensions(self) -> list[str]:
|
||||
"""Return list of supported file extensions."""
|
||||
return [".html", ".htm"]
|
||||
|
||||
def getSupportedMimeTypes(self) -> list[str]:
|
||||
"""Return list of supported MIME types."""
|
||||
return ["text/html"]
|
||||
|
||||
def extract(self, fileBytes: bytes, context: Dict[str, Any]) -> List[ContentPart]:
|
||||
mimeType = context.get("mimeType") or "text/html"
|
||||
text = fileBytes.decode("utf-8", errors="replace")
|
||||
try:
|
||||
BeautifulSoup(text, "html.parser")
|
||||
except Exception:
|
||||
pass
|
||||
return [ContentPart(
|
||||
id=makeId(),
|
||||
parentId=None,
|
||||
label="main",
|
||||
typeGroup="structure",
|
||||
mimeType=mimeType,
|
||||
data=text,
|
||||
metadata={"size": len(fileBytes)}
|
||||
)]
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,77 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
from typing import Any, Dict, List
|
||||
import base64
|
||||
import logging
|
||||
|
||||
from ..subUtils import makeId
|
||||
from modules.datamodels.datamodelExtraction import ContentPart
|
||||
from ..subRegistry import Extractor
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class ImageExtractor(Extractor):
|
||||
"""
|
||||
Extractor for image files.
|
||||
|
||||
Supported formats:
|
||||
- MIME types: image/jpeg, image/png, image/gif, image/webp, image/bmp, image/tiff
|
||||
- File extensions: .jpg, .jpeg, .png, .gif, .webp, .bmp, .tiff
|
||||
- Special handling: GIF files are converted to PNG during extraction
|
||||
"""
|
||||
|
||||
def detect(self, fileName: str, mimeType: str, headBytes: bytes) -> bool:
|
||||
return ((mimeType or "").startswith("image/") or
|
||||
(fileName or "").lower().endswith((".jpg", ".jpeg", ".png", ".gif", ".webp", ".bmp", ".tiff")))
|
||||
|
||||
def getSupportedExtensions(self) -> list[str]:
|
||||
"""Return list of supported file extensions."""
|
||||
return [".jpg", ".jpeg", ".png", ".gif", ".webp", ".bmp", ".tiff"]
|
||||
|
||||
def getSupportedMimeTypes(self) -> list[str]:
|
||||
"""Return list of supported MIME types."""
|
||||
return ["image/jpeg", "image/png", "image/gif", "image/webp", "image/bmp", "image/tiff"]
|
||||
|
||||
def extract(self, fileBytes: bytes, context: Dict[str, Any]) -> List[ContentPart]:
|
||||
mimeType = context.get("mimeType") or "image/unknown"
|
||||
fileName = context.get("fileName", "")
|
||||
|
||||
# Convert GIF to PNG during extraction
|
||||
if mimeType.lower() == "image/gif":
|
||||
try:
|
||||
from PIL import Image
|
||||
import io
|
||||
|
||||
# Open GIF and convert to PNG
|
||||
with Image.open(io.BytesIO(fileBytes)) as img:
|
||||
# Convert to RGB (removes animation)
|
||||
if img.mode in ('RGBA', 'LA', 'P'):
|
||||
img = img.convert('RGB')
|
||||
|
||||
# Save as PNG in memory
|
||||
png_buffer = io.BytesIO()
|
||||
img.save(png_buffer, format='PNG')
|
||||
png_data = png_buffer.getvalue()
|
||||
|
||||
# Update mimeType and fileBytes
|
||||
mimeType = "image/png"
|
||||
fileBytes = png_data
|
||||
|
||||
logger.info(f"GIF converted to PNG during extraction: {fileName}, original={len(fileBytes)} bytes, converted={len(png_data)} bytes")
|
||||
|
||||
except Exception as e:
|
||||
logger.warning(f"GIF conversion failed during extraction for {fileName}: {str(e)}, using original")
|
||||
# Keep original GIF data if conversion fails
|
||||
|
||||
return [ContentPart(
|
||||
id=makeId(),
|
||||
parentId=None,
|
||||
label="image",
|
||||
typeGroup="image",
|
||||
mimeType=mimeType,
|
||||
data=base64.b64encode(fileBytes).decode("utf-8"),
|
||||
metadata={"size": len(fileBytes)}
|
||||
)]
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,50 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
from typing import Any, Dict, List
|
||||
import json
|
||||
|
||||
from modules.datamodels.datamodelExtraction import ContentPart
|
||||
from ..subUtils import makeId
|
||||
from ..subRegistry import Extractor
|
||||
|
||||
|
||||
class JsonExtractor(Extractor):
|
||||
"""
|
||||
Extractor for JSON files.
|
||||
|
||||
Supported formats:
|
||||
- MIME types: application/json
|
||||
- File extensions: .json
|
||||
- Special handling: Validates JSON format, falls back to text if invalid
|
||||
"""
|
||||
|
||||
def detect(self, fileName: str, mimeType: str, headBytes: bytes) -> bool:
|
||||
return mimeType == "application/json" or (fileName or "").lower().endswith(".json")
|
||||
|
||||
def getSupportedExtensions(self) -> list[str]:
|
||||
"""Return list of supported file extensions."""
|
||||
return [".json"]
|
||||
|
||||
def getSupportedMimeTypes(self) -> list[str]:
|
||||
"""Return list of supported MIME types."""
|
||||
return ["application/json"]
|
||||
|
||||
def extract(self, fileBytes: bytes, context: Dict[str, Any]) -> List[ContentPart]:
|
||||
mimeType = context.get("mimeType") or "application/json"
|
||||
text = fileBytes.decode("utf-8", errors="replace")
|
||||
# verify JSON is well-formed; fall back to text if not
|
||||
try:
|
||||
json.loads(text)
|
||||
except Exception:
|
||||
pass
|
||||
return [ContentPart(
|
||||
id=makeId(),
|
||||
parentId=None,
|
||||
label="main",
|
||||
typeGroup="structure",
|
||||
mimeType=mimeType,
|
||||
data=text,
|
||||
metadata={"size": len(fileBytes)}
|
||||
)]
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,156 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
from typing import Any, Dict, List
|
||||
import base64
|
||||
import io
|
||||
|
||||
from ..subUtils import makeId
|
||||
from modules.datamodels.datamodelExtraction import ContentPart
|
||||
from ..subRegistry import Extractor
|
||||
|
||||
|
||||
class PdfExtractor(Extractor):
|
||||
"""
|
||||
Extractor for PDF files.
|
||||
|
||||
Supported formats:
|
||||
- MIME types: application/pdf
|
||||
- File extensions: .pdf
|
||||
- Special handling: Extracts text per page and embedded images
|
||||
- Dependencies: PyPDF2, PyMuPDF (fitz)
|
||||
"""
|
||||
|
||||
def __init__(self):
|
||||
self._loaded = False
|
||||
self._haveLibs = False
|
||||
|
||||
def _load(self):
|
||||
if self._loaded:
|
||||
return
|
||||
self._loaded = True
|
||||
try:
|
||||
global PyPDF2, fitz
|
||||
import PyPDF2
|
||||
import fitz # PyMuPDF
|
||||
self._haveLibs = True
|
||||
except Exception:
|
||||
self._haveLibs = False
|
||||
|
||||
def detect(self, fileName: str, mimeType: str, headBytes: bytes) -> bool:
|
||||
return mimeType == "application/pdf" or (fileName or "").lower().endswith(".pdf")
|
||||
|
||||
def getSupportedExtensions(self) -> list[str]:
|
||||
"""Return list of supported file extensions."""
|
||||
return [".pdf"]
|
||||
|
||||
def getSupportedMimeTypes(self) -> list[str]:
|
||||
"""Return list of supported MIME types."""
|
||||
return ["application/pdf"]
|
||||
|
||||
def extract(self, fileBytes: bytes, context: Dict[str, Any]) -> List[ContentPart]:
|
||||
self._load()
|
||||
parts: List[ContentPart] = []
|
||||
rootId = makeId()
|
||||
parts.append(ContentPart(
|
||||
id=rootId,
|
||||
parentId=None,
|
||||
label="pdf",
|
||||
typeGroup="container",
|
||||
mimeType="application/pdf",
|
||||
data="",
|
||||
metadata={"size": len(fileBytes)}
|
||||
))
|
||||
|
||||
if not self._haveLibs:
|
||||
parts.append(ContentPart(
|
||||
id=makeId(),
|
||||
parentId=rootId,
|
||||
label="binary",
|
||||
typeGroup="binary",
|
||||
mimeType="application/pdf",
|
||||
data=base64.b64encode(fileBytes).decode("utf-8"),
|
||||
metadata={"size": len(fileBytes), "warning": "PDF libs not available"}
|
||||
))
|
||||
return parts
|
||||
|
||||
# Extract text per page with PyMuPDF (same lib as in-place search - ensures extraction matches PDF text layer)
|
||||
try:
|
||||
with io.BytesIO(fileBytes) as buf:
|
||||
doc = fitz.open(stream=buf.getvalue(), filetype="pdf")
|
||||
for i in range(len(doc)):
|
||||
try:
|
||||
page = doc[i]
|
||||
text = page.get_text() or ""
|
||||
if text.strip():
|
||||
parts.append(ContentPart(
|
||||
id=makeId(),
|
||||
parentId=rootId,
|
||||
label=f"page_{i+1}",
|
||||
typeGroup="text",
|
||||
mimeType="text/plain",
|
||||
data=text,
|
||||
metadata={"pages": 1, "pageIndex": i, "size": len(text.encode('utf-8'))}
|
||||
))
|
||||
except Exception:
|
||||
continue
|
||||
doc.close()
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# Fallback to PyPDF2 if PyMuPDF text extraction failed or returned nothing
|
||||
has_text = any(getattr(p, 'typeGroup', '') == "text" for p in parts)
|
||||
if not has_text:
|
||||
try:
|
||||
with io.BytesIO(fileBytes) as buf:
|
||||
reader = PyPDF2.PdfReader(buf)
|
||||
for i, page in enumerate(reader.pages):
|
||||
try:
|
||||
text = page.extract_text() or ""
|
||||
if text.strip():
|
||||
parts.append(ContentPart(
|
||||
id=makeId(),
|
||||
parentId=rootId,
|
||||
label=f"page_{i+1}",
|
||||
typeGroup="text",
|
||||
mimeType="text/plain",
|
||||
data=text,
|
||||
metadata={"pages": 1, "pageIndex": i, "size": len(text.encode('utf-8'))}
|
||||
))
|
||||
except Exception:
|
||||
continue
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# Extract images with PyMuPDF
|
||||
try:
|
||||
with io.BytesIO(fileBytes) as buf2:
|
||||
doc = fitz.open(stream=buf2.getvalue(), filetype="pdf")
|
||||
for i in range(len(doc)):
|
||||
page = doc[i]
|
||||
images = page.get_images(full=True)
|
||||
for j, img in enumerate(images):
|
||||
try:
|
||||
xref = img[0]
|
||||
baseImage = doc.extract_image(xref)
|
||||
if baseImage:
|
||||
imgBytes = baseImage.get("image", b"")
|
||||
ext = baseImage.get("ext", "png")
|
||||
if imgBytes:
|
||||
parts.append(ContentPart(
|
||||
id=makeId(),
|
||||
parentId=rootId,
|
||||
label=f"image_{i+1}_{j}",
|
||||
typeGroup="image",
|
||||
mimeType=f"image/{ext}",
|
||||
data=base64.b64encode(imgBytes).decode("utf-8"),
|
||||
metadata={"pageIndex": i, "size": len(imgBytes)}
|
||||
))
|
||||
except Exception:
|
||||
continue
|
||||
doc.close()
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
return parts
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,227 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
import logging
|
||||
import base64
|
||||
from typing import List, Dict, Any, Optional
|
||||
from modules.datamodels.datamodelExtraction import ContentPart, ContentExtracted
|
||||
from ..subRegistry import Extractor
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class PptxExtractor(Extractor):
|
||||
"""
|
||||
Extractor for PowerPoint files.
|
||||
|
||||
Supported formats:
|
||||
- MIME types: application/vnd.openxmlformats-officedocument.presentationml.presentation, application/vnd.ms-powerpoint
|
||||
- File extensions: .pptx, .ppt
|
||||
- Special handling: Extracts slide content, tables, and images
|
||||
- Dependencies: python-pptx
|
||||
"""
|
||||
|
||||
def __init__(self):
|
||||
self._loaded = False
|
||||
self._haveLibs = False
|
||||
|
||||
def _load(self):
|
||||
if self._loaded:
|
||||
return
|
||||
self._loaded = True
|
||||
try:
|
||||
global Presentation
|
||||
from pptx import Presentation
|
||||
self._haveLibs = True
|
||||
except Exception:
|
||||
self._haveLibs = False
|
||||
|
||||
def detect(self, fileName: str, mimeType: str, headBytes: bytes) -> bool:
|
||||
return (mimeType in [
|
||||
"application/vnd.openxmlformats-officedocument.presentationml.presentation",
|
||||
"application/vnd.ms-powerpoint"
|
||||
]) or (fileName or "").lower().endswith((".pptx", ".ppt"))
|
||||
|
||||
def getSupportedExtensions(self) -> list[str]:
|
||||
"""Return list of supported file extensions."""
|
||||
return [".pptx", ".ppt"]
|
||||
|
||||
def getSupportedMimeTypes(self) -> list[str]:
|
||||
"""Return list of supported MIME types."""
|
||||
return [
|
||||
"application/vnd.openxmlformats-officedocument.presentationml.presentation",
|
||||
"application/vnd.ms-powerpoint"
|
||||
]
|
||||
|
||||
def extract(self, fileBytes: bytes, context: Dict[str, Any]) -> List[ContentPart]:
|
||||
"""
|
||||
Extract content from PowerPoint files.
|
||||
|
||||
Args:
|
||||
fileBytes: Raw file data as bytes
|
||||
context: Context dictionary with file information
|
||||
|
||||
Returns:
|
||||
List of ContentPart objects with extracted content
|
||||
"""
|
||||
self._load()
|
||||
|
||||
if not self._haveLibs:
|
||||
logger.error("python-pptx library not installed. Install with: pip install python-pptx")
|
||||
return [ContentPart(
|
||||
id="error",
|
||||
label="PowerPoint Extraction Error",
|
||||
typeGroup="text",
|
||||
mimeType="text/plain",
|
||||
data="Error: python-pptx library not installed",
|
||||
metadata={"error": True, "error_message": "python-pptx library not installed"}
|
||||
)]
|
||||
|
||||
try:
|
||||
import io
|
||||
|
||||
# Load presentation from bytes
|
||||
presentation = Presentation(io.BytesIO(fileBytes))
|
||||
|
||||
parts = []
|
||||
slide_index = 0
|
||||
|
||||
# Extract content from each slide
|
||||
for slide in presentation.slides:
|
||||
slide_index += 1
|
||||
slide_content = []
|
||||
|
||||
# Extract text from slide
|
||||
for shape in slide.shapes:
|
||||
if hasattr(shape, "text") and shape.text.strip():
|
||||
slide_content.append(shape.text.strip())
|
||||
|
||||
# Extract table data
|
||||
for shape in slide.shapes:
|
||||
if shape.has_table:
|
||||
table = shape.table
|
||||
table_data = []
|
||||
for row in table.rows:
|
||||
row_data = []
|
||||
for cell in row.cells:
|
||||
row_data.append(cell.text.strip())
|
||||
table_data.append(row_data)
|
||||
|
||||
if table_data:
|
||||
# Convert table to markdown format
|
||||
table_md = self._table_to_markdown(table_data)
|
||||
slide_content.append(table_md)
|
||||
|
||||
# Extract images
|
||||
for shape in slide.shapes:
|
||||
if shape.shape_type == 13: # MSO_SHAPE_TYPE.PICTURE
|
||||
try:
|
||||
image = shape.image
|
||||
image_bytes = image.blob
|
||||
image_b64 = base64.b64encode(image_bytes).decode('utf-8')
|
||||
|
||||
# Create image part
|
||||
image_part = ContentPart(
|
||||
id=f"slide_{slide_index}_image_{len(parts)}",
|
||||
label=f"Slide {slide_index} Image",
|
||||
typeGroup="image",
|
||||
mimeType="image/png", # Default to PNG
|
||||
data=image_b64,
|
||||
metadata={
|
||||
"slide_number": slide_index,
|
||||
"shape_type": "image",
|
||||
"extracted_from": "powerpoint"
|
||||
}
|
||||
)
|
||||
parts.append(image_part)
|
||||
except Exception as e:
|
||||
logger.warning(f"Failed to extract image from slide {slide_index}: {str(e)}")
|
||||
|
||||
# Create slide content part
|
||||
if slide_content:
|
||||
slide_text = f"# Slide {slide_index}\n\n" + "\n\n".join(slide_content)
|
||||
|
||||
slide_part = ContentPart(
|
||||
id=f"slide_{slide_index}",
|
||||
label=f"Slide {slide_index} Content",
|
||||
typeGroup="structure",
|
||||
mimeType="text/plain",
|
||||
data=slide_text,
|
||||
metadata={
|
||||
"slide_number": slide_index,
|
||||
"content_type": "slide",
|
||||
"extracted_from": "powerpoint",
|
||||
"text_length": len(slide_text)
|
||||
}
|
||||
)
|
||||
parts.append(slide_part)
|
||||
|
||||
# Create presentation overview
|
||||
file_name = context.get("fileName", "presentation.pptx")
|
||||
overview_text = f"# PowerPoint Presentation: {file_name}\n\n"
|
||||
overview_text += f"**Total Slides:** {len(presentation.slides)}\n\n"
|
||||
overview_text += f"**Content Parts:** {len(parts)}\n\n"
|
||||
|
||||
# Add slide summaries
|
||||
for i, slide in enumerate(presentation.slides, 1):
|
||||
slide_text_parts = []
|
||||
for shape in slide.shapes:
|
||||
if hasattr(shape, "text") and shape.text.strip():
|
||||
slide_text_parts.append(shape.text.strip())
|
||||
|
||||
if slide_text_parts:
|
||||
overview_text += f"## Slide {i}\n"
|
||||
overview_text += "\n".join(slide_text_parts[:3]) # First 3 text elements
|
||||
overview_text += "\n\n"
|
||||
|
||||
# Create overview part
|
||||
overview_part = ContentPart(
|
||||
id="presentation_overview",
|
||||
label="Presentation Overview",
|
||||
typeGroup="text",
|
||||
mimeType="text/plain",
|
||||
data=overview_text,
|
||||
metadata={
|
||||
"content_type": "overview",
|
||||
"extracted_from": "powerpoint",
|
||||
"total_slides": len(presentation.slides),
|
||||
"text_length": len(overview_text)
|
||||
}
|
||||
)
|
||||
parts.insert(0, overview_part) # Insert at beginning
|
||||
|
||||
return parts
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error extracting PowerPoint content: {str(e)}")
|
||||
return [ContentPart(
|
||||
id="error",
|
||||
label="PowerPoint Extraction Error",
|
||||
typeGroup="text",
|
||||
mimeType="text/plain",
|
||||
data=f"Error extracting PowerPoint content: {str(e)}",
|
||||
metadata={"error": True, "error_message": str(e)}
|
||||
)]
|
||||
|
||||
def _table_to_markdown(self, table_data: List[List[str]]) -> str:
|
||||
"""Convert table data to markdown format."""
|
||||
if not table_data:
|
||||
return ""
|
||||
|
||||
markdown_lines = []
|
||||
|
||||
# Header row
|
||||
if table_data:
|
||||
header = "| " + " | ".join(table_data[0]) + " |"
|
||||
markdown_lines.append(header)
|
||||
|
||||
# Separator row
|
||||
separator = "| " + " | ".join(["---"] * len(table_data[0])) + " |"
|
||||
markdown_lines.append(separator)
|
||||
|
||||
# Data rows
|
||||
for row in table_data[1:]:
|
||||
data_row = "| " + " | ".join(row) + " |"
|
||||
markdown_lines.append(data_row)
|
||||
|
||||
return "\n".join(markdown_lines)
|
||||
|
||||
|
|
@ -0,0 +1,58 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
from typing import Any, Dict, List
|
||||
|
||||
from modules.datamodels.datamodelExtraction import ContentPart
|
||||
from ..subUtils import makeId
|
||||
from ..subRegistry import Extractor
|
||||
|
||||
|
||||
class SqlExtractor(Extractor):
|
||||
"""
|
||||
Extractor for SQL files.
|
||||
|
||||
Supported formats:
|
||||
- MIME types: text/x-sql, application/sql
|
||||
- File extensions: .sql, .ddl, .dml, .dcl, .tcl
|
||||
- Special handling: Treats as structured text with SQL syntax
|
||||
"""
|
||||
|
||||
def detect(self, fileName: str, mimeType: str, headBytes: bytes) -> bool:
|
||||
return (mimeType in ("text/x-sql", "application/sql") or
|
||||
(fileName or "").lower().endswith((".sql", ".ddl", ".dml", ".dcl", ".tcl")))
|
||||
|
||||
def getSupportedExtensions(self) -> list[str]:
|
||||
"""Return list of supported file extensions."""
|
||||
return [".sql", ".ddl", ".dml", ".dcl", ".tcl"]
|
||||
|
||||
def getSupportedMimeTypes(self) -> list[str]:
|
||||
"""Return list of supported MIME types."""
|
||||
return ["text/x-sql", "application/sql"]
|
||||
|
||||
def extract(self, fileBytes: bytes, context: Dict[str, Any]) -> List[ContentPart]:
|
||||
fileName = context.get("fileName")
|
||||
mimeType = context.get("mimeType") or "text/x-sql"
|
||||
data = fileBytes.decode("utf-8", errors="replace")
|
||||
|
||||
# Add SQL-specific metadata
|
||||
metadata = {
|
||||
"size": len(fileBytes),
|
||||
"file_type": "sql",
|
||||
"line_count": len(data.splitlines()),
|
||||
"has_select": "SELECT" in data.upper(),
|
||||
"has_insert": "INSERT" in data.upper(),
|
||||
"has_update": "UPDATE" in data.upper(),
|
||||
"has_delete": "DELETE" in data.upper(),
|
||||
"has_create": "CREATE" in data.upper(),
|
||||
"has_drop": "DROP" in data.upper()
|
||||
}
|
||||
|
||||
return [ContentPart(
|
||||
id=makeId(),
|
||||
parentId=None,
|
||||
label="main",
|
||||
typeGroup="structure",
|
||||
mimeType=mimeType,
|
||||
data=data,
|
||||
metadata=metadata
|
||||
)]
|
||||
|
|
@ -0,0 +1,105 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
from typing import Any, Dict, List
|
||||
|
||||
from modules.datamodels.datamodelExtraction import ContentPart
|
||||
from ..subUtils import makeId
|
||||
from ..subRegistry import Extractor
|
||||
|
||||
|
||||
class TextExtractor(Extractor):
|
||||
"""
|
||||
Extractor for plain text files and code files.
|
||||
|
||||
Supported formats:
|
||||
- MIME types: text/plain, text/markdown, text/x-python, text/x-java-source, text/javascript, etc.
|
||||
- File extensions: .txt, .md, .log, .java, .js, .jsx, .ts, .tsx, .py, .config, .ini, .cfg, .conf, .properties, .yaml, .yml, .toml, .sh, .bat, .ps1, .sql, .css, .scss, .sass, .less, .xml, .json, .csv, .tsv, .rtf, .tex, .rst, .adoc, .org, .pod, .man, .1, .2, .3, .4, .5, .6, .7, .8, .9, .n, .l, .m, .r, .t, .x, .y, .z
|
||||
"""
|
||||
|
||||
def detect(self, fileName: str, mimeType: str, headBytes: bytes) -> bool:
|
||||
# Check MIME types
|
||||
if mimeType and mimeType.startswith("text/"):
|
||||
return True
|
||||
|
||||
# Check file extensions
|
||||
if fileName:
|
||||
ext = fileName.lower()
|
||||
return ext.endswith((
|
||||
# Basic text files
|
||||
".txt", ".md", ".log", ".rtf", ".tex", ".rst", ".adoc", ".org", ".pod",
|
||||
# Programming languages
|
||||
".java", ".js", ".jsx", ".ts", ".tsx", ".py", ".rb", ".go", ".rs", ".cpp", ".c", ".h", ".hpp", ".cc", ".cxx",
|
||||
".cs", ".php", ".swift", ".kt", ".scala", ".clj", ".hs", ".ml", ".fs", ".vb", ".dart", ".r", ".m", ".pl", ".sh",
|
||||
# Web technologies
|
||||
".html", ".htm", ".css", ".scss", ".sass", ".less", ".vue", ".svelte",
|
||||
# Configuration files
|
||||
".config", ".ini", ".cfg", ".conf", ".properties", ".yaml", ".yml", ".toml", ".json", ".xml",
|
||||
# Scripts and automation
|
||||
".bat", ".ps1", ".psm1", ".psd1", ".vbs", ".wsf", ".cmd", ".com",
|
||||
# Data files
|
||||
".csv", ".tsv", ".tab", ".dat", ".data",
|
||||
# Documentation
|
||||
".man", ".1", ".2", ".3", ".4", ".5", ".6", ".7", ".8", ".9", ".n", ".l", ".m", ".r", ".t", ".x", ".y", ".z",
|
||||
# Other text formats
|
||||
".diff", ".patch", ".gitignore", ".dockerignore", ".editorconfig", ".gitattributes",
|
||||
".env", ".env.local", ".env.development", ".env.production", ".env.test",
|
||||
".lock", ".lockb", ".lockfile", ".pkg-lock", ".yarn-lock"
|
||||
))
|
||||
|
||||
return False
|
||||
|
||||
def getSupportedExtensions(self) -> list[str]:
|
||||
"""Return list of supported file extensions."""
|
||||
return [
|
||||
# Basic text files
|
||||
".txt", ".md", ".log", ".rtf", ".tex", ".rst", ".adoc", ".org", ".pod",
|
||||
# Programming languages
|
||||
".java", ".js", ".jsx", ".ts", ".tsx", ".py", ".rb", ".go", ".rs", ".cpp", ".c", ".h", ".hpp", ".cc", ".cxx",
|
||||
".cs", ".php", ".swift", ".kt", ".scala", ".clj", ".hs", ".ml", ".fs", ".vb", ".dart", ".r", ".m", ".pl", ".sh",
|
||||
# Web technologies
|
||||
".html", ".htm", ".css", ".scss", ".sass", ".less", ".vue", ".svelte",
|
||||
# Configuration files
|
||||
".config", ".ini", ".cfg", ".conf", ".properties", ".yaml", ".yml", ".toml", ".json", ".xml",
|
||||
# Scripts and automation
|
||||
".bat", ".ps1", ".psm1", ".psd1", ".vbs", ".wsf", ".cmd", ".com",
|
||||
# Data files
|
||||
".csv", ".tsv", ".tab", ".dat", ".data",
|
||||
# Documentation
|
||||
".man", ".1", ".2", ".3", ".4", ".5", ".6", ".7", ".8", ".9", ".n", ".l", ".m", ".r", ".t", ".x", ".y", ".z",
|
||||
# Other text formats
|
||||
".diff", ".patch", ".gitignore", ".dockerignore", ".editorconfig", ".gitattributes",
|
||||
".env", ".env.local", ".env.development", ".env.production", ".env.test",
|
||||
".lock", ".lockb", ".lockfile", ".pkg-lock", ".yarn-lock"
|
||||
]
|
||||
|
||||
def getSupportedMimeTypes(self) -> list[str]:
|
||||
"""Return list of supported MIME types."""
|
||||
return [
|
||||
"text/plain", "text/markdown", "text/x-python", "text/x-java-source",
|
||||
"text/javascript", "text/x-javascript", "text/typescript", "text/x-typescript",
|
||||
"text/x-c", "text/x-c++", "text/x-csharp", "text/x-php", "text/x-ruby",
|
||||
"text/x-go", "text/x-rust", "text/x-scala", "text/x-swift", "text/x-kotlin",
|
||||
"text/x-sql", "text/x-sh", "text/x-shellscript", "text/x-yaml", "text/x-toml",
|
||||
"text/x-ini", "text/x-config", "text/x-properties", "text/x-log",
|
||||
"text/html", "text/css", "text/x-scss", "text/x-sass", "text/x-less",
|
||||
"text/xml", "text/csv", "text/tab-separated-values", "text/rtf",
|
||||
"text/x-tex", "text/x-rst", "text/x-asciidoc", "text/x-org",
|
||||
"application/x-yaml", "application/x-toml", "application/x-ini",
|
||||
"application/x-config", "application/x-properties", "application/x-log"
|
||||
]
|
||||
|
||||
def extract(self, fileBytes: bytes, context: Dict[str, Any]) -> List[ContentPart]:
|
||||
fileName = context.get("fileName")
|
||||
mimeType = context.get("mimeType") or "text/plain"
|
||||
data = fileBytes.decode("utf-8", errors="replace")
|
||||
return [ContentPart(
|
||||
id=makeId(),
|
||||
parentId=None,
|
||||
label="main",
|
||||
typeGroup="text",
|
||||
mimeType=mimeType,
|
||||
data=data,
|
||||
metadata={"size": len(fileBytes)}
|
||||
)]
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,114 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
from typing import Any, Dict, List
|
||||
import io
|
||||
from datetime import datetime
|
||||
|
||||
from ..subUtils import makeId
|
||||
from modules.datamodels.datamodelExtraction import ContentPart
|
||||
from ..subRegistry import Extractor
|
||||
|
||||
|
||||
class XlsxExtractor(Extractor):
|
||||
"""
|
||||
Extractor for Microsoft Excel spreadsheets.
|
||||
|
||||
Supported formats:
|
||||
- MIME types: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
|
||||
- File extensions: .xlsx, .xlsm
|
||||
- Special handling: Extracts all sheets as CSV data
|
||||
- Dependencies: openpyxl
|
||||
"""
|
||||
|
||||
def __init__(self):
|
||||
self._loaded = False
|
||||
self._haveLibs = False
|
||||
|
||||
def _load(self):
|
||||
if self._loaded:
|
||||
return
|
||||
self._loaded = True
|
||||
try:
|
||||
global openpyxl
|
||||
import openpyxl
|
||||
self._haveLibs = True
|
||||
except Exception:
|
||||
self._haveLibs = False
|
||||
|
||||
def detect(self, fileName: str, mimeType: str, headBytes: bytes) -> bool:
|
||||
mt = "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
|
||||
return mimeType == mt or (fileName or "").lower().endswith((".xlsx", ".xlsm"))
|
||||
|
||||
def getSupportedExtensions(self) -> list[str]:
|
||||
"""Return list of supported file extensions."""
|
||||
return [".xlsx", ".xlsm"]
|
||||
|
||||
def getSupportedMimeTypes(self) -> list[str]:
|
||||
"""Return list of supported MIME types."""
|
||||
return ["application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"]
|
||||
|
||||
def extract(self, fileBytes: bytes, context: Dict[str, Any]) -> List[ContentPart]:
|
||||
self._load()
|
||||
parts: List[ContentPart] = []
|
||||
rootId = makeId()
|
||||
parts.append(ContentPart(
|
||||
id=rootId,
|
||||
parentId=None,
|
||||
label="xlsx",
|
||||
typeGroup="container",
|
||||
mimeType="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
|
||||
data="",
|
||||
metadata={"size": len(fileBytes)}
|
||||
))
|
||||
|
||||
if not self._haveLibs:
|
||||
parts.append(ContentPart(
|
||||
id=makeId(),
|
||||
parentId=rootId,
|
||||
label="binary",
|
||||
typeGroup="binary",
|
||||
mimeType="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
|
||||
data="",
|
||||
metadata={"size": len(fileBytes), "warning": "openpyxl not available"}
|
||||
))
|
||||
return parts
|
||||
|
||||
with io.BytesIO(fileBytes) as buf:
|
||||
wb = openpyxl.load_workbook(buf, data_only=True)
|
||||
for sheetName in wb.sheetnames:
|
||||
ws = wb[sheetName]
|
||||
# extract rectangular data region by min/max
|
||||
min_row = ws.min_row
|
||||
max_row = ws.max_row
|
||||
min_col = ws.min_column
|
||||
max_col = ws.max_column
|
||||
lines: list[str] = []
|
||||
for r in range(min_row, max_row + 1):
|
||||
cells: list[str] = []
|
||||
for c in range(min_col, max_col + 1):
|
||||
cell = ws.cell(row=r, column=c)
|
||||
v = cell.value
|
||||
if v is None:
|
||||
cells.append("")
|
||||
elif isinstance(v, (int, float)):
|
||||
cells.append(str(v))
|
||||
elif isinstance(v, datetime):
|
||||
cells.append(v.strftime("%Y-%m-%d %H:%M:%S"))
|
||||
else:
|
||||
escaped_value = str(v).replace('"', '""')
|
||||
cells.append(f'"{escaped_value}"')
|
||||
lines.append(",".join(cells))
|
||||
csvData = "\n".join(lines)
|
||||
parts.append(ContentPart(
|
||||
id=makeId(),
|
||||
parentId=rootId,
|
||||
label=f"sheet_{sheetName}",
|
||||
typeGroup="table",
|
||||
mimeType="text/csv",
|
||||
data=csvData,
|
||||
metadata={"sheet": sheetName, "size": len(csvData.encode('utf-8'))}
|
||||
))
|
||||
|
||||
return parts
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,49 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
from typing import Any, Dict, List
|
||||
import xml.etree.ElementTree as ET
|
||||
|
||||
from modules.datamodels.datamodelExtraction import ContentPart
|
||||
from ..subUtils import makeId
|
||||
from ..subRegistry import Extractor
|
||||
|
||||
|
||||
class XmlExtractor(Extractor):
|
||||
"""
|
||||
Extractor for XML files.
|
||||
|
||||
Supported formats:
|
||||
- MIME types: application/xml
|
||||
- File extensions: .xml, .rss, .atom
|
||||
- Special handling: Uses ElementTree for parsing
|
||||
"""
|
||||
|
||||
def detect(self, fileName: str, mimeType: str, headBytes: bytes) -> bool:
|
||||
return mimeType == "application/xml" or (fileName or "").lower().endswith((".xml", ".rss", ".atom"))
|
||||
|
||||
def getSupportedExtensions(self) -> list[str]:
|
||||
"""Return list of supported file extensions."""
|
||||
return [".xml", ".rss", ".atom"]
|
||||
|
||||
def getSupportedMimeTypes(self) -> list[str]:
|
||||
"""Return list of supported MIME types."""
|
||||
return ["application/xml"]
|
||||
|
||||
def extract(self, fileBytes: bytes, context: Dict[str, Any]) -> List[ContentPart]:
|
||||
mimeType = context.get("mimeType") or "application/xml"
|
||||
text = fileBytes.decode("utf-8", errors="replace")
|
||||
try:
|
||||
ET.fromstring(text)
|
||||
except Exception:
|
||||
pass
|
||||
return [ContentPart(
|
||||
id=makeId(),
|
||||
parentId=None,
|
||||
label="main",
|
||||
typeGroup="structure",
|
||||
mimeType=mimeType,
|
||||
data=text,
|
||||
metadata={"size": len(fileBytes)}
|
||||
)]
|
||||
|
||||
|
||||
File diff suppressed because it is too large
Load diff
|
|
@ -0,0 +1,2 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
|
|
@ -0,0 +1,13 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
from typing import Any, Dict, List
|
||||
from modules.datamodels.datamodelExtraction import ContentPart, MergeStrategy
|
||||
|
||||
|
||||
class DefaultMerger:
|
||||
def merge(self, parts: List[ContentPart], strategy: MergeStrategy) -> List[ContentPart]:
|
||||
"""
|
||||
Default merger that passes through parts unchanged.
|
||||
Used for image, binary, metadata, container typeGroups.
|
||||
"""
|
||||
return parts
|
||||
|
|
@ -0,0 +1,154 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
from typing import Any, Dict, List
|
||||
from modules.datamodels.datamodelExtraction import ContentPart, MergeStrategy
|
||||
from ..subUtils import makeId
|
||||
|
||||
|
||||
class TableMerger:
|
||||
def merge(self, parts: List[ContentPart], strategy: MergeStrategy) -> List[ContentPart]:
|
||||
"""
|
||||
Merge table parts based on strategy.
|
||||
Strategy options:
|
||||
- groupBy: "parentId" (default), "documentId", "sheet", "none"
|
||||
- maxSize: maximum size per merged part
|
||||
- combineSheets: bool - whether to combine multiple sheets into one table
|
||||
"""
|
||||
if not parts:
|
||||
return parts
|
||||
|
||||
groupBy = strategy.groupBy
|
||||
maxSize = strategy.maxSize or 0
|
||||
combineSheets = strategy.tableMerge.get("combineSheets", False) if strategy.tableMerge else False
|
||||
|
||||
# Group parts
|
||||
groups = self._groupParts(parts, groupBy, combineSheets)
|
||||
|
||||
merged: List[ContentPart] = []
|
||||
for groupKey, groupParts in groups.items():
|
||||
if maxSize > 0:
|
||||
merged.extend(self._mergeWithSizeLimit(groupParts, maxSize, groupKey))
|
||||
else:
|
||||
merged.extend(self._mergeGroup(groupParts, groupKey))
|
||||
|
||||
return merged
|
||||
|
||||
def _groupParts(self, parts: List[ContentPart], groupBy: str, combineSheets: bool) -> Dict[str, List[ContentPart]]:
|
||||
groups: Dict[str, List[ContentPart]] = {}
|
||||
|
||||
for part in parts:
|
||||
if part.typeGroup != "table":
|
||||
# Non-table parts go in their own group
|
||||
key = f"nontable_{part.id}"
|
||||
if key not in groups:
|
||||
groups[key] = []
|
||||
groups[key].append(part)
|
||||
continue
|
||||
|
||||
if groupBy == "parentId":
|
||||
key = part.parentId or "root"
|
||||
elif groupBy == "documentId":
|
||||
key = part.metadata.get("documentId", "unknown")
|
||||
elif groupBy == "sheet" and not combineSheets:
|
||||
key = part.metadata.get("sheet", "unknown")
|
||||
else: # "none" or combineSheets=True
|
||||
key = "all_tables"
|
||||
|
||||
if key not in groups:
|
||||
groups[key] = []
|
||||
groups[key].append(part)
|
||||
|
||||
return groups
|
||||
|
||||
def _mergeGroup(self, parts: List[ContentPart], groupKey: str) -> List[ContentPart]:
|
||||
if not parts:
|
||||
return []
|
||||
if len(parts) == 1:
|
||||
return parts
|
||||
|
||||
# For tables, we typically keep them separate unless explicitly combining
|
||||
# But we can add metadata about the group
|
||||
for i, part in enumerate(parts):
|
||||
part.metadata["groupKey"] = groupKey
|
||||
part.metadata["groupIndex"] = i
|
||||
part.metadata["groupSize"] = len(parts)
|
||||
|
||||
return parts
|
||||
|
||||
def _mergeWithSizeLimit(self, parts: List[ContentPart], maxSize: int, groupKey: str) -> List[ContentPart]:
|
||||
if not parts:
|
||||
return []
|
||||
|
||||
# For tables, we typically don't merge across different tables
|
||||
# Instead, we chunk individual large tables
|
||||
merged: List[ContentPart] = []
|
||||
|
||||
for part in parts:
|
||||
partSize = part.metadata.get("size", 0)
|
||||
|
||||
if partSize <= maxSize:
|
||||
# Part fits within limit
|
||||
part.metadata["groupKey"] = groupKey
|
||||
merged.append(part)
|
||||
else:
|
||||
# Chunk the large table
|
||||
chunks = self._chunkTable(part, maxSize)
|
||||
merged.extend(chunks)
|
||||
|
||||
return merged
|
||||
|
||||
def _chunkTable(self, part: ContentPart, maxSize: int) -> List[ContentPart]:
|
||||
"""Chunk a large table by rows while preserving CSV structure."""
|
||||
lines = part.data.split('\n')
|
||||
if not lines:
|
||||
return [part]
|
||||
|
||||
chunks: List[ContentPart] = []
|
||||
currentChunk: List[str] = []
|
||||
currentSize = 0
|
||||
|
||||
for line in lines:
|
||||
lineSize = len(line.encode('utf-8')) + 1 # +1 for newline
|
||||
|
||||
if currentSize + lineSize > maxSize and currentChunk:
|
||||
# Flush current chunk
|
||||
chunkData = '\n'.join(currentChunk)
|
||||
chunks.append(ContentPart(
|
||||
id=makeId(),
|
||||
parentId=part.parentId,
|
||||
label=f"{part.label}_chunk_{len(chunks)}",
|
||||
typeGroup="table",
|
||||
mimeType=part.mimeType,
|
||||
data=chunkData,
|
||||
metadata={
|
||||
"size": len(chunkData.encode('utf-8')),
|
||||
"chunk": True,
|
||||
"originalPart": part.id,
|
||||
"chunkIndex": len(chunks)
|
||||
}
|
||||
))
|
||||
currentChunk = [line]
|
||||
currentSize = lineSize
|
||||
else:
|
||||
currentChunk.append(line)
|
||||
currentSize += lineSize
|
||||
|
||||
# Flush remaining chunk
|
||||
if currentChunk:
|
||||
chunkData = '\n'.join(currentChunk)
|
||||
chunks.append(ContentPart(
|
||||
id=makeId(),
|
||||
parentId=part.parentId,
|
||||
label=f"{part.label}_chunk_{len(chunks)}",
|
||||
typeGroup="table",
|
||||
mimeType=part.mimeType,
|
||||
data=chunkData,
|
||||
metadata={
|
||||
"size": len(chunkData.encode('utf-8')),
|
||||
"chunk": True,
|
||||
"originalPart": part.id,
|
||||
"chunkIndex": len(chunks)
|
||||
}
|
||||
))
|
||||
|
||||
return chunks
|
||||
|
|
@ -0,0 +1,138 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
from typing import Any, Dict, List
|
||||
from modules.datamodels.datamodelExtraction import ContentPart, MergeStrategy
|
||||
from ..subUtils import makeId
|
||||
|
||||
|
||||
class TextMerger:
|
||||
def merge(self, parts: List[ContentPart], strategy: MergeStrategy) -> List[ContentPart]:
|
||||
"""
|
||||
Merge text parts based on strategy.
|
||||
Strategy options:
|
||||
- groupBy: "parentId" (default), "documentId", "none"
|
||||
- orderBy: "label", "pageIndex", "sheetIndex", "none"
|
||||
- maxSize: maximum size per merged part
|
||||
"""
|
||||
if not parts:
|
||||
return parts
|
||||
|
||||
groupBy = strategy.groupBy
|
||||
orderBy = strategy.orderBy
|
||||
maxSize = strategy.maxSize or 0
|
||||
|
||||
# Group parts
|
||||
groups = self._groupParts(parts, groupBy)
|
||||
|
||||
merged: List[ContentPart] = []
|
||||
for groupKey, groupParts in groups.items():
|
||||
# Sort within group
|
||||
sortedParts = self._sortParts(groupParts, orderBy)
|
||||
|
||||
# Merge respecting maxSize
|
||||
if maxSize > 0:
|
||||
merged.extend(self._mergeWithSizeLimit(sortedParts, maxSize))
|
||||
else:
|
||||
merged.extend(self._mergeGroup(sortedParts, groupKey))
|
||||
|
||||
return merged
|
||||
|
||||
def _groupParts(self, parts: List[ContentPart], groupBy: str) -> Dict[str, List[ContentPart]]:
|
||||
groups: Dict[str, List[ContentPart]] = {}
|
||||
|
||||
for part in parts:
|
||||
if part.typeGroup != "text":
|
||||
# Non-text parts go in their own group
|
||||
key = f"nontext_{part.id}"
|
||||
if key not in groups:
|
||||
groups[key] = []
|
||||
groups[key].append(part)
|
||||
continue
|
||||
|
||||
if groupBy == "parentId":
|
||||
key = part.parentId or "root"
|
||||
elif groupBy == "documentId":
|
||||
key = part.metadata.get("documentId", "unknown")
|
||||
else: # "none"
|
||||
key = "all"
|
||||
|
||||
if key not in groups:
|
||||
groups[key] = []
|
||||
groups[key].append(part)
|
||||
|
||||
return groups
|
||||
|
||||
def _sortParts(self, parts: List[ContentPart], orderBy: str) -> List[ContentPart]:
|
||||
if orderBy == "pageIndex":
|
||||
return sorted(parts, key=lambda p: p.metadata.get("pageIndex", 0))
|
||||
elif orderBy == "sheetIndex":
|
||||
return sorted(parts, key=lambda p: p.metadata.get("sheetIndex", 0))
|
||||
elif orderBy == "label":
|
||||
return sorted(parts, key=lambda p: p.label)
|
||||
else: # "none"
|
||||
return parts
|
||||
|
||||
def _mergeGroup(self, parts: List[ContentPart], groupKey: str) -> List[ContentPart]:
|
||||
if not parts:
|
||||
return []
|
||||
if len(parts) == 1:
|
||||
return parts
|
||||
|
||||
# Merge all text parts in group
|
||||
textParts = [p for p in parts if p.typeGroup == "text"]
|
||||
nonTextParts = [p for p in parts if p.typeGroup != "text"]
|
||||
|
||||
if not textParts:
|
||||
return nonTextParts
|
||||
|
||||
# Combine text data
|
||||
combinedData = "\n".join([p.data for p in textParts])
|
||||
totalSize = sum(p.metadata.get("size", 0) for p in textParts)
|
||||
|
||||
mergedPart = ContentPart(
|
||||
id=makeId(),
|
||||
parentId=textParts[0].parentId,
|
||||
label=f"merged_{groupKey}",
|
||||
typeGroup="text",
|
||||
mimeType="text/plain",
|
||||
data=combinedData,
|
||||
metadata={
|
||||
"size": totalSize,
|
||||
"merged": len(textParts),
|
||||
"originalParts": [p.id for p in textParts]
|
||||
}
|
||||
)
|
||||
|
||||
return [mergedPart] + nonTextParts
|
||||
|
||||
def _mergeWithSizeLimit(self, parts: List[ContentPart], maxSize: int) -> List[ContentPart]:
|
||||
if not parts:
|
||||
return []
|
||||
|
||||
textParts = [p for p in parts if p.typeGroup == "text"]
|
||||
nonTextParts = [p for p in parts if p.typeGroup != "text"]
|
||||
|
||||
if not textParts:
|
||||
return nonTextParts
|
||||
|
||||
merged: List[ContentPart] = []
|
||||
currentGroup: List[ContentPart] = []
|
||||
currentSize = 0
|
||||
|
||||
for part in textParts:
|
||||
partSize = part.metadata.get("size", 0)
|
||||
|
||||
if currentSize + partSize > maxSize and currentGroup:
|
||||
# Flush current group
|
||||
merged.extend(self._mergeGroup(currentGroup, f"chunk_{len(merged)}"))
|
||||
currentGroup = [part]
|
||||
currentSize = partSize
|
||||
else:
|
||||
currentGroup.append(part)
|
||||
currentSize += partSize
|
||||
|
||||
# Flush remaining group
|
||||
if currentGroup:
|
||||
merged.extend(self._mergeGroup(currentGroup, f"chunk_{len(merged)}"))
|
||||
|
||||
return merged + nonTextParts
|
||||
211
modules/serviceCenter/services/serviceExtraction/subMerger.py
Normal file
211
modules/serviceCenter/services/serviceExtraction/subMerger.py
Normal file
|
|
@ -0,0 +1,211 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""
|
||||
Intelligent Token-Aware Merger for optimizing AI calls based on LLM token limits.
|
||||
"""
|
||||
from typing import List, Dict, Any
|
||||
import logging
|
||||
from modules.datamodels.datamodelExtraction import ContentPart
|
||||
from .subUtils import makeId
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class IntelligentTokenAwareMerger:
|
||||
"""
|
||||
Intelligent merger that groups chunks based on LLM token limits to minimize AI calls.
|
||||
|
||||
Strategy:
|
||||
1. Calculate token count for each chunk
|
||||
2. Group chunks to maximize token usage without exceeding limits
|
||||
3. Preserve document structure and semantic boundaries
|
||||
4. Minimize total number of AI calls
|
||||
"""
|
||||
|
||||
def __init__(self, modelCapabilities: Dict[str, Any]):
|
||||
self.maxTokens = modelCapabilities.get("maxTokens", 4000)
|
||||
self.safetyMargin = modelCapabilities.get("safetyMargin", 0.1)
|
||||
self.effectiveMaxTokens = int(self.maxTokens * (1 - self.safetyMargin))
|
||||
self.charsPerToken = modelCapabilities.get("charsPerToken", 4) # Rough estimation
|
||||
|
||||
def mergeChunksIntelligently(self, chunks: List[ContentPart], prompt: str = "") -> List[ContentPart]:
|
||||
"""
|
||||
Merge chunks intelligently based on token limits.
|
||||
|
||||
Args:
|
||||
chunks: List of ContentPart chunks to merge
|
||||
prompt: AI prompt to account for in token calculation
|
||||
|
||||
Returns:
|
||||
List of optimally merged ContentPart objects
|
||||
"""
|
||||
if not chunks:
|
||||
return chunks
|
||||
|
||||
logger.info(f"🧠 Intelligent merging: {len(chunks)} chunks, maxTokens={self.effectiveMaxTokens}")
|
||||
|
||||
# Calculate tokens for prompt
|
||||
promptTokens = self._estimateTokens(prompt)
|
||||
availableTokens = self.effectiveMaxTokens - promptTokens
|
||||
|
||||
logger.info(f"📊 Prompt tokens: {promptTokens}, Available for content: {availableTokens}")
|
||||
|
||||
# Group chunks by document and type for semantic coherence
|
||||
groupedChunks = self._groupChunksByDocumentAndType(chunks)
|
||||
|
||||
mergedParts = []
|
||||
|
||||
for groupKey, groupChunks in groupedChunks.items():
|
||||
logger.info(f"📁 Processing group: {groupKey} ({len(groupChunks)} chunks)")
|
||||
|
||||
# Merge chunks within this group optimally
|
||||
groupMerged = self._mergeGroupOptimally(groupChunks, availableTokens)
|
||||
mergedParts.extend(groupMerged)
|
||||
|
||||
logger.info(f"✅ Intelligent merging complete: {len(chunks)} → {len(mergedParts)} parts")
|
||||
return mergedParts
|
||||
|
||||
def _groupChunksByDocumentAndType(self, chunks: List[ContentPart]) -> Dict[str, List[ContentPart]]:
|
||||
"""Group chunks by document and type for semantic coherence."""
|
||||
groups = {}
|
||||
|
||||
for chunk in chunks:
|
||||
# Create group key: document_id + type_group
|
||||
docId = chunk.metadata.get("documentId", "unknown")
|
||||
typeGroup = chunk.typeGroup
|
||||
groupKey = f"{docId}_{typeGroup}"
|
||||
|
||||
if groupKey not in groups:
|
||||
groups[groupKey] = []
|
||||
groups[groupKey].append(chunk)
|
||||
|
||||
return groups
|
||||
|
||||
def _mergeGroupOptimally(self, chunks: List[ContentPart], availableTokens: int) -> List[ContentPart]:
|
||||
"""Merge chunks within a group optimally to minimize AI calls."""
|
||||
if not chunks:
|
||||
return []
|
||||
|
||||
# Sort chunks by size (smallest first for better packing)
|
||||
sortedChunks = sorted(chunks, key=lambda c: self._estimateTokens(c.data))
|
||||
|
||||
mergedParts = []
|
||||
currentGroup = []
|
||||
currentTokens = 0
|
||||
|
||||
for chunk in sortedChunks:
|
||||
chunkTokens = self._estimateTokens(chunk.data)
|
||||
|
||||
# Special case: If single chunk is already at max size, process it alone
|
||||
if chunkTokens >= availableTokens * 0.9: # 90% of available tokens
|
||||
# Finalize current group if it exists
|
||||
if currentGroup:
|
||||
mergedPart = self._createMergedPart(currentGroup, currentTokens)
|
||||
mergedParts.append(mergedPart)
|
||||
currentGroup = []
|
||||
currentTokens = 0
|
||||
|
||||
# Process large chunk individually
|
||||
mergedParts.append(chunk)
|
||||
logger.debug(f"🔍 Large chunk processed individually: {chunkTokens} tokens")
|
||||
continue
|
||||
|
||||
# If adding this chunk would exceed limit, finalize current group
|
||||
if currentTokens + chunkTokens > availableTokens and currentGroup:
|
||||
mergedPart = self._createMergedPart(currentGroup, currentTokens)
|
||||
mergedParts.append(mergedPart)
|
||||
currentGroup = [chunk]
|
||||
currentTokens = chunkTokens
|
||||
else:
|
||||
currentGroup.append(chunk)
|
||||
currentTokens += chunkTokens
|
||||
|
||||
# Finalize remaining group
|
||||
if currentGroup:
|
||||
mergedPart = self._createMergedPart(currentGroup, currentTokens)
|
||||
mergedParts.append(mergedPart)
|
||||
|
||||
logger.info(f"📦 Group merged: {len(chunks)} → {len(mergedParts)} parts")
|
||||
return mergedParts
|
||||
|
||||
def _createMergedPart(self, chunks: List[ContentPart], totalTokens: int) -> ContentPart:
|
||||
"""Create a merged ContentPart from multiple chunks."""
|
||||
if len(chunks) == 1:
|
||||
return chunks[0] # No need to merge single chunk
|
||||
|
||||
# Combine data with semantic separators
|
||||
combinedData = self._combineChunkData(chunks)
|
||||
|
||||
# Use metadata from first chunk as base
|
||||
baseChunk = chunks[0]
|
||||
mergedMetadata = baseChunk.metadata.copy()
|
||||
mergedMetadata.update({
|
||||
"merged": True,
|
||||
"originalChunkCount": len(chunks),
|
||||
"totalTokens": totalTokens,
|
||||
"originalChunkIds": [c.id for c in chunks],
|
||||
"size": len(combinedData.encode('utf-8'))
|
||||
})
|
||||
|
||||
mergedPart = ContentPart(
|
||||
id=makeId(),
|
||||
parentId=baseChunk.parentId,
|
||||
label=f"merged_{len(chunks)}_chunks",
|
||||
typeGroup=baseChunk.typeGroup,
|
||||
mimeType=baseChunk.mimeType,
|
||||
data=combinedData,
|
||||
metadata=mergedMetadata
|
||||
)
|
||||
|
||||
logger.debug(f"🔗 Created merged part: {len(chunks)} chunks, {totalTokens} tokens")
|
||||
return mergedPart
|
||||
|
||||
def _combineChunkData(self, chunks: List[ContentPart]) -> str:
|
||||
"""Combine chunk data with appropriate separators."""
|
||||
if not chunks:
|
||||
return ""
|
||||
|
||||
# Use different separators based on content type
|
||||
if chunks[0].typeGroup == "text":
|
||||
separator = "\n\n---\n\n" # Clear text separation
|
||||
elif chunks[0].typeGroup == "table":
|
||||
separator = "\n\n[TABLE BREAK]\n\n" # Table separation
|
||||
else:
|
||||
separator = "\n\n---\n\n" # Default separation
|
||||
|
||||
return separator.join([chunk.data for chunk in chunks])
|
||||
|
||||
def _estimateTokens(self, text: str) -> int:
|
||||
"""Estimate token count for text."""
|
||||
if not text:
|
||||
return 0
|
||||
return len(text) // self.charsPerToken
|
||||
|
||||
def calculateOptimizationStats(self, originalChunks: List[ContentPart], mergedParts: List[ContentPart]) -> Dict[str, Any]:
|
||||
"""Calculate optimization statistics with detailed analysis."""
|
||||
originalCalls = len(originalChunks)
|
||||
optimizedCalls = len(mergedParts)
|
||||
reductionPercent = ((originalCalls - optimizedCalls) / originalCalls * 100) if originalCalls > 0 else 0
|
||||
|
||||
# Analyze chunk sizes
|
||||
largeChunks = [c for c in originalChunks if self._estimateTokens(c.data) >= self.effectiveMaxTokens * 0.9]
|
||||
smallChunks = [c for c in originalChunks if self._estimateTokens(c.data) < self.effectiveMaxTokens * 0.9]
|
||||
|
||||
# Calculate theoretical maximum optimization (if all small chunks could be merged)
|
||||
theoreticalMinCalls = len(largeChunks) + max(1, len(smallChunks) // 3) # Assume 3 small chunks per call
|
||||
theoreticalReduction = ((originalCalls - theoreticalMinCalls) / originalCalls * 100) if originalCalls > 0 else 0
|
||||
|
||||
return {
|
||||
"original_ai_calls": originalCalls,
|
||||
"optimized_ai_calls": optimizedCalls,
|
||||
"reduction_percent": round(reductionPercent, 1),
|
||||
"cost_savings": f"{reductionPercent:.1f}%",
|
||||
"efficiency_gain": f"{originalCalls / optimizedCalls:.1f}x" if optimizedCalls > 0 else "∞",
|
||||
"analysis": {
|
||||
"large_chunks": len(largeChunks),
|
||||
"small_chunks": len(smallChunks),
|
||||
"theoretical_min_calls": theoreticalMinCalls,
|
||||
"theoretical_reduction": round(theoreticalReduction, 1),
|
||||
"optimization_potential": "high" if reductionPercent > 50 else "moderate" if reductionPercent > 20 else "low"
|
||||
}
|
||||
}
|
||||
|
|
@ -0,0 +1,48 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
from typing import List
|
||||
import logging
|
||||
|
||||
from modules.datamodels.datamodelExtraction import ContentExtracted, ContentPart, ExtractionOptions, MergeStrategy
|
||||
from .subUtils import makeId
|
||||
from .subRegistry import ExtractorRegistry, ChunkerRegistry
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
# REMOVED: _mergeParts function - unused, functionality replaced by applyMerging in interfaceAiObjects.py
|
||||
|
||||
|
||||
def runExtraction(extractorRegistry: ExtractorRegistry, chunkerRegistry: ChunkerRegistry, documentBytes: bytes, fileName: str, mimeType: str, options: ExtractionOptions) -> ContentExtracted:
|
||||
|
||||
extractor = extractorRegistry.resolve(mimeType, fileName)
|
||||
if extractor is None:
|
||||
# fallback: single binary part
|
||||
part = ContentPart(
|
||||
id=makeId(),
|
||||
parentId=None,
|
||||
label="file",
|
||||
typeGroup="binary",
|
||||
mimeType=mimeType or "application/octet-stream",
|
||||
data="",
|
||||
metadata={"warning": "No extractor registered"}
|
||||
)
|
||||
return ContentExtracted(id=makeId(), parts=[part])
|
||||
|
||||
parts = extractor.extract(documentBytes, {"fileName": fileName, "mimeType": mimeType})
|
||||
|
||||
# REMOVED: poolAndLimit(parts, chunkerRegistry, options)
|
||||
# REMOVED: Chunking logic - now handled in AI call phase
|
||||
|
||||
# Apply merging strategy if provided (preserve existing logic)
|
||||
if options.mergeStrategy:
|
||||
# Use module-level applyMerging function
|
||||
from .mainServiceExtraction import applyMerging
|
||||
parts = applyMerging(parts, options.mergeStrategy)
|
||||
|
||||
return ContentExtracted(id=makeId(), parts=parts)
|
||||
|
||||
|
||||
# REMOVED: poolAndLimit function - chunking now handled in AI call phase
|
||||
# REMOVED: applyMerging function - moved to interfaceAiObjects.py for proper interface-level access
|
||||
|
||||
|
|
@ -0,0 +1,214 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""
|
||||
Prompt builder for document extraction.
|
||||
This module builds prompts for extracting content from documents.
|
||||
"""
|
||||
|
||||
import json
|
||||
import logging
|
||||
from typing import Dict, Any, Optional
|
||||
from modules.datamodels.datamodelAi import AiCallRequest, AiCallOptions, OperationTypeEnum
|
||||
|
||||
# Type hint for renderer parameter
|
||||
from typing import TYPE_CHECKING
|
||||
if TYPE_CHECKING:
|
||||
from modules.serviceCenter.services.serviceGeneration.renderers.documentRendererBaseTemplate import BaseRenderer
|
||||
_RendererLike = BaseRenderer
|
||||
else:
|
||||
_RendererLike = Any
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
async def buildExtractionPrompt(
|
||||
outputFormat: str,
|
||||
userPrompt: str,
|
||||
title: str,
|
||||
aiService=None,
|
||||
services=None,
|
||||
renderer: _RendererLike = None
|
||||
) -> str:
|
||||
"""
|
||||
Build unified extraction prompt for extracting content from documents.
|
||||
Always uses multi-file format (single doc = multi with n=1).
|
||||
|
||||
Args:
|
||||
outputFormat: Target output format
|
||||
userPrompt: User's prompt describing what to extract
|
||||
title: Document title
|
||||
aiService: Optional AI service for intent parsing
|
||||
services: Services instance
|
||||
renderer: Optional renderer for format-specific guidelines
|
||||
|
||||
Returns:
|
||||
Complete extraction prompt string
|
||||
"""
|
||||
|
||||
# Flat extraction format - returns extracted content as structured data, not documents/sections
|
||||
# This format allows merging multiple contentParts into one response
|
||||
json_example = {
|
||||
"extracted_content": {
|
||||
"text": "Extracted text content from the document...",
|
||||
"tables": [
|
||||
{
|
||||
"headers": ["Column 1", "Column 2"],
|
||||
"rows": [
|
||||
["Value 1", "Value 2"],
|
||||
["Value 3", "Value 4"]
|
||||
]
|
||||
}
|
||||
],
|
||||
"headings": [
|
||||
{
|
||||
"level": 1,
|
||||
"text": "Main Heading"
|
||||
},
|
||||
{
|
||||
"level": 2,
|
||||
"text": "Subheading"
|
||||
}
|
||||
],
|
||||
"lists": [
|
||||
{
|
||||
"type": "bullet",
|
||||
"items": ["Item 1", "Item 2", "Item 3"]
|
||||
}
|
||||
],
|
||||
"images": [
|
||||
{
|
||||
"description": "Description of image content, including all visible text, tables, and visual elements"
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
|
||||
structure_instruction = """CRITICAL EXTRACTION REQUIREMENTS:
|
||||
1. Extract content from the provided ContentPart(s) - process what is provided in this call
|
||||
2. If this ContentPart contains tables, extract them with proper structure (headers and rows)
|
||||
3. If this ContentPart contains text, extract it as structured text
|
||||
4. Return ONE JSON object with extracted content from this ContentPart
|
||||
5. Preserve all original data - do not summarize or interpret
|
||||
6. The system will merge results from multiple ContentParts automatically - focus on extracting this ContentPart's content accurately"""
|
||||
|
||||
# Parse extraction intent if AI service is available
|
||||
extraction_intent = await _parseExtractionIntent(userPrompt, outputFormat, aiService, services) if aiService else userPrompt
|
||||
|
||||
# Extract user language for document language instruction
|
||||
userLanguage = 'en' # Default fallback
|
||||
if services:
|
||||
try:
|
||||
# Prefer detected language if available
|
||||
if hasattr(services, 'currentUserLanguage') and services.currentUserLanguage:
|
||||
userLanguage = services.currentUserLanguage
|
||||
elif hasattr(services, 'user') and services.user and hasattr(services.user, 'language'):
|
||||
userLanguage = services.user.language
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# Build base prompt with clear user prompt markers
|
||||
sanitized_user_prompt = services.utils.sanitizePromptContent(userPrompt, 'userinput') if services else userPrompt
|
||||
adaptive_prompt = f"""
|
||||
{'='*80}
|
||||
USER REQUEST / USER PROMPT:
|
||||
{'='*80}
|
||||
{sanitized_user_prompt}
|
||||
{'='*80}
|
||||
END OF USER REQUEST / USER PROMPT
|
||||
{'='*80}
|
||||
|
||||
You are a document processing assistant that extracts content from documents. Your task is to analyze the provided ContentPart(s) and extract their content into a structured JSON format.
|
||||
|
||||
TASK: Extract content from the provided ContentPart(s). Extract all tables, text, headings, lists, and other content types accurately. The system processes ContentParts individually and merges results automatically.
|
||||
|
||||
LANGUAGE REQUIREMENT: All extracted content must be in the language '{userLanguage}'. Extract and preserve content in this language.
|
||||
|
||||
{extraction_intent}
|
||||
|
||||
{structure_instruction}
|
||||
|
||||
OUTPUT FORMAT: Return only valid JSON in this exact structure:
|
||||
{json.dumps(json_example, indent=2)}
|
||||
|
||||
CRITICAL EXTRACTION RULES:
|
||||
- Extract only content that is ACTUALLY PRESENT in the ContentPart - never create fake or placeholder data
|
||||
- Return empty arrays [] or empty strings "" when content is missing - this is normal and expected
|
||||
- Extract all tables, text, headings, lists accurately with proper structure
|
||||
- Preserve all original data - do not summarize or interpret
|
||||
- Return ONE JSON object per ContentPart (the system merges multiple ContentParts automatically)
|
||||
|
||||
Content Types to Extract:
|
||||
1. Tables: Extract all rows and columns with proper headers
|
||||
2. Lists: Extract all items with proper nesting
|
||||
3. Headings: Extract with appropriate levels
|
||||
4. Paragraphs: Extract as structured text
|
||||
5. Code: Extract code blocks with language identification
|
||||
6. Images: Analyze images and describe all visible content including text, tables, logos, graphics, layout, and visual elements
|
||||
|
||||
Image Analysis Requirements:
|
||||
- If you cannot analyze an image for any reason, explain why in the JSON response
|
||||
- Describe everything you see in the image
|
||||
- Include all text content, tables, logos, graphics, layout, and visual elements
|
||||
- If the image is too small, corrupted, or unclear, explain this
|
||||
- Always provide feedback - never return empty responses
|
||||
|
||||
Return only the JSON structure with actual data from the documents. Do not include any text before or after the JSON.
|
||||
|
||||
Extract only actual content from the ContentPart. Return empty arrays/strings when content is missing - never create fake data.
|
||||
""".strip()
|
||||
|
||||
# Add renderer-specific guidelines if provided
|
||||
if renderer:
|
||||
try:
|
||||
if hasattr(renderer, 'getExtractionGuidelines'):
|
||||
formatGuidelines = renderer.getExtractionGuidelines()
|
||||
adaptive_prompt = f"{adaptive_prompt}\n\n{formatGuidelines}".strip()
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# Save extraction prompt to debug file - only if debug enabled
|
||||
from modules.shared.debugLogger import writeDebugFile
|
||||
writeDebugFile(adaptive_prompt, "extraction_prompt")
|
||||
|
||||
return adaptive_prompt
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
async def _parseExtractionIntent(userPrompt: str, outputFormat: str, aiService=None, services=None) -> str:
|
||||
"""
|
||||
Parse user prompt to extract the core extraction intent.
|
||||
"""
|
||||
if not aiService:
|
||||
return f"Extract content from the provided documents and create a {outputFormat} report."
|
||||
|
||||
try:
|
||||
analysis_prompt = f"""
|
||||
Analyze this user request and extract the core extraction intent:
|
||||
|
||||
User request: "{userPrompt}"
|
||||
Target format: {outputFormat}
|
||||
|
||||
Extract the main intent and requirements for document processing. Focus on:
|
||||
1. What content needs to be extracted
|
||||
2. How it should be organized
|
||||
3. Any specific requirements or preferences
|
||||
|
||||
Respond with a clear, concise statement of the extraction intent.
|
||||
"""
|
||||
request_options = AiCallOptions()
|
||||
request_options.operationType = OperationTypeEnum.DATA_GENERATE
|
||||
|
||||
request = AiCallRequest(prompt=analysis_prompt, context="", options=request_options)
|
||||
response = await aiService.aiObjects.call(request)
|
||||
|
||||
if response and response.content:
|
||||
return response.content.strip()
|
||||
else:
|
||||
return f"Extract content from the provided documents and create a {outputFormat} report."
|
||||
|
||||
except Exception as e:
|
||||
services.utils.debugLogToFile(f"Extraction intent analysis failed: {str(e)}", "PROMPT_BUILDER")
|
||||
return f"Extract content from the provided documents and create a {outputFormat} report."
|
||||
|
||||
208
modules/serviceCenter/services/serviceExtraction/subRegistry.py
Normal file
208
modules/serviceCenter/services/serviceExtraction/subRegistry.py
Normal file
|
|
@ -0,0 +1,208 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
from typing import Any, Dict, Optional
|
||||
import logging
|
||||
|
||||
from modules.datamodels.datamodelExtraction import ContentPart
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class Extractor:
|
||||
"""
|
||||
Base class for all document extractors.
|
||||
|
||||
Each extractor should implement:
|
||||
- detect(): Check if this extractor can handle the given file
|
||||
- extract(): Extract content from the file
|
||||
- getSupportedExtensions(): Return supported file extensions
|
||||
- getSupportedMimeTypes(): Return supported MIME types
|
||||
"""
|
||||
|
||||
def detect(self, fileName: str, mimeType: str, headBytes: bytes) -> bool:
|
||||
"""Check if this extractor can handle the given file."""
|
||||
return False
|
||||
|
||||
def extract(self, fileBytes: bytes, context: Dict[str, Any]) -> list[ContentPart]:
|
||||
"""Extract content from the file bytes."""
|
||||
raise NotImplementedError
|
||||
|
||||
def getSupportedExtensions(self) -> list[str]:
|
||||
"""Return list of supported file extensions (including dots)."""
|
||||
return []
|
||||
|
||||
def getSupportedMimeTypes(self) -> list[str]:
|
||||
"""Return list of supported MIME types."""
|
||||
return []
|
||||
|
||||
|
||||
class Chunker:
|
||||
def chunk(self, part: ContentPart, options: Dict[str, Any]) -> list[Dict[str, Any]]:
|
||||
return []
|
||||
|
||||
|
||||
class ExtractorRegistry:
|
||||
def __init__(self):
|
||||
self._map: Dict[str, Extractor] = {}
|
||||
self._fallback: Optional[Extractor] = None
|
||||
self._auto_discover_extractors()
|
||||
|
||||
def _auto_discover_extractors(self):
|
||||
"""Auto-discover and register all extractors from the extractors directory."""
|
||||
try:
|
||||
import os
|
||||
import importlib
|
||||
from pathlib import Path
|
||||
|
||||
# Get the extractors directory
|
||||
current_dir = Path(__file__).parent
|
||||
extractors_dir = current_dir / "extractors"
|
||||
|
||||
if not extractors_dir.exists():
|
||||
logger.error(f"Extractors directory not found: {extractors_dir}")
|
||||
return
|
||||
|
||||
# Import all extractor modules
|
||||
extractor_modules = []
|
||||
for file_path in extractors_dir.glob("extractor*.py"):
|
||||
if file_path.name == "__init__.py":
|
||||
continue
|
||||
|
||||
module_name = file_path.stem
|
||||
try:
|
||||
# Import the module
|
||||
module = importlib.import_module(f".{module_name}", package="modules.serviceCenter.services.serviceExtraction.extractors")
|
||||
|
||||
# Find all extractor classes in the module
|
||||
for attr_name in dir(module):
|
||||
attr = getattr(module, attr_name)
|
||||
if (isinstance(attr, type) and
|
||||
issubclass(attr, Extractor) and
|
||||
attr != Extractor and
|
||||
not attr_name.startswith('_')):
|
||||
|
||||
# Create instance and auto-register
|
||||
extractor_instance = attr()
|
||||
self._auto_register_extractor(extractor_instance)
|
||||
extractor_modules.append(attr_name)
|
||||
|
||||
except Exception as e:
|
||||
logger.warning(f"Failed to import {module_name}: {str(e)}")
|
||||
continue
|
||||
|
||||
# Set fallback extractor
|
||||
try:
|
||||
from .extractors.extractorBinary import BinaryExtractor
|
||||
self.setFallback(BinaryExtractor())
|
||||
except Exception as e:
|
||||
logger.warning(f"Failed to set fallback extractor: {str(e)}")
|
||||
|
||||
logger.info(f"ExtractorRegistry: Auto-discovered and registered {len(extractor_modules)} extractor classes: {', '.join(extractor_modules)}")
|
||||
logger.info(f"ExtractorRegistry: Total registered formats: {len(self._map)}")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"ExtractorRegistry: Failed to auto-discover extractors: {str(e)}")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
|
||||
def _auto_register_extractor(self, extractor: Extractor):
|
||||
"""Auto-register an extractor based on its declared supported formats."""
|
||||
try:
|
||||
# Register MIME types
|
||||
mime_types = extractor.getSupportedMimeTypes()
|
||||
for mime_type in mime_types:
|
||||
self.register(mime_type, extractor)
|
||||
|
||||
# Register file extensions
|
||||
extensions = extractor.getSupportedExtensions()
|
||||
for ext in extensions:
|
||||
# Remove leading dot for registry key
|
||||
ext_key = ext.lstrip('.')
|
||||
self.register(ext_key, extractor)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to auto-register {extractor.__class__.__name__}: {str(e)}")
|
||||
|
||||
def register(self, key: str, extractor: Extractor):
|
||||
self._map[key] = extractor
|
||||
|
||||
def setFallback(self, extractor: Extractor):
|
||||
self._fallback = extractor
|
||||
|
||||
def resolve(self, mimeType: str, fileName: str) -> Optional[Extractor]:
|
||||
if mimeType in self._map:
|
||||
return self._map[mimeType]
|
||||
# simple extension fallback
|
||||
if "." in fileName:
|
||||
ext = fileName.lower().rsplit(".", 1)[-1]
|
||||
if ext in self._map:
|
||||
return self._map[ext]
|
||||
return self._fallback
|
||||
|
||||
def getAllSupportedFormats(self) -> Dict[str, Dict[str, list[str]]]:
|
||||
"""
|
||||
Get all supported formats from all registered extractors.
|
||||
|
||||
Returns:
|
||||
Dictionary with format information:
|
||||
{
|
||||
"extensions": {
|
||||
"extractor_name": [".ext1", ".ext2", ...]
|
||||
},
|
||||
"mime_types": {
|
||||
"extractor_name": ["mime/type1", "mime/type2", ...]
|
||||
}
|
||||
}
|
||||
"""
|
||||
formats = {"extensions": {}, "mime_types": {}}
|
||||
|
||||
# Get formats from registered extractors
|
||||
for key, extractor in self._map.items():
|
||||
if hasattr(extractor, 'getSupportedExtensions'):
|
||||
extensions = extractor.getSupportedExtensions()
|
||||
if extensions:
|
||||
formats["extensions"][key] = extensions
|
||||
|
||||
if hasattr(extractor, 'getSupportedMimeTypes'):
|
||||
mime_types = extractor.getSupportedMimeTypes()
|
||||
if mime_types:
|
||||
formats["mime_types"][key] = mime_types
|
||||
|
||||
# Add fallback extractor info
|
||||
if self._fallback and hasattr(self._fallback, 'getSupportedExtensions'):
|
||||
formats["extensions"]["fallback"] = self._fallback.getSupportedExtensions()
|
||||
if self._fallback and hasattr(self._fallback, 'getSupportedMimeTypes'):
|
||||
formats["mime_types"]["fallback"] = self._fallback.getSupportedMimeTypes()
|
||||
|
||||
return formats
|
||||
|
||||
|
||||
class ChunkerRegistry:
|
||||
def __init__(self):
|
||||
self._map: Dict[str, Chunker] = {}
|
||||
self._noop = Chunker()
|
||||
# Register default chunkers
|
||||
try:
|
||||
from .chunking.chunkerText import TextChunker
|
||||
from .chunking.chunkerTable import TableChunker
|
||||
from .chunking.chunkerStructure import StructureChunker
|
||||
from .chunking.chunkerImage import ImageChunker
|
||||
self.register("text", TextChunker())
|
||||
self.register("table", TableChunker())
|
||||
self.register("structure", StructureChunker())
|
||||
self.register("image", ImageChunker())
|
||||
# Use text chunker for container and binary content
|
||||
self.register("container", TextChunker())
|
||||
self.register("binary", TextChunker())
|
||||
except Exception as e:
|
||||
logger.error(f"ChunkerRegistry: Failed to register chunkers: {str(e)}")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
|
||||
def register(self, typeGroup: str, chunker: Chunker):
|
||||
self._map[typeGroup] = chunker
|
||||
|
||||
def resolve(self, typeGroup: str) -> Chunker:
|
||||
return self._map.get(typeGroup, self._noop)
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,7 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
import uuid
|
||||
|
||||
|
||||
def makeId() -> str:
|
||||
return str(uuid.uuid4())
|
||||
|
|
@ -0,0 +1,7 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""Generation service."""
|
||||
|
||||
from .mainServiceGeneration import GenerationService
|
||||
|
||||
__all__ = ["GenerationService"]
|
||||
|
|
@ -0,0 +1,616 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
import logging
|
||||
import uuid
|
||||
import base64
|
||||
import traceback
|
||||
from typing import Any, Dict, List, Optional, Callable
|
||||
from modules.datamodels.datamodelDocument import RenderedDocument
|
||||
from modules.datamodels.datamodelChat import ChatDocument
|
||||
from .subDocumentUtility import (
|
||||
getFileExtension,
|
||||
getMimeTypeFromExtension,
|
||||
detectMimeTypeFromContent,
|
||||
detectMimeTypeFromData,
|
||||
convertDocumentDataToString
|
||||
)
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class _ServicesAdapter:
|
||||
"""Adapter providing Services-like interface from (context, get_service)."""
|
||||
|
||||
def __init__(self, context, get_service: Callable[[str], Any]):
|
||||
self._context = context
|
||||
self._get_service = get_service
|
||||
self.user = context.user
|
||||
self.mandateId = context.mandate_id
|
||||
self.featureInstanceId = context.feature_instance_id
|
||||
self.workflow = context.workflow
|
||||
chat = get_service("chat")
|
||||
self.interfaceDbComponent = chat.interfaceDbComponent
|
||||
self.interfaceDbChat = chat.interfaceDbChat
|
||||
|
||||
@property
|
||||
def chat(self):
|
||||
return self._get_service("chat")
|
||||
|
||||
@property
|
||||
def utils(self):
|
||||
return self._get_service("utils")
|
||||
|
||||
@property
|
||||
def ai(self):
|
||||
return self._get_service("ai")
|
||||
|
||||
|
||||
class GenerationService:
|
||||
def __init__(self, context, get_service: Callable[[str], Any]):
|
||||
"""Initialize with ServiceCenterContext and service resolver."""
|
||||
self.services = _ServicesAdapter(context, get_service)
|
||||
self._get_service = get_service
|
||||
self.interfaceDbComponent = self.services.interfaceDbComponent
|
||||
self.interfaceDbChat = self.services.interfaceDbChat
|
||||
|
||||
def processActionResultDocuments(self, actionResult, action) -> List[Dict[str, Any]]:
|
||||
"""
|
||||
Process documents produced by AI actions and convert them to ChatDocument format.
|
||||
This function handles AI-generated document data, not document references.
|
||||
Returns a list of processed document dictionaries.
|
||||
"""
|
||||
try:
|
||||
# Read documents from the standard documents field (not data.documents)
|
||||
documents = actionResult.documents if actionResult and hasattr(actionResult, 'documents') else []
|
||||
|
||||
if not documents:
|
||||
return []
|
||||
|
||||
# Process each document from the AI action result
|
||||
processedDocuments = []
|
||||
for doc in documents:
|
||||
processedDoc = self.processSingleDocument(doc, action)
|
||||
if processedDoc:
|
||||
processedDocuments.append(processedDoc)
|
||||
|
||||
return processedDocuments
|
||||
except Exception as e:
|
||||
logger.error(f"Error processing action result documents: {str(e)}")
|
||||
return []
|
||||
|
||||
def processSingleDocument(self, doc: Any, action) -> Optional[Dict[str, Any]]:
|
||||
"""Process a single document from action result with simplified logic"""
|
||||
try:
|
||||
# ActionDocument objects have documentName, documentData, and mimeType
|
||||
mime_type = doc.mimeType
|
||||
if mime_type == "application/octet-stream":
|
||||
content = doc.documentData
|
||||
# Detect MIME without relying on a service center
|
||||
mime_type = detectMimeTypeFromContent(content, doc.documentName)
|
||||
|
||||
# WICHTIG: Für ActionDocuments mit validationMetadata (z.B. context.extractContent)
|
||||
# müssen wir das gesamte ActionDocument serialisieren, nicht nur documentData
|
||||
document_data = doc.documentData
|
||||
if hasattr(doc, 'validationMetadata') and doc.validationMetadata:
|
||||
# Wenn validationMetadata vorhanden ist, serialisiere das gesamte ActionDocument-Format
|
||||
if mime_type == "application/json":
|
||||
# Erstelle ActionDocument-Format mit validationMetadata und documentData
|
||||
if hasattr(document_data, 'model_dump'):
|
||||
# Pydantic v2
|
||||
document_data_dict = document_data.model_dump()
|
||||
elif hasattr(document_data, 'dict'):
|
||||
# Pydantic v1
|
||||
document_data_dict = document_data.dict()
|
||||
elif isinstance(document_data, dict):
|
||||
document_data_dict = document_data
|
||||
elif isinstance(document_data, str):
|
||||
# JSON-String: parsen und als dict speichern (z.B. von outlook.composeAndDraftEmailWithContext)
|
||||
import json
|
||||
try:
|
||||
document_data_dict = json.loads(document_data)
|
||||
except json.JSONDecodeError:
|
||||
# Kein valides JSON - als plain text speichern
|
||||
document_data_dict = {"data": document_data}
|
||||
else:
|
||||
document_data_dict = {"data": str(document_data)}
|
||||
|
||||
# Erstelle ActionDocument-Format
|
||||
document_data = {
|
||||
"validationMetadata": doc.validationMetadata,
|
||||
"documentData": document_data_dict
|
||||
}
|
||||
|
||||
return {
|
||||
'fileName': doc.documentName,
|
||||
'fileSize': len(str(document_data)),
|
||||
'mimeType': mime_type,
|
||||
'content': document_data,
|
||||
'document': doc
|
||||
}
|
||||
except Exception as e:
|
||||
logger.error(f"Error processing single document: {str(e)}")
|
||||
return None
|
||||
|
||||
def createDocumentsFromActionResult(self, actionResult, action, workflow, message_id=None) -> List[Any]:
|
||||
"""
|
||||
Create actual document objects from action result and store them in the system.
|
||||
Returns a list of created document objects with proper workflow context.
|
||||
"""
|
||||
try:
|
||||
processed_docs = self.processActionResultDocuments(actionResult, action)
|
||||
|
||||
createdDocuments = []
|
||||
for i, doc_data in enumerate(processed_docs):
|
||||
try:
|
||||
documentName = doc_data['fileName']
|
||||
documentData = doc_data['content']
|
||||
mimeType = doc_data['mimeType']
|
||||
|
||||
# Handle binary data (images, PDFs, Office docs) differently from text
|
||||
# Check if this is a binary MIME type
|
||||
binaryMimeTypes = {
|
||||
"application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
||||
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
|
||||
"application/vnd.openxmlformats-officedocument.presentationml.presentation",
|
||||
"application/pdf",
|
||||
"image/png", "image/jpeg", "image/jpg", "image/gif", "image/webp", "image/bmp", "image/svg+xml",
|
||||
}
|
||||
|
||||
isBinaryMimeType = mimeType in binaryMimeTypes
|
||||
base64encoded = False
|
||||
content = None
|
||||
|
||||
if isBinaryMimeType:
|
||||
# For binary data, handle bytes vs base64 string vs regular string
|
||||
if isinstance(documentData, bytes):
|
||||
# Already bytes - encode to base64 string for storage
|
||||
# base64 is already imported at module level
|
||||
content = base64.b64encode(documentData).decode('utf-8')
|
||||
base64encoded = True
|
||||
elif isinstance(documentData, str):
|
||||
# Check if it's already valid base64
|
||||
# base64 is already imported at module level
|
||||
try:
|
||||
# Try to decode to verify it's base64
|
||||
base64.b64decode(documentData, validate=True)
|
||||
# Valid base64 - use as is
|
||||
content = documentData
|
||||
base64encoded = True
|
||||
except Exception:
|
||||
# Not valid base64 - might be raw string, try encoding
|
||||
try:
|
||||
content = base64.b64encode(documentData.encode('utf-8')).decode('utf-8')
|
||||
base64encoded = True
|
||||
except Exception:
|
||||
logger.warning(f"Could not process binary data for {documentName}, skipping")
|
||||
continue
|
||||
else:
|
||||
# Other types - convert to string then base64
|
||||
# base64 is already imported at module level
|
||||
try:
|
||||
content = base64.b64encode(str(documentData).encode('utf-8')).decode('utf-8')
|
||||
base64encoded = True
|
||||
except Exception:
|
||||
logger.warning(f"Could not encode binary data for {documentName}, skipping")
|
||||
continue
|
||||
else:
|
||||
# Text data - convert to string
|
||||
content = convertDocumentDataToString(documentData, getFileExtension(documentName))
|
||||
|
||||
# Skip empty or minimal content
|
||||
minimalContentPatterns = ['{}', '[]', 'null', '""', "''"]
|
||||
if not content or content.strip() == "" or content.strip() in minimalContentPatterns:
|
||||
logger.warning(f"Empty or minimal content for document {documentName}, skipping")
|
||||
continue
|
||||
|
||||
# Normalize file extension based on mime type if missing or incorrect
|
||||
try:
|
||||
mime_to_ext = {
|
||||
"application/vnd.openxmlformats-officedocument.wordprocessingml.document": ".docx",
|
||||
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet": ".xlsx",
|
||||
"application/vnd.openxmlformats-officedocument.presentationml.presentation": ".pptx",
|
||||
"application/pdf": ".pdf",
|
||||
"text/html": ".html",
|
||||
"text/markdown": ".md",
|
||||
"text/plain": ".txt",
|
||||
"application/json": ".json",
|
||||
"image/png": ".png",
|
||||
"image/jpeg": ".jpg",
|
||||
"image/jpg": ".jpg",
|
||||
"image/gif": ".gif",
|
||||
"image/webp": ".webp",
|
||||
"image/bmp": ".bmp",
|
||||
"image/svg+xml": ".svg",
|
||||
}
|
||||
expectedExt = mime_to_ext.get(mimeType)
|
||||
if expectedExt:
|
||||
if not documentName.lower().endswith(expectedExt):
|
||||
# Append/replace extension to match mime type
|
||||
if "." in documentName:
|
||||
documentName = documentName.rsplit(".", 1)[0] + expectedExt
|
||||
else:
|
||||
documentName = documentName + expectedExt
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# Create document with file in one step using interfaces directly
|
||||
document = self._createDocument(
|
||||
fileName=documentName,
|
||||
mimeType=mimeType,
|
||||
content=content,
|
||||
base64encoded=base64encoded,
|
||||
messageId=message_id
|
||||
)
|
||||
if document:
|
||||
# Set workflow context on the document if possible
|
||||
self._setDocumentWorkflowContext(document, action, workflow)
|
||||
createdDocuments.append(document)
|
||||
else:
|
||||
logger.error(f"Failed to create ChatDocument object for {documentName}")
|
||||
except Exception as e:
|
||||
logger.error(f"Error creating document {doc_data.get('fileName', 'unknown')}: {str(e)}")
|
||||
continue
|
||||
|
||||
return createdDocuments
|
||||
except Exception as e:
|
||||
logger.error(f"Error creating documents from action result: {str(e)}")
|
||||
return []
|
||||
|
||||
def _setDocumentWorkflowContext(self, document, action, workflow):
|
||||
"""Set workflow context on a document for proper routing and labeling"""
|
||||
try:
|
||||
# Get current workflow context directly from workflow object
|
||||
workflowContext = self._getWorkflowContext(workflow)
|
||||
workflowStats = self._getWorkflowStats(workflow)
|
||||
|
||||
currentRound = workflowContext.get('currentRound', 0)
|
||||
currentTask = workflowContext.get('currentTask', 0)
|
||||
currentAction = workflowContext.get('currentAction', 0)
|
||||
|
||||
# Try to set workflow context attributes if they exist
|
||||
if hasattr(document, 'roundNumber'):
|
||||
document.roundNumber = currentRound
|
||||
if hasattr(document, 'taskNumber'):
|
||||
document.taskNumber = currentTask
|
||||
if hasattr(document, 'actionNumber'):
|
||||
document.actionNumber = currentAction
|
||||
if hasattr(document, 'actionId'):
|
||||
document.actionId = action.id if hasattr(action, 'id') else None
|
||||
|
||||
# Set additional workflow metadata if available
|
||||
if hasattr(document, 'workflowId'):
|
||||
document.workflowId = workflowStats.get('workflowId', workflow.id if hasattr(workflow, 'id') else None)
|
||||
if hasattr(document, 'workflowStatus'):
|
||||
document.workflowStatus = workflowStats.get('workflowStatus', workflow.status if hasattr(workflow, 'status') else 'unknown')
|
||||
|
||||
|
||||
except Exception as e:
|
||||
logger.warning(f"Could not set workflow context on document: {str(e)}")
|
||||
|
||||
def _createDocument(self, fileName: str, mimeType: str, content: str, base64encoded: bool = True, messageId: str = None) -> Optional[ChatDocument]:
|
||||
"""Create file and ChatDocument using interfaces without service indirection."""
|
||||
try:
|
||||
if not self.interfaceDbComponent:
|
||||
logger.error("Component interface not available for document creation")
|
||||
return None
|
||||
# Convert content to bytes
|
||||
if base64encoded:
|
||||
# base64 is already imported at module level
|
||||
content_bytes = base64.b64decode(content)
|
||||
else:
|
||||
content_bytes = content.encode('utf-8')
|
||||
# Create file and store data
|
||||
file_item = self.interfaceDbComponent.createFile(
|
||||
name=fileName,
|
||||
mimeType=mimeType,
|
||||
content=content_bytes
|
||||
)
|
||||
self.interfaceDbComponent.createFileData(file_item.id, content_bytes)
|
||||
# Collect file info
|
||||
file_info = self._getFileInfo(file_item.id)
|
||||
if not file_info:
|
||||
logger.error(f"Could not get file info for fileId: {file_item.id}")
|
||||
return None
|
||||
# Build ChatDocument
|
||||
document = ChatDocument(
|
||||
id=str(uuid.uuid4()),
|
||||
messageId=messageId or "",
|
||||
fileId=file_item.id,
|
||||
fileName=file_info.get("fileName", fileName),
|
||||
fileSize=file_info.get("size", 0),
|
||||
mimeType=file_info.get("mimeType", mimeType)
|
||||
)
|
||||
# Ensure document can access component interface later
|
||||
if hasattr(document, 'setComponentInterface') and self.interfaceDbComponent:
|
||||
try:
|
||||
document.setComponentInterface(self.interfaceDbComponent)
|
||||
except Exception:
|
||||
pass
|
||||
return document
|
||||
except Exception as e:
|
||||
logger.error(f"Error creating document: {str(e)}")
|
||||
return None
|
||||
|
||||
def _getFileInfo(self, fileId: str) -> Optional[Dict[str, Any]]:
|
||||
try:
|
||||
if not self.interfaceDbComponent:
|
||||
return None
|
||||
file_item = self.interfaceDbComponent.getFile(fileId)
|
||||
if file_item:
|
||||
return {
|
||||
"id": file_item.id,
|
||||
"fileName": file_item.fileName,
|
||||
"size": file_item.fileSize,
|
||||
"mimeType": file_item.mimeType,
|
||||
"fileHash": getattr(file_item, 'fileHash', None),
|
||||
"creationDate": getattr(file_item, 'creationDate', None)
|
||||
}
|
||||
return None
|
||||
except Exception as e:
|
||||
logger.error(f"Error getting file info for {fileId}: {str(e)}")
|
||||
return None
|
||||
|
||||
def _getWorkflowContext(self, workflow) -> Dict[str, int]:
|
||||
try:
|
||||
return {
|
||||
'currentRound': getattr(workflow, 'currentRound', 0),
|
||||
'currentTask': getattr(workflow, 'currentTask', 0),
|
||||
'currentAction': getattr(workflow, 'currentAction', 0)
|
||||
}
|
||||
except Exception:
|
||||
return {'currentRound': 0, 'currentTask': 0, 'currentAction': 0}
|
||||
|
||||
def _getWorkflowStats(self, workflow) -> Dict[str, Any]:
|
||||
try:
|
||||
context = self._getWorkflowContext(workflow)
|
||||
return {
|
||||
'currentRound': context['currentRound'],
|
||||
'currentTask': context['currentTask'],
|
||||
'currentAction': context['currentAction'],
|
||||
'totalTasks': getattr(workflow, 'totalTasks', 0),
|
||||
'totalActions': getattr(workflow, 'totalActions', 0),
|
||||
'workflowStatus': getattr(workflow, 'status', 'unknown'),
|
||||
'workflowId': getattr(workflow, 'id', 'unknown')
|
||||
}
|
||||
except Exception:
|
||||
return {
|
||||
'currentRound': 0,
|
||||
'currentTask': 0,
|
||||
'currentAction': 0,
|
||||
'totalTasks': 0,
|
||||
'totalActions': 0,
|
||||
'workflowStatus': 'unknown',
|
||||
'workflowId': 'unknown'
|
||||
}
|
||||
|
||||
async def renderReport(self, extractedContent: Dict[str, Any], outputFormat: str, language: str, title: str, userPrompt: str = None, aiService=None, parentOperationId: Optional[str] = None) -> List[RenderedDocument]:
|
||||
"""
|
||||
Render extracted JSON content to the specified output format.
|
||||
Processes EACH document separately and calls renderer for each.
|
||||
Each renderer can return 1..n documents (e.g., HTML + images).
|
||||
|
||||
Per-document format and language are extracted from structure (validated in State 3).
|
||||
Multiple documents can have different formats and languages.
|
||||
|
||||
Args:
|
||||
extractedContent: Structured JSON document with documents array
|
||||
outputFormat: Target format (html, pdf, docx, txt, md, json, csv, xlsx) - Global fallback
|
||||
language: Language (global fallback) - Per-document language extracted from structure
|
||||
title: Report title
|
||||
userPrompt: User's original prompt for report generation
|
||||
aiService: AI service instance for generation prompt creation
|
||||
parentOperationId: Optional parent operation ID for hierarchical logging
|
||||
|
||||
Returns:
|
||||
List of RenderedDocument objects.
|
||||
Each RenderedDocument represents one rendered file (main document or supporting file)
|
||||
"""
|
||||
try:
|
||||
# Validate JSON input
|
||||
if not isinstance(extractedContent, dict):
|
||||
raise ValueError("extractedContent must be a JSON dictionary")
|
||||
|
||||
# Unified approach: Always expect "documents" array
|
||||
if "documents" not in extractedContent:
|
||||
raise ValueError("extractedContent must contain 'documents' array")
|
||||
|
||||
documents = extractedContent["documents"]
|
||||
if len(documents) == 0:
|
||||
raise ValueError("No documents found in 'documents' array")
|
||||
|
||||
metadata = extractedContent.get("metadata", {})
|
||||
allRenderedDocuments = []
|
||||
|
||||
# Process EACH document separately
|
||||
for docIndex, doc in enumerate(documents):
|
||||
if not isinstance(doc, dict):
|
||||
logger.warning(f"Skipping invalid document at index {docIndex}")
|
||||
continue
|
||||
|
||||
if "sections" not in doc:
|
||||
logger.warning(f"Document {doc.get('id', docIndex)} has no sections, skipping")
|
||||
continue
|
||||
|
||||
# Determine format for this document
|
||||
# Check outputFormat field first (per-document), then format field (legacy), then global fallback
|
||||
docFormat = doc.get("outputFormat") or doc.get("format") or outputFormat
|
||||
|
||||
# Determine language for this document
|
||||
# Extract per-document language from structure (validated in State 3), fallback to global
|
||||
docLanguage = doc.get("language") or language
|
||||
|
||||
# Validate language format (should be 2-character ISO code, validated in State 3)
|
||||
if not isinstance(docLanguage, str) or len(docLanguage) != 2:
|
||||
logger.warning(f"Document {doc.get('id')} has invalid language format: {docLanguage}, using fallback")
|
||||
docLanguage = language # Use global fallback
|
||||
|
||||
# Get renderer for this document's format
|
||||
renderer = self._getFormatRenderer(docFormat)
|
||||
if not renderer:
|
||||
logger.warning(f"Unsupported format '{docFormat}' for document {doc.get('id', docIndex)}, skipping")
|
||||
continue
|
||||
|
||||
# Check output style classification (code/document/image/etc.) from renderer
|
||||
from .renderers.registry import getOutputStyle
|
||||
outputStyle = getOutputStyle(docFormat)
|
||||
if outputStyle:
|
||||
logger.debug(f"Document {doc.get('id', docIndex)} format '{docFormat}' classified as '{outputStyle}' style")
|
||||
# Store style in document metadata for potential use in processing paths
|
||||
if "metadata" not in doc:
|
||||
doc["metadata"] = {}
|
||||
doc["metadata"]["outputStyle"] = outputStyle
|
||||
|
||||
# Create JSON structure with single document (preserving metadata)
|
||||
singleDocContent = {
|
||||
"metadata": {**metadata, "language": docLanguage}, # Add per-document language to metadata
|
||||
"documents": [doc] # Only this document
|
||||
}
|
||||
|
||||
# Use document title or fallback to provided title
|
||||
docTitle = doc.get("title", title)
|
||||
|
||||
# Render this document (can return multiple files, e.g., HTML + images)
|
||||
renderedDocs = await renderer.render(singleDocContent, docTitle, userPrompt, aiService)
|
||||
allRenderedDocuments.extend(renderedDocs)
|
||||
|
||||
logger.info(f"Rendered {len(documents)} document(s) into {len(allRenderedDocuments)} file(s)")
|
||||
return allRenderedDocuments
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error rendering JSON report to {outputFormat}: {str(e)}")
|
||||
raise
|
||||
|
||||
async def generateDocumentWithTwoPhases(
|
||||
self,
|
||||
userPrompt: str,
|
||||
cachedContent: Optional[Dict[str, Any]] = None,
|
||||
contentParts: Optional[List[Any]] = None,
|
||||
maxSectionLength: int = 500,
|
||||
parallelGeneration: bool = True,
|
||||
progressCallback: Optional[Callable] = None
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
Generate document using two-phase approach:
|
||||
1. Generate structure skeleton with empty sections
|
||||
2. Generate content for each section iteratively
|
||||
|
||||
This is the core logic for document generation in AI calls.
|
||||
|
||||
Args:
|
||||
userPrompt: User's original prompt
|
||||
cachedContent: Optional extracted content cache (from extraction phase)
|
||||
contentParts: Optional list of ContentParts to use for structure generation
|
||||
maxSectionLength: Maximum words for simple sections
|
||||
parallelGeneration: Enable parallel section generation
|
||||
progressCallback: Optional callback function(progress, total, message) for progress updates
|
||||
|
||||
Returns:
|
||||
Complete document structure with populated elements ready for rendering
|
||||
"""
|
||||
try:
|
||||
from .subStructureGenerator import StructureGenerator
|
||||
from .subContentGenerator import ContentGenerator
|
||||
|
||||
# Phase 1: Generate structure skeleton
|
||||
if progressCallback:
|
||||
progressCallback(0, 100, "Generating document structure...")
|
||||
|
||||
structureGenerator = StructureGenerator(self.services)
|
||||
|
||||
# Extract imageDocuments from cachedContent if available
|
||||
existingImages = None
|
||||
if cachedContent and cachedContent.get("imageDocuments"):
|
||||
existingImages = cachedContent.get("imageDocuments")
|
||||
|
||||
structure = await structureGenerator.generateStructure(
|
||||
userPrompt=userPrompt,
|
||||
documentList=None, # Not used in current implementation
|
||||
cachedContent=cachedContent,
|
||||
contentParts=contentParts, # Pass ContentParts for structure generation
|
||||
maxSectionLength=maxSectionLength,
|
||||
existingImages=existingImages
|
||||
)
|
||||
|
||||
if progressCallback:
|
||||
progressCallback(30, 100, "Structure generated, starting content generation...")
|
||||
|
||||
# Phase 2: Generate content for each section
|
||||
contentGenerator = ContentGenerator(self.services)
|
||||
|
||||
# Create progress callback wrapper for content generation phase (30-90%)
|
||||
def contentProgressCallback(sectionIndex: int, totalSections: int, message: str):
|
||||
if progressCallback:
|
||||
# Map section progress to overall progress (30% to 90%)
|
||||
if totalSections > 0:
|
||||
overallProgress = 30 + int(60 * (sectionIndex / totalSections))
|
||||
else:
|
||||
overallProgress = 30
|
||||
progressCallback(overallProgress, 100, f"Section {sectionIndex}/{totalSections}: {message}")
|
||||
|
||||
completeStructure = await contentGenerator.generateContent(
|
||||
structure=structure,
|
||||
cachedContent=cachedContent,
|
||||
userPrompt=userPrompt,
|
||||
contentParts=contentParts, # Pass ContentParts for content generation
|
||||
progressCallback=contentProgressCallback,
|
||||
parallelGeneration=parallelGeneration
|
||||
)
|
||||
|
||||
if progressCallback:
|
||||
progressCallback(100, 100, "Document generation complete")
|
||||
|
||||
return completeStructure
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error in two-phase document generation: {str(e)}")
|
||||
logger.debug(traceback.format_exc())
|
||||
raise
|
||||
|
||||
async def getAdaptiveExtractionPrompt(
|
||||
self,
|
||||
outputFormat: str,
|
||||
userPrompt: str,
|
||||
title: str,
|
||||
aiService=None
|
||||
) -> str:
|
||||
"""Get adaptive extraction prompt."""
|
||||
from modules.serviceCenter.services.serviceExtraction.subPromptBuilderExtraction import buildExtractionPrompt
|
||||
return await buildExtractionPrompt(
|
||||
outputFormat=outputFormat,
|
||||
userPrompt=userPrompt,
|
||||
title=title,
|
||||
aiService=aiService,
|
||||
services=self.services
|
||||
)
|
||||
|
||||
|
||||
def _getFormatRenderer(self, output_format: str):
|
||||
"""Get the appropriate document renderer for the specified format."""
|
||||
try:
|
||||
from .renderers.registry import getRenderer, getSupportedFormats
|
||||
renderer = getRenderer(output_format, services=self.services, outputStyle='document')
|
||||
|
||||
if renderer:
|
||||
return renderer
|
||||
|
||||
# Log available formats for debugging
|
||||
availableFormats = getSupportedFormats()
|
||||
logger.error(
|
||||
f"No renderer found for format '{output_format}'. "
|
||||
f"Available formats: {availableFormats}"
|
||||
)
|
||||
|
||||
# Fallback to text renderer if no specific renderer found
|
||||
logger.warning(f"Falling back to text renderer for format {output_format}")
|
||||
fallbackRenderer = getRenderer('text', services=self.services, outputStyle='document')
|
||||
if fallbackRenderer:
|
||||
return fallbackRenderer
|
||||
|
||||
logger.error("Even text renderer fallback failed")
|
||||
return None
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error getting renderer for {output_format}: {str(e)}")
|
||||
# traceback is already imported at module level
|
||||
logger.debug(traceback.format_exc())
|
||||
return None
|
||||
|
|
@ -0,0 +1,939 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""
|
||||
Code Generation Path
|
||||
|
||||
Handles code generation with multi-file project support, dependency handling,
|
||||
and proper cross-file references.
|
||||
"""
|
||||
|
||||
import json
|
||||
import logging
|
||||
import time
|
||||
import re
|
||||
from typing import Dict, Any, List, Optional
|
||||
from modules.datamodels.datamodelWorkflow import AiResponse, AiResponseMetadata, DocumentData
|
||||
from modules.datamodels.datamodelExtraction import ContentPart
|
||||
from modules.datamodels.datamodelAi import AiCallOptions, OperationTypeEnum
|
||||
from modules.shared.jsonUtils import extractJsonString
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class CodeGenerationPath:
|
||||
"""Code generation path."""
|
||||
|
||||
def __init__(self, services):
|
||||
self.services = services
|
||||
|
||||
|
||||
async def generateCode(
|
||||
self,
|
||||
userPrompt: str,
|
||||
outputFormat: str = None,
|
||||
contentParts: Optional[List[ContentPart]] = None,
|
||||
title: str = "Generated Code",
|
||||
parentOperationId: Optional[str] = None
|
||||
) -> AiResponse:
|
||||
"""
|
||||
Generate code files with multi-file project support.
|
||||
|
||||
Returns: AiResponse with code files as documents
|
||||
"""
|
||||
# Create operation ID
|
||||
workflowId = self.services.workflow.id if self.services.workflow else f"no-workflow-{int(time.time())}"
|
||||
codeOperationId = f"code_gen_{workflowId}_{int(time.time())}"
|
||||
|
||||
# Start progress tracking
|
||||
self.services.chat.progressLogStart(
|
||||
codeOperationId,
|
||||
"Code Generation",
|
||||
"Code Generation",
|
||||
f"Format: {outputFormat or 'txt'}",
|
||||
parentOperationId=parentOperationId
|
||||
)
|
||||
|
||||
try:
|
||||
# Detect language and project type from prompt or outputFormat
|
||||
language, projectType = self._detectLanguageAndProjectType(userPrompt, outputFormat)
|
||||
|
||||
# Phase 1: Code structure generation (with looping)
|
||||
self.services.chat.progressLogUpdate(codeOperationId, 0.2, "Generating code structure")
|
||||
codeStructure = await self._generateCodeStructure(
|
||||
userPrompt=userPrompt,
|
||||
language=language,
|
||||
outputFormat=outputFormat,
|
||||
contentParts=contentParts
|
||||
)
|
||||
|
||||
# Phase 2: Code content generation (with dependency handling)
|
||||
self.services.chat.progressLogUpdate(codeOperationId, 0.5, "Generating code content")
|
||||
codeFiles = await self._generateCodeContent(
|
||||
codeStructure,
|
||||
codeOperationId,
|
||||
userPrompt=userPrompt,
|
||||
contentParts=contentParts
|
||||
)
|
||||
|
||||
# Phase 3: Code formatting & validation
|
||||
self.services.chat.progressLogUpdate(codeOperationId, 0.8, "Formatting code files")
|
||||
formattedFiles = await self._formatAndValidateCode(codeFiles)
|
||||
|
||||
# Phase 4: Code Rendering (Renderer-Based)
|
||||
self.services.chat.progressLogUpdate(codeOperationId, 0.9, "Rendering code files")
|
||||
|
||||
# Group files by format
|
||||
filesByFormat = {}
|
||||
for file in formattedFiles:
|
||||
fileType = file.get("fileType", outputFormat or "txt")
|
||||
if fileType not in filesByFormat:
|
||||
filesByFormat[fileType] = []
|
||||
filesByFormat[fileType].append(file)
|
||||
|
||||
# Render each format group using appropriate renderer
|
||||
allRenderedDocuments = []
|
||||
for fileType, files in filesByFormat.items():
|
||||
# Get renderer for this format
|
||||
renderer = self._getCodeRenderer(fileType)
|
||||
|
||||
if renderer:
|
||||
# Use code renderer
|
||||
renderedDocs = await renderer.renderCodeFiles(
|
||||
codeFiles=files,
|
||||
metadata=codeStructure.get("metadata", {}),
|
||||
userPrompt=userPrompt
|
||||
)
|
||||
allRenderedDocuments.extend(renderedDocs)
|
||||
else:
|
||||
# Fallback: output directly (for formats without renderers)
|
||||
for file in files:
|
||||
mimeType = self._getMimeType(file.get("fileType", "txt"))
|
||||
content = file.get("content", "")
|
||||
contentBytes = content.encode('utf-8') if isinstance(content, str) else content
|
||||
|
||||
from modules.datamodels.datamodelDocument import RenderedDocument
|
||||
allRenderedDocuments.append(
|
||||
RenderedDocument(
|
||||
documentData=contentBytes,
|
||||
mimeType=mimeType,
|
||||
filename=file.get("filename", "generated.txt"),
|
||||
metadata=codeStructure.get("metadata", {})
|
||||
)
|
||||
)
|
||||
|
||||
# Convert RenderedDocument to DocumentData
|
||||
documents = []
|
||||
for renderedDoc in allRenderedDocuments:
|
||||
documents.append(DocumentData(
|
||||
documentName=renderedDoc.filename,
|
||||
documentData=renderedDoc.documentData,
|
||||
mimeType=renderedDoc.mimeType,
|
||||
sourceJson=renderedDoc.metadata if hasattr(renderedDoc, 'metadata') else None
|
||||
))
|
||||
|
||||
metadata = AiResponseMetadata(
|
||||
title=title,
|
||||
operationType=OperationTypeEnum.DATA_GENERATE.value
|
||||
)
|
||||
|
||||
# Create summary JSON for content field
|
||||
summaryContent = {
|
||||
"type": "code_generation",
|
||||
"metadata": codeStructure.get("metadata", {}),
|
||||
"files": [
|
||||
{
|
||||
"filename": doc.documentName,
|
||||
"mimeType": doc.mimeType
|
||||
}
|
||||
for doc in documents
|
||||
],
|
||||
"fileCount": len(documents)
|
||||
}
|
||||
|
||||
self.services.chat.progressLogFinish(codeOperationId, True)
|
||||
|
||||
return AiResponse(
|
||||
documents=documents,
|
||||
content=json.dumps(summaryContent, ensure_ascii=False),
|
||||
metadata=metadata
|
||||
)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error in code generation: {str(e)}")
|
||||
self.services.chat.progressLogFinish(codeOperationId, False)
|
||||
raise
|
||||
|
||||
def _detectLanguageAndProjectType(self, userPrompt: str, outputFormat: Optional[str]) -> tuple:
|
||||
"""Detect programming language and project type from prompt or format."""
|
||||
promptLower = userPrompt.lower()
|
||||
|
||||
# Detect language
|
||||
language = None
|
||||
if outputFormat:
|
||||
if outputFormat == "py":
|
||||
language = "python"
|
||||
elif outputFormat in ["js", "ts"]:
|
||||
language = outputFormat
|
||||
elif outputFormat == "html":
|
||||
language = "html"
|
||||
|
||||
if not language:
|
||||
if "python" in promptLower or ".py" in promptLower:
|
||||
language = "python"
|
||||
elif "javascript" in promptLower or ".js" in promptLower:
|
||||
language = "javascript"
|
||||
elif "typescript" in promptLower or ".ts" in promptLower:
|
||||
language = "typescript"
|
||||
elif "html" in promptLower:
|
||||
language = "html"
|
||||
else:
|
||||
language = "python" # Default
|
||||
|
||||
# Detect project type
|
||||
projectType = "single_file"
|
||||
if "multi" in promptLower or "multiple files" in promptLower or "project" in promptLower:
|
||||
projectType = "multi_file"
|
||||
|
||||
return language, projectType
|
||||
|
||||
async def _generateCodeStructure(
|
||||
self,
|
||||
userPrompt: str,
|
||||
language: str,
|
||||
outputFormat: Optional[str],
|
||||
contentParts: Optional[List[ContentPart]]
|
||||
) -> Dict[str, Any]:
|
||||
"""Generate code structure using looping system."""
|
||||
|
||||
# Build content parts index (similar to document generation)
|
||||
contentPartsIndex = ""
|
||||
if contentParts:
|
||||
validParts = []
|
||||
for part in contentParts:
|
||||
contentFormat = part.metadata.get("contentFormat", "unknown")
|
||||
originalFileName = part.metadata.get('originalFileName', 'N/A')
|
||||
|
||||
# Include reference parts and parts with data
|
||||
if contentFormat == "reference" or (part.data and len(str(part.data).strip()) > 0):
|
||||
validParts.append(part)
|
||||
|
||||
if validParts:
|
||||
contentPartsIndex = "\n## AVAILABLE CONTENT PARTS\n"
|
||||
for i, part in enumerate(validParts, 1):
|
||||
contentFormat = part.metadata.get("contentFormat", "unknown")
|
||||
originalFileName = part.metadata.get('originalFileName', 'N/A')
|
||||
|
||||
contentPartsIndex += f"\n{i}. ContentPart ID: {part.id}\n"
|
||||
contentPartsIndex += f" Format: {contentFormat}\n"
|
||||
contentPartsIndex += f" Type: {part.typeGroup}\n"
|
||||
contentPartsIndex += f" MIME Type: {part.mimeType or 'N/A'}\n"
|
||||
contentPartsIndex += f" Source: {part.metadata.get('documentId', 'unknown')}\n"
|
||||
contentPartsIndex += f" Original file name: {originalFileName}\n"
|
||||
contentPartsIndex += f" Usage hint: {part.metadata.get('usageHint', 'N/A')}\n"
|
||||
|
||||
if not contentPartsIndex:
|
||||
contentPartsIndex = "\n(No content parts available)"
|
||||
|
||||
# Create template structure explicitly (not extracted from prompt)
|
||||
templateStructure = f"""{{
|
||||
"metadata": {{
|
||||
"language": "{language}",
|
||||
"projectType": "single_file|multi_file",
|
||||
"projectName": ""
|
||||
}},
|
||||
"files": [
|
||||
{{
|
||||
"id": "",
|
||||
"filename": "",
|
||||
"fileType": "",
|
||||
"dependencies": [],
|
||||
"imports": [],
|
||||
"functions": [],
|
||||
"classes": []
|
||||
}}
|
||||
]
|
||||
}}"""
|
||||
|
||||
# Build structure generation prompt
|
||||
structurePrompt = f"""# TASK: Generate Code Project Structure
|
||||
|
||||
This is a PLANNING task. Return EXACTLY ONE complete JSON object. Do not generate multiple JSON objects, alternatives, or variations. Do not use separators like "---" between JSON objects.
|
||||
|
||||
## USER REQUEST (for context)
|
||||
```
|
||||
{userPrompt}
|
||||
```
|
||||
{contentPartsIndex}
|
||||
|
||||
## LANGUAGE
|
||||
{language}
|
||||
|
||||
## TASK DESCRIPTION
|
||||
Analyze the USER REQUEST above and create a project structure that fulfills ALL requirements mentioned in the request.
|
||||
|
||||
IMPORTANT: If the request mentions multiple files (e.g., "3 files", "config.json and customers.json", etc.), you MUST include ALL requested files in the files array. Set projectType to "multi_file" when multiple files are requested.
|
||||
|
||||
## CONTENT PARTS USAGE (if available)
|
||||
If AVAILABLE CONTENT PARTS are listed above, use them to inform the file structure:
|
||||
|
||||
**Analyzing Content Parts:**
|
||||
- Review each ContentPart's format, type, original file name, and usage hint
|
||||
- Content parts with "reference" format = documents/images that will be processed/extracted
|
||||
- Content parts with "extracted" format = pre-processed data ready to use
|
||||
- Content parts with "object" format = images/documents to be displayed or processed
|
||||
|
||||
**Mapping Content Parts to Files:**
|
||||
- If content parts contain data (e.g., expense receipts, customer lists), create data files (JSON/CSV) that will store/represent that data
|
||||
- If content parts are documents to be processed (e.g., PDFs), you may need code files that parse/process them
|
||||
- Use the original file names and usage hints to determine appropriate filenames and file types
|
||||
|
||||
**Populating File Structure Fields:**
|
||||
- **dependencies**: List file IDs that this file depends on (e.g., if a Python script reads a JSON config file, the script depends on the config file)
|
||||
- **imports**: For code files, list imports needed based on content parts (e.g., if processing PDFs: ["import PyPDF2"], if processing CSV: ["import csv"], if processing JSON: ["import json"])
|
||||
- **functions**: For CODE files only - list function signatures if the USER REQUEST specifies functionality (e.g., {{"name": "parseReceipt", "signature": "def parseReceipt(pdf_path: str) -> dict"}})
|
||||
- **classes**: For CODE files only - list class definitions if the USER REQUEST specifies OOP structure
|
||||
- **functions/classes for DATA files**: Leave as empty arrays [] - data files (JSON/CSV/XML) don't contain executable code
|
||||
|
||||
## FILE STRUCTURE REQUIREMENTS
|
||||
Create a JSON structure with:
|
||||
1. metadata: {{"language": "{language}", "projectType": "single_file|multi_file", "projectName": "..."}}
|
||||
- projectName: Derive from USER REQUEST or content parts (e.g., "expense-tracker", "customer-manager")
|
||||
|
||||
2. files: Array of file structures, each with:
|
||||
- id: Unique identifier (e.g., "file_1", "file_2")
|
||||
- filename: File name matching USER REQUEST requirements (e.g., "config.json", "customers.json", "expenses.csv")
|
||||
- fileType: File extension matching the requested format (e.g., "json", "py", "js", "csv", "xml")
|
||||
- dependencies: List of file IDs this file depends on (for multi-file projects where files reference each other)
|
||||
- imports: List of import statements that this file will need (e.g., ["import json", "import csv"] for Python files processing JSON/CSV)
|
||||
- functions: Array of function signatures {{"name": "...", "signature": "..."}} - ONLY if the file will contain executable code (not for pure data files like JSON/CSV)
|
||||
- classes: Array of class definitions {{"name": "...", "signature": "..."}} - ONLY if the file will contain executable code (not for pure data files like JSON/CSV)
|
||||
|
||||
IMPORTANT FOR DATA FILES (JSON, CSV, XML):
|
||||
- For pure data files (config.json, customers.json, expenses.csv), leave functions and classes as empty arrays []
|
||||
- These files contain structured data, not executable code
|
||||
- Use imports only if the file will be processed by code (e.g., a Python script that reads the CSV)
|
||||
|
||||
IMPORTANT FOR CODE FILES (Python, JavaScript, etc.):
|
||||
- Include functions/classes if the USER REQUEST specifies functionality
|
||||
- Use dependencies to indicate which data files this code file reads/processes
|
||||
- Use imports to specify what libraries/modules are needed
|
||||
|
||||
For single-file projects, return one file. For multi-file projects, include ALL requested files in the files array.
|
||||
|
||||
Return ONLY valid JSON matching the request above.
|
||||
"""
|
||||
|
||||
# Build continuation prompt builder
|
||||
async def buildCodeStructurePromptWithContinuation(
|
||||
continuationContext: Any,
|
||||
templateStructure: str,
|
||||
basePrompt: str
|
||||
) -> str:
|
||||
"""Build code structure prompt with continuation context. Uses unified signature.
|
||||
|
||||
Note: All initial context (userPrompt, contentParts, etc.) is already
|
||||
contained in basePrompt. This function only adds continuation-specific instructions.
|
||||
"""
|
||||
# Extract continuation context fields (only what's needed for continuation)
|
||||
incompletePart = continuationContext.incomplete_part
|
||||
lastRawJson = continuationContext.last_raw_json
|
||||
|
||||
# Generate both overlap context and hierarchy context using jsonContinuation
|
||||
overlapContext = ""
|
||||
unifiedContext = ""
|
||||
if lastRawJson:
|
||||
# Get contexts directly from jsonContinuation
|
||||
from modules.shared.jsonContinuation import getContexts
|
||||
contexts = getContexts(lastRawJson)
|
||||
overlapContext = contexts.overlapContext
|
||||
unifiedContext = contexts.hierarchyContextForPrompt
|
||||
elif incompletePart:
|
||||
unifiedContext = incompletePart
|
||||
else:
|
||||
unifiedContext = "Unable to extract context - response was completely broken"
|
||||
|
||||
# Build unified continuation prompt format
|
||||
continuationPrompt = f"""{basePrompt}
|
||||
|
||||
--- CONTINUATION REQUEST ---
|
||||
The previous JSON response was incomplete. Continue from where it stopped.
|
||||
|
||||
Context showing structure hierarchy with cut point:
|
||||
```
|
||||
{unifiedContext}
|
||||
```
|
||||
|
||||
Overlap Requirement:
|
||||
To ensure proper merging, your response MUST start EXACTLY with the overlap context shown below, then continue with new content.
|
||||
|
||||
Overlap context (start your response with this exact text):
|
||||
```json
|
||||
{overlapContext if overlapContext else "No overlap context available"}
|
||||
```
|
||||
|
||||
TASK:
|
||||
1. Start your response EXACTLY with the overlap context shown above (character by character)
|
||||
2. Continue seamlessly from where the overlap context ends
|
||||
3. Complete the remaining content following the JSON structure template above
|
||||
4. Return ONLY valid JSON following the structure template - no overlap/continuation wrapper objects
|
||||
|
||||
CRITICAL:
|
||||
- Your response MUST begin with the exact overlap context text (this enables automatic merging)
|
||||
- Continue seamlessly after the overlap context with new content
|
||||
- Your response must be valid JSON matching the structure template above"""
|
||||
return continuationPrompt
|
||||
|
||||
# Use generic looping system with code_structure use case
|
||||
options = AiCallOptions(
|
||||
operationType=OperationTypeEnum.DATA_GENERATE,
|
||||
resultFormat="json"
|
||||
)
|
||||
|
||||
structureJson = await self.services.ai.callAiWithLooping(
|
||||
prompt=structurePrompt,
|
||||
options=options,
|
||||
promptBuilder=buildCodeStructurePromptWithContinuation,
|
||||
promptArgs={
|
||||
"userPrompt": userPrompt,
|
||||
"contentParts": contentParts,
|
||||
"templateStructure": templateStructure,
|
||||
"basePrompt": structurePrompt
|
||||
},
|
||||
useCaseId="code_structure",
|
||||
debugPrefix="code_structure_generation",
|
||||
contentParts=contentParts
|
||||
)
|
||||
|
||||
# Extract JSON from markdown fences if present
|
||||
extractedJson = extractJsonString(structureJson)
|
||||
parsed = json.loads(extractedJson)
|
||||
return parsed
|
||||
|
||||
async def _generateCodeContent(
|
||||
self,
|
||||
codeStructure: Dict[str, Any],
|
||||
parentOperationId: str,
|
||||
userPrompt: str = None,
|
||||
contentParts: Optional[List[ContentPart]] = None
|
||||
) -> List[Dict[str, Any]]:
|
||||
"""Generate code content for each file with dependency handling."""
|
||||
files = codeStructure.get("files", [])
|
||||
metadata = codeStructure.get("metadata", {})
|
||||
|
||||
if not files:
|
||||
raise ValueError("No files found in code structure")
|
||||
|
||||
# Step 1: Resolve dependency order
|
||||
orderedFiles = self._resolveDependencyOrder(files)
|
||||
|
||||
# Step 2: Generate dependency files first (requirements.txt, package.json, etc.)
|
||||
dependencyFiles = await self._generateDependencyFiles(metadata, orderedFiles)
|
||||
|
||||
# Step 3: Generate code files in dependency order (not fully parallel)
|
||||
codeFiles = []
|
||||
generatedFileContext = {} # Track what's been generated for cross-file references
|
||||
|
||||
for idx, fileStructure in enumerate(orderedFiles):
|
||||
# Update progress
|
||||
progress = 0.5 + (0.4 * (idx / len(orderedFiles)))
|
||||
self.services.chat.progressLogUpdate(
|
||||
parentOperationId,
|
||||
progress,
|
||||
f"Generating {fileStructure.get('filename', 'file')}"
|
||||
)
|
||||
|
||||
# Provide context about already-generated files for proper imports
|
||||
fileContext = self._buildFileContext(generatedFileContext, fileStructure)
|
||||
|
||||
# Generate this file with context
|
||||
fileContent = await self._generateSingleFileContent(
|
||||
fileStructure,
|
||||
fileContext=fileContext,
|
||||
allFilesStructure=orderedFiles,
|
||||
metadata=metadata,
|
||||
userPrompt=userPrompt,
|
||||
contentParts=contentParts
|
||||
)
|
||||
|
||||
codeFiles.append(fileContent)
|
||||
|
||||
# Update context with generated file info (for next files)
|
||||
generatedFileContext[fileStructure["id"]] = {
|
||||
"filename": fileContent.get("filename", fileStructure.get("filename")),
|
||||
"functions": fileContent.get("functions", []),
|
||||
"classes": fileContent.get("classes", []),
|
||||
"exports": fileContent.get("exports", [])
|
||||
}
|
||||
|
||||
# Combine dependency files and code files
|
||||
return dependencyFiles + codeFiles
|
||||
|
||||
def _resolveDependencyOrder(self, files: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
|
||||
"""Resolve file generation order based on dependencies using topological sort."""
|
||||
# Build dependency graph
|
||||
fileMap = {f["id"]: f for f in files}
|
||||
dependencies = {}
|
||||
|
||||
for file in files:
|
||||
fileId = file["id"]
|
||||
deps = file.get("dependencies", []) # List of file IDs this file depends on
|
||||
dependencies[fileId] = deps
|
||||
|
||||
# Topological sort
|
||||
ordered = []
|
||||
visited = set()
|
||||
tempMark = set()
|
||||
|
||||
def visit(fileId: str):
|
||||
if fileId in tempMark:
|
||||
# Circular dependency detected - break it
|
||||
logger.warning(f"Circular dependency detected involving {fileId}")
|
||||
return
|
||||
if fileId in visited:
|
||||
return
|
||||
|
||||
tempMark.add(fileId)
|
||||
for depId in dependencies.get(fileId, []):
|
||||
if depId in fileMap:
|
||||
visit(depId)
|
||||
tempMark.remove(fileId)
|
||||
visited.add(fileId)
|
||||
ordered.append(fileMap[fileId])
|
||||
|
||||
for file in files:
|
||||
if file["id"] not in visited:
|
||||
visit(file["id"])
|
||||
|
||||
return ordered
|
||||
|
||||
async def _generateDependencyFiles(
|
||||
self,
|
||||
metadata: Dict[str, Any],
|
||||
files: List[Dict[str, Any]]
|
||||
) -> List[Dict[str, Any]]:
|
||||
"""Generate dependency files (requirements.txt, package.json, etc.)."""
|
||||
language = metadata.get("language", "").lower()
|
||||
dependencyFiles = []
|
||||
|
||||
# Generate requirements.txt for Python
|
||||
if language in ["python", "py"]:
|
||||
requirementsContent = await self._generateRequirementsTxt(files)
|
||||
if requirementsContent:
|
||||
dependencyFiles.append({
|
||||
"filename": "requirements.txt",
|
||||
"content": requirementsContent,
|
||||
"fileType": "txt",
|
||||
"id": "requirements_txt"
|
||||
})
|
||||
|
||||
# Generate package.json for JavaScript/TypeScript
|
||||
elif language in ["javascript", "typescript", "js", "ts"]:
|
||||
packageJson = await self._generatePackageJson(files, metadata)
|
||||
if packageJson:
|
||||
dependencyFiles.append({
|
||||
"filename": "package.json",
|
||||
"content": json.dumps(packageJson, indent=2),
|
||||
"fileType": "json",
|
||||
"id": "package_json"
|
||||
})
|
||||
|
||||
return dependencyFiles
|
||||
|
||||
async def _generateRequirementsTxt(
|
||||
self,
|
||||
files: List[Dict[str, Any]]
|
||||
) -> Optional[str]:
|
||||
"""Generate requirements.txt content from Python imports."""
|
||||
pythonPackages = set()
|
||||
|
||||
for file in files:
|
||||
imports = file.get("imports", [])
|
||||
if isinstance(imports, list):
|
||||
for imp in imports:
|
||||
if isinstance(imp, str):
|
||||
# Extract package name from import
|
||||
# Handle: "from flask import", "import flask", "from flask import Flask"
|
||||
imp = imp.strip()
|
||||
if "import" in imp:
|
||||
if "from" in imp:
|
||||
# "from package import ..."
|
||||
parts = imp.split("from")
|
||||
if len(parts) > 1:
|
||||
package = parts[1].split("import")[0].strip()
|
||||
if package and not package.startswith("."):
|
||||
pythonPackages.add(package.split(".")[0]) # Get root package
|
||||
else:
|
||||
# "import package" or "import package.module"
|
||||
parts = imp.split("import")
|
||||
if len(parts) > 1:
|
||||
package = parts[1].strip().split(".")[0].strip()
|
||||
if package and not package.startswith("."):
|
||||
pythonPackages.add(package)
|
||||
|
||||
if pythonPackages:
|
||||
return "\n".join(sorted(pythonPackages))
|
||||
return None
|
||||
|
||||
async def _generatePackageJson(
|
||||
self,
|
||||
files: List[Dict[str, Any]],
|
||||
metadata: Dict[str, Any]
|
||||
) -> Optional[Dict[str, Any]]:
|
||||
"""Generate package.json content from JavaScript/TypeScript imports."""
|
||||
npmPackages = {}
|
||||
|
||||
for file in files:
|
||||
imports = file.get("imports", [])
|
||||
if isinstance(imports, list):
|
||||
for imp in imports:
|
||||
if isinstance(imp, str):
|
||||
# Extract npm package from import
|
||||
# Handle: "import express from 'express'", "const express = require('express')"
|
||||
imp = imp.strip()
|
||||
if "from" in imp:
|
||||
# ES6 import: "import ... from 'package'"
|
||||
parts = imp.split("from")
|
||||
if len(parts) > 1:
|
||||
package = parts[1].strip().strip("'\"")
|
||||
if package and not package.startswith(".") and not package.startswith("/"):
|
||||
npmPackages[package] = "*"
|
||||
elif "require" in imp:
|
||||
# CommonJS: "require('package')"
|
||||
match = re.search(r"require\(['\"]([^'\"]+)['\"]\)", imp)
|
||||
if match:
|
||||
package = match.group(1)
|
||||
if not package.startswith(".") and not package.startswith("/"):
|
||||
npmPackages[package] = "*"
|
||||
|
||||
if npmPackages:
|
||||
return {
|
||||
"name": metadata.get("projectName", "generated-project"),
|
||||
"version": "1.0.0",
|
||||
"dependencies": npmPackages
|
||||
}
|
||||
return None
|
||||
|
||||
def _buildFileContext(
|
||||
self,
|
||||
generatedFileContext: Dict[str, Dict[str, Any]],
|
||||
currentFile: Dict[str, Any]
|
||||
) -> Dict[str, Any]:
|
||||
"""Build context about other files for proper imports/references."""
|
||||
context = {
|
||||
"availableFiles": [],
|
||||
"availableFunctions": {},
|
||||
"availableClasses": {}
|
||||
}
|
||||
|
||||
# Add info about already-generated files
|
||||
for fileId, fileInfo in generatedFileContext.items():
|
||||
context["availableFiles"].append({
|
||||
"id": fileId,
|
||||
"filename": fileInfo["filename"],
|
||||
"functions": fileInfo.get("functions", []),
|
||||
"classes": fileInfo.get("classes", []),
|
||||
"exports": fileInfo.get("exports", [])
|
||||
})
|
||||
|
||||
# Build function/class maps for easy lookup
|
||||
for func in fileInfo.get("functions", []):
|
||||
funcName = func.get("name", "")
|
||||
if funcName:
|
||||
context["availableFunctions"][funcName] = {
|
||||
"file": fileInfo["filename"],
|
||||
"signature": func.get("signature", "")
|
||||
}
|
||||
|
||||
for cls in fileInfo.get("classes", []):
|
||||
className = cls.get("name", "")
|
||||
if className:
|
||||
context["availableClasses"][className] = {
|
||||
"file": fileInfo["filename"]
|
||||
}
|
||||
|
||||
return context
|
||||
|
||||
async def _generateSingleFileContent(
|
||||
self,
|
||||
fileStructure: Dict[str, Any],
|
||||
fileContext: Dict[str, Any] = None,
|
||||
allFilesStructure: List[Dict[str, Any]] = None,
|
||||
metadata: Dict[str, Any] = None,
|
||||
userPrompt: str = None,
|
||||
contentParts: Optional[List[ContentPart]] = None
|
||||
) -> Dict[str, Any]:
|
||||
"""Generate code content for a single file with context about other files."""
|
||||
|
||||
# Build prompt with context about other files for proper imports
|
||||
filename = fileStructure.get("filename", "generated.py")
|
||||
fileType = fileStructure.get("fileType", "py")
|
||||
dependencies = fileStructure.get("dependencies", [])
|
||||
functions = fileStructure.get("functions", [])
|
||||
classes = fileStructure.get("classes", [])
|
||||
|
||||
contextInfo = ""
|
||||
if fileContext and fileContext.get("availableFiles"):
|
||||
contextInfo = "\n\nAvailable files and their exports:\n"
|
||||
for fileInfo in fileContext["availableFiles"]:
|
||||
contextInfo += f"- {fileInfo['filename']}: "
|
||||
funcs = [f.get("name", "") for f in fileInfo.get("functions", [])]
|
||||
cls = [c.get("name", "") for c in fileInfo.get("classes", [])]
|
||||
exports = []
|
||||
if funcs:
|
||||
exports.extend(funcs)
|
||||
if cls:
|
||||
exports.extend(cls)
|
||||
if exports:
|
||||
contextInfo += ", ".join(exports)
|
||||
contextInfo += "\n"
|
||||
|
||||
# Build content parts section if available
|
||||
contentPartsSection = ""
|
||||
if contentParts:
|
||||
relevantParts = []
|
||||
for part in contentParts:
|
||||
# Include parts that might be relevant to this file
|
||||
usageHint = part.metadata.get('usageHint', '').lower()
|
||||
originalFileName = part.metadata.get('originalFileName', '').lower()
|
||||
filenameLower = filename.lower()
|
||||
|
||||
# Check if this content part is relevant to this file
|
||||
if (filenameLower in usageHint or
|
||||
filenameLower in originalFileName or
|
||||
part.metadata.get('contentFormat') == 'reference' or
|
||||
(part.data and len(str(part.data).strip()) > 0)):
|
||||
relevantParts.append(part)
|
||||
|
||||
if relevantParts:
|
||||
contentPartsSection = "\n## AVAILABLE CONTENT PARTS\n"
|
||||
for i, part in enumerate(relevantParts, 1):
|
||||
contentFormat = part.metadata.get("contentFormat", "unknown")
|
||||
originalFileName = part.metadata.get('originalFileName', 'N/A')
|
||||
contentPartsSection += f"\n{i}. ContentPart ID: {part.id}\n"
|
||||
contentPartsSection += f" Format: {contentFormat}\n"
|
||||
contentPartsSection += f" Type: {part.typeGroup}\n"
|
||||
contentPartsSection += f" Original file name: {originalFileName}\n"
|
||||
contentPartsSection += f" Usage hint: {part.metadata.get('usageHint', 'N/A')}\n"
|
||||
# Include actual content if it's small enough (for data files like CSV, JSON)
|
||||
if part.data and isinstance(part.data, str) and len(part.data) < 2000:
|
||||
contentPartsSection += f" Content preview: {part.data[:500]}...\n"
|
||||
|
||||
# Build user request section
|
||||
userRequestSection = ""
|
||||
if userPrompt:
|
||||
userRequestSection = f"""
|
||||
## ORIGINAL USER REQUEST
|
||||
```
|
||||
{userPrompt}
|
||||
```
|
||||
"""
|
||||
|
||||
# Create template structure explicitly (not extracted from prompt)
|
||||
templateStructure = f"""{{
|
||||
"files": [
|
||||
{{
|
||||
"filename": "{filename}",
|
||||
"content": "// Complete code here",
|
||||
"functions": {json.dumps(functions, indent=2) if functions else '[]'},
|
||||
"classes": {json.dumps(classes, indent=2) if classes else '[]'}
|
||||
}}
|
||||
]
|
||||
}}"""
|
||||
|
||||
# Build base prompt
|
||||
contentPrompt = f"""# TASK: Generate Code File Content
|
||||
|
||||
Generate complete, executable code for the file: {filename}
|
||||
{userRequestSection}## FILE SPECIFICATIONS
|
||||
|
||||
File Type: {fileType}
|
||||
Language: {metadata.get('language', 'python') if metadata else 'python'}
|
||||
{contentPartsSection}
|
||||
|
||||
Required functions:
|
||||
{json.dumps(functions, indent=2) if functions else 'None specified'}
|
||||
|
||||
Required classes:
|
||||
{json.dumps(classes, indent=2) if classes else 'None specified'}
|
||||
|
||||
Dependencies on other files: {', '.join(dependencies) if dependencies else 'None'}
|
||||
{contextInfo}
|
||||
|
||||
Generate complete, production-ready code with:
|
||||
1. Proper imports (including imports from other files in the project if dependencies exist)
|
||||
2. All required functions and classes
|
||||
3. Error handling
|
||||
4. Documentation/docstrings
|
||||
5. Type hints where appropriate
|
||||
|
||||
Return ONLY valid JSON in this format:
|
||||
{templateStructure}
|
||||
"""
|
||||
|
||||
# Build continuation prompt builder
|
||||
async def buildCodeContentPromptWithContinuation(
|
||||
continuationContext: Any,
|
||||
templateStructure: str,
|
||||
basePrompt: str
|
||||
) -> str:
|
||||
"""Build code content prompt with continuation context. Uses unified signature.
|
||||
|
||||
Note: All initial context (filename, fileType, functions, etc.) is already
|
||||
contained in basePrompt. This function only adds continuation-specific instructions.
|
||||
"""
|
||||
# Extract continuation context fields (only what's needed for continuation)
|
||||
incompletePart = continuationContext.incomplete_part
|
||||
lastRawJson = continuationContext.last_raw_json
|
||||
|
||||
# Generate both overlap context and hierarchy context using jsonContinuation
|
||||
overlapContext = ""
|
||||
unifiedContext = ""
|
||||
if lastRawJson:
|
||||
# Get contexts directly from jsonContinuation
|
||||
from modules.shared.jsonContinuation import getContexts
|
||||
contexts = getContexts(lastRawJson)
|
||||
overlapContext = contexts.overlapContext
|
||||
unifiedContext = contexts.hierarchyContextForPrompt
|
||||
elif incompletePart:
|
||||
unifiedContext = incompletePart
|
||||
else:
|
||||
unifiedContext = "Unable to extract context - response was completely broken"
|
||||
|
||||
# Build unified continuation prompt format
|
||||
continuationPrompt = f"""{basePrompt}
|
||||
|
||||
--- CONTINUATION REQUEST ---
|
||||
The previous JSON response was incomplete. Continue from where it stopped.
|
||||
|
||||
Context showing structure hierarchy with cut point:
|
||||
```
|
||||
{unifiedContext}
|
||||
```
|
||||
|
||||
Overlap Requirement:
|
||||
To ensure proper merging, your response MUST start EXACTLY with the overlap context shown below, then continue with new content.
|
||||
|
||||
Overlap context (start your response with this exact text):
|
||||
```json
|
||||
{overlapContext if overlapContext else "No overlap context available"}
|
||||
```
|
||||
|
||||
TASK:
|
||||
1. Start your response EXACTLY with the overlap context shown above (character by character)
|
||||
2. Continue seamlessly from where the overlap context ends
|
||||
3. Complete the remaining content following the JSON structure template above
|
||||
4. Return ONLY valid JSON following the structure template - no overlap/continuation wrapper objects
|
||||
|
||||
CRITICAL:
|
||||
- Your response MUST begin with the exact overlap context text (this enables automatic merging)
|
||||
- Continue seamlessly after the overlap context with new content
|
||||
- Your response must be valid JSON matching the structure template above"""
|
||||
return continuationPrompt
|
||||
|
||||
# Use generic looping system with code_content use case
|
||||
options = AiCallOptions(
|
||||
operationType=OperationTypeEnum.DATA_GENERATE,
|
||||
resultFormat="json"
|
||||
)
|
||||
|
||||
contentJson = await self.services.ai.callAiWithLooping(
|
||||
prompt=contentPrompt,
|
||||
options=options,
|
||||
promptBuilder=buildCodeContentPromptWithContinuation,
|
||||
promptArgs={
|
||||
"filename": filename,
|
||||
"fileType": fileType,
|
||||
"functions": functions,
|
||||
"classes": classes,
|
||||
"dependencies": dependencies,
|
||||
"metadata": metadata,
|
||||
"userPrompt": userPrompt,
|
||||
"contentParts": contentParts,
|
||||
"contextInfo": contextInfo,
|
||||
"templateStructure": templateStructure,
|
||||
"basePrompt": contentPrompt
|
||||
},
|
||||
useCaseId="code_content",
|
||||
debugPrefix=f"code_content_{fileStructure.get('id', 'file')}",
|
||||
)
|
||||
|
||||
# Extract JSON from markdown fences if present
|
||||
extractedJson = extractJsonString(contentJson)
|
||||
parsed = json.loads(extractedJson)
|
||||
|
||||
# Extract file content and metadata
|
||||
files = parsed.get("files", [])
|
||||
if files and len(files) > 0:
|
||||
fileData = files[0]
|
||||
return {
|
||||
"filename": fileData.get("filename", filename),
|
||||
"content": fileData.get("content", ""),
|
||||
"fileType": fileType,
|
||||
"functions": fileData.get("functions", functions),
|
||||
"classes": fileData.get("classes", classes),
|
||||
"id": fileStructure.get("id")
|
||||
}
|
||||
|
||||
# Fallback if structure is different
|
||||
return {
|
||||
"filename": filename,
|
||||
"content": parsed.get("content", ""),
|
||||
"fileType": fileType,
|
||||
"functions": functions,
|
||||
"classes": classes,
|
||||
"id": fileStructure.get("id")
|
||||
}
|
||||
|
||||
async def _formatAndValidateCode(self, codeFiles: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
|
||||
"""Format and validate generated code files."""
|
||||
# For now, just return files as-is
|
||||
# TODO: Add code formatting (black, prettier, etc.) and validation
|
||||
formatted = []
|
||||
for file in codeFiles:
|
||||
content = file.get("content", "")
|
||||
# Basic cleanup: remove markdown code fences if present
|
||||
if isinstance(content, str):
|
||||
content = re.sub(r'^```[\w]*\n', '', content, flags=re.MULTILINE)
|
||||
content = re.sub(r'\n```$', '', content, flags=re.MULTILINE)
|
||||
file["content"] = content.strip()
|
||||
formatted.append(file)
|
||||
return formatted
|
||||
|
||||
def _getMimeType(self, fileType: str) -> str:
|
||||
"""Get MIME type for file type."""
|
||||
mimeTypes = {
|
||||
"py": "text/x-python",
|
||||
"js": "application/javascript",
|
||||
"ts": "application/typescript",
|
||||
"html": "text/html",
|
||||
"css": "text/css",
|
||||
"json": "application/json",
|
||||
"txt": "text/plain",
|
||||
"md": "text/markdown",
|
||||
"java": "text/x-java-source",
|
||||
"cpp": "text/x-c++src",
|
||||
"c": "text/x-csrc",
|
||||
"csv": "text/csv",
|
||||
"xml": "application/xml"
|
||||
}
|
||||
return mimeTypes.get(fileType.lower(), "text/plain")
|
||||
|
||||
def _getCodeRenderer(self, fileType: str):
|
||||
"""Get code renderer for file type."""
|
||||
from ..renderers.registry import getRenderer
|
||||
|
||||
# Map file types to renderer formats (code path)
|
||||
formatMap = {
|
||||
'json': 'json',
|
||||
'csv': 'csv',
|
||||
'xml': 'xml'
|
||||
}
|
||||
|
||||
rendererFormat = formatMap.get(fileType.lower())
|
||||
if rendererFormat:
|
||||
renderer = getRenderer(rendererFormat, self.services, outputStyle='code')
|
||||
# Check if renderer supports code rendering
|
||||
if renderer and hasattr(renderer, 'renderCodeFiles'):
|
||||
return renderer
|
||||
|
||||
return None
|
||||
|
|
@ -0,0 +1,214 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""
|
||||
Document Generation Path
|
||||
|
||||
Handles document generation using existing chapter/section model.
|
||||
"""
|
||||
|
||||
import json
|
||||
import logging
|
||||
import time
|
||||
import copy
|
||||
from typing import Dict, Any, List, Optional
|
||||
from modules.datamodels.datamodelWorkflow import AiResponse, AiResponseMetadata, DocumentData
|
||||
from modules.datamodels.datamodelExtraction import ContentPart, DocumentIntent
|
||||
from modules.datamodels.datamodelAi import AiCallOptions, OperationTypeEnum
|
||||
from modules.datamodels.datamodelDocument import RenderedDocument
|
||||
from modules.workflows.processing.shared.stateTools import checkWorkflowStopped
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class DocumentGenerationPath:
|
||||
"""Document generation path (existing functionality, refactored)."""
|
||||
|
||||
def __init__(self, services):
|
||||
self.services = services
|
||||
|
||||
async def generateDocument(
|
||||
self,
|
||||
userPrompt: str,
|
||||
documentList: Optional[Any] = None, # DocumentReferenceList
|
||||
documentIntents: Optional[List[DocumentIntent]] = None,
|
||||
contentParts: Optional[List[ContentPart]] = None,
|
||||
outputFormat: str = "txt",
|
||||
title: Optional[str] = None,
|
||||
parentOperationId: Optional[str] = None
|
||||
) -> AiResponse:
|
||||
"""
|
||||
Generate document using existing chapter/section model.
|
||||
|
||||
Returns: AiResponse with documents list
|
||||
"""
|
||||
# Create operation ID
|
||||
workflowId = self.services.workflow.id if self.services.workflow else f"no-workflow-{int(time.time())}"
|
||||
docOperationId = f"doc_gen_{workflowId}_{int(time.time())}"
|
||||
|
||||
# Start progress tracking
|
||||
self.services.chat.progressLogStart(
|
||||
docOperationId,
|
||||
"Document Generation",
|
||||
"Document Generation",
|
||||
f"Format: {outputFormat}",
|
||||
parentOperationId=parentOperationId
|
||||
)
|
||||
|
||||
try:
|
||||
# Schritt 5A: Kläre Dokument-Intents
|
||||
documents = []
|
||||
if documentList:
|
||||
documents = self.services.chat.getChatDocumentsFromDocumentList(documentList)
|
||||
|
||||
# Filter: Entferne Original-Dokumente, wenn bereits Pre-Extracted JSONs existieren
|
||||
# (um Duplikate zu vermeiden - Pre-Extracted JSONs enthalten bereits die ContentParts)
|
||||
# Schritt 1: Identifiziere alle Original-Dokument-IDs, die durch Pre-Extracted JSONs abgedeckt werden
|
||||
originalDocIdsCoveredByPreExtracted = set()
|
||||
for doc in documents:
|
||||
preExtracted = self.services.ai.intentAnalyzer.resolvePreExtractedDocument(doc)
|
||||
if preExtracted:
|
||||
originalDocId = preExtracted["originalDocument"]["id"]
|
||||
originalDocIdsCoveredByPreExtracted.add(originalDocId)
|
||||
logger.debug(f"Found pre-extracted JSON {doc.id} covering original document {originalDocId}")
|
||||
|
||||
# Schritt 2: Filtere Dokumente - entferne Original-Dokumente, die bereits durch Pre-Extracted JSONs abgedeckt werden
|
||||
filteredDocuments = []
|
||||
for doc in documents:
|
||||
preExtracted = self.services.ai.intentAnalyzer.resolvePreExtractedDocument(doc)
|
||||
if preExtracted:
|
||||
# Pre-Extracted JSON behalten
|
||||
filteredDocuments.append(doc)
|
||||
elif doc.id in originalDocIdsCoveredByPreExtracted:
|
||||
# Original-Dokument, das bereits durch Pre-Extracted JSON abgedeckt wird - entfernen
|
||||
logger.info(f"Skipping original document {doc.id} ({doc.fileName}) - already covered by pre-extracted JSON")
|
||||
else:
|
||||
# Normales Dokument ohne Pre-Extracted JSON - behalten
|
||||
filteredDocuments.append(doc)
|
||||
|
||||
documents = filteredDocuments
|
||||
|
||||
checkWorkflowStopped(self.services)
|
||||
|
||||
if not documentIntents and documents:
|
||||
documentIntents = await self.services.ai.clarifyDocumentIntents(
|
||||
documents,
|
||||
userPrompt,
|
||||
{"outputFormat": outputFormat},
|
||||
docOperationId
|
||||
)
|
||||
|
||||
checkWorkflowStopped(self.services)
|
||||
|
||||
# Schritt 5B: Extrahiere und bereite Content vor
|
||||
if documents:
|
||||
preparedContentParts = await self.services.ai.extractAndPrepareContent(
|
||||
documents,
|
||||
documentIntents or [],
|
||||
docOperationId
|
||||
)
|
||||
|
||||
# Merge mit bereitgestellten contentParts (falls vorhanden)
|
||||
if contentParts:
|
||||
# Prüfe auf pre-extracted Content
|
||||
for part in contentParts:
|
||||
if part.metadata.get("skipExtraction", False):
|
||||
# Bereits extrahiert - verwende as-is, stelle sicher dass Metadaten vollständig
|
||||
part.metadata.setdefault("contentFormat", "extracted")
|
||||
part.metadata.setdefault("isPreExtracted", True)
|
||||
preparedContentParts.extend(contentParts)
|
||||
|
||||
contentParts = preparedContentParts
|
||||
|
||||
# Schritt 5B.5: Documents are converted to contentParts (like pre-processed JSON files)
|
||||
# No AI extraction here - AI extraction happens during section generation
|
||||
if contentParts:
|
||||
logger.info(f"Using {len(contentParts)} content parts for generation (no AI extraction at this stage)")
|
||||
|
||||
checkWorkflowStopped(self.services)
|
||||
|
||||
# Schritt 5C: Generiere Struktur
|
||||
structure = await self.services.ai.generateStructure(
|
||||
userPrompt,
|
||||
contentParts or [],
|
||||
outputFormat,
|
||||
docOperationId
|
||||
)
|
||||
|
||||
checkWorkflowStopped(self.services)
|
||||
|
||||
# Schritt 5D: Fülle Struktur
|
||||
# Language will be extracted from services (user intention analysis) in fillStructure
|
||||
filledStructure = await self.services.ai.fillStructure(
|
||||
structure,
|
||||
contentParts or [],
|
||||
userPrompt,
|
||||
docOperationId
|
||||
)
|
||||
|
||||
checkWorkflowStopped(self.services)
|
||||
|
||||
# Schritt 5E: Rendere Resultat
|
||||
# Jedes Dokument wird einzeln gerendert, kann 1..n Dateien zurückgeben (z.B. HTML + Bilder)
|
||||
# Language is already validated in structure (State 3) and preserved in filled structure (State 4)
|
||||
# Per-document language will be extracted in renderReport() from filledStructure
|
||||
# Use validated currentUserLanguage as global fallback (always valid infrastructure)
|
||||
language = self.services.currentUserLanguage if hasattr(self.services, 'currentUserLanguage') and self.services.currentUserLanguage else "en"
|
||||
|
||||
# IMPORTANT: Create deep copy BEFORE renderResult to preserve filledStructure with elements
|
||||
# renderResult might modify the structure, so we need to preserve the original for sourceJson
|
||||
# This ensures sourceJson contains the complete structure with elements for validation
|
||||
filledStructureForSourceJson = copy.deepcopy(filledStructure) if filledStructure else None
|
||||
|
||||
renderedDocuments = await self.services.ai.renderResult(
|
||||
filledStructure,
|
||||
outputFormat,
|
||||
language, # Global fallback (per-document language extracted from structure in renderReport)
|
||||
title or "Generated Document",
|
||||
userPrompt,
|
||||
docOperationId
|
||||
)
|
||||
|
||||
# Baue Response: Konvertiere alle gerenderten Dokumente zu DocumentData
|
||||
documentDataList = []
|
||||
for renderedDoc in renderedDocuments:
|
||||
try:
|
||||
# Erstelle DocumentData für jedes gerenderte Dokument
|
||||
# Use the preserved filledStructureForSourceJson (with elements) for sourceJson
|
||||
docDataObj = DocumentData(
|
||||
documentName=renderedDoc.filename,
|
||||
documentData=renderedDoc.documentData,
|
||||
mimeType=renderedDoc.mimeType,
|
||||
sourceJson=filledStructureForSourceJson if len(documentDataList) == 0 else None # Nur für erstes Dokument
|
||||
)
|
||||
documentDataList.append(docDataObj)
|
||||
logger.debug(f"Added rendered document: {renderedDoc.filename} ({len(renderedDoc.documentData)} bytes, {renderedDoc.mimeType})")
|
||||
except Exception as e:
|
||||
logger.warning(f"Error creating document {renderedDoc.filename}: {str(e)}")
|
||||
|
||||
if not documentDataList:
|
||||
raise ValueError("No documents were rendered")
|
||||
|
||||
metadata = AiResponseMetadata(
|
||||
title=title or filledStructure.get("metadata", {}).get("title", "Generated Document"),
|
||||
operationType=OperationTypeEnum.DATA_GENERATE.value
|
||||
)
|
||||
|
||||
# Debug-Log (harmonisiert)
|
||||
self.services.utils.writeDebugFile(
|
||||
json.dumps(filledStructure, indent=2, ensure_ascii=False, default=str),
|
||||
"document_generation_response"
|
||||
)
|
||||
|
||||
self.services.chat.progressLogFinish(docOperationId, True)
|
||||
|
||||
return AiResponse(
|
||||
content=json.dumps(filledStructure),
|
||||
metadata=metadata,
|
||||
documents=documentDataList
|
||||
)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error in document generation: {str(e)}")
|
||||
self.services.chat.progressLogFinish(docOperationId, False)
|
||||
raise
|
||||
|
||||
|
|
@ -0,0 +1,128 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""
|
||||
Image Generation Path
|
||||
|
||||
Handles image generation with support for single and batch generation.
|
||||
"""
|
||||
|
||||
import logging
|
||||
import time
|
||||
from typing import List, Optional
|
||||
from modules.datamodels.datamodelWorkflow import AiResponse, AiResponseMetadata, DocumentData
|
||||
from modules.datamodels.datamodelAi import AiCallOptions, OperationTypeEnum, AiCallRequest
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class ImageGenerationPath:
|
||||
"""Image generation path."""
|
||||
|
||||
def __init__(self, services):
|
||||
self.services = services
|
||||
|
||||
async def generateImages(
|
||||
self,
|
||||
userPrompt: str,
|
||||
count: int = 1,
|
||||
style: Optional[str] = None,
|
||||
format: str = "png",
|
||||
title: Optional[str] = None,
|
||||
parentOperationId: Optional[str] = None
|
||||
) -> AiResponse:
|
||||
"""
|
||||
Generate image files.
|
||||
|
||||
Returns: AiResponse with image files as documents
|
||||
"""
|
||||
# Create operation ID
|
||||
workflowId = self.services.workflow.id if self.services.workflow else f"no-workflow-{int(time.time())}"
|
||||
imageOperationId = f"image_gen_{workflowId}_{int(time.time())}"
|
||||
|
||||
# Start progress tracking
|
||||
self.services.chat.progressLogStart(
|
||||
imageOperationId,
|
||||
"Image Generation",
|
||||
"Image Generation",
|
||||
f"Format: {format}",
|
||||
parentOperationId=parentOperationId
|
||||
)
|
||||
|
||||
try:
|
||||
self.services.chat.progressLogUpdate(imageOperationId, 0.4, "Calling AI for image generation")
|
||||
|
||||
# Build prompt with style if provided
|
||||
imagePrompt = userPrompt
|
||||
if style:
|
||||
imagePrompt = f"{userPrompt}\n\nStyle: {style}"
|
||||
|
||||
# Use IMAGE_GENERATE operation
|
||||
options = AiCallOptions(
|
||||
operationType=OperationTypeEnum.IMAGE_GENERATE,
|
||||
resultFormat=format
|
||||
)
|
||||
|
||||
request = AiCallRequest(
|
||||
prompt=imagePrompt,
|
||||
context="",
|
||||
options=options
|
||||
)
|
||||
|
||||
response = await self.services.ai.callAi(request)
|
||||
|
||||
if not response.content:
|
||||
errorMsg = f"No image data returned: {response.content}"
|
||||
logger.error(f"Error in AI image generation: {errorMsg}")
|
||||
self.services.chat.progressLogFinish(imageOperationId, False)
|
||||
raise ValueError(errorMsg)
|
||||
|
||||
# Handle response content (could be base64 string or bytes)
|
||||
imageData = response.content
|
||||
if isinstance(imageData, str):
|
||||
# Assume base64 encoded string
|
||||
import base64
|
||||
try:
|
||||
imageData = base64.b64decode(imageData)
|
||||
except Exception:
|
||||
# If not base64, try encoding as bytes
|
||||
imageData = imageData.encode('utf-8')
|
||||
elif not isinstance(imageData, bytes):
|
||||
imageData = bytes(imageData)
|
||||
|
||||
# Create document
|
||||
imageDoc = DocumentData(
|
||||
documentName=f"generated_image.{format}",
|
||||
documentData=imageData,
|
||||
mimeType=f"image/{format}"
|
||||
)
|
||||
|
||||
metadata = AiResponseMetadata(
|
||||
title=title or "Generated Image",
|
||||
operationType=OperationTypeEnum.IMAGE_GENERATE.value
|
||||
)
|
||||
|
||||
# Note: Stats are now stored centrally in callAi() - no need to duplicate here
|
||||
|
||||
self.services.chat.progressLogUpdate(imageOperationId, 0.9, "Image generated")
|
||||
self.services.chat.progressLogFinish(imageOperationId, True)
|
||||
|
||||
# Create content string describing the image generation
|
||||
import json
|
||||
contentJson = json.dumps({
|
||||
"type": "image",
|
||||
"format": format,
|
||||
"prompt": userPrompt,
|
||||
"filename": imageDoc.documentName
|
||||
}, ensure_ascii=False)
|
||||
|
||||
return AiResponse(
|
||||
content=contentJson, # JSON string describing the image generation
|
||||
metadata=metadata,
|
||||
documents=[imageDoc]
|
||||
)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error in image generation: {str(e)}")
|
||||
self.services.chat.progressLogFinish(imageOperationId, False)
|
||||
raise
|
||||
|
||||
|
|
@ -0,0 +1,45 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""
|
||||
Base renderer class for code format renderers.
|
||||
"""
|
||||
|
||||
from abc import abstractmethod
|
||||
from .documentRendererBaseTemplate import BaseRenderer
|
||||
from modules.datamodels.datamodelDocument import RenderedDocument
|
||||
from typing import Dict, Any, List, Optional
|
||||
import logging
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
class BaseCodeRenderer(BaseRenderer):
|
||||
"""Base class for code format renderers."""
|
||||
|
||||
@abstractmethod
|
||||
async def renderCodeFiles(
|
||||
self,
|
||||
codeFiles: List[Dict[str, Any]],
|
||||
metadata: Dict[str, Any],
|
||||
userPrompt: str = None
|
||||
) -> List[RenderedDocument]:
|
||||
"""
|
||||
Render code files to format-specific output.
|
||||
|
||||
Args:
|
||||
codeFiles: List of file dictionaries with:
|
||||
- filename: str
|
||||
- fileType: str (json, csv, xml, etc.)
|
||||
- content: str (generated code)
|
||||
- id: str (optional)
|
||||
metadata: Project metadata (language, projectType, etc.)
|
||||
userPrompt: Original user prompt
|
||||
|
||||
Returns:
|
||||
List of RenderedDocument objects (can be 1..n files)
|
||||
"""
|
||||
pass
|
||||
|
||||
def _validateCodeFile(self, codeFile: Dict[str, Any]) -> bool:
|
||||
"""Validate code file structure."""
|
||||
required = ['filename', 'fileType', 'content']
|
||||
return all(key in codeFile for key in required)
|
||||
|
|
@ -0,0 +1,484 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""
|
||||
Base renderer class for all format renderers.
|
||||
"""
|
||||
|
||||
from abc import ABC, abstractmethod
|
||||
from typing import Dict, Any, List, Tuple, Optional
|
||||
from modules.datamodels.datamodelJson import supportedSectionTypes
|
||||
from modules.datamodels.datamodelDocument import RenderedDocument
|
||||
import json
|
||||
import logging
|
||||
import re
|
||||
from datetime import datetime, UTC
|
||||
import base64
|
||||
import io
|
||||
from PIL import Image
|
||||
from modules.datamodels.datamodelAi import AiCallRequest, AiCallOptions, OperationTypeEnum
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
class BaseRenderer(ABC):
|
||||
"""Base class for all format renderers."""
|
||||
|
||||
def __init__(self, services=None):
|
||||
self.logger = logger
|
||||
self.services = services # Add services attribute
|
||||
|
||||
@classmethod
|
||||
def getSupportedFormats(cls) -> List[str]:
|
||||
"""
|
||||
Return list of supported format names for this renderer.
|
||||
Override this method in subclasses to specify supported formats.
|
||||
"""
|
||||
return []
|
||||
|
||||
@classmethod
|
||||
def getFormatAliases(cls) -> List[str]:
|
||||
"""
|
||||
Return list of format aliases for this renderer.
|
||||
Override this method in subclasses to specify format aliases.
|
||||
"""
|
||||
return []
|
||||
|
||||
@classmethod
|
||||
def getPriority(cls) -> int:
|
||||
"""
|
||||
Return priority for this renderer (higher number = higher priority).
|
||||
Used when multiple renderers support the same format.
|
||||
"""
|
||||
return 0
|
||||
|
||||
@classmethod
|
||||
def getOutputStyle(cls, formatName: Optional[str] = None) -> str:
|
||||
"""
|
||||
Return the output style classification for this renderer.
|
||||
Returns: 'code', 'document', 'image', or other (e.g., 'video' for future use)
|
||||
Override this method in subclasses to specify the output style.
|
||||
|
||||
Args:
|
||||
formatName: Optional format name (e.g., 'txt', 'js', 'csv') - useful for renderers
|
||||
that handle multiple formats with different styles (e.g., RendererText)
|
||||
"""
|
||||
return 'document' # Default to document style
|
||||
|
||||
@classmethod
|
||||
def getAcceptedSectionTypes(cls, formatName: Optional[str] = None) -> List[str]:
|
||||
"""
|
||||
Return list of section content types that this renderer accepts.
|
||||
This allows renderers to declare which section types they can process.
|
||||
|
||||
Default implementation returns all supported section types.
|
||||
Override this method in subclasses to restrict accepted types.
|
||||
|
||||
Args:
|
||||
formatName: Optional format name (e.g., 'txt', 'js', 'csv') - useful for renderers
|
||||
that handle multiple formats with different accepted types (e.g., RendererText)
|
||||
|
||||
Returns:
|
||||
List of accepted section content types (e.g., ["table", "paragraph", "heading"])
|
||||
Valid types: "table", "bullet_list", "heading", "paragraph", "code_block", "image"
|
||||
"""
|
||||
# Default: accept all section types
|
||||
return list(supportedSectionTypes)
|
||||
|
||||
@abstractmethod
|
||||
async def render(self, extractedContent: Dict[str, Any], title: str, userPrompt: str = None, aiService=None) -> List[RenderedDocument]:
|
||||
"""
|
||||
Render extracted JSON content to multiple documents.
|
||||
Each renderer must implement this method.
|
||||
Can return 1..n documents (e.g., HTML + images).
|
||||
|
||||
Args:
|
||||
extractedContent: Structured JSON content with sections and metadata (contains single document)
|
||||
title: Report title
|
||||
userPrompt: Original user prompt for context
|
||||
aiService: AI service instance for additional processing
|
||||
|
||||
Returns:
|
||||
List of RenderedDocument objects.
|
||||
First document is the main document, additional documents are supporting files (e.g., images).
|
||||
Even if only one document is returned, it must be wrapped in a list.
|
||||
"""
|
||||
pass
|
||||
|
||||
def _determineFilename(self, title: str, mimeType: str) -> str:
|
||||
"""Determine filename from title and mimeType."""
|
||||
import re
|
||||
# Get extension from mimeType
|
||||
extensionMap = {
|
||||
"text/html": "html",
|
||||
"application/pdf": "pdf",
|
||||
"application/vnd.openxmlformats-officedocument.wordprocessingml.document": "docx",
|
||||
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet": "xlsx",
|
||||
"text/plain": "txt",
|
||||
"text/markdown": "md",
|
||||
"application/json": "json",
|
||||
"text/csv": "csv"
|
||||
}
|
||||
extension = extensionMap.get(mimeType, "txt")
|
||||
|
||||
# Sanitize title for filename
|
||||
sanitized = re.sub(r"[^a-zA-Z0-9._-]", "_", title)
|
||||
sanitized = re.sub(r"_+", "_", sanitized).strip("_")
|
||||
if not sanitized:
|
||||
sanitized = "document"
|
||||
|
||||
return f"{sanitized}.{extension}"
|
||||
|
||||
def _extractSections(self, reportData: Dict[str, Any]) -> List[Dict[str, Any]]:
|
||||
"""
|
||||
Extract sections from standardized schema: {metadata: {...}, documents: [{sections: [...]}]}
|
||||
Phase 5: Supports multiple documents - extracts all sections from all documents.
|
||||
"""
|
||||
if "documents" not in reportData:
|
||||
raise ValueError("Report data must follow standardized schema with 'documents' array")
|
||||
|
||||
documents = reportData.get("documents", [])
|
||||
if not isinstance(documents, list) or len(documents) == 0:
|
||||
raise ValueError("Standardized schema must contain at least one document in 'documents' array")
|
||||
|
||||
# Phase 5: Extract sections from ALL documents
|
||||
all_sections = []
|
||||
for doc in documents:
|
||||
if isinstance(doc, dict) and "sections" in doc:
|
||||
sections = doc.get("sections", [])
|
||||
if isinstance(sections, list):
|
||||
all_sections.extend(sections)
|
||||
|
||||
if not all_sections:
|
||||
raise ValueError("No sections found in any document")
|
||||
|
||||
return all_sections
|
||||
|
||||
def _extractMetadata(self, reportData: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""
|
||||
Extract metadata from standardized schema: {metadata: {...}, documents: [{sections: [...]}]}
|
||||
"""
|
||||
if "metadata" not in reportData:
|
||||
raise ValueError("Report data must follow standardized schema with 'metadata' field")
|
||||
|
||||
metadata = reportData.get("metadata", {})
|
||||
if not isinstance(metadata, dict):
|
||||
raise ValueError("Metadata in standardized schema must be a dictionary")
|
||||
|
||||
return metadata
|
||||
|
||||
def _getTitle(self, reportData: Dict[str, Any], fallbackTitle: str) -> str:
|
||||
"""Get title from report data or use fallback."""
|
||||
metadata = reportData.get('metadata', {})
|
||||
return metadata.get('title', fallbackTitle)
|
||||
|
||||
def _validateJsonStructure(self, jsonContent: Dict[str, Any]) -> bool:
|
||||
"""
|
||||
Validate that JSON content follows standardized schema: {metadata: {...}, documents: [{sections: [...]}]}
|
||||
"""
|
||||
if not isinstance(jsonContent, dict):
|
||||
return False
|
||||
|
||||
# Validate metadata field exists
|
||||
if "metadata" not in jsonContent:
|
||||
return False
|
||||
|
||||
if not isinstance(jsonContent.get("metadata"), dict):
|
||||
return False
|
||||
|
||||
# Validate documents array exists and is not empty
|
||||
if "documents" not in jsonContent:
|
||||
return False
|
||||
|
||||
documents = jsonContent.get("documents", [])
|
||||
if not isinstance(documents, list) or len(documents) == 0:
|
||||
return False
|
||||
|
||||
# Validate first document has sections
|
||||
firstDoc = documents[0]
|
||||
if not isinstance(firstDoc, dict) or "sections" not in firstDoc:
|
||||
return False
|
||||
|
||||
sections = firstDoc.get("sections", [])
|
||||
if not isinstance(sections, list):
|
||||
return False
|
||||
|
||||
# Validate each section has content_type and elements
|
||||
for section in sections:
|
||||
if not isinstance(section, dict):
|
||||
return False
|
||||
if "content_type" not in section or "elements" not in section:
|
||||
return False
|
||||
|
||||
return True
|
||||
|
||||
def _getSectionType(self, section: Dict[str, Any]) -> str:
|
||||
"""Get the type of a section; default to 'paragraph' for non-dict inputs."""
|
||||
if isinstance(section, dict):
|
||||
return section.get("content_type", "paragraph")
|
||||
# If section is a list or any other type, treat as paragraph elements
|
||||
return "paragraph"
|
||||
|
||||
def _getSectionData(self, section: Dict[str, Any]) -> List[Dict[str, Any]]:
|
||||
"""Get the elements of a section; if a list is provided directly, return it."""
|
||||
if isinstance(section, dict):
|
||||
return section.get("elements", [])
|
||||
if isinstance(section, list):
|
||||
return section
|
||||
return []
|
||||
|
||||
def _getSectionId(self, section: Dict[str, Any]) -> str:
|
||||
"""Get the ID of a section (if available)."""
|
||||
if isinstance(section, dict):
|
||||
return section.get("id", "unknown")
|
||||
return "unknown"
|
||||
|
||||
def _validateImageData(self, base64Data: str, altText: str) -> bool:
|
||||
"""Validate image data."""
|
||||
if not base64Data:
|
||||
self.logger.warning("Image section has no base64 data")
|
||||
return False
|
||||
|
||||
if not altText:
|
||||
self.logger.warning("Image section has no alt text")
|
||||
return False
|
||||
|
||||
# Basic base64 validation
|
||||
try:
|
||||
base64.b64decode(base64Data, validate=True)
|
||||
return True
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Invalid base64 image data: {str(e)}")
|
||||
return False
|
||||
|
||||
def _getImageDimensions(self, base64Data: str) -> Tuple[int, int]:
|
||||
"""
|
||||
Get image dimensions from base64 data.
|
||||
This is a helper method that format-specific renderers can use.
|
||||
"""
|
||||
try:
|
||||
# Decode base64 data
|
||||
imageData = base64.b64decode(base64Data)
|
||||
image = Image.open(io.BytesIO(imageData))
|
||||
|
||||
return image.size # Returns (width, height)
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Could not determine image dimensions: {str(e)}")
|
||||
return (0, 0)
|
||||
|
||||
def _resizeImageIfNeeded(self, base64Data: str, maxWidth: int = 800, maxHeight: int = 600) -> str:
|
||||
"""
|
||||
Resize image if it exceeds maximum dimensions.
|
||||
Returns the resized image as base64 string.
|
||||
"""
|
||||
try:
|
||||
# Decode base64 data
|
||||
imageData = base64.b64decode(base64Data)
|
||||
image = Image.open(io.BytesIO(imageData))
|
||||
|
||||
# Check if resizing is needed
|
||||
width, height = image.size
|
||||
if width <= maxWidth and height <= maxHeight:
|
||||
return base64Data # No resizing needed
|
||||
|
||||
# Calculate new dimensions maintaining aspect ratio
|
||||
ratio = min(maxWidth / width, maxHeight / height)
|
||||
newWidth = int(width * ratio)
|
||||
newHeight = int(height * ratio)
|
||||
|
||||
# Resize image
|
||||
resizedImage = image.resize((newWidth, newHeight), Image.Resampling.LANCZOS)
|
||||
|
||||
# Convert back to base64
|
||||
buffer = io.BytesIO()
|
||||
resizedImage.save(buffer, format=image.format or 'PNG')
|
||||
resizedData = buffer.getvalue()
|
||||
|
||||
return base64.b64encode(resizedData).decode('utf-8')
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Could not resize image: {str(e)}")
|
||||
return base64Data # Return original if resize fails
|
||||
|
||||
def _getSupportedSectionTypes(self) -> List[str]:
|
||||
"""Return list of supported section types (from unified schema)."""
|
||||
return supportedSectionTypes
|
||||
|
||||
def _isValidSectionType(self, sectionType: str) -> bool:
|
||||
"""Check if a section type is valid."""
|
||||
return sectionType in self._getSupportedSectionTypes()
|
||||
|
||||
def _formatTimestamp(self, timestamp: str = None) -> str:
|
||||
"""Format timestamp for display."""
|
||||
if timestamp:
|
||||
return timestamp
|
||||
return datetime.now(UTC).strftime("%Y-%m-%d %H:%M:%S UTC")
|
||||
|
||||
# ===== GENERIC AI STYLING HELPERS =====
|
||||
|
||||
async def _getAiStyles(self, aiService, styleTemplate: str, defaultStyles: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""
|
||||
Generic AI styling method that can be used by all renderers.
|
||||
|
||||
Args:
|
||||
aiService: AI service instance
|
||||
styleTemplate: Format-specific style template
|
||||
defaultStyles: Default styles to fall back to
|
||||
|
||||
Returns:
|
||||
Dict with styling definitions
|
||||
"""
|
||||
# DEBUG: Show which renderer is calling this method
|
||||
|
||||
if not aiService:
|
||||
return defaultStyles
|
||||
|
||||
try:
|
||||
|
||||
requestOptions = AiCallOptions()
|
||||
requestOptions.operationType = OperationTypeEnum.DATA_GENERATE
|
||||
|
||||
request = AiCallRequest(prompt=styleTemplate, context="", options=requestOptions)
|
||||
|
||||
# DEBUG: Show the actual prompt being sent to AI
|
||||
self.logger.debug(f"AI Style Template Prompt:")
|
||||
self.logger.debug(f"{styleTemplate}")
|
||||
|
||||
response = await aiService.callAi(request)
|
||||
|
||||
# Save styling prompt and response to debug (fire and forget - don't block on slow file I/O)
|
||||
# The writeDebugFile calls os.listdir() which can be slow with many files
|
||||
# Run in background thread to avoid blocking rendering
|
||||
import threading
|
||||
def _writeDebugFiles():
|
||||
try:
|
||||
self.services.utils.writeDebugFile(styleTemplate, "renderer_styling_prompt")
|
||||
self.services.utils.writeDebugFile(response.content or '', "renderer_styling_response")
|
||||
except Exception:
|
||||
pass # Silently fail - debug writing should never block rendering
|
||||
|
||||
threading.Thread(target=_writeDebugFiles, daemon=True).start()
|
||||
|
||||
# Clean and parse JSON
|
||||
result = response.content.strip() if response and response.content else ""
|
||||
|
||||
# Check if result is empty
|
||||
if not result:
|
||||
self.logger.warning("AI styling returned empty response, using defaults")
|
||||
return defaultStyles
|
||||
|
||||
# Extract JSON from markdown if present
|
||||
jsonMatch = re.search(r'```json\s*\n(.*?)\n```', result, re.DOTALL)
|
||||
if jsonMatch:
|
||||
result = jsonMatch.group(1).strip()
|
||||
elif result.startswith('```json'):
|
||||
result = re.sub(r'^```json\s*', '', result)
|
||||
result = re.sub(r'\s*```$', '', result)
|
||||
elif result.startswith('```'):
|
||||
result = re.sub(r'^```\s*', '', result)
|
||||
result = re.sub(r'\s*```$', '', result)
|
||||
|
||||
# Try to parse JSON
|
||||
try:
|
||||
styles = json.loads(result)
|
||||
except json.JSONDecodeError as jsonError:
|
||||
self.logger.warning(f"AI styling returned invalid JSON: {jsonError}")
|
||||
|
||||
# Use print instead of logger to avoid truncation
|
||||
self.services.utils.debugLogToFile(f"FULL AI RESPONSE THAT FAILED TO PARSE: {result}", "RENDERER")
|
||||
self.services.utils.debugLogToFile(f"RESPONSE LENGTH: {len(result)} characters", "RENDERER")
|
||||
|
||||
self.logger.warning(f"Raw content that failed to parse: {result}")
|
||||
|
||||
# Try to fix incomplete JSON by adding missing closing braces
|
||||
openBraces = result.count('{')
|
||||
closeBraces = result.count('}')
|
||||
|
||||
if openBraces > closeBraces:
|
||||
# JSON is incomplete, add missing closing braces
|
||||
missingBraces = openBraces - closeBraces
|
||||
result = result + '}' * missingBraces
|
||||
self.logger.info(f"Added {missingBraces} missing closing brace(s)")
|
||||
self.logger.debug(f"Fixed JSON: {result}")
|
||||
|
||||
# Try parsing the fixed JSON
|
||||
try:
|
||||
styles = json.loads(result)
|
||||
self.logger.info("Successfully fixed incomplete JSON")
|
||||
except json.JSONDecodeError as fixError:
|
||||
self.logger.warning(f"Fixed JSON still invalid: {fixError}")
|
||||
self.logger.warning(f"Fixed JSON content: {result}")
|
||||
# Try to extract just the JSON part if it's embedded in text
|
||||
jsonStart = result.find('{')
|
||||
jsonEnd = result.rfind('}')
|
||||
if jsonStart != -1 and jsonEnd != -1 and jsonEnd > jsonStart:
|
||||
jsonPart = result[jsonStart:jsonEnd+1]
|
||||
try:
|
||||
styles = json.loads(jsonPart)
|
||||
self.logger.info("Successfully extracted JSON from explanatory text")
|
||||
except json.JSONDecodeError:
|
||||
self.logger.warning("Could not extract valid JSON from response, using defaults")
|
||||
return defaultStyles
|
||||
else:
|
||||
return defaultStyles
|
||||
else:
|
||||
# Try to extract just the JSON part if it's embedded in text
|
||||
jsonStart = result.find('{')
|
||||
jsonEnd = result.rfind('}')
|
||||
if jsonStart != -1 and jsonEnd != -1 and jsonEnd > jsonStart:
|
||||
jsonPart = result[jsonStart:jsonEnd+1]
|
||||
try:
|
||||
styles = json.loads(jsonPart)
|
||||
self.logger.info("Successfully extracted JSON from explanatory text")
|
||||
except json.JSONDecodeError:
|
||||
self.logger.warning("Could not extract valid JSON from response, using defaults")
|
||||
return defaultStyles
|
||||
else:
|
||||
return defaultStyles
|
||||
|
||||
# Convert colors to appropriate format
|
||||
styles = self._convertColorsFormat(styles)
|
||||
|
||||
return styles
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"AI styling failed: {str(e)}, using defaults")
|
||||
return defaultStyles
|
||||
|
||||
def _convertColorsFormat(self, styles: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""
|
||||
Convert colors to appropriate format based on renderer type.
|
||||
Override this method in subclasses for format-specific color handling.
|
||||
"""
|
||||
return styles
|
||||
|
||||
def _createAiStyleTemplate(self, formatName: str, userPrompt: str, styleSchema: Dict[str, Any]) -> str:
|
||||
"""
|
||||
Create a standardized AI style template for any format.
|
||||
|
||||
Args:
|
||||
formatName: Name of the format (e.g., "docx", "xlsx", "pptx")
|
||||
userPrompt: User's original prompt
|
||||
styleSchema: Format-specific style schema
|
||||
|
||||
Returns:
|
||||
Formatted prompt string
|
||||
"""
|
||||
schemaJson = json.dumps(styleSchema, indent=4)
|
||||
|
||||
# DEBUG: Show the schema being sent
|
||||
|
||||
return f"""You are a professional document styling expert. Generate a complete JSON styling configuration for {formatName.upper()} documents.
|
||||
|
||||
User request: {userPrompt}
|
||||
|
||||
Use this schema as a template:
|
||||
{schemaJson}
|
||||
|
||||
Requirements:
|
||||
- Return ONLY the complete JSON object (no markdown, no explanations)
|
||||
- If the user request contains style/formatting/design instructions (in any language), customize the styling accordingly (adapt styles and add styles if needed)
|
||||
- If the user request has NO style instructions, return the default schema values unchanged
|
||||
- Ensure all objects are properly closed with closing braces
|
||||
- Only modify styles if style instructions are present in the user request
|
||||
|
||||
Return the complete JSON:"""
|
||||
|
|
@ -0,0 +1,238 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""
|
||||
Renderer registry for automatic discovery and registration of renderers.
|
||||
|
||||
Renderers are indexed by (format, outputStyle) so that document generation
|
||||
and code generation each get the correct renderer for the same format.
|
||||
"""
|
||||
|
||||
import logging
|
||||
import importlib
|
||||
from typing import Dict, Type, List, Optional, Tuple
|
||||
from .documentRendererBaseTemplate import BaseRenderer
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class RendererRegistry:
|
||||
"""Registry for automatic renderer discovery and management.
|
||||
|
||||
Maintains separate renderer mappings per outputStyle ('document', 'code', etc.)
|
||||
so that document-generation and code-generation paths each resolve to the
|
||||
correct renderer, even when both support the same format (e.g. 'csv').
|
||||
"""
|
||||
|
||||
def __init__(self):
|
||||
# Key: (formatName, outputStyle) -> rendererClass
|
||||
self._renderers: Dict[Tuple[str, str], Type[BaseRenderer]] = {}
|
||||
self._format_mappings: Dict[str, str] = {}
|
||||
self._discovered = False
|
||||
|
||||
def discoverRenderers(self) -> None:
|
||||
"""Automatically discover and register all renderers by scanning files."""
|
||||
if self._discovered:
|
||||
return
|
||||
|
||||
try:
|
||||
from pathlib import Path
|
||||
|
||||
currentDir = Path(__file__).parent
|
||||
packageName = __name__.rsplit('.', 1)[0]
|
||||
|
||||
for filePath in currentDir.glob("*.py"):
|
||||
if filePath.name in ['registry.py', 'documentRendererBaseTemplate.py', 'codeRendererBaseTemplate.py', '__init__.py']:
|
||||
continue
|
||||
|
||||
moduleName = filePath.stem
|
||||
|
||||
try:
|
||||
fullModuleName = f"{packageName}.{moduleName}"
|
||||
module = importlib.import_module(fullModuleName)
|
||||
|
||||
for attrName in dir(module):
|
||||
attr = getattr(module, attrName)
|
||||
if (isinstance(attr, type) and
|
||||
issubclass(attr, BaseRenderer) and
|
||||
attr != BaseRenderer and
|
||||
hasattr(attr, 'getSupportedFormats')):
|
||||
self._registerRendererClass(attr)
|
||||
|
||||
except Exception as e:
|
||||
logger.warning(f"Could not load renderer from {moduleName}: {str(e)}")
|
||||
continue
|
||||
|
||||
self._discovered = True
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error during renderer discovery: {str(e)}")
|
||||
self._discovered = True
|
||||
|
||||
def _registerRendererClass(self, rendererClass: Type[BaseRenderer]) -> None:
|
||||
"""Register a renderer class keyed by (format, outputStyle)."""
|
||||
try:
|
||||
supportedFormats = rendererClass.getSupportedFormats()
|
||||
outputStyle = rendererClass.getOutputStyle() if hasattr(rendererClass, 'getOutputStyle') else 'document'
|
||||
priority = rendererClass.getPriority() if hasattr(rendererClass, 'getPriority') else 0
|
||||
|
||||
for formatName in supportedFormats:
|
||||
formatKey = formatName.lower()
|
||||
registryKey = (formatKey, outputStyle)
|
||||
|
||||
if registryKey in self._renderers:
|
||||
existingRenderer = self._renderers[registryKey]
|
||||
existingPriority = existingRenderer.getPriority() if hasattr(existingRenderer, 'getPriority') else 0
|
||||
|
||||
if priority > existingPriority:
|
||||
logger.debug(f"Replacing {existingRenderer.__name__} with {rendererClass.__name__} for ({formatKey}, {outputStyle}) (priority {priority} > {existingPriority})")
|
||||
self._renderers[registryKey] = rendererClass
|
||||
else:
|
||||
logger.debug(f"Keeping {existingRenderer.__name__} for ({formatKey}, {outputStyle}) (priority {existingPriority} >= {priority})")
|
||||
else:
|
||||
self._renderers[registryKey] = rendererClass
|
||||
|
||||
# Register aliases
|
||||
if hasattr(rendererClass, 'getFormatAliases'):
|
||||
aliases = rendererClass.getFormatAliases()
|
||||
for alias in aliases:
|
||||
self._format_mappings[alias.lower()] = formatKey
|
||||
|
||||
logger.debug(f"Registered {rendererClass.__name__} for formats={supportedFormats}, style={outputStyle}, priority={priority}")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error registering renderer {rendererClass.__name__}: {str(e)}")
|
||||
|
||||
def getRenderer(self, outputFormat: str, services=None, outputStyle: str = None) -> Optional[BaseRenderer]:
|
||||
"""Get a renderer instance for the specified format and style.
|
||||
|
||||
Args:
|
||||
outputFormat: Format name (e.g. 'csv', 'json', 'pdf')
|
||||
services: Services instance passed to renderer constructor
|
||||
outputStyle: 'document' or 'code'. If None, returns the first match
|
||||
with preference: document > code (most callers are document path).
|
||||
"""
|
||||
if not self._discovered:
|
||||
self.discoverRenderers()
|
||||
|
||||
formatName = outputFormat.lower().strip()
|
||||
if formatName in self._format_mappings:
|
||||
formatName = self._format_mappings[formatName]
|
||||
|
||||
rendererClass = None
|
||||
|
||||
if outputStyle:
|
||||
# Exact match by style
|
||||
rendererClass = self._renderers.get((formatName, outputStyle))
|
||||
else:
|
||||
# No style specified — prefer 'document', then 'code', then any
|
||||
for style in ['document', 'code']:
|
||||
rendererClass = self._renderers.get((formatName, style))
|
||||
if rendererClass:
|
||||
break
|
||||
# Fallback: check any registered style
|
||||
if not rendererClass:
|
||||
for key, cls in self._renderers.items():
|
||||
if key[0] == formatName:
|
||||
rendererClass = cls
|
||||
break
|
||||
|
||||
if rendererClass:
|
||||
try:
|
||||
return rendererClass(services=services)
|
||||
except Exception as e:
|
||||
logger.error(f"Error creating renderer instance for {formatName}: {str(e)}")
|
||||
return None
|
||||
|
||||
logger.warning(f"No renderer found for format={outputFormat}, style={outputStyle}")
|
||||
return None
|
||||
|
||||
def getSupportedFormats(self) -> List[str]:
|
||||
"""Get list of all supported formats."""
|
||||
if not self._discovered:
|
||||
self.discoverRenderers()
|
||||
|
||||
formats = set()
|
||||
for (fmt, _style) in self._renderers.keys():
|
||||
formats.add(fmt)
|
||||
formats.update(self._format_mappings.keys())
|
||||
return sorted(formats)
|
||||
|
||||
def getRendererInfo(self) -> Dict[str, Dict[str, str]]:
|
||||
"""Get information about all registered renderers."""
|
||||
if not self._discovered:
|
||||
self.discoverRenderers()
|
||||
|
||||
info = {}
|
||||
for (formatName, style), rendererClass in self._renderers.items():
|
||||
key = f"{formatName}:{style}"
|
||||
info[key] = {
|
||||
'class_name': rendererClass.__name__,
|
||||
'module': rendererClass.__module__,
|
||||
'outputStyle': style,
|
||||
'description': getattr(rendererClass, '__doc__', 'No description').strip().split('\n')[0] if rendererClass.__doc__ else 'No description'
|
||||
}
|
||||
|
||||
return info
|
||||
|
||||
def getOutputStyle(self, outputFormat: str) -> Optional[str]:
|
||||
"""
|
||||
Get the output style classification for a given format.
|
||||
When both 'document' and 'code' renderers exist for a format,
|
||||
returns the default ('document') since this is called during document generation.
|
||||
"""
|
||||
if not self._discovered:
|
||||
self.discoverRenderers()
|
||||
|
||||
formatName = outputFormat.lower().strip()
|
||||
if formatName in self._format_mappings:
|
||||
formatName = self._format_mappings[formatName]
|
||||
|
||||
# Check document first, then code
|
||||
for style in ['document', 'code']:
|
||||
rendererClass = self._renderers.get((formatName, style))
|
||||
if rendererClass:
|
||||
try:
|
||||
return rendererClass.getOutputStyle(formatName)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# Fallback: any style
|
||||
for key, rendererClass in self._renderers.items():
|
||||
if key[0] == formatName:
|
||||
try:
|
||||
return rendererClass.getOutputStyle(formatName)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
logger.warning(f"No renderer found for format: {outputFormat}, cannot determine output style")
|
||||
return None
|
||||
|
||||
|
||||
# Global registry instance
|
||||
_registry = RendererRegistry()
|
||||
|
||||
|
||||
def getRenderer(outputFormat: str, services=None, outputStyle: str = None) -> Optional[BaseRenderer]:
|
||||
"""Get a renderer instance for the specified format and style.
|
||||
|
||||
Args:
|
||||
outputFormat: Format name (e.g. 'csv', 'json', 'pdf')
|
||||
services: Services instance
|
||||
outputStyle: 'document' or 'code'. If None, prefers document renderer.
|
||||
"""
|
||||
return _registry.getRenderer(outputFormat, services, outputStyle=outputStyle)
|
||||
|
||||
|
||||
def getSupportedFormats() -> List[str]:
|
||||
"""Get list of all supported formats."""
|
||||
return _registry.getSupportedFormats()
|
||||
|
||||
|
||||
def getRendererInfo() -> Dict[str, Dict[str, str]]:
|
||||
"""Get information about all registered renderers."""
|
||||
return _registry.getRendererInfo()
|
||||
|
||||
|
||||
def getOutputStyle(outputFormat: str) -> Optional[str]:
|
||||
"""Get the output style classification for a given format."""
|
||||
return _registry.getOutputStyle(outputFormat)
|
||||
|
|
@ -0,0 +1,159 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""
|
||||
CSV code renderer for code generation.
|
||||
"""
|
||||
|
||||
from .codeRendererBaseTemplate import BaseCodeRenderer
|
||||
from modules.datamodels.datamodelDocument import RenderedDocument
|
||||
from typing import Dict, Any, List, Optional
|
||||
import csv
|
||||
import io
|
||||
|
||||
class RendererCodeCsv(BaseCodeRenderer):
|
||||
"""Renders CSV code files."""
|
||||
|
||||
@classmethod
|
||||
def getSupportedFormats(cls) -> List[str]:
|
||||
"""Return supported CSV formats."""
|
||||
return ['csv']
|
||||
|
||||
@classmethod
|
||||
def getFormatAliases(cls) -> List[str]:
|
||||
"""Return format aliases."""
|
||||
return []
|
||||
|
||||
@classmethod
|
||||
def getPriority(cls) -> int:
|
||||
"""Return priority for CSV code renderer."""
|
||||
return 75 # Higher than document renderer (70) for code generation
|
||||
|
||||
@classmethod
|
||||
def getOutputStyle(cls, formatName: Optional[str] = None) -> str:
|
||||
"""Return output style classification: CSV requires specific structure."""
|
||||
return 'code'
|
||||
|
||||
async def renderCodeFiles(
|
||||
self,
|
||||
codeFiles: List[Dict[str, Any]],
|
||||
metadata: Dict[str, Any],
|
||||
userPrompt: str = None
|
||||
) -> List[RenderedDocument]:
|
||||
"""
|
||||
Render CSV code files.
|
||||
For single file: output as-is (validate structure)
|
||||
For multiple files: output separately (each is independent CSV)
|
||||
"""
|
||||
renderedDocs = []
|
||||
|
||||
for codeFile in codeFiles:
|
||||
if not self._validateCodeFile(codeFile):
|
||||
self.logger.warning(f"Invalid code file: {codeFile.get('filename', 'unknown')}")
|
||||
continue
|
||||
|
||||
filename = codeFile['filename']
|
||||
content = codeFile['content']
|
||||
|
||||
# Validate CSV structure (header row, consistent columns)
|
||||
validatedContent = self._validateAndFixCsv(content)
|
||||
|
||||
# Extract CSV statistics for validation
|
||||
csvStats = self._extractCsvStatistics(validatedContent)
|
||||
|
||||
# Merge file-specific metadata with project metadata
|
||||
fileMetadata = dict(metadata) if metadata else {}
|
||||
fileMetadata.update({
|
||||
"filename": filename,
|
||||
"fileType": "csv",
|
||||
"statistics": csvStats
|
||||
})
|
||||
|
||||
renderedDocs.append(
|
||||
RenderedDocument(
|
||||
documentData=validatedContent.encode('utf-8'),
|
||||
mimeType="text/csv",
|
||||
filename=filename,
|
||||
metadata=fileMetadata
|
||||
)
|
||||
)
|
||||
|
||||
return renderedDocs
|
||||
|
||||
async def render(self, extractedContent: Dict[str, Any], title: str, userPrompt: str = None, aiService=None) -> List[RenderedDocument]:
|
||||
"""
|
||||
Render method for document generation compatibility.
|
||||
Delegates to document renderer if needed, or handles code files directly.
|
||||
"""
|
||||
# Check if this is code generation (has files array) or document generation (has documents array)
|
||||
if "files" in extractedContent:
|
||||
# Code generation path - use renderCodeFiles
|
||||
files = extractedContent.get("files", [])
|
||||
metadata = extractedContent.get("metadata", {})
|
||||
return await self.renderCodeFiles(files, metadata, userPrompt)
|
||||
else:
|
||||
# Document generation path - delegate to document renderer
|
||||
from .rendererCsv import RendererCsv
|
||||
documentRenderer = RendererCsv(self.services)
|
||||
return await documentRenderer.render(extractedContent, title, userPrompt, aiService)
|
||||
|
||||
def _validateAndFixCsv(self, content: str) -> str:
|
||||
"""Validate CSV structure and fix common issues."""
|
||||
try:
|
||||
# Parse CSV to validate structure
|
||||
reader = csv.reader(io.StringIO(content))
|
||||
rows = list(reader)
|
||||
|
||||
if not rows:
|
||||
return content # Empty CSV
|
||||
|
||||
# Check header row exists
|
||||
headerRow = rows[0]
|
||||
headerCount = len(headerRow)
|
||||
|
||||
# Validate all rows have same column count
|
||||
fixedRows = [headerRow] # Start with header
|
||||
|
||||
for i, row in enumerate(rows[1:], 1):
|
||||
if len(row) != headerCount:
|
||||
self.logger.debug(f"Row {i} has {len(row)} columns, expected {headerCount}. Auto-fixing...")
|
||||
# Pad or truncate to match header
|
||||
if len(row) < headerCount:
|
||||
row.extend([''] * (headerCount - len(row)))
|
||||
else:
|
||||
row = row[:headerCount]
|
||||
fixedRows.append(row)
|
||||
|
||||
# Convert back to CSV string
|
||||
output = io.StringIO()
|
||||
writer = csv.writer(output)
|
||||
for row in fixedRows:
|
||||
writer.writerow(row)
|
||||
|
||||
return output.getvalue()
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"CSV validation failed: {e}, returning original content")
|
||||
return content
|
||||
|
||||
def _extractCsvStatistics(self, content: str) -> Dict[str, Any]:
|
||||
"""Extract CSV statistics for validation (row count, column count, headers)."""
|
||||
try:
|
||||
reader = csv.reader(io.StringIO(content))
|
||||
rows = list(reader)
|
||||
|
||||
if not rows:
|
||||
return {"rowCount": 0, "columnCount": 0, "headerRow": []}
|
||||
|
||||
headerRow = rows[0]
|
||||
columnCount = len(headerRow)
|
||||
rowCount = len(rows) - 1 # Exclude header
|
||||
|
||||
return {
|
||||
"rowCount": rowCount,
|
||||
"columnCount": columnCount,
|
||||
"headerRow": headerRow,
|
||||
"dataRowCount": rowCount
|
||||
}
|
||||
except Exception as e:
|
||||
self.logger.warning(f"CSV statistics extraction failed: {e}")
|
||||
return {}
|
||||
|
|
@ -0,0 +1,141 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""
|
||||
JSON code renderer for code generation.
|
||||
"""
|
||||
|
||||
from .codeRendererBaseTemplate import BaseCodeRenderer
|
||||
from modules.datamodels.datamodelDocument import RenderedDocument
|
||||
from typing import Dict, Any, List, Optional
|
||||
import json
|
||||
|
||||
class RendererCodeJson(BaseCodeRenderer):
|
||||
"""Renders JSON code files."""
|
||||
|
||||
@classmethod
|
||||
def getSupportedFormats(cls) -> List[str]:
|
||||
"""Return supported JSON formats."""
|
||||
return ['json']
|
||||
|
||||
@classmethod
|
||||
def getFormatAliases(cls) -> List[str]:
|
||||
"""Return format aliases."""
|
||||
return []
|
||||
|
||||
@classmethod
|
||||
def getPriority(cls) -> int:
|
||||
"""Return priority for JSON code renderer."""
|
||||
return 85 # Higher than document renderer (80) for code generation
|
||||
|
||||
@classmethod
|
||||
def getOutputStyle(cls, formatName: Optional[str] = None) -> str:
|
||||
"""Return output style classification: JSON is structured data format."""
|
||||
return 'code'
|
||||
|
||||
async def renderCodeFiles(
|
||||
self,
|
||||
codeFiles: List[Dict[str, Any]],
|
||||
metadata: Dict[str, Any],
|
||||
userPrompt: str = None
|
||||
) -> List[RenderedDocument]:
|
||||
"""
|
||||
Render JSON code files.
|
||||
For single file: output as-is
|
||||
For multiple files: output separately (each file is independent JSON)
|
||||
"""
|
||||
renderedDocs = []
|
||||
|
||||
for codeFile in codeFiles:
|
||||
if not self._validateCodeFile(codeFile):
|
||||
self.logger.warning(f"Invalid code file: {codeFile.get('filename', 'unknown')}")
|
||||
continue
|
||||
|
||||
filename = codeFile['filename']
|
||||
content = codeFile['content']
|
||||
|
||||
# Validate JSON syntax and extract statistics
|
||||
parsed = None
|
||||
try:
|
||||
parsed = json.loads(content) # Validate JSON
|
||||
except json.JSONDecodeError as e:
|
||||
self.logger.warning(f"Invalid JSON in {filename}: {e}")
|
||||
# Could fix/format JSON here if needed
|
||||
|
||||
# Format JSON (pretty print)
|
||||
try:
|
||||
if parsed is None:
|
||||
parsed = json.loads(content)
|
||||
formattedContent = json.dumps(parsed, indent=2, ensure_ascii=False)
|
||||
except Exception:
|
||||
formattedContent = content # Use original if formatting fails
|
||||
|
||||
# Extract JSON statistics for validation
|
||||
jsonStats = self._extractJsonStatistics(parsed) if parsed else {}
|
||||
|
||||
# Merge file-specific metadata with project metadata
|
||||
fileMetadata = dict(metadata) if metadata else {}
|
||||
fileMetadata.update({
|
||||
"filename": filename,
|
||||
"fileType": "json",
|
||||
"statistics": jsonStats
|
||||
})
|
||||
|
||||
renderedDocs.append(
|
||||
RenderedDocument(
|
||||
documentData=formattedContent.encode('utf-8'),
|
||||
mimeType="application/json",
|
||||
filename=filename,
|
||||
metadata=fileMetadata
|
||||
)
|
||||
)
|
||||
|
||||
return renderedDocs
|
||||
|
||||
async def render(self, extractedContent: Dict[str, Any], title: str, userPrompt: str = None, aiService=None) -> List[RenderedDocument]:
|
||||
"""
|
||||
Render method for document generation compatibility.
|
||||
Delegates to document renderer if needed, or handles code files directly.
|
||||
"""
|
||||
# Check if this is code generation (has files array) or document generation (has documents array)
|
||||
if "files" in extractedContent:
|
||||
# Code generation path - use renderCodeFiles
|
||||
files = extractedContent.get("files", [])
|
||||
metadata = extractedContent.get("metadata", {})
|
||||
return await self.renderCodeFiles(files, metadata, userPrompt)
|
||||
else:
|
||||
# Document generation path - delegate to document renderer
|
||||
# Import here to avoid circular dependency
|
||||
from .rendererJson import RendererJson
|
||||
documentRenderer = RendererJson(self.services)
|
||||
return await documentRenderer.render(extractedContent, title, userPrompt, aiService)
|
||||
|
||||
def _extractJsonStatistics(self, parsed: Any) -> Dict[str, Any]:
|
||||
"""Extract JSON statistics for validation (object count, array count, key count)."""
|
||||
try:
|
||||
stats = {
|
||||
"isArray": isinstance(parsed, list),
|
||||
"isObject": isinstance(parsed, dict),
|
||||
"itemCount": 0,
|
||||
"keyCount": 0
|
||||
}
|
||||
|
||||
if isinstance(parsed, list):
|
||||
stats["itemCount"] = len(parsed)
|
||||
# Count nested objects/arrays
|
||||
objectCount = sum(1 for item in parsed if isinstance(item, dict))
|
||||
arrayCount = sum(1 for item in parsed if isinstance(item, list))
|
||||
stats["objectCount"] = objectCount
|
||||
stats["arrayCount"] = arrayCount
|
||||
elif isinstance(parsed, dict):
|
||||
stats["keyCount"] = len(parsed)
|
||||
stats["keys"] = list(parsed.keys())
|
||||
# Count nested objects/arrays
|
||||
objectCount = sum(1 for v in parsed.values() if isinstance(v, dict))
|
||||
arrayCount = sum(1 for v in parsed.values() if isinstance(v, list))
|
||||
stats["objectCount"] = objectCount
|
||||
stats["arrayCount"] = arrayCount
|
||||
|
||||
return stats
|
||||
except Exception as e:
|
||||
self.logger.warning(f"JSON statistics extraction failed: {e}")
|
||||
return {}
|
||||
|
|
@ -0,0 +1,148 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""
|
||||
XML code renderer for code generation.
|
||||
"""
|
||||
|
||||
from .codeRendererBaseTemplate import BaseCodeRenderer
|
||||
from modules.datamodels.datamodelDocument import RenderedDocument
|
||||
from typing import Dict, Any, List, Optional
|
||||
import xml.etree.ElementTree as ET
|
||||
from xml.dom import minidom
|
||||
|
||||
class RendererCodeXml(BaseCodeRenderer):
|
||||
"""Renders XML code files."""
|
||||
|
||||
@classmethod
|
||||
def getSupportedFormats(cls) -> List[str]:
|
||||
"""Return supported XML formats."""
|
||||
return ['xml']
|
||||
|
||||
@classmethod
|
||||
def getFormatAliases(cls) -> List[str]:
|
||||
"""Return format aliases."""
|
||||
return []
|
||||
|
||||
@classmethod
|
||||
def getPriority(cls) -> int:
|
||||
"""Return priority for XML code renderer."""
|
||||
return 80
|
||||
|
||||
@classmethod
|
||||
def getOutputStyle(cls, formatName: Optional[str] = None) -> str:
|
||||
"""Return output style classification: XML is structured data format."""
|
||||
return 'code'
|
||||
|
||||
async def renderCodeFiles(
|
||||
self,
|
||||
codeFiles: List[Dict[str, Any]],
|
||||
metadata: Dict[str, Any],
|
||||
userPrompt: str = None
|
||||
) -> List[RenderedDocument]:
|
||||
"""
|
||||
Render XML code files.
|
||||
Validates XML syntax and formats (pretty print).
|
||||
"""
|
||||
renderedDocs = []
|
||||
|
||||
for codeFile in codeFiles:
|
||||
if not self._validateCodeFile(codeFile):
|
||||
self.logger.warning(f"Invalid code file: {codeFile.get('filename', 'unknown')}")
|
||||
continue
|
||||
|
||||
filename = codeFile['filename']
|
||||
content = codeFile['content']
|
||||
|
||||
# Validate and format XML
|
||||
formattedContent = self._validateAndFormatXml(content)
|
||||
|
||||
# Extract XML statistics for validation
|
||||
xmlStats = self._extractXmlStatistics(formattedContent)
|
||||
|
||||
# Merge file-specific metadata with project metadata
|
||||
fileMetadata = dict(metadata) if metadata else {}
|
||||
fileMetadata.update({
|
||||
"filename": filename,
|
||||
"fileType": "xml",
|
||||
"statistics": xmlStats
|
||||
})
|
||||
|
||||
renderedDocs.append(
|
||||
RenderedDocument(
|
||||
documentData=formattedContent.encode('utf-8'),
|
||||
mimeType="application/xml",
|
||||
filename=filename,
|
||||
metadata=fileMetadata
|
||||
)
|
||||
)
|
||||
|
||||
return renderedDocs
|
||||
|
||||
async def render(self, extractedContent: Dict[str, Any], title: str, userPrompt: str = None, aiService=None) -> List[RenderedDocument]:
|
||||
"""
|
||||
Render method for document generation compatibility.
|
||||
For XML, we only support code generation (no document renderer exists yet).
|
||||
"""
|
||||
# Check if this is code generation (has files array)
|
||||
if "files" in extractedContent:
|
||||
# Code generation path - use renderCodeFiles
|
||||
files = extractedContent.get("files", [])
|
||||
metadata = extractedContent.get("metadata", {})
|
||||
return await self.renderCodeFiles(files, metadata, userPrompt)
|
||||
else:
|
||||
# Document generation path - not supported yet, return error
|
||||
self.logger.warning("XML document generation not supported, only code generation")
|
||||
return [
|
||||
RenderedDocument(
|
||||
documentData=f"XML document generation not yet supported".encode('utf-8'),
|
||||
mimeType="text/plain",
|
||||
filename="error.txt",
|
||||
metadata={}
|
||||
)
|
||||
]
|
||||
|
||||
def _validateAndFormatXml(self, content: str) -> str:
|
||||
"""Validate XML syntax and format (pretty print)."""
|
||||
try:
|
||||
# Parse XML to validate
|
||||
root = ET.fromstring(content)
|
||||
|
||||
# Format XML (pretty print)
|
||||
rough_string = ET.tostring(root, encoding='unicode')
|
||||
reparsed = minidom.parseString(rough_string)
|
||||
formatted = reparsed.toprettyxml(indent=" ")
|
||||
|
||||
# Remove extra blank lines
|
||||
lines = [line for line in formatted.split('\n') if line.strip()]
|
||||
return '\n'.join(lines)
|
||||
|
||||
except ET.ParseError as e:
|
||||
self.logger.warning(f"Invalid XML: {e}, returning original content")
|
||||
return content
|
||||
except Exception as e:
|
||||
self.logger.warning(f"XML formatting failed: {e}, returning original content")
|
||||
return content
|
||||
|
||||
def _extractXmlStatistics(self, content: str) -> Dict[str, Any]:
|
||||
"""Extract XML statistics for validation (element count, attribute count, root element)."""
|
||||
try:
|
||||
root = ET.fromstring(content)
|
||||
|
||||
# Count all elements recursively
|
||||
elementCount = len(list(root.iter()))
|
||||
|
||||
# Count attributes
|
||||
attributeCount = sum(len(elem.attrib) for elem in root.iter())
|
||||
|
||||
# Get root element name
|
||||
rootElement = root.tag
|
||||
|
||||
return {
|
||||
"elementCount": elementCount,
|
||||
"attributeCount": attributeCount,
|
||||
"rootElement": rootElement,
|
||||
"hasRoot": True
|
||||
}
|
||||
except Exception as e:
|
||||
self.logger.warning(f"XML statistics extraction failed: {e}")
|
||||
return {}
|
||||
|
|
@ -0,0 +1,415 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""
|
||||
CSV renderer for report generation.
|
||||
"""
|
||||
|
||||
from .documentRendererBaseTemplate import BaseRenderer
|
||||
from modules.datamodels.datamodelDocument import RenderedDocument
|
||||
from typing import Dict, Any, List, Optional
|
||||
|
||||
class RendererCsv(BaseRenderer):
|
||||
"""Renders content to CSV format with format-specific extraction."""
|
||||
|
||||
@classmethod
|
||||
def getSupportedFormats(cls) -> List[str]:
|
||||
"""Return supported CSV formats."""
|
||||
return ['csv']
|
||||
|
||||
@classmethod
|
||||
def getFormatAliases(cls) -> List[str]:
|
||||
"""Return format aliases."""
|
||||
return ['spreadsheet', 'table']
|
||||
|
||||
@classmethod
|
||||
def getPriority(cls) -> int:
|
||||
"""Return priority for CSV renderer."""
|
||||
return 70
|
||||
|
||||
@classmethod
|
||||
def getOutputStyle(cls, formatName: Optional[str] = None) -> str:
|
||||
"""Return output style classification: CSV document renderer converts structured document content to CSV."""
|
||||
return 'document'
|
||||
|
||||
@classmethod
|
||||
def getAcceptedSectionTypes(cls, formatName: Optional[str] = None) -> List[str]:
|
||||
"""
|
||||
Return list of section content types that CSV renderer accepts.
|
||||
CSV renderer accepts table sections and code_block sections (for raw CSV content).
|
||||
"""
|
||||
return ["table", "code_block"]
|
||||
|
||||
async def render(self, extractedContent: Dict[str, Any], title: str, userPrompt: str = None, aiService=None) -> List[RenderedDocument]:
|
||||
"""Render extracted JSON content to CSV format. Produces one CSV file per table section."""
|
||||
try:
|
||||
# Validate JSON structure
|
||||
if not self._validateJsonStructure(extractedContent):
|
||||
raise ValueError("JSON content must follow standardized schema: {metadata: {...}, documents: [{sections: [...]}]}")
|
||||
|
||||
# Extract sections and metadata
|
||||
sections = self._extractSections(extractedContent)
|
||||
metadata = self._extractMetadata(extractedContent)
|
||||
|
||||
# Determine base filename from document or title
|
||||
documents = extractedContent.get("documents", [])
|
||||
baseFilename = None
|
||||
if documents and isinstance(documents[0], dict):
|
||||
baseFilename = documents[0].get("filename")
|
||||
if not baseFilename:
|
||||
baseFilename = self._determineFilename(title, "text/csv")
|
||||
|
||||
# Remove extension from base filename if present
|
||||
if baseFilename.endswith('.csv'):
|
||||
baseFilename = baseFilename[:-4]
|
||||
|
||||
# Collect CSV-producing sections: table sections AND code_block sections with CSV language
|
||||
tableSections = []
|
||||
codeBlockCsvSections = []
|
||||
for section in sections:
|
||||
sectionType = section.get("content_type", "paragraph")
|
||||
if sectionType == "table":
|
||||
tableSections.append(section)
|
||||
elif sectionType == "code_block":
|
||||
# Check if any element is a code_block with language "csv"
|
||||
for element in section.get("elements", []):
|
||||
content = element.get("content", {})
|
||||
if isinstance(content, dict) and content.get("language", "").lower() == "csv":
|
||||
codeBlockCsvSections.append(section)
|
||||
break
|
||||
|
||||
# If no usable sections found, return empty CSV
|
||||
if not tableSections and not codeBlockCsvSections:
|
||||
self.logger.warning("No table or CSV code_block sections found in CSV document - returning empty CSV")
|
||||
emptyCsv = self._convertRowsToCsv([["No table data available"]])
|
||||
return [
|
||||
RenderedDocument(
|
||||
documentData=emptyCsv.encode('utf-8'),
|
||||
mimeType="text/csv",
|
||||
filename=self._determineFilename(title, "text/csv"),
|
||||
documentType=metadata.get("documentType") if isinstance(metadata, dict) else None,
|
||||
metadata=metadata if isinstance(metadata, dict) else None
|
||||
)
|
||||
]
|
||||
|
||||
allCsvSections = tableSections + codeBlockCsvSections
|
||||
|
||||
# Generate one CSV file per section
|
||||
renderedDocuments = []
|
||||
for i, csvSection in enumerate(allCsvSections):
|
||||
sectionType = csvSection.get("content_type", "paragraph")
|
||||
sectionTitle = csvSection.get("title")
|
||||
csvContent = ""
|
||||
|
||||
if sectionType == "code_block":
|
||||
# Extract raw CSV content directly from code_block elements
|
||||
rawCsvParts = []
|
||||
for element in csvSection.get("elements", []):
|
||||
content = element.get("content", {})
|
||||
if isinstance(content, dict) and content.get("language", "").lower() == "csv":
|
||||
code = content.get("code", "")
|
||||
if code:
|
||||
rawCsvParts.append(code)
|
||||
csvContent = "\n".join(rawCsvParts)
|
||||
else:
|
||||
# Table section — render via table logic
|
||||
csvRows = []
|
||||
if sectionTitle:
|
||||
csvRows.append([sectionTitle])
|
||||
csvRows.append([]) # Empty row after title
|
||||
|
||||
elements = csvSection.get("elements", [])
|
||||
for element in elements:
|
||||
tableRows = self._renderJsonTableToCsv(element)
|
||||
if tableRows:
|
||||
csvRows.extend(tableRows)
|
||||
|
||||
csvContent = self._convertRowsToCsv(csvRows)
|
||||
|
||||
# Determine filename
|
||||
if len(allCsvSections) == 1:
|
||||
filename = f"{baseFilename}.csv"
|
||||
else:
|
||||
sectionId = csvSection.get("id", f"csv_{i+1}")
|
||||
if sectionTitle:
|
||||
safeTitle = "".join(c for c in sectionTitle if c.isalnum() or c in (' ', '-', '_')).strip()
|
||||
safeTitle = safeTitle.replace(' ', '_')[:30]
|
||||
filename = f"{baseFilename}_{safeTitle}.csv"
|
||||
else:
|
||||
filename = f"{baseFilename}_{sectionId}.csv"
|
||||
|
||||
documentType = metadata.get("documentType") if isinstance(metadata, dict) else None
|
||||
|
||||
renderedDocuments.append(
|
||||
RenderedDocument(
|
||||
documentData=csvContent.encode('utf-8'),
|
||||
mimeType="text/csv",
|
||||
filename=filename,
|
||||
documentType=documentType,
|
||||
metadata=metadata if isinstance(metadata, dict) else None
|
||||
)
|
||||
)
|
||||
|
||||
return renderedDocuments
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error rendering CSV: {str(e)}")
|
||||
# Return minimal CSV fallback
|
||||
fallbackCsv = self._convertRowsToCsv([["Title", "Content"], [title, f"Error rendering report: {str(e)}"]])
|
||||
return [
|
||||
RenderedDocument(
|
||||
documentData=fallbackCsv.encode('utf-8'),
|
||||
mimeType="text/csv",
|
||||
filename=self._determineFilename(title, "text/csv"),
|
||||
metadata=extractedContent.get("metadata", {}) if extractedContent else None
|
||||
)
|
||||
]
|
||||
|
||||
async def _generateCsvFromJson(self, jsonContent: Dict[str, Any], title: str) -> str:
|
||||
"""Generate CSV content from structured JSON document. DEPRECATED: Use render() method instead."""
|
||||
# This method is kept for backward compatibility but is no longer used
|
||||
# The render() method now handles CSV generation directly
|
||||
try:
|
||||
# Validate JSON structure (standardized schema: {metadata: {...}, documents: [{sections: [...]}]})
|
||||
if not self._validateJsonStructure(jsonContent):
|
||||
raise ValueError("JSON content must follow standardized schema: {metadata: {...}, documents: [{sections: [...]}]}")
|
||||
|
||||
# Extract sections and metadata from standardized schema
|
||||
sections = self._extractSections(jsonContent)
|
||||
metadata = self._extractMetadata(jsonContent)
|
||||
|
||||
# Use provided title (which comes from documents[].title) as primary source
|
||||
# Fallback to metadata.title only if title parameter is empty
|
||||
documentTitle = title if title else metadata.get("title", "Generated Document")
|
||||
|
||||
# Generate CSV content
|
||||
csvRows = []
|
||||
|
||||
# Add title row
|
||||
if documentTitle:
|
||||
csvRows.append([documentTitle])
|
||||
csvRows.append([]) # Empty row
|
||||
|
||||
# Process each section in order - only table sections
|
||||
for section in sections:
|
||||
sectionType = section.get("content_type", "paragraph")
|
||||
if sectionType == "table":
|
||||
sectionCsv = self._renderJsonSectionToCsv(section)
|
||||
if sectionCsv:
|
||||
csvRows.extend(sectionCsv)
|
||||
csvRows.append([]) # Empty row between sections
|
||||
|
||||
# Convert to CSV string
|
||||
csvContent = self._convertRowsToCsv(csvRows)
|
||||
|
||||
return csvContent
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error generating CSV from JSON: {str(e)}")
|
||||
raise Exception(f"CSV generation failed: {str(e)}")
|
||||
|
||||
def _renderJsonSectionToCsv(self, section: Dict[str, Any]) -> List[List[str]]:
|
||||
"""Render a single JSON section to CSV rows."""
|
||||
try:
|
||||
sectionType = section.get("content_type", "paragraph")
|
||||
elements = section.get("elements", [])
|
||||
|
||||
csvRows = []
|
||||
|
||||
# Add section title if available
|
||||
sectionTitle = section.get("title")
|
||||
if sectionTitle:
|
||||
csvRows.append([f"# {sectionTitle}"])
|
||||
|
||||
# Process each element in the section
|
||||
for element in elements:
|
||||
if sectionType == "table":
|
||||
csvRows.extend(self._renderJsonTableToCsv(element))
|
||||
elif sectionType == "list":
|
||||
csvRows.extend(self._renderJsonListToCsv(element))
|
||||
elif sectionType == "heading":
|
||||
csvRows.extend(self._renderJsonHeadingToCsv(element))
|
||||
elif sectionType == "paragraph":
|
||||
csvRows.extend(self._renderJsonParagraphToCsv(element))
|
||||
elif sectionType == "code":
|
||||
csvRows.extend(self._renderJsonCodeToCsv(element))
|
||||
else:
|
||||
# Fallback to paragraph for unknown types
|
||||
csvRows.extend(self._renderJsonParagraphToCsv(element))
|
||||
|
||||
return csvRows
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error rendering section {section.get('id', 'unknown')}: {str(e)}")
|
||||
return [["[Error rendering section]"]]
|
||||
|
||||
def _renderJsonTableToCsv(self, tableData: Dict[str, Any]) -> List[List[str]]:
|
||||
"""Render a JSON table to CSV rows."""
|
||||
try:
|
||||
# Extract from nested content structure
|
||||
content = tableData.get("content", {})
|
||||
if not isinstance(content, dict):
|
||||
return []
|
||||
headers = content.get("headers", [])
|
||||
rows = content.get("rows", [])
|
||||
|
||||
csvRows = []
|
||||
|
||||
if headers:
|
||||
csvRows.append(headers)
|
||||
|
||||
if rows:
|
||||
csvRows.extend(rows)
|
||||
|
||||
return csvRows
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error rendering table: {str(e)}")
|
||||
return [["[Error rendering table]"]]
|
||||
|
||||
def _renderJsonListToCsv(self, listData: Dict[str, Any]) -> List[List[str]]:
|
||||
"""Render a JSON list to CSV rows."""
|
||||
try:
|
||||
# Extract from nested content structure
|
||||
content = listData.get("content", {})
|
||||
if not isinstance(content, dict):
|
||||
return []
|
||||
items = content.get("items", [])
|
||||
csvRows = []
|
||||
|
||||
for item in items:
|
||||
if isinstance(item, dict):
|
||||
text = item.get("text", "")
|
||||
subitems = item.get("subitems", [])
|
||||
csvRows.append([text])
|
||||
|
||||
# Add subitems as indented rows
|
||||
for subitem in subitems:
|
||||
if isinstance(subitem, dict):
|
||||
csvRows.append([f" - {subitem.get('text', '')}"])
|
||||
else:
|
||||
csvRows.append([f" - {subitem}"])
|
||||
else:
|
||||
csvRows.append([str(item)])
|
||||
|
||||
return csvRows
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error rendering list: {str(e)}")
|
||||
return [["[Error rendering list]"]]
|
||||
|
||||
def _renderJsonHeadingToCsv(self, headingData: Dict[str, Any]) -> List[List[str]]:
|
||||
"""Render a JSON heading to CSV rows."""
|
||||
try:
|
||||
# Extract from nested content structure
|
||||
content = headingData.get("content", {})
|
||||
if not isinstance(content, dict):
|
||||
return []
|
||||
text = content.get("text", "")
|
||||
level = content.get("level", 1)
|
||||
|
||||
if text:
|
||||
# Use # symbols for heading levels
|
||||
headingText = f"{'#' * level} {text}"
|
||||
return [[headingText]]
|
||||
|
||||
return []
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error rendering heading: {str(e)}")
|
||||
return [["[Error rendering heading]"]]
|
||||
|
||||
def _renderJsonParagraphToCsv(self, paragraphData: Dict[str, Any]) -> List[List[str]]:
|
||||
"""Render a JSON paragraph to CSV rows."""
|
||||
try:
|
||||
# Extract from nested content structure
|
||||
content = paragraphData.get("content", {})
|
||||
if isinstance(content, dict):
|
||||
text = content.get("text", "")
|
||||
elif isinstance(content, str):
|
||||
text = content
|
||||
else:
|
||||
text = ""
|
||||
|
||||
if text:
|
||||
# Split long paragraphs into multiple rows if needed
|
||||
if len(text) > 100:
|
||||
words = text.split()
|
||||
rows = []
|
||||
currentRow = []
|
||||
currentLength = 0
|
||||
|
||||
for word in words:
|
||||
if currentLength + len(word) > 100 and currentRow:
|
||||
rows.append([" ".join(currentRow)])
|
||||
currentRow = [word]
|
||||
currentLength = len(word)
|
||||
else:
|
||||
currentRow.append(word)
|
||||
currentLength += len(word) + 1
|
||||
|
||||
if currentRow:
|
||||
rows.append([" ".join(currentRow)])
|
||||
|
||||
return rows
|
||||
else:
|
||||
return [[text]]
|
||||
|
||||
return []
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error rendering paragraph: {str(e)}")
|
||||
return [["[Error rendering paragraph]"]]
|
||||
|
||||
def _renderJsonCodeToCsv(self, codeData: Dict[str, Any]) -> List[List[str]]:
|
||||
"""Render a JSON code block to CSV rows."""
|
||||
try:
|
||||
# Extract from nested content structure
|
||||
content = codeData.get("content", {})
|
||||
if not isinstance(content, dict):
|
||||
return []
|
||||
code = content.get("code", "")
|
||||
language = content.get("language", "")
|
||||
|
||||
csvRows = []
|
||||
|
||||
if language:
|
||||
csvRows.append([f"Code ({language}):"])
|
||||
|
||||
if code:
|
||||
# Split code into lines
|
||||
codeLines = code.split('\n')
|
||||
for line in codeLines:
|
||||
csvRows.append([f" {line}"])
|
||||
|
||||
return csvRows
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error rendering code block: {str(e)}")
|
||||
return [["[Error rendering code block]"]]
|
||||
|
||||
def _convertRowsToCsv(self, rows: List[List[str]]) -> str:
|
||||
"""Convert rows to CSV string."""
|
||||
import csv
|
||||
import io
|
||||
|
||||
output = io.StringIO()
|
||||
writer = csv.writer(output)
|
||||
|
||||
for row in rows:
|
||||
if row: # Only write non-empty rows
|
||||
writer.writerow(row)
|
||||
|
||||
return output.getvalue()
|
||||
|
||||
def _cleanCsvContent(self, content: str, title: str) -> str:
|
||||
"""Clean and validate CSV content from AI."""
|
||||
content = content.strip()
|
||||
|
||||
# Remove markdown code blocks if present
|
||||
if content.startswith("```") and content.endswith("```"):
|
||||
lines = content.split('\n')
|
||||
if len(lines) > 2:
|
||||
content = '\n'.join(lines[1:-1]).strip()
|
||||
|
||||
return content
|
||||
|
||||
File diff suppressed because it is too large
Load diff
|
|
@ -0,0 +1,841 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""
|
||||
HTML renderer for report generation.
|
||||
"""
|
||||
|
||||
from .documentRendererBaseTemplate import BaseRenderer
|
||||
from modules.datamodels.datamodelDocument import RenderedDocument
|
||||
from typing import Dict, Any, List, Optional
|
||||
|
||||
class RendererHtml(BaseRenderer):
|
||||
"""Renders content to HTML format with format-specific extraction."""
|
||||
|
||||
@classmethod
|
||||
def getSupportedFormats(cls) -> List[str]:
|
||||
"""Return supported HTML formats."""
|
||||
return ['html', 'htm']
|
||||
|
||||
@classmethod
|
||||
def getFormatAliases(cls) -> List[str]:
|
||||
"""Return format aliases."""
|
||||
return ['web', 'webpage']
|
||||
|
||||
@classmethod
|
||||
def getPriority(cls) -> int:
|
||||
"""Return priority for HTML renderer."""
|
||||
return 100
|
||||
|
||||
@classmethod
|
||||
def getOutputStyle(cls, formatName: Optional[str] = None) -> str:
|
||||
"""Return output style classification: HTML web pages are rendered documents."""
|
||||
return 'document'
|
||||
|
||||
@classmethod
|
||||
def getAcceptedSectionTypes(cls, formatName: Optional[str] = None) -> List[str]:
|
||||
"""
|
||||
Return list of section content types that HTML renderer accepts.
|
||||
HTML renderer accepts all section types (HTML pages can contain all content types including images).
|
||||
"""
|
||||
from modules.datamodels.datamodelJson import supportedSectionTypes
|
||||
return list(supportedSectionTypes)
|
||||
|
||||
async def render(self, extractedContent: Dict[str, Any], title: str, userPrompt: str = None, aiService=None) -> List[RenderedDocument]:
|
||||
"""
|
||||
Render HTML document with images as separate files.
|
||||
Returns list of documents: [HTML document, image1, image2, ...]
|
||||
"""
|
||||
import base64
|
||||
|
||||
# Extract images first
|
||||
images = self._extractImages(extractedContent)
|
||||
|
||||
# Store images in instance for later retrieval
|
||||
self._renderedImages = images
|
||||
|
||||
# Generate HTML using AI-analyzed styling
|
||||
htmlContent = await self._generateHtmlFromJson(extractedContent, title, userPrompt, aiService)
|
||||
|
||||
# Replace base64 data URIs with relative file paths if images exist
|
||||
if images:
|
||||
htmlContent = self._replaceImageDataUris(htmlContent, images)
|
||||
|
||||
# Determine HTML filename from document or title
|
||||
documents = extractedContent.get("documents", [])
|
||||
if documents and isinstance(documents[0], dict):
|
||||
htmlFilename = documents[0].get("filename")
|
||||
if not htmlFilename:
|
||||
htmlFilename = self._determineFilename(title, "text/html")
|
||||
else:
|
||||
htmlFilename = self._determineFilename(title, "text/html")
|
||||
|
||||
# Extract metadata for document type and other info
|
||||
metadata = extractedContent.get("metadata", {}) if extractedContent else {}
|
||||
documentType = metadata.get("documentType") if isinstance(metadata, dict) else None
|
||||
|
||||
# Start with HTML document
|
||||
resultDocuments = [
|
||||
RenderedDocument(
|
||||
documentData=htmlContent.encode('utf-8'),
|
||||
mimeType="text/html",
|
||||
filename=htmlFilename,
|
||||
documentType=documentType,
|
||||
metadata=metadata if isinstance(metadata, dict) else None
|
||||
)
|
||||
]
|
||||
|
||||
# Add images as separate documents
|
||||
for img in images:
|
||||
base64Data = img.get("base64Data", "")
|
||||
filename = img.get("filename", f"image_{len(resultDocuments)}.png")
|
||||
mimeType = img.get("mimeType", "image/png")
|
||||
|
||||
if base64Data:
|
||||
try:
|
||||
# Decode base64 to bytes
|
||||
imageBytes = base64.b64decode(base64Data)
|
||||
resultDocuments.append(
|
||||
RenderedDocument(
|
||||
documentData=imageBytes,
|
||||
mimeType=mimeType,
|
||||
filename=filename
|
||||
)
|
||||
)
|
||||
self.logger.debug(f"Added image file: {filename} ({len(imageBytes)} bytes)")
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error creating image file {filename}: {str(e)}")
|
||||
|
||||
return resultDocuments
|
||||
|
||||
async def _generateHtmlFromJson(self, jsonContent: Dict[str, Any], title: str, userPrompt: str = None, aiService=None) -> str:
|
||||
"""Generate HTML content from structured JSON document using AI-generated styling."""
|
||||
try:
|
||||
# Get style set: use styles from metadata if available, otherwise enhance with AI
|
||||
styles = await self._getStyleSet(jsonContent, userPrompt, aiService)
|
||||
|
||||
# Validate JSON structure
|
||||
if not self._validateJsonStructure(jsonContent):
|
||||
raise ValueError("JSON content must follow standardized schema: {metadata: {...}, documents: [{sections: [...]}]}")
|
||||
|
||||
# Extract sections and metadata from standardized schema
|
||||
sections = self._extractSections(jsonContent)
|
||||
metadata = self._extractMetadata(jsonContent)
|
||||
|
||||
# Use provided title (which comes from documents[].title) as primary source
|
||||
# Fallback to metadata.title only if title parameter is empty
|
||||
documentTitle = title if title else metadata.get("title", "Generated Document")
|
||||
|
||||
# Build HTML document
|
||||
htmlParts = []
|
||||
|
||||
# HTML document structure
|
||||
htmlParts.append('<!DOCTYPE html>')
|
||||
htmlParts.append('<html lang="en">')
|
||||
htmlParts.append('<head>')
|
||||
htmlParts.append('<meta charset="UTF-8">')
|
||||
htmlParts.append('<meta name="viewport" content="width=device-width, initial-scale=1.0">')
|
||||
htmlParts.append(f'<title>{documentTitle}</title>')
|
||||
htmlParts.append('<style>')
|
||||
htmlParts.append(self._generateCssStyles(styles))
|
||||
htmlParts.append('</style>')
|
||||
htmlParts.append('</head>')
|
||||
htmlParts.append('<body>')
|
||||
|
||||
# Document header
|
||||
htmlParts.append(f'<header><h1 class="document-title">{documentTitle}</h1></header>')
|
||||
|
||||
# Main content
|
||||
htmlParts.append('<main>')
|
||||
|
||||
# Process each section
|
||||
for section in sections:
|
||||
sectionHtml = self._renderJsonSection(section, styles)
|
||||
if sectionHtml:
|
||||
htmlParts.append(sectionHtml)
|
||||
|
||||
htmlParts.append('</main>')
|
||||
|
||||
# Footer
|
||||
htmlParts.append('<footer>')
|
||||
htmlParts.append(f'<p class="generated-info">Generated: {self._formatTimestamp()}</p>')
|
||||
htmlParts.append('</footer>')
|
||||
|
||||
htmlParts.append('</body>')
|
||||
htmlParts.append('</html>')
|
||||
|
||||
return '\n'.join(htmlParts)
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error generating HTML from JSON: {str(e)}")
|
||||
raise Exception(f"HTML generation failed: {str(e)}")
|
||||
|
||||
async def _getStyleSet(self, extractedContent: Dict[str, Any] = None, userPrompt: str = None, aiService=None, templateName: str = None) -> Dict[str, Any]:
|
||||
"""Get style set - use styles from document generation metadata if available,
|
||||
otherwise enhance default styles with AI if userPrompt provided.
|
||||
|
||||
WICHTIG: In a dynamic scalable AI system, styling should come from document generation,
|
||||
not be generated separately by renderers. Only fall back to AI if styles not provided.
|
||||
|
||||
Args:
|
||||
extractedContent: Document content with metadata (may contain styles)
|
||||
userPrompt: User's prompt (AI will detect style instructions in any language)
|
||||
aiService: AI service (used only if styles not in metadata and userPrompt provided)
|
||||
templateName: Name of template style set (None = default)
|
||||
|
||||
Returns:
|
||||
Dict with style definitions for all document styles
|
||||
"""
|
||||
# Get default style set
|
||||
defaultStyleSet = self._getDefaultStyleSet()
|
||||
|
||||
# FIRST: Check if styles are provided in document generation metadata (preferred approach)
|
||||
if extractedContent:
|
||||
metadata = extractedContent.get("metadata", {})
|
||||
if isinstance(metadata, dict):
|
||||
styles = metadata.get("styles")
|
||||
if styles and isinstance(styles, dict):
|
||||
self.logger.debug("Using styles from document generation metadata")
|
||||
return self._validateStylesContrast(styles)
|
||||
|
||||
# FALLBACK: Enhance with AI if userPrompt provided (only if styles not in metadata)
|
||||
if userPrompt and aiService:
|
||||
self.logger.info(f"Styles not in metadata, enhancing with AI based on user prompt...")
|
||||
enhancedStyleSet = await self._enhanceStylesWithAI(userPrompt, defaultStyleSet, aiService)
|
||||
return self._validateStylesContrast(enhancedStyleSet)
|
||||
else:
|
||||
# Use default styles only
|
||||
return defaultStyleSet
|
||||
|
||||
async def _enhanceStylesWithAI(self, userPrompt: str, defaultStyleSet: Dict[str, Any], aiService) -> Dict[str, Any]:
|
||||
"""Enhance default styles with AI based on user prompt."""
|
||||
try:
|
||||
style_template = self._createAiStyleTemplate("html", userPrompt, defaultStyleSet)
|
||||
enhanced_styles = await self._getAiStyles(aiService, style_template, defaultStyleSet)
|
||||
return enhanced_styles
|
||||
except Exception as e:
|
||||
self.logger.warning(f"AI style enhancement failed: {str(e)}, using default styles")
|
||||
return defaultStyleSet
|
||||
|
||||
def _validateStylesContrast(self, styles: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""Validate and fix contrast issues in AI-generated styles."""
|
||||
try:
|
||||
# Fix table header contrast
|
||||
if "table_header" in styles:
|
||||
header = styles["table_header"]
|
||||
bgColor = header.get("background", "#FFFFFF")
|
||||
textColor = header.get("color", "#000000")
|
||||
|
||||
# If both are white or both are dark, fix it
|
||||
if bgColor.upper() == "#FFFFFF" and textColor.upper() == "#FFFFFF":
|
||||
header["background"] = "#4F4F4F"
|
||||
header["color"] = "#FFFFFF"
|
||||
elif bgColor.upper() == "#000000" and textColor.upper() == "#000000":
|
||||
header["background"] = "#4F4F4F"
|
||||
header["color"] = "#FFFFFF"
|
||||
|
||||
# Fix table cell contrast
|
||||
if "table_cell" in styles:
|
||||
cell = styles["table_cell"]
|
||||
bgColor = cell.get("background", "#FFFFFF")
|
||||
textColor = cell.get("color", "#000000")
|
||||
|
||||
# If both are white or both are dark, fix it
|
||||
if bgColor.upper() == "#FFFFFF" and textColor.upper() == "#FFFFFF":
|
||||
cell["background"] = "#FFFFFF"
|
||||
cell["color"] = "#2F2F2F"
|
||||
elif bgColor.upper() == "#000000" and textColor.upper() == "#000000":
|
||||
cell["background"] = "#FFFFFF"
|
||||
cell["color"] = "#2F2F2F"
|
||||
|
||||
return styles
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Style validation failed: {str(e)}")
|
||||
return self._getDefaultStyleSet()
|
||||
|
||||
def _getDefaultStyleSet(self) -> Dict[str, Any]:
|
||||
"""Default HTML style set - used when no style instructions present."""
|
||||
return {
|
||||
"title": {"font_size": "2.5em", "color": "#1F4E79", "font_weight": "bold", "text_align": "center", "margin": "0 0 1em 0"},
|
||||
"heading1": {"font_size": "2em", "color": "#2F2F2F", "font_weight": "bold", "text_align": "left", "margin": "1.5em 0 0.5em 0"},
|
||||
"heading2": {"font_size": "1.5em", "color": "#4F4F4F", "font_weight": "bold", "text_align": "left", "margin": "1em 0 0.5em 0"},
|
||||
"paragraph": {"font_size": "1em", "color": "#2F2F2F", "font_weight": "normal", "text_align": "left", "margin": "0 0 1em 0", "line_height": "1.6"},
|
||||
"table": {"border": "1px solid #ddd", "border_collapse": "collapse", "width": "100%", "margin": "1em 0"},
|
||||
"table_header": {"background": "#4F4F4F", "color": "#FFFFFF", "font_weight": "bold", "text_align": "center", "padding": "12px"},
|
||||
"table_cell": {"background": "#FFFFFF", "color": "#2F2F2F", "font_weight": "normal", "text_align": "left", "padding": "8px", "border": "1px solid #ddd"},
|
||||
"bullet_list": {"font_size": "1em", "color": "#2F2F2F", "margin": "0 0 1em 0", "padding_left": "20px"},
|
||||
"code_block": {"font_family": "Courier New, monospace", "font_size": "0.9em", "color": "#2F2F2F", "background": "#F5F5F5", "padding": "1em", "border": "1px solid #ddd", "border_radius": "4px", "margin": "1em 0"},
|
||||
"image": {"max_width": "100%", "height": "auto", "margin": "1em 0", "border_radius": "4px"},
|
||||
"body": {"font_family": "Arial, sans-serif", "background": "#FFFFFF", "color": "#2F2F2F", "margin": "0", "padding": "20px"}
|
||||
}
|
||||
|
||||
|
||||
def _generateCssStyles(self, styles: Dict[str, Any]) -> str:
|
||||
"""Generate CSS from style definitions."""
|
||||
css_parts = []
|
||||
|
||||
# Body styles
|
||||
body_style = styles.get("body", {})
|
||||
css_parts.append("body {")
|
||||
for property_name, value in body_style.items():
|
||||
css_property = property_name.replace("_", "-")
|
||||
css_parts.append(f" {css_property}: {value};")
|
||||
css_parts.append("}")
|
||||
|
||||
# Document title
|
||||
title_style = styles.get("title", {})
|
||||
css_parts.append(".document-title {")
|
||||
for property_name, value in title_style.items():
|
||||
css_property = property_name.replace("_", "-")
|
||||
css_parts.append(f" {css_property}: {value};")
|
||||
css_parts.append("}")
|
||||
|
||||
# Headings
|
||||
for heading_level in ["heading1", "heading2"]:
|
||||
heading_style = styles.get(heading_level, {})
|
||||
css_class = f"h{heading_level[-1]}"
|
||||
css_parts.append(f"{css_class} {{")
|
||||
for property_name, value in heading_style.items():
|
||||
css_property = property_name.replace("_", "-")
|
||||
css_parts.append(f" {css_property}: {value};")
|
||||
css_parts.append("}")
|
||||
|
||||
# Paragraphs
|
||||
paragraph_style = styles.get("paragraph", {})
|
||||
css_parts.append("p {")
|
||||
for property_name, value in paragraph_style.items():
|
||||
css_property = property_name.replace("_", "-")
|
||||
css_parts.append(f" {css_property}: {value};")
|
||||
css_parts.append("}")
|
||||
|
||||
# Tables
|
||||
table_style = styles.get("table", {})
|
||||
css_parts.append("table {")
|
||||
for property_name, value in table_style.items():
|
||||
css_property = property_name.replace("_", "-")
|
||||
css_parts.append(f" {css_property}: {value};")
|
||||
css_parts.append("}")
|
||||
|
||||
# Table headers
|
||||
table_header_style = styles.get("table_header", {})
|
||||
css_parts.append("th {")
|
||||
for property_name, value in table_header_style.items():
|
||||
css_property = property_name.replace("_", "-")
|
||||
css_parts.append(f" {css_property}: {value};")
|
||||
css_parts.append("}")
|
||||
|
||||
# Table cells
|
||||
table_cell_style = styles.get("table_cell", {})
|
||||
css_parts.append("td {")
|
||||
for property_name, value in table_cell_style.items():
|
||||
css_property = property_name.replace("_", "-")
|
||||
css_parts.append(f" {css_property}: {value};")
|
||||
css_parts.append("}")
|
||||
|
||||
# Lists
|
||||
bullet_list_style = styles.get("bullet_list", {})
|
||||
css_parts.append("ul {")
|
||||
for property_name, value in bullet_list_style.items():
|
||||
css_property = property_name.replace("_", "-")
|
||||
css_parts.append(f" {css_property}: {value};")
|
||||
css_parts.append("}")
|
||||
|
||||
# Code blocks
|
||||
code_block_style = styles.get("code_block", {})
|
||||
css_parts.append("pre {")
|
||||
for property_name, value in code_block_style.items():
|
||||
css_property = property_name.replace("_", "-")
|
||||
css_parts.append(f" {css_property}: {value};")
|
||||
css_parts.append("}")
|
||||
|
||||
# Images
|
||||
image_style = styles.get("image", {})
|
||||
css_parts.append("img {")
|
||||
for property_name, value in image_style.items():
|
||||
css_property = property_name.replace("_", "-")
|
||||
css_parts.append(f" {css_property}: {value};")
|
||||
css_parts.append("}")
|
||||
|
||||
# Generated info
|
||||
css_parts.append(".generated-info {")
|
||||
css_parts.append(" font-size: 0.9em;")
|
||||
css_parts.append(" color: #666;")
|
||||
css_parts.append(" text-align: center;")
|
||||
css_parts.append(" margin-top: 2em;")
|
||||
css_parts.append(" padding-top: 1em;")
|
||||
css_parts.append(" border-top: 1px solid #ddd;")
|
||||
css_parts.append("}")
|
||||
|
||||
return '\n'.join(css_parts)
|
||||
|
||||
def _renderJsonSection(self, section: Dict[str, Any], styles: Dict[str, Any]) -> str:
|
||||
"""Render a single JSON section to HTML using AI-generated styles.
|
||||
Supports three content formats: reference, object (base64), extracted_text.
|
||||
WICHTIG: Respektiert sectionType (content_type) für korrekte Rendering-Logik.
|
||||
"""
|
||||
try:
|
||||
sectionType = self._getSectionType(section)
|
||||
sectionData = self._getSectionData(section)
|
||||
|
||||
# WICHTIG: Respektiere sectionType (content_type) ZUERST, dann process elements entsprechend
|
||||
# Process elements according to section's content_type, not just element types
|
||||
|
||||
if sectionType == "table":
|
||||
# Work directly with elements like other renderers
|
||||
if isinstance(sectionData, list) and sectionData:
|
||||
element = sectionData[0] if isinstance(sectionData[0], dict) else {}
|
||||
return self._renderJsonTable(element, styles)
|
||||
return ""
|
||||
elif sectionType == "bullet_list":
|
||||
# Work directly with elements like other renderers
|
||||
if isinstance(sectionData, list) and sectionData:
|
||||
element = sectionData[0] if isinstance(sectionData[0], dict) else {}
|
||||
return self._renderJsonBulletList(element, styles)
|
||||
return ""
|
||||
elif sectionType == "heading":
|
||||
# Work directly with elements like other renderers
|
||||
if isinstance(sectionData, list) and sectionData:
|
||||
element = sectionData[0] if isinstance(sectionData[0], dict) else {}
|
||||
return self._renderJsonHeading(element, styles)
|
||||
return ""
|
||||
elif sectionType == "paragraph":
|
||||
# Process paragraph elements, including extracted_text
|
||||
if isinstance(sectionData, list):
|
||||
htmlParts = []
|
||||
for element in sectionData:
|
||||
element_type = element.get("type", "") if isinstance(element, dict) else ""
|
||||
|
||||
if element_type == "reference":
|
||||
doc_ref = element.get("documentReference", "")
|
||||
label = element.get("label", "Reference")
|
||||
htmlParts.append(f'<p class="reference"><em>[Reference: {label}]</em></p>')
|
||||
elif element_type == "extracted_text":
|
||||
content = element.get("content", "")
|
||||
source = element.get("source", "")
|
||||
if content:
|
||||
source_text = f' <small><em>(Source: {source})</em></small>' if source else ''
|
||||
htmlParts.append(f'<p>{content}{source_text}</p>')
|
||||
elif isinstance(element, dict):
|
||||
# Regular paragraph element - extract from nested content structure (standard JSON format)
|
||||
content = element.get("content", {})
|
||||
if isinstance(content, dict):
|
||||
text = content.get("text", "")
|
||||
elif isinstance(content, str):
|
||||
text = content
|
||||
else:
|
||||
text = ""
|
||||
|
||||
if text:
|
||||
htmlParts.append(f'<p>{text}</p>')
|
||||
elif isinstance(element, str):
|
||||
htmlParts.append(f'<p>{element}</p>')
|
||||
|
||||
if htmlParts:
|
||||
return '\n'.join(htmlParts)
|
||||
# If sectionData is not a list, treat it as a dict
|
||||
if isinstance(sectionData, dict):
|
||||
return self._renderJsonParagraph(sectionData, styles)
|
||||
return ""
|
||||
elif sectionType == "code_block":
|
||||
# Work directly with elements like other renderers
|
||||
if isinstance(sectionData, list) and sectionData:
|
||||
element = sectionData[0] if isinstance(sectionData[0], dict) else {}
|
||||
return self._renderJsonCodeBlock(element, styles)
|
||||
return ""
|
||||
elif sectionType == "image":
|
||||
# Work directly with elements like other renderers
|
||||
if isinstance(sectionData, list) and sectionData:
|
||||
element = sectionData[0] if isinstance(sectionData[0], dict) else {}
|
||||
return self._renderJsonImage(element, styles)
|
||||
return ""
|
||||
else:
|
||||
# Fallback: Check for special element types first
|
||||
if isinstance(sectionData, list):
|
||||
htmlParts = []
|
||||
for element in sectionData:
|
||||
element_type = element.get("type", "") if isinstance(element, dict) else ""
|
||||
|
||||
if element_type == "reference":
|
||||
doc_ref = element.get("documentReference", "")
|
||||
label = element.get("label", "Reference")
|
||||
htmlParts.append(f'<p class="reference"><em>[Reference: {label}]</em></p>')
|
||||
elif element_type == "extracted_text":
|
||||
content = element.get("content", "")
|
||||
source = element.get("source", "")
|
||||
if content:
|
||||
source_text = f' <small><em>(Source: {source})</em></small>' if source else ''
|
||||
htmlParts.append(f'<p>{content}{source_text}</p>')
|
||||
|
||||
if htmlParts:
|
||||
return '\n'.join(htmlParts)
|
||||
# Fallback to paragraph for unknown types
|
||||
if isinstance(sectionData, dict):
|
||||
return self._renderJsonParagraph(sectionData, styles)
|
||||
return ""
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error rendering section {self._getSectionId(section)}: {str(e)}")
|
||||
return f'<div class="error">[Error rendering section: {str(e)}]</div>'
|
||||
|
||||
def _renderJsonTable(self, tableData: Dict[str, Any], styles: Dict[str, Any]) -> str:
|
||||
"""Render a JSON table to HTML using AI-generated styles."""
|
||||
try:
|
||||
# Extract from nested content structure: element.content.{headers, rows}
|
||||
content = tableData.get("content", {})
|
||||
if not isinstance(content, dict):
|
||||
return ""
|
||||
headers = content.get("headers", [])
|
||||
rows = content.get("rows", [])
|
||||
|
||||
if not headers or not rows:
|
||||
return ""
|
||||
|
||||
htmlParts = ['<table>']
|
||||
|
||||
# Table header
|
||||
htmlParts.append('<thead><tr>')
|
||||
for header in headers:
|
||||
htmlParts.append(f'<th>{header}</th>')
|
||||
htmlParts.append('</tr></thead>')
|
||||
|
||||
# Table body
|
||||
htmlParts.append('<tbody>')
|
||||
for row in rows:
|
||||
htmlParts.append('<tr>')
|
||||
for cellData in row:
|
||||
htmlParts.append(f'<td>{cellData}</td>')
|
||||
htmlParts.append('</tr>')
|
||||
htmlParts.append('</tbody>')
|
||||
|
||||
htmlParts.append('</table>')
|
||||
return '\n'.join(htmlParts)
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error rendering table: {str(e)}")
|
||||
return ""
|
||||
|
||||
def _renderJsonBulletList(self, listData: Dict[str, Any], styles: Dict[str, Any]) -> str:
|
||||
"""Render a JSON bullet list to HTML using AI-generated styles."""
|
||||
try:
|
||||
# Extract from nested content structure: element.content.{items}
|
||||
content = listData.get("content", {})
|
||||
if not isinstance(content, dict):
|
||||
return ""
|
||||
items = content.get("items", [])
|
||||
|
||||
if not items:
|
||||
return ""
|
||||
|
||||
htmlParts = ['<ul>']
|
||||
for item in items:
|
||||
if isinstance(item, str):
|
||||
htmlParts.append(f'<li>{item}</li>')
|
||||
elif isinstance(item, dict) and "text" in item:
|
||||
htmlParts.append(f'<li>{item["text"]}</li>')
|
||||
htmlParts.append('</ul>')
|
||||
|
||||
return '\n'.join(htmlParts)
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error rendering bullet list: {str(e)}")
|
||||
return ""
|
||||
|
||||
def _renderJsonHeading(self, headingData: Dict[str, Any], styles: Dict[str, Any]) -> str:
|
||||
"""Render a JSON heading to HTML using AI-generated styles."""
|
||||
try:
|
||||
# Extract from nested content structure: element.content.{text, level}
|
||||
content = headingData.get("content", {})
|
||||
if not isinstance(content, dict):
|
||||
return ""
|
||||
text = content.get("text", "")
|
||||
level = content.get("level", 1)
|
||||
|
||||
if text:
|
||||
level = max(1, min(6, level))
|
||||
return f'<h{level}>{text}</h{level}>'
|
||||
|
||||
return ""
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error rendering heading: {str(e)}")
|
||||
return ""
|
||||
|
||||
def _renderJsonParagraph(self, paragraphData: Dict[str, Any], styles: Dict[str, Any]) -> str:
|
||||
"""Render a JSON paragraph to HTML using AI-generated styles."""
|
||||
try:
|
||||
# Normalize inputs - paragraphData is typically a list of elements from _getSectionData
|
||||
if isinstance(paragraphData, list):
|
||||
# Extract text from all paragraph elements (expects nested content structure)
|
||||
texts = []
|
||||
for el in paragraphData:
|
||||
if isinstance(el, dict):
|
||||
content = el.get("content", {})
|
||||
if isinstance(content, dict):
|
||||
text = content.get("text", "")
|
||||
elif isinstance(content, str):
|
||||
text = content
|
||||
else:
|
||||
text = ""
|
||||
if text:
|
||||
texts.append(text)
|
||||
elif isinstance(el, str):
|
||||
texts.append(el)
|
||||
if texts:
|
||||
# Join multiple paragraphs with <p> tags
|
||||
return '\n'.join(f'<p>{text}</p>' for text in texts)
|
||||
return ""
|
||||
elif isinstance(paragraphData, str):
|
||||
return f'<p>{paragraphData}</p>'
|
||||
elif isinstance(paragraphData, dict):
|
||||
# Handle nested content structure: element.content vs element.text
|
||||
# Extract from nested content structure
|
||||
content = paragraphData.get("content", {})
|
||||
if isinstance(content, dict):
|
||||
text = content.get("text", "")
|
||||
elif isinstance(content, str):
|
||||
text = content
|
||||
else:
|
||||
text = ""
|
||||
if text:
|
||||
return f'<p>{text}</p>'
|
||||
return ""
|
||||
else:
|
||||
return ""
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error rendering paragraph: {str(e)}")
|
||||
return ""
|
||||
|
||||
def _renderJsonCodeBlock(self, codeData: Dict[str, Any], styles: Dict[str, Any]) -> str:
|
||||
"""Render a JSON code block to HTML using AI-generated styles."""
|
||||
try:
|
||||
# Extract from nested content structure: element.content.{code, language}
|
||||
content = codeData.get("content", {})
|
||||
if not isinstance(content, dict):
|
||||
return ""
|
||||
code = content.get("code", "")
|
||||
language = content.get("language", "")
|
||||
|
||||
if code:
|
||||
if language:
|
||||
return f'<pre><code class="language-{language}">{code}</code></pre>'
|
||||
else:
|
||||
return f'<pre><code>{code}</code></pre>'
|
||||
|
||||
return ""
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error rendering code block: {str(e)}")
|
||||
return ""
|
||||
|
||||
def _renderJsonImage(self, imageData: Dict[str, Any], styles: Dict[str, Any]) -> str:
|
||||
"""Render a JSON image to HTML with placeholder for later replacement. Expects nested content structure."""
|
||||
try:
|
||||
import html
|
||||
# Extract from nested content structure (standard JSON format)
|
||||
content = imageData.get("content", {})
|
||||
if not isinstance(content, dict):
|
||||
return ""
|
||||
|
||||
base64Data = content.get("base64Data", "")
|
||||
altText = content.get("altText", "Image")
|
||||
caption = content.get("caption", "")
|
||||
|
||||
# Escape HTML in altText and caption to prevent injection
|
||||
altTextEscaped = html.escape(str(altText))
|
||||
captionEscaped = html.escape(str(caption)) if caption else ""
|
||||
|
||||
if base64Data:
|
||||
# Use data URI as placeholder - will be replaced with file path in _replaceImageDataUris
|
||||
# Include a marker so we can find and replace it
|
||||
imageMarker = f"<!--IMAGE_MARKER:{len(base64Data)}:{altTextEscaped[:50]}-->"
|
||||
# Add max-width and max-height to ensure image fits within page dimensions
|
||||
# Typical page width is ~800-1200px, height varies but we limit to 600px for readability
|
||||
imgTag = f'<img src="data:image/png;base64,{base64Data}" alt="{altTextEscaped}" style="max-width: 100%; max-height: 600px; width: auto; height: auto;">'
|
||||
|
||||
if captionEscaped:
|
||||
return f'{imageMarker}<figure>{imgTag}<figcaption>{captionEscaped}</figcaption></figure>'
|
||||
else:
|
||||
return f'{imageMarker}{imgTag}'
|
||||
|
||||
return ""
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error embedding image in HTML: {str(e)}")
|
||||
altText = imageData.get("altText", "Image")
|
||||
errorMsg = html.escape(f"[Error: Could not embed image '{altText}'. {str(e)}]")
|
||||
return f'<div class="error" style="color: red; padding: 10px; border: 1px solid red;">{errorMsg}</div>'
|
||||
|
||||
def _extractImages(self, jsonContent: Dict[str, Any]) -> List[Dict[str, Any]]:
|
||||
"""
|
||||
Extract all images from JSON structure.
|
||||
|
||||
Returns:
|
||||
List of image data dictionaries with base64Data, altText, caption, sectionId
|
||||
"""
|
||||
images = []
|
||||
|
||||
try:
|
||||
# Extract from standardized schema: {metadata: {...}, documents: [{sections: [...]}]}
|
||||
documents = jsonContent.get("documents", [])
|
||||
if not documents or not isinstance(documents, list):
|
||||
return images
|
||||
|
||||
for doc in documents:
|
||||
if not isinstance(doc, dict):
|
||||
continue
|
||||
sections = doc.get("sections", [])
|
||||
for section in sections:
|
||||
if section.get("content_type") == "image":
|
||||
elements = section.get("elements", [])
|
||||
for element in elements:
|
||||
# Extract from nested content structure
|
||||
content = element.get("content", {})
|
||||
base64Data = ""
|
||||
|
||||
if isinstance(content, dict):
|
||||
base64Data = content.get("base64Data", "")
|
||||
elif isinstance(content, str):
|
||||
# Content might be base64 string directly (shouldn't happen)
|
||||
pass
|
||||
|
||||
# If base64Data not found in content, try direct element fields (fallback)
|
||||
if not base64Data:
|
||||
base64Data = element.get("base64Data", "")
|
||||
|
||||
# If base64Data still not found, try extracting from url data URI
|
||||
if not base64Data:
|
||||
url = element.get("url", "") or (content.get("url", "") if isinstance(content, dict) else "")
|
||||
if url and isinstance(url, str) and url.startswith("data:image/"):
|
||||
# Extract base64 from data URI: data:image/png;base64,<base64>
|
||||
import re
|
||||
match = re.match(r'data:image/[^;]+;base64,(.+)', url)
|
||||
if match:
|
||||
base64Data = match.group(1)
|
||||
|
||||
if base64Data:
|
||||
sectionId = section.get("id", "unknown")
|
||||
|
||||
# Bestimme MIME-Type und Extension
|
||||
mimeType = element.get("mimeType", "") or (content.get("mimeType", "") if isinstance(content, dict) else "")
|
||||
if not mimeType or mimeType == "unknown":
|
||||
# Versuche MIME-Type aus base64 zu erkennen
|
||||
if base64Data.startswith("/9j/"):
|
||||
mimeType = "image/jpeg"
|
||||
elif base64Data.startswith("iVBORw0KGgo"):
|
||||
mimeType = "image/png"
|
||||
else:
|
||||
mimeType = "image/png" # Default
|
||||
|
||||
# Bestimme Extension basierend auf MIME-Type
|
||||
extension = "png"
|
||||
if mimeType == "image/jpeg" or mimeType == "image/jpg":
|
||||
extension = "jpg"
|
||||
elif mimeType == "image/png":
|
||||
extension = "png"
|
||||
elif mimeType == "image/gif":
|
||||
extension = "gif"
|
||||
elif mimeType == "image/webp":
|
||||
extension = "webp"
|
||||
|
||||
# Generate filename from section ID
|
||||
filename = f"{sectionId}.{extension}"
|
||||
# Clean filename (remove invalid characters)
|
||||
filename = "".join(c if c.isalnum() or c in "._-" else "_" for c in filename)
|
||||
|
||||
images.append({
|
||||
"base64Data": base64Data,
|
||||
"altText": element.get("altText", "Image"),
|
||||
"caption": element.get("caption"),
|
||||
"sectionId": sectionId,
|
||||
"filename": filename,
|
||||
"mimeType": mimeType
|
||||
})
|
||||
self.logger.debug(f"Extracted image from section {sectionId}: {filename}")
|
||||
|
||||
self.logger.info(f"Extracted {len(images)} image(s) from JSON structure")
|
||||
return images
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error extracting images: {str(e)}")
|
||||
return []
|
||||
|
||||
def _replaceImageDataUris(self, htmlContent: str, images: List[Dict[str, Any]]) -> str:
|
||||
"""
|
||||
Replace base64 data URIs in HTML with relative file paths.
|
||||
|
||||
Args:
|
||||
htmlContent: HTML content with data URIs
|
||||
images: List of image data dictionaries
|
||||
|
||||
Returns:
|
||||
HTML content with relative file paths
|
||||
"""
|
||||
try:
|
||||
import base64
|
||||
import re
|
||||
|
||||
# Find entire img tags with data URIs and replace them
|
||||
# Pattern: <img src="data:image/[type];base64,<base64>" [other attributes]>
|
||||
imgTagPattern = r'<img\s+src="data:image/[^"]+"[^>]*>'
|
||||
|
||||
def replaceImgTag(match):
|
||||
imgTag = match.group(0)
|
||||
|
||||
# Extract base64 data from the img tag
|
||||
base64Match = re.search(r'data:image/[^;]+;base64,([A-Za-z0-9+/=]+)', imgTag)
|
||||
if not base64Match:
|
||||
return imgTag # Return original if no base64 found
|
||||
|
||||
base64Data = base64Match.group(1)
|
||||
|
||||
# Find matching image in images list
|
||||
matchingImage = None
|
||||
for img in images:
|
||||
imgBase64 = img.get("base64Data", "")
|
||||
# Vergleiche base64-Daten (kann unterschiedliche Längen haben durch Padding)
|
||||
if imgBase64 == base64Data or imgBase64.startswith(base64Data[:100]) or base64Data.startswith(imgBase64[:100]):
|
||||
matchingImage = img
|
||||
break
|
||||
|
||||
if matchingImage:
|
||||
import html
|
||||
# Use filename from image data (generated from section ID)
|
||||
filename = matchingImage.get("filename", f"image_{images.index(matchingImage) + 1}.png")
|
||||
|
||||
# Extract existing alt text or use from matchingImage
|
||||
altMatch = re.search(r'alt="([^"]*)"', imgTag)
|
||||
existingAlt = altMatch.group(1) if altMatch else ""
|
||||
altText = html.escape(str(matchingImage.get("altText", existingAlt or "Image")))
|
||||
caption = html.escape(str(matchingImage.get("caption", ""))) if matchingImage.get("caption") else ""
|
||||
|
||||
# Create new img tag with filename
|
||||
imgTag = f'<img src="{filename}" alt="{altText}">'
|
||||
|
||||
if caption:
|
||||
return f'<figure>{imgTag}<figcaption>{caption}</figcaption></figure>'
|
||||
else:
|
||||
return imgTag
|
||||
else:
|
||||
# Keep original if no match found
|
||||
return match.group(0)
|
||||
|
||||
# Replace all img tags with data URIs (auch IMAGE_MARKER Kommentare entfernen)
|
||||
updatedHtml = re.sub(imgTagPattern, replaceImgTag, htmlContent)
|
||||
# Entferne IMAGE_MARKER Kommentare die übrig geblieben sind
|
||||
updatedHtml = re.sub(r'<!--IMAGE_MARKER:[^>]+-->', '', updatedHtml)
|
||||
|
||||
return updatedHtml
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error replacing image data URIs: {str(e)}")
|
||||
return htmlContent # Return original if replacement fails
|
||||
|
||||
def getRenderedImages(self) -> List[Dict[str, Any]]:
|
||||
"""
|
||||
Get images that were extracted during rendering.
|
||||
Returns list of image dicts with base64Data, altText, caption, and filename.
|
||||
"""
|
||||
if not hasattr(self, '_renderedImages'):
|
||||
return []
|
||||
return self._renderedImages
|
||||
|
|
@ -0,0 +1,355 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""
|
||||
Image renderer for report generation using AI image generation.
|
||||
"""
|
||||
|
||||
from .documentRendererBaseTemplate import BaseRenderer
|
||||
from modules.datamodels.datamodelDocument import RenderedDocument
|
||||
from typing import Dict, Any, List, Optional
|
||||
import logging
|
||||
import base64
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
class RendererImage(BaseRenderer):
|
||||
"""Renders content to image format using AI image generation."""
|
||||
|
||||
@classmethod
|
||||
def getSupportedFormats(cls) -> List[str]:
|
||||
"""Return supported image formats."""
|
||||
return ['png', 'jpg', 'jpeg', 'image']
|
||||
|
||||
@classmethod
|
||||
def getFormatAliases(cls) -> List[str]:
|
||||
"""Return format aliases."""
|
||||
return ['img', 'picture', 'photo', 'graphic']
|
||||
|
||||
@classmethod
|
||||
def getPriority(cls) -> int:
|
||||
"""Return priority for image renderer."""
|
||||
return 90
|
||||
|
||||
@classmethod
|
||||
def getOutputStyle(cls, formatName: Optional[str] = None) -> str:
|
||||
"""Return output style classification: Images are visual media."""
|
||||
return 'image'
|
||||
|
||||
@classmethod
|
||||
def getAcceptedSectionTypes(cls, formatName: Optional[str] = None) -> List[str]:
|
||||
"""
|
||||
Return list of section content types that Image renderer accepts.
|
||||
Image renderer only accepts image sections (images are generated from image sections).
|
||||
"""
|
||||
return ["image"]
|
||||
|
||||
async def render(self, extractedContent: Dict[str, Any], title: str, userPrompt: str = None, aiService=None) -> List[RenderedDocument]:
|
||||
"""Render extracted JSON content to image format using AI image generation."""
|
||||
try:
|
||||
# Generate AI image from content
|
||||
imageContent = await self._generateAiImage(extractedContent, title, userPrompt, aiService)
|
||||
|
||||
# Determine filename from document or title
|
||||
documents = extractedContent.get("documents", [])
|
||||
if documents and isinstance(documents[0], dict):
|
||||
filename = documents[0].get("filename")
|
||||
if not filename:
|
||||
filename = self._determineFilename(title, "image/png")
|
||||
else:
|
||||
filename = self._determineFilename(title, "image/png")
|
||||
|
||||
# Convert image content to bytes (base64 string or bytes)
|
||||
if isinstance(imageContent, str):
|
||||
try:
|
||||
imageBytes = base64.b64decode(imageContent)
|
||||
except Exception:
|
||||
imageBytes = imageContent.encode('utf-8')
|
||||
else:
|
||||
imageBytes = imageContent
|
||||
|
||||
# Extract metadata for document type and other info
|
||||
metadata = extractedContent.get("metadata", {}) if extractedContent else {}
|
||||
documentType = metadata.get("documentType") if isinstance(metadata, dict) else None
|
||||
|
||||
return [
|
||||
RenderedDocument(
|
||||
documentData=imageBytes,
|
||||
mimeType="image/png",
|
||||
filename=filename,
|
||||
documentType=documentType,
|
||||
metadata=metadata if isinstance(metadata, dict) else None
|
||||
)
|
||||
]
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error rendering image: {str(e)}")
|
||||
# Re-raise the exception instead of using fallback
|
||||
raise Exception(f"Image rendering failed: {str(e)}")
|
||||
|
||||
async def _generateAiImage(self, extractedContent: Dict[str, Any], title: str, userPrompt: str = None, aiService=None) -> str:
|
||||
"""Generate AI image from extracted content."""
|
||||
try:
|
||||
if not aiService:
|
||||
raise ValueError("AI service is required for image generation")
|
||||
|
||||
# Validate JSON structure (standardized schema: {metadata: {...}, documents: [{sections: [...]}]})
|
||||
if not self._validateJsonStructure(extractedContent):
|
||||
raise ValueError("Extracted content must follow standardized schema: {metadata: {...}, documents: [{sections: [...]}]}")
|
||||
|
||||
# Extract metadata from standardized schema
|
||||
metadata = self._extractMetadata(extractedContent)
|
||||
|
||||
# Use provided title (which comes from documents[].title) as primary source
|
||||
# Fallback to metadata.title only if title parameter is empty
|
||||
documentTitle = title if title else metadata.get("title", "Generated Document")
|
||||
|
||||
# Create AI prompt for image generation
|
||||
imagePrompt = await self._createImageGeneratePrompt(extractedContent, documentTitle, userPrompt, aiService)
|
||||
|
||||
# Save image generation prompt to debug
|
||||
aiService.services.utils.writeDebugFile(imagePrompt, "image_generation_prompt")
|
||||
|
||||
# Format prompt as JSON with image generation parameters
|
||||
from modules.datamodels.datamodelAi import AiCallPromptImage, AiCallOptions, OperationTypeEnum
|
||||
import json
|
||||
|
||||
promptModel = AiCallPromptImage(
|
||||
prompt=imagePrompt,
|
||||
size="1024x1024",
|
||||
quality="standard",
|
||||
style="vivid"
|
||||
)
|
||||
promptJson = promptModel.model_dump_json(exclude_none=True, indent=2)
|
||||
|
||||
# Use unified callAiContent method
|
||||
options = AiCallOptions(
|
||||
operationType=OperationTypeEnum.IMAGE_GENERATE,
|
||||
resultFormat="base64"
|
||||
)
|
||||
|
||||
# Use unified callAiContent method
|
||||
imageResponse = await aiService.callAiContent(
|
||||
prompt=promptJson,
|
||||
options=options,
|
||||
outputFormat="base64"
|
||||
)
|
||||
|
||||
# Save image generation response to debug
|
||||
aiService.services.utils.writeDebugFile(str(imageResponse.content), "image_generation_response")
|
||||
|
||||
# Extract base64 image data from AiResponse
|
||||
# AiResponse.documents contains DocumentData objects
|
||||
if imageResponse.documents and len(imageResponse.documents) > 0:
|
||||
imageData = imageResponse.documents[0].documentData
|
||||
if imageData:
|
||||
return imageData
|
||||
|
||||
# Fallback: check content field (might be base64 string)
|
||||
if imageResponse.content:
|
||||
return imageResponse.content
|
||||
|
||||
raise ValueError("No image data returned from AI")
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error generating AI image: {str(e)}")
|
||||
raise Exception(f"AI image generation failed: {str(e)}")
|
||||
|
||||
async def _createImageGeneratePrompt(self, extractedContent: Dict[str, Any], title: str, userPrompt: str = None, aiService=None) -> str:
|
||||
"""Create a detailed prompt for AI image generation based on the content."""
|
||||
try:
|
||||
# Start with base prompt
|
||||
promptParts = []
|
||||
|
||||
# Add user's original intent if available
|
||||
if userPrompt:
|
||||
sanitized_prompt = aiService.services.utils.sanitizePromptContent(userPrompt, 'userinput') if aiService else userPrompt
|
||||
promptParts.append(f"User Request: {sanitized_prompt}")
|
||||
|
||||
# Add document title
|
||||
promptParts.append(f"Document Title: {title}")
|
||||
|
||||
# Analyze content and create visual description
|
||||
sections = self._extractSections(extractedContent)
|
||||
contentDescription = self._analyzeContentForVisualDescription(sections)
|
||||
|
||||
if contentDescription:
|
||||
promptParts.append(f"Content to Visualize: {contentDescription}")
|
||||
|
||||
# Add style guidance
|
||||
styleGuidance = self._getStyleGuidanceFromContent(extractedContent, userPrompt)
|
||||
if styleGuidance:
|
||||
promptParts.append(f"Visual Style: {styleGuidance}")
|
||||
|
||||
# Combine all parts
|
||||
fullPrompt = "Create a professional, informative image that visualizes the following content:\n\n" + "\n\n".join(promptParts)
|
||||
|
||||
# Add technical requirements
|
||||
fullPrompt += "\n\nTechnical Requirements:"
|
||||
fullPrompt += "\n- High quality, professional appearance"
|
||||
fullPrompt += "\n- Clear, readable text if any text is included"
|
||||
fullPrompt += "\n- Appropriate colors and layout"
|
||||
fullPrompt += "\n- Suitable for business/professional use"
|
||||
|
||||
# Truncate prompt if it exceeds DALL-E's 4000 character limit
|
||||
if len(fullPrompt) > 4000:
|
||||
# Use AI to compress the prompt intelligently
|
||||
compressedPrompt = await self._compressPromptWithAi(fullPrompt, aiService)
|
||||
if compressedPrompt and len(compressedPrompt) <= 4000:
|
||||
return compressedPrompt
|
||||
|
||||
# Fallback to minimal prompt if AI compression fails or is still too long
|
||||
minimalPrompt = f"Create a professional image representing: {title}"
|
||||
if userPrompt:
|
||||
sanitized_prompt = aiService.services.utils.sanitizePromptContent(userPrompt, 'userinput') if aiService else userPrompt
|
||||
minimalPrompt += f" - {sanitized_prompt}"
|
||||
|
||||
# If even the minimal prompt is too long, truncate it
|
||||
if len(minimalPrompt) > 4000:
|
||||
minimalPrompt = minimalPrompt[:3997] + "..."
|
||||
|
||||
return minimalPrompt
|
||||
|
||||
return fullPrompt
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error creating image prompt: {str(e)}")
|
||||
# Fallback to simple prompt
|
||||
return f"Create a professional image representing: {title}"
|
||||
|
||||
async def _compressPromptWithAi(self, longPrompt: str, aiService=None) -> str:
|
||||
"""Use AI to intelligently compress a long prompt while preserving key information."""
|
||||
try:
|
||||
if not aiService:
|
||||
return None
|
||||
|
||||
compressionPrompt = f"""
|
||||
You are an expert at creating concise, effective prompts for AI image generation.
|
||||
|
||||
The following prompt is too long for DALL-E (4000 character limit) and needs to be compressed to under 4000 characters while preserving the most important visual information.
|
||||
|
||||
Original prompt ({len(longPrompt)} characters):
|
||||
{longPrompt}
|
||||
|
||||
Please create a compressed version that:
|
||||
1. Keeps the most important visual elements and requirements
|
||||
2. Maintains the core intent and style guidance
|
||||
3. Preserves technical requirements
|
||||
4. Stays under 4000 characters
|
||||
5. Is optimized for DALL-E image generation
|
||||
|
||||
Return only the compressed prompt, no explanations.
|
||||
"""
|
||||
|
||||
# Use AI to compress the prompt - call the AI service correctly
|
||||
# The ai_service has an aiObjects attribute that contains the actual AI interface
|
||||
from modules.datamodels.datamodelAi import AiCallRequest, AiCallOptions, OperationTypeEnum
|
||||
|
||||
request = AiCallRequest(
|
||||
prompt=compressionPrompt,
|
||||
options=AiCallOptions(
|
||||
operationType=OperationTypeEnum.DATA_GENERATE,
|
||||
maxTokens=None, # Let the model use its full context length
|
||||
temperature=0.3 # Lower temperature for more consistent compression
|
||||
)
|
||||
)
|
||||
|
||||
response = await aiService.callAi(request)
|
||||
compressed = response.content.strip()
|
||||
|
||||
# Validate the compressed prompt
|
||||
if compressed and len(compressed) <= 4000 and len(compressed) > 50:
|
||||
self.logger.info(f"Successfully compressed prompt from {len(longPrompt)} to {len(compressed)} characters")
|
||||
return compressed
|
||||
else:
|
||||
self.logger.warning(f"AI compression failed or produced invalid result: {len(compressed) if compressed else 0} chars")
|
||||
return None
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error compressing prompt with AI: {str(e)}")
|
||||
return None
|
||||
|
||||
def _analyzeContentForVisualDescription(self, sections: List[Dict[str, Any]]) -> str:
|
||||
"""Analyze content sections and create a visual description for AI."""
|
||||
try:
|
||||
descriptions = []
|
||||
|
||||
for section in sections:
|
||||
sectionType = self._getSectionType(section)
|
||||
sectionData = self._getSectionData(section)
|
||||
|
||||
if sectionType == "table":
|
||||
headers = sectionData.get("headers", [])
|
||||
rows = sectionData.get("rows", [])
|
||||
if headers and rows:
|
||||
descriptions.append(f"Data table with {len(headers)} columns and {len(rows)} rows: {', '.join(headers)}")
|
||||
|
||||
elif sectionType == "bullet_list":
|
||||
items = sectionData.get("items", [])
|
||||
if items:
|
||||
descriptions.append(f"List with {len(items)} items")
|
||||
|
||||
elif sectionType == "heading":
|
||||
text = sectionData.get("text", "")
|
||||
level = sectionData.get("level", 1)
|
||||
if text:
|
||||
descriptions.append(f"Heading {level}: {text}")
|
||||
|
||||
elif sectionType == "paragraph":
|
||||
text = sectionData.get("text", "")
|
||||
if text and len(text) > 10: # Only include substantial paragraphs
|
||||
# Truncate long text
|
||||
truncated = text[:100] + "..." if len(text) > 100 else text
|
||||
descriptions.append(f"Text content: {truncated}")
|
||||
|
||||
elif sectionType == "code_block":
|
||||
code = sectionData.get("code", "")
|
||||
language = sectionData.get("language", "")
|
||||
if code:
|
||||
descriptions.append(f"Code block ({language}): {code[:50]}...")
|
||||
|
||||
return "; ".join(descriptions) if descriptions else "General document content"
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error analyzing content: {str(e)}")
|
||||
return "Document content"
|
||||
|
||||
def _getStyleGuidanceFromContent(self, extractedContent: Dict[str, Any], userPrompt: str = None) -> str:
|
||||
"""Determine visual style guidance based on content and user prompt."""
|
||||
try:
|
||||
styleElements = []
|
||||
|
||||
# Analyze user prompt for style hints
|
||||
if userPrompt:
|
||||
promptLower = userPrompt.lower()
|
||||
|
||||
if any(word in promptLower for word in ["modern", "contemporary", "sleek"]):
|
||||
styleElements.append("modern, clean design")
|
||||
elif any(word in promptLower for word in ["classic", "traditional", "formal"]):
|
||||
styleElements.append("classic, formal design")
|
||||
elif any(word in promptLower for word in ["creative", "artistic", "colorful"]):
|
||||
styleElements.append("creative, artistic design")
|
||||
elif any(word in promptLower for word in ["corporate", "business", "professional"]):
|
||||
styleElements.append("corporate, professional design")
|
||||
|
||||
# Analyze content type for additional style hints
|
||||
sections = self._extractSections(extractedContent)
|
||||
hasTables = any(self._getSectionType(s) == "table" for s in sections)
|
||||
hasLists = any(self._getSectionType(s) == "bullet_list" for s in sections)
|
||||
hasCode = any(self._getSectionType(s) == "code_block" for s in sections)
|
||||
|
||||
if hasTables:
|
||||
styleElements.append("data-focused layout")
|
||||
if hasLists:
|
||||
styleElements.append("organized, structured presentation")
|
||||
if hasCode:
|
||||
styleElements.append("technical, developer-friendly")
|
||||
|
||||
# Default style if no specific guidance
|
||||
if not styleElements:
|
||||
styleElements.append("professional, clean design")
|
||||
|
||||
return ", ".join(styleElements)
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error determining style guidance: {str(e)}")
|
||||
return "professional design"
|
||||
|
|
@ -0,0 +1,129 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""
|
||||
JSON renderer for report generation.
|
||||
"""
|
||||
|
||||
from .documentRendererBaseTemplate import BaseRenderer
|
||||
from modules.datamodels.datamodelDocument import RenderedDocument
|
||||
from typing import Dict, Any, List, Optional
|
||||
import json
|
||||
|
||||
class RendererJson(BaseRenderer):
|
||||
"""Renders content to JSON format with format-specific extraction."""
|
||||
|
||||
@classmethod
|
||||
def getSupportedFormats(cls) -> List[str]:
|
||||
"""Return supported JSON formats."""
|
||||
return ['json']
|
||||
|
||||
@classmethod
|
||||
def getFormatAliases(cls) -> List[str]:
|
||||
"""Return format aliases."""
|
||||
return ['data']
|
||||
|
||||
@classmethod
|
||||
def getPriority(cls) -> int:
|
||||
"""Return priority for JSON renderer."""
|
||||
return 80
|
||||
|
||||
@classmethod
|
||||
def getOutputStyle(cls, formatName: Optional[str] = None) -> str:
|
||||
"""Return output style classification: JSON document renderer converts structured document content to JSON."""
|
||||
return 'document'
|
||||
|
||||
@classmethod
|
||||
def getAcceptedSectionTypes(cls, formatName: Optional[str] = None) -> List[str]:
|
||||
"""
|
||||
Return list of section content types that JSON renderer accepts.
|
||||
JSON renderer accepts all section types except images (images cannot be serialized to JSON).
|
||||
"""
|
||||
from modules.datamodels.datamodelJson import supportedSectionTypes
|
||||
# Return all types except image
|
||||
return [st for st in supportedSectionTypes if st != "image"]
|
||||
|
||||
async def render(self, extractedContent: Dict[str, Any], title: str, userPrompt: str = None, aiService=None) -> List[RenderedDocument]:
|
||||
"""Render extracted JSON content to JSON format."""
|
||||
try:
|
||||
# The extracted content should already be JSON from the AI
|
||||
# Just validate and format it
|
||||
jsonContent = self._cleanJsonContent(extractedContent, title)
|
||||
|
||||
# Determine filename from document or title
|
||||
documents = extractedContent.get("documents", [])
|
||||
if documents and isinstance(documents[0], dict):
|
||||
filename = documents[0].get("filename")
|
||||
if not filename:
|
||||
filename = self._determineFilename(title, "application/json")
|
||||
else:
|
||||
filename = self._determineFilename(title, "application/json")
|
||||
|
||||
# Extract metadata for document type and other info
|
||||
metadata = extractedContent.get("metadata", {}) if extractedContent else {}
|
||||
documentType = metadata.get("documentType") if isinstance(metadata, dict) else None
|
||||
|
||||
return [
|
||||
RenderedDocument(
|
||||
documentData=jsonContent.encode('utf-8'),
|
||||
mimeType="application/json",
|
||||
filename=filename,
|
||||
documentType=documentType,
|
||||
metadata=metadata if isinstance(metadata, dict) else None
|
||||
)
|
||||
]
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error rendering JSON: {str(e)}")
|
||||
# Return minimal JSON fallback
|
||||
fallbackData = {
|
||||
"title": title,
|
||||
"sections": [{"content_type": "paragraph", "elements": [{"text": f"Error rendering report: {str(e)}"}]}],
|
||||
"metadata": {"error": str(e)}
|
||||
}
|
||||
fallbackContent = json.dumps(fallbackData, indent=2)
|
||||
metadata = extractedContent.get("metadata", {}) if extractedContent else {}
|
||||
documentType = metadata.get("documentType") if isinstance(metadata, dict) else None
|
||||
return [
|
||||
RenderedDocument(
|
||||
documentData=fallbackContent.encode('utf-8'),
|
||||
mimeType="application/json",
|
||||
filename=self._determineFilename(title, "application/json"),
|
||||
documentType=documentType,
|
||||
metadata=metadata if isinstance(metadata, dict) else None
|
||||
)
|
||||
]
|
||||
|
||||
def _cleanJsonContent(self, content: Dict[str, Any], title: str) -> str:
|
||||
"""Clean and validate JSON content from AI."""
|
||||
try:
|
||||
# Validate JSON structure
|
||||
if not isinstance(content, dict):
|
||||
raise ValueError("Content must be a dictionary")
|
||||
|
||||
# Ensure it has the expected structure
|
||||
if "sections" not in content:
|
||||
# Convert old format to new format
|
||||
content = {
|
||||
"sections": [{"content_type": "paragraph", "elements": [{"text": str(content)}]}],
|
||||
"metadata": {"title": title}
|
||||
}
|
||||
|
||||
# Ensure metadata exists
|
||||
if "metadata" not in content:
|
||||
content["metadata"] = {}
|
||||
|
||||
# Set title in metadata if not present
|
||||
if "title" not in content["metadata"]:
|
||||
content["metadata"]["title"] = title
|
||||
|
||||
# Re-format with proper indentation
|
||||
return json.dumps(content, indent=2, ensure_ascii=False)
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error cleaning JSON content: {str(e)}")
|
||||
# Return minimal valid JSON
|
||||
fallbackData = {
|
||||
"sections": [{"content_type": "paragraph", "elements": [{"text": str(content)}]}],
|
||||
"metadata": {"title": title, "error": str(e)}
|
||||
}
|
||||
return json.dumps(fallbackData, indent=2, ensure_ascii=False)
|
||||
|
|
@ -0,0 +1,349 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""
|
||||
Markdown renderer for report generation.
|
||||
"""
|
||||
|
||||
from .documentRendererBaseTemplate import BaseRenderer
|
||||
from modules.datamodels.datamodelDocument import RenderedDocument
|
||||
from typing import Dict, Any, List, Optional
|
||||
|
||||
class RendererMarkdown(BaseRenderer):
|
||||
"""Renders content to Markdown format with format-specific extraction."""
|
||||
|
||||
@classmethod
|
||||
def getSupportedFormats(cls) -> List[str]:
|
||||
"""Return supported Markdown formats."""
|
||||
return ['md', 'markdown']
|
||||
|
||||
@classmethod
|
||||
def getFormatAliases(cls) -> List[str]:
|
||||
"""Return format aliases."""
|
||||
return ['mdown', 'mkd']
|
||||
|
||||
@classmethod
|
||||
def getPriority(cls) -> int:
|
||||
"""Return priority for markdown renderer."""
|
||||
return 95
|
||||
|
||||
@classmethod
|
||||
def getOutputStyle(cls, formatName: Optional[str] = None) -> str:
|
||||
"""Return output style classification: Markdown documents are formatted documents."""
|
||||
return 'document'
|
||||
|
||||
@classmethod
|
||||
def getAcceptedSectionTypes(cls, formatName: Optional[str] = None) -> List[str]:
|
||||
"""
|
||||
Return list of section content types that Markdown renderer accepts.
|
||||
Markdown renderer accepts all section types except images.
|
||||
"""
|
||||
from modules.datamodels.datamodelJson import supportedSectionTypes
|
||||
return [st for st in supportedSectionTypes if st != "image"]
|
||||
|
||||
async def render(self, extractedContent: Dict[str, Any], title: str, userPrompt: str = None, aiService=None) -> List[RenderedDocument]:
|
||||
"""Render extracted JSON content to Markdown format."""
|
||||
try:
|
||||
# Generate markdown from JSON structure
|
||||
markdownContent = self._generateMarkdownFromJson(extractedContent, title)
|
||||
|
||||
# Determine filename from document or title
|
||||
documents = extractedContent.get("documents", [])
|
||||
if documents and isinstance(documents[0], dict):
|
||||
filename = documents[0].get("filename")
|
||||
if not filename:
|
||||
filename = self._determineFilename(title, "text/markdown")
|
||||
else:
|
||||
filename = self._determineFilename(title, "text/markdown")
|
||||
|
||||
# Extract metadata for document type and other info
|
||||
metadata = extractedContent.get("metadata", {}) if extractedContent else {}
|
||||
documentType = metadata.get("documentType") if isinstance(metadata, dict) else None
|
||||
|
||||
return [
|
||||
RenderedDocument(
|
||||
documentData=markdownContent.encode('utf-8'),
|
||||
mimeType="text/markdown",
|
||||
filename=filename,
|
||||
documentType=documentType,
|
||||
metadata=metadata if isinstance(metadata, dict) else None
|
||||
)
|
||||
]
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error rendering markdown: {str(e)}")
|
||||
# Return minimal markdown fallback
|
||||
fallbackContent = f"# {title}\n\nError rendering report: {str(e)}"
|
||||
metadata = extractedContent.get("metadata", {}) if extractedContent else {}
|
||||
documentType = metadata.get("documentType") if isinstance(metadata, dict) else None
|
||||
return [
|
||||
RenderedDocument(
|
||||
documentData=fallbackContent.encode('utf-8'),
|
||||
mimeType="text/markdown",
|
||||
filename=self._determineFilename(title, "text/markdown"),
|
||||
documentType=documentType,
|
||||
metadata=metadata if isinstance(metadata, dict) else None
|
||||
)
|
||||
]
|
||||
|
||||
def _generateMarkdownFromJson(self, jsonContent: Dict[str, Any], title: str) -> str:
|
||||
"""Generate markdown content from structured JSON document."""
|
||||
try:
|
||||
# Validate JSON structure (standardized schema: {metadata: {...}, documents: [{sections: [...]}]})
|
||||
if not self._validateJsonStructure(jsonContent):
|
||||
raise ValueError("JSON content must follow standardized schema: {metadata: {...}, documents: [{sections: [...]}]}")
|
||||
|
||||
# Extract sections and metadata from standardized schema
|
||||
sections = self._extractSections(jsonContent)
|
||||
metadata = self._extractMetadata(jsonContent)
|
||||
|
||||
# Use provided title (which comes from documents[].title) as primary source
|
||||
# Fallback to metadata.title only if title parameter is empty
|
||||
documentTitle = title if title else metadata.get("title", "Generated Document")
|
||||
|
||||
# Build markdown content
|
||||
markdownParts = []
|
||||
|
||||
# Document title
|
||||
markdownParts.append(f"# {documentTitle}")
|
||||
markdownParts.append("")
|
||||
|
||||
# Process each section
|
||||
for section in sections:
|
||||
sectionMarkdown = self._renderJsonSection(section)
|
||||
if sectionMarkdown:
|
||||
markdownParts.append(sectionMarkdown)
|
||||
markdownParts.append("") # Add spacing between sections
|
||||
|
||||
# Add generation info
|
||||
markdownParts.append("---")
|
||||
markdownParts.append(f"*Generated: {self._formatTimestamp()}*")
|
||||
|
||||
return '\n'.join(markdownParts)
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error generating markdown from JSON: {str(e)}")
|
||||
raise Exception(f"Markdown generation failed: {str(e)}")
|
||||
|
||||
def _renderJsonSection(self, section: Dict[str, Any]) -> str:
|
||||
"""Render a single JSON section to markdown.
|
||||
Supports three content formats: reference, object (base64), extracted_text.
|
||||
"""
|
||||
try:
|
||||
sectionType = self._getSectionType(section)
|
||||
sectionData = self._getSectionData(section)
|
||||
|
||||
# Check for three content formats from Phase 5D in elements
|
||||
if isinstance(sectionData, list):
|
||||
markdownParts = []
|
||||
for element in sectionData:
|
||||
element_type = element.get("type", "") if isinstance(element, dict) else ""
|
||||
|
||||
# Support three content formats from Phase 5D
|
||||
if element_type == "reference":
|
||||
# Document reference format
|
||||
doc_ref = element.get("documentReference", "")
|
||||
label = element.get("label", "Reference")
|
||||
markdownParts.append(f"*[Reference: {label}]*")
|
||||
continue
|
||||
elif element_type == "extracted_text":
|
||||
# Extracted text format
|
||||
content = element.get("content", "")
|
||||
source = element.get("source", "")
|
||||
if content:
|
||||
source_text = f" *(Source: {source})*" if source else ""
|
||||
markdownParts.append(f"{content}{source_text}")
|
||||
continue
|
||||
|
||||
# If we processed reference/extracted_text elements, return them
|
||||
if markdownParts:
|
||||
return '\n\n'.join(markdownParts)
|
||||
|
||||
if sectionType == "table":
|
||||
# Work directly with elements like other renderers
|
||||
if isinstance(sectionData, list) and sectionData:
|
||||
element = sectionData[0] if isinstance(sectionData[0], dict) else {}
|
||||
return self._renderJsonTable(element)
|
||||
return ""
|
||||
elif sectionType == "bullet_list":
|
||||
# Work directly with elements like other renderers
|
||||
if isinstance(sectionData, list) and sectionData:
|
||||
element = sectionData[0] if isinstance(sectionData[0], dict) else {}
|
||||
return self._renderJsonBulletList(element)
|
||||
return ""
|
||||
elif sectionType == "heading":
|
||||
# Work directly with elements like other renderers
|
||||
if isinstance(sectionData, list) and sectionData:
|
||||
element = sectionData[0] if isinstance(sectionData[0], dict) else {}
|
||||
return self._renderJsonHeading(element)
|
||||
return ""
|
||||
elif sectionType == "paragraph":
|
||||
# Work directly with elements like other renderers
|
||||
if isinstance(sectionData, list) and sectionData:
|
||||
element = sectionData[0] if isinstance(sectionData[0], dict) else {}
|
||||
return self._renderJsonParagraph(element)
|
||||
elif isinstance(sectionData, dict):
|
||||
return self._renderJsonParagraph(sectionData)
|
||||
return ""
|
||||
elif sectionType == "code_block":
|
||||
# Work directly with elements like other renderers
|
||||
if isinstance(sectionData, list) and sectionData:
|
||||
element = sectionData[0] if isinstance(sectionData[0], dict) else {}
|
||||
return self._renderJsonCodeBlock(element)
|
||||
return ""
|
||||
elif sectionType == "image":
|
||||
# Work directly with elements like other renderers
|
||||
if isinstance(sectionData, list) and sectionData:
|
||||
element = sectionData[0] if isinstance(sectionData[0], dict) else {}
|
||||
return self._renderJsonImage(element)
|
||||
return ""
|
||||
else:
|
||||
# Fallback to paragraph for unknown types
|
||||
if isinstance(sectionData, list) and sectionData:
|
||||
element = sectionData[0] if isinstance(sectionData[0], dict) else {}
|
||||
return self._renderJsonParagraph(element)
|
||||
elif isinstance(sectionData, dict):
|
||||
return self._renderJsonParagraph(sectionData)
|
||||
return ""
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error rendering section {self._getSectionId(section)}: {str(e)}")
|
||||
return f"*[Error rendering section: {str(e)}]*"
|
||||
|
||||
def _renderJsonTable(self, tableData: Dict[str, Any]) -> str:
|
||||
"""Render a JSON table to markdown."""
|
||||
try:
|
||||
# Extract from nested content structure: element.content.{headers, rows}
|
||||
content = tableData.get("content", {})
|
||||
if not isinstance(content, dict):
|
||||
return ""
|
||||
headers = content.get("headers", [])
|
||||
rows = content.get("rows", [])
|
||||
|
||||
if not headers or not rows:
|
||||
return ""
|
||||
|
||||
markdownParts = []
|
||||
|
||||
# Create table header
|
||||
headerLine = " | ".join(str(header) for header in headers)
|
||||
markdownParts.append(headerLine)
|
||||
|
||||
# Add separator line
|
||||
separatorLine = " | ".join("---" for _ in headers)
|
||||
markdownParts.append(separatorLine)
|
||||
|
||||
# Add data rows
|
||||
for row in rows:
|
||||
rowLine = " | ".join(str(cellData) for cellData in row)
|
||||
markdownParts.append(rowLine)
|
||||
|
||||
return '\n'.join(markdownParts)
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error rendering table: {str(e)}")
|
||||
return ""
|
||||
|
||||
def _renderJsonBulletList(self, listData: Dict[str, Any]) -> str:
|
||||
"""Render a JSON bullet list to markdown."""
|
||||
try:
|
||||
# Extract from nested content structure: element.content.{items}
|
||||
content = listData.get("content", {})
|
||||
if not isinstance(content, dict):
|
||||
return ""
|
||||
items = content.get("items", [])
|
||||
|
||||
if not items:
|
||||
return ""
|
||||
|
||||
markdownParts = []
|
||||
for item in items:
|
||||
if isinstance(item, str):
|
||||
markdownParts.append(f"- {item}")
|
||||
elif isinstance(item, dict) and "text" in item:
|
||||
markdownParts.append(f"- {item['text']}")
|
||||
|
||||
return '\n'.join(markdownParts)
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error rendering bullet list: {str(e)}")
|
||||
return ""
|
||||
|
||||
def _renderJsonHeading(self, headingData: Dict[str, Any]) -> str:
|
||||
"""Render a JSON heading to markdown."""
|
||||
try:
|
||||
# Extract from nested content structure: element.content.{text, level}
|
||||
content = headingData.get("content", {})
|
||||
if not isinstance(content, dict):
|
||||
return ""
|
||||
text = content.get("text", "")
|
||||
level = content.get("level", 1)
|
||||
|
||||
if text:
|
||||
level = max(1, min(6, level))
|
||||
return f"{'#' * level} {text}"
|
||||
|
||||
return ""
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error rendering heading: {str(e)}")
|
||||
return ""
|
||||
|
||||
def _renderJsonParagraph(self, paragraphData: Dict[str, Any]) -> str:
|
||||
"""Render a JSON paragraph to markdown."""
|
||||
try:
|
||||
# Extract from nested content structure
|
||||
content = paragraphData.get("content", {})
|
||||
if isinstance(content, dict):
|
||||
text = content.get("text", "")
|
||||
elif isinstance(content, str):
|
||||
text = content
|
||||
else:
|
||||
text = ""
|
||||
return text if text else ""
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error rendering paragraph: {str(e)}")
|
||||
return ""
|
||||
|
||||
def _renderJsonCodeBlock(self, codeData: Dict[str, Any]) -> str:
|
||||
"""Render a JSON code block to markdown."""
|
||||
try:
|
||||
# Extract from nested content structure
|
||||
content = codeData.get("content", {})
|
||||
if not isinstance(content, dict):
|
||||
return ""
|
||||
code = content.get("code", "")
|
||||
language = content.get("language", "")
|
||||
|
||||
if code:
|
||||
if language:
|
||||
return f"```{language}\n{code}\n```"
|
||||
else:
|
||||
return f"```\n{code}\n```"
|
||||
|
||||
return ""
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error rendering code block: {str(e)}")
|
||||
return ""
|
||||
|
||||
def _renderJsonImage(self, imageData: Dict[str, Any]) -> str:
|
||||
"""Render a JSON image to markdown."""
|
||||
try:
|
||||
# Extract from nested content structure: element.content.{base64Data, altText, caption}
|
||||
content = imageData.get("content", {})
|
||||
if not isinstance(content, dict):
|
||||
return ""
|
||||
altText = content.get("altText", "Image")
|
||||
base64Data = content.get("base64Data", "")
|
||||
|
||||
if base64Data:
|
||||
# For base64 images, we can't embed them directly in markdown
|
||||
# So we'll use a placeholder with the alt text
|
||||
return f""
|
||||
else:
|
||||
return f""
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error rendering image: {str(e)}")
|
||||
return f""
|
||||
|
|
@ -0,0 +1,944 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""
|
||||
PDF renderer for report generation using reportlab.
|
||||
"""
|
||||
|
||||
from .documentRendererBaseTemplate import BaseRenderer
|
||||
from modules.datamodels.datamodelDocument import RenderedDocument
|
||||
from typing import Dict, Any, List, Optional
|
||||
import io
|
||||
import base64
|
||||
|
||||
try:
|
||||
from reportlab.lib.pagesizes import letter, A4
|
||||
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, Table, TableStyle, PageBreak
|
||||
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
|
||||
from reportlab.lib.units import inch
|
||||
from reportlab.lib import colors
|
||||
from reportlab.lib.enums import TA_CENTER, TA_LEFT, TA_JUSTIFY
|
||||
REPORTLAB_AVAILABLE = True
|
||||
except ImportError:
|
||||
REPORTLAB_AVAILABLE = False
|
||||
|
||||
class RendererPdf(BaseRenderer):
|
||||
"""Renders content to PDF format using reportlab."""
|
||||
|
||||
@classmethod
|
||||
def getSupportedFormats(cls) -> List[str]:
|
||||
"""Return supported PDF formats."""
|
||||
return ['pdf']
|
||||
|
||||
@classmethod
|
||||
def getFormatAliases(cls) -> List[str]:
|
||||
"""Return format aliases."""
|
||||
return ['document', 'print']
|
||||
|
||||
@classmethod
|
||||
def getPriority(cls) -> int:
|
||||
"""Return priority for PDF renderer."""
|
||||
return 120
|
||||
|
||||
@classmethod
|
||||
def getOutputStyle(cls, formatName: Optional[str] = None) -> str:
|
||||
"""Return output style classification: PDF documents are formatted documents."""
|
||||
return 'document'
|
||||
|
||||
@classmethod
|
||||
def getAcceptedSectionTypes(cls, formatName: Optional[str] = None) -> List[str]:
|
||||
"""
|
||||
Return list of section content types that PDF renderer accepts.
|
||||
PDF renderer accepts all section types (PDF documents can contain all content types).
|
||||
"""
|
||||
from modules.datamodels.datamodelJson import supportedSectionTypes
|
||||
return list(supportedSectionTypes)
|
||||
|
||||
async def render(self, extractedContent: Dict[str, Any], title: str, userPrompt: str = None, aiService=None) -> List[RenderedDocument]:
|
||||
"""Render extracted JSON content to PDF format using AI-analyzed styling."""
|
||||
try:
|
||||
if not REPORTLAB_AVAILABLE:
|
||||
# Fallback to HTML if reportlab not available
|
||||
from .rendererHtml import RendererHtml
|
||||
html_renderer = RendererHtml()
|
||||
return await html_renderer.render(extractedContent, title, userPrompt, aiService)
|
||||
|
||||
# Generate PDF using AI-analyzed styling
|
||||
pdf_content = await self._generatePdfFromJson(extractedContent, title, userPrompt, aiService)
|
||||
|
||||
# Extract metadata for document type and other info
|
||||
metadata = extractedContent.get("metadata", {}) if extractedContent else {}
|
||||
documentType = metadata.get("documentType") if isinstance(metadata, dict) else None
|
||||
|
||||
# Determine filename from document or title
|
||||
documents = extractedContent.get("documents", [])
|
||||
if documents and isinstance(documents[0], dict):
|
||||
filename = documents[0].get("filename")
|
||||
if not filename:
|
||||
filename = self._determineFilename(title, "application/pdf")
|
||||
else:
|
||||
filename = self._determineFilename(title, "application/pdf")
|
||||
|
||||
# Convert PDF content to bytes if it's a string (base64)
|
||||
if isinstance(pdf_content, str):
|
||||
# Try to decode as base64, otherwise encode as UTF-8
|
||||
try:
|
||||
pdf_bytes = base64.b64decode(pdf_content)
|
||||
except Exception:
|
||||
pdf_bytes = pdf_content.encode('utf-8')
|
||||
else:
|
||||
pdf_bytes = pdf_content
|
||||
|
||||
return [
|
||||
RenderedDocument(
|
||||
documentData=pdf_bytes,
|
||||
mimeType="application/pdf",
|
||||
filename=filename,
|
||||
documentType=documentType,
|
||||
metadata=metadata if isinstance(metadata, dict) else None
|
||||
)
|
||||
]
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error rendering PDF: {str(e)}")
|
||||
# Return minimal fallback
|
||||
fallbackContent = f"PDF Generation Error: {str(e)}"
|
||||
return [
|
||||
RenderedDocument(
|
||||
documentData=fallbackContent.encode('utf-8'),
|
||||
mimeType="text/plain",
|
||||
filename=self._determineFilename(title, "text/plain")
|
||||
)
|
||||
]
|
||||
|
||||
async def _generatePdfFromJson(self, json_content: Dict[str, Any], title: str, userPrompt: str = None, aiService=None) -> str:
|
||||
"""Generate PDF content from structured JSON document using AI-generated styling."""
|
||||
try:
|
||||
# Get style set: use styles from metadata if available, otherwise enhance with AI
|
||||
styles = await self._getStyleSet(json_content, userPrompt, aiService)
|
||||
|
||||
# Validate JSON structure
|
||||
if not self._validateJsonStructure(json_content):
|
||||
raise ValueError("JSON content must follow standardized schema: {metadata: {...}, documents: [{sections: [...]}]}")
|
||||
|
||||
# Extract sections and metadata from standardized schema
|
||||
sections = self._extractSections(json_content)
|
||||
metadata = self._extractMetadata(json_content)
|
||||
|
||||
# Use provided title (which comes from documents[].title) as primary source
|
||||
# Fallback to metadata.title only if title parameter is empty
|
||||
document_title = title if title else metadata.get("title", "Generated Document")
|
||||
|
||||
# Make title shorter to prevent wrapping/overlapping
|
||||
if len(document_title) > 40:
|
||||
document_title = "PowerOn - Consent Agreement"
|
||||
|
||||
# Create a buffer to hold the PDF
|
||||
buffer = io.BytesIO()
|
||||
|
||||
# Create PDF document
|
||||
doc = SimpleDocTemplate(
|
||||
buffer,
|
||||
pagesize=A4,
|
||||
rightMargin=72,
|
||||
leftMargin=72,
|
||||
topMargin=72,
|
||||
bottomMargin=18
|
||||
)
|
||||
|
||||
# Build PDF content
|
||||
story = []
|
||||
|
||||
# Title page
|
||||
title_style = self._createTitleStyle(styles)
|
||||
story.append(Paragraph(document_title, title_style))
|
||||
story.append(Spacer(1, 50)) # Increased spacing to prevent overlap
|
||||
story.append(Paragraph(f"Generated: {self._formatTimestamp()}", self._createNormalStyle(styles)))
|
||||
story.append(Spacer(1, 30)) # Add spacing before page break
|
||||
story.append(PageBreak())
|
||||
|
||||
# Process each section (sections already extracted above)
|
||||
self.services.utils.debugLogToFile(f"PDF SECTIONS TO PROCESS: {len(sections)} sections", "PDF_RENDERER")
|
||||
for i, section in enumerate(sections):
|
||||
self.services.utils.debugLogToFile(f"PDF SECTION {i}: content_type={section.get('content_type', 'unknown')}, id={section.get('id', 'unknown')}", "PDF_RENDERER")
|
||||
section_elements = self._renderJsonSection(section, styles)
|
||||
self.services.utils.debugLogToFile(f"PDF SECTION {i} ELEMENTS: {len(section_elements)} elements", "PDF_RENDERER")
|
||||
story.extend(section_elements)
|
||||
|
||||
# Build PDF
|
||||
doc.build(story)
|
||||
|
||||
# Get PDF content as base64
|
||||
buffer.seek(0)
|
||||
pdf_bytes = buffer.getvalue()
|
||||
pdf_base64 = base64.b64encode(pdf_bytes).decode('utf-8')
|
||||
|
||||
return pdf_base64
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error generating PDF from JSON: {str(e)}")
|
||||
raise Exception(f"PDF generation failed: {str(e)}")
|
||||
|
||||
async def _getStyleSet(self, extractedContent: Dict[str, Any] = None, userPrompt: str = None, aiService=None, templateName: str = None) -> Dict[str, Any]:
|
||||
"""Get style set - use styles from document generation metadata if available,
|
||||
otherwise enhance default styles with AI if userPrompt provided.
|
||||
|
||||
WICHTIG: In a dynamic scalable AI system, styling should come from document generation,
|
||||
not be generated separately by renderers. Only fall back to AI if styles not provided.
|
||||
|
||||
Args:
|
||||
extractedContent: Document content with metadata (may contain styles)
|
||||
userPrompt: User's prompt (AI will detect style instructions in any language)
|
||||
aiService: AI service (used only if styles not in metadata and userPrompt provided)
|
||||
templateName: Name of template style set (None = default)
|
||||
|
||||
Returns:
|
||||
Dict with style definitions for all document styles
|
||||
"""
|
||||
# Get default style set
|
||||
defaultStyleSet = self._getDefaultStyleSet()
|
||||
|
||||
# FIRST: Check if styles are provided in document generation metadata (preferred approach)
|
||||
if extractedContent:
|
||||
metadata = extractedContent.get("metadata", {})
|
||||
if isinstance(metadata, dict):
|
||||
styles = metadata.get("styles")
|
||||
if styles and isinstance(styles, dict):
|
||||
self.logger.debug("Using styles from document generation metadata")
|
||||
enhancedStyleSet = self._convertColorsFormat(styles)
|
||||
return self._validateStylesContrast(enhancedStyleSet)
|
||||
|
||||
# FALLBACK: Enhance with AI if userPrompt provided (only if styles not in metadata)
|
||||
if userPrompt and aiService:
|
||||
self.logger.info(f"Styles not in metadata, enhancing with AI based on user prompt...")
|
||||
enhancedStyleSet = await self._enhanceStylesWithAI(userPrompt, defaultStyleSet, aiService)
|
||||
# Convert colors to PDF format after getting styles
|
||||
enhancedStyleSet = self._convertColorsFormat(enhancedStyleSet)
|
||||
return self._validateStylesContrast(enhancedStyleSet)
|
||||
else:
|
||||
# Use default styles only
|
||||
return defaultStyleSet
|
||||
|
||||
async def _enhanceStylesWithAI(self, userPrompt: str, defaultStyleSet: Dict[str, Any], aiService) -> Dict[str, Any]:
|
||||
"""Enhance default styles with AI based on user prompt."""
|
||||
try:
|
||||
style_template = self._createAiStyleTemplate("pdf", userPrompt, defaultStyleSet)
|
||||
enhanced_styles = await self._getAiStyles(aiService, style_template, defaultStyleSet)
|
||||
return enhanced_styles
|
||||
except Exception as e:
|
||||
self.logger.warning(f"AI style enhancement failed: {str(e)}, using default styles")
|
||||
return defaultStyleSet
|
||||
|
||||
def _validateStylesContrast(self, styles: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""Validate and fix contrast issues in AI-generated styles."""
|
||||
try:
|
||||
# Fix table header contrast
|
||||
if "table_header" in styles:
|
||||
header = styles["table_header"]
|
||||
bg_color = header.get("background", "#FFFFFF")
|
||||
text_color = header.get("text_color", "#000000")
|
||||
|
||||
# If both are white or both are dark, fix it
|
||||
if bg_color.upper() == "#FFFFFF" and text_color.upper() == "#FFFFFF":
|
||||
header["background"] = "#4F4F4F"
|
||||
header["text_color"] = "#FFFFFF"
|
||||
elif bg_color.upper() == "#000000" and text_color.upper() == "#000000":
|
||||
header["background"] = "#4F4F4F"
|
||||
header["text_color"] = "#FFFFFF"
|
||||
|
||||
# Fix table cell contrast
|
||||
if "table_cell" in styles:
|
||||
cell = styles["table_cell"]
|
||||
bg_color = cell.get("background", "#FFFFFF")
|
||||
text_color = cell.get("text_color", "#000000")
|
||||
|
||||
# If both are white or both are dark, fix it
|
||||
if bg_color.upper() == "#FFFFFF" and text_color.upper() == "#FFFFFF":
|
||||
cell["background"] = "#FFFFFF"
|
||||
cell["text_color"] = "#2F2F2F"
|
||||
elif bg_color.upper() == "#000000" and text_color.upper() == "#000000":
|
||||
cell["background"] = "#FFFFFF"
|
||||
cell["text_color"] = "#2F2F2F"
|
||||
|
||||
return styles
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Style validation failed: {str(e)}")
|
||||
return self._getDefaultStyleSet()
|
||||
|
||||
def _getDefaultStyleSet(self) -> Dict[str, Any]:
|
||||
"""Default PDF style set - used when no style instructions present."""
|
||||
return {
|
||||
"title": {"font_size": 24, "color": "#1F4E79", "bold": True, "align": "center", "space_after": 30},
|
||||
"heading1": {"font_size": 18, "color": "#2F2F2F", "bold": True, "align": "left", "space_after": 12, "space_before": 12},
|
||||
"heading2": {"font_size": 14, "color": "#4F4F4F", "bold": True, "align": "left", "space_after": 8, "space_before": 8},
|
||||
"paragraph": {"font_size": 11, "color": "#2F2F2F", "bold": False, "align": "left", "space_after": 6, "line_height": 1.2},
|
||||
"table_header": {"background": "#4F4F4F", "text_color": "#FFFFFF", "bold": True, "align": "center", "font_size": 12},
|
||||
"table_cell": {"background": "#FFFFFF", "text_color": "#2F2F2F", "bold": False, "align": "left", "font_size": 10},
|
||||
"bullet_list": {"font_size": 11, "color": "#2F2F2F", "space_after": 3},
|
||||
"code_block": {"font": "Courier", "font_size": 9, "color": "#2F2F2F", "background": "#F5F5F5", "space_after": 6}
|
||||
}
|
||||
|
||||
async def _getAiStylesWithPdfColors(self, ai_service, style_template: str, default_styles: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""Get AI styles with proper PDF color conversion."""
|
||||
if not ai_service:
|
||||
return default_styles
|
||||
|
||||
try:
|
||||
from modules.datamodels.datamodelAi import AiCallRequest, AiCallOptions, OperationTypeEnum
|
||||
|
||||
request_options = AiCallOptions()
|
||||
request_options.operationType = OperationTypeEnum.DATA_GENERATE
|
||||
|
||||
request = AiCallRequest(prompt=style_template, context="", options=request_options)
|
||||
|
||||
# Check if AI service is properly configured
|
||||
if not hasattr(ai_service, 'aiObjects') or not ai_service.aiObjects:
|
||||
self.logger.warning("AI service not properly configured, using defaults")
|
||||
return default_styles
|
||||
|
||||
response = await ai_service.callAi(request)
|
||||
|
||||
# Check if response is valid
|
||||
if not response:
|
||||
self.logger.warning("AI service returned no response, using defaults")
|
||||
return default_styles
|
||||
|
||||
import json
|
||||
import re
|
||||
|
||||
# Clean and parse JSON
|
||||
result = response.content.strip() if response and response.content else ""
|
||||
|
||||
# Check if result is empty
|
||||
if not result:
|
||||
self.logger.warning("AI styling returned empty response, using defaults")
|
||||
return default_styles
|
||||
|
||||
# Log the raw response for debugging
|
||||
self.logger.debug(f"AI styling raw response: {result[:200]}...")
|
||||
|
||||
# Extract JSON from various formats
|
||||
json_match = re.search(r'```json\s*\n(.*?)\n```', result, re.DOTALL)
|
||||
if json_match:
|
||||
result = json_match.group(1).strip()
|
||||
elif result.startswith('```json'):
|
||||
result = re.sub(r'^```json\s*', '', result)
|
||||
result = re.sub(r'\s*```$', '', result)
|
||||
elif result.startswith('```'):
|
||||
result = re.sub(r'^```\s*', '', result)
|
||||
result = re.sub(r'\s*```$', '', result)
|
||||
|
||||
# Try to extract JSON from explanatory text
|
||||
json_patterns = [
|
||||
r'\{[^{}]*"title"[^{}]*\}', # Simple JSON object
|
||||
r'\{.*?"title".*?\}', # JSON with title field
|
||||
r'\{.*?"font_size".*?\}', # JSON with font_size field
|
||||
]
|
||||
|
||||
for pattern in json_patterns:
|
||||
json_match = re.search(pattern, result, re.DOTALL)
|
||||
if json_match:
|
||||
result = json_match.group(0)
|
||||
break
|
||||
|
||||
# Additional cleanup - remove any leading/trailing whitespace and newlines
|
||||
result = result.strip()
|
||||
|
||||
# Check if result is still empty after cleanup
|
||||
if not result:
|
||||
self.logger.warning("AI styling returned empty content after cleanup, using defaults")
|
||||
return default_styles
|
||||
|
||||
# Try to parse JSON
|
||||
try:
|
||||
styles = json.loads(result)
|
||||
self.logger.debug(f"Successfully parsed AI styles: {list(styles.keys())}")
|
||||
except json.JSONDecodeError as json_error:
|
||||
self.logger.warning(f"AI styling returned invalid JSON: {json_error}")
|
||||
|
||||
# Use print instead of logger to avoid truncation
|
||||
self.services.utils.debugLogToFile(f"FULL AI RESPONSE THAT FAILED TO PARSE: {result}", "PDF_RENDERER")
|
||||
self.services.utils.debugLogToFile(f"RESPONSE LENGTH: {len(result)} characters", "PDF_RENDERER")
|
||||
|
||||
self.logger.warning(f"Raw content that failed to parse: {result}")
|
||||
|
||||
# Try to fix incomplete JSON by adding missing closing braces
|
||||
open_braces = result.count('{')
|
||||
close_braces = result.count('}')
|
||||
|
||||
if open_braces > close_braces:
|
||||
# JSON is incomplete, add missing closing braces
|
||||
missing_braces = open_braces - close_braces
|
||||
result = result + '}' * missing_braces
|
||||
self.logger.info(f"Added {missing_braces} missing closing brace(s)")
|
||||
|
||||
# Try parsing the fixed JSON
|
||||
try:
|
||||
styles = json.loads(result)
|
||||
self.logger.info("Successfully fixed incomplete JSON")
|
||||
except json.JSONDecodeError as fix_error:
|
||||
self.logger.warning(f"Fixed JSON still invalid: {fix_error}")
|
||||
# Try to extract just the JSON part if it's embedded in text
|
||||
json_start = result.find('{')
|
||||
json_end = result.rfind('}')
|
||||
if json_start != -1 and json_end != -1 and json_end > json_start:
|
||||
json_part = result[json_start:json_end+1]
|
||||
try:
|
||||
styles = json.loads(json_part)
|
||||
self.logger.info("Successfully extracted JSON from explanatory text")
|
||||
except json.JSONDecodeError:
|
||||
self.logger.warning("Could not extract valid JSON from response, using defaults")
|
||||
return default_styles
|
||||
else:
|
||||
return default_styles
|
||||
else:
|
||||
# Try to extract just the JSON part if it's embedded in text
|
||||
json_start = result.find('{')
|
||||
json_end = result.rfind('}')
|
||||
if json_start != -1 and json_end != -1 and json_end > json_start:
|
||||
json_part = result[json_start:json_end+1]
|
||||
try:
|
||||
styles = json.loads(json_part)
|
||||
self.logger.info("Successfully extracted JSON from explanatory text")
|
||||
except json.JSONDecodeError:
|
||||
self.logger.warning("Could not extract valid JSON from response, using defaults")
|
||||
return default_styles
|
||||
else:
|
||||
return default_styles
|
||||
|
||||
# Convert colors to PDF format (keep as hex strings, PDF renderer will convert them)
|
||||
styles = self._convertColorsFormat(styles)
|
||||
|
||||
return styles
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"AI styling failed: {str(e)}, using defaults")
|
||||
return default_styles
|
||||
|
||||
def _convertColorsFormat(self, styles: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""Convert colors to proper format for PDF compatibility."""
|
||||
try:
|
||||
for style_name, style_config in styles.items():
|
||||
if isinstance(style_config, dict):
|
||||
for prop, value in style_config.items():
|
||||
if isinstance(value, str) and value.startswith('#') and len(value) == 7:
|
||||
# Convert #RRGGBB to #AARRGGBB (add FF alpha channel) for consistency
|
||||
styles[style_name][prop] = f"FF{value[1:]}"
|
||||
elif isinstance(value, str) and value.startswith('#') and len(value) == 9:
|
||||
# Already aRGB format, keep as is
|
||||
pass
|
||||
return styles
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Color conversion failed: {str(e)}")
|
||||
return styles
|
||||
|
||||
def _getSafeColor(self, color_value: str, default: str = "#000000") -> str:
|
||||
"""Get a safe hex color value for PDF."""
|
||||
if isinstance(color_value, str) and color_value.startswith('#'):
|
||||
if len(color_value) == 7:
|
||||
return f"FF{color_value[1:]}"
|
||||
elif len(color_value) == 9:
|
||||
return color_value
|
||||
return default
|
||||
|
||||
|
||||
def _createTitleStyle(self, styles: Dict[str, Any]) -> ParagraphStyle:
|
||||
"""Create title style from style definitions."""
|
||||
title_style_def = styles.get("title", {})
|
||||
|
||||
# DEBUG: Show what color and spacing is being used for title
|
||||
title_color = title_style_def.get("color", "#1F4E79")
|
||||
title_space_after = title_style_def.get("space_after", 30)
|
||||
self.services.utils.debugLogToFile(f"PDF TITLE COLOR: {title_color} -> {self._hexToColor(title_color)}", "PDF_RENDERER")
|
||||
self.services.utils.debugLogToFile(f"PDF TITLE SPACE_AFTER: {title_space_after}", "PDF_RENDERER")
|
||||
|
||||
return ParagraphStyle(
|
||||
'CustomTitle',
|
||||
fontSize=title_style_def.get("font_size", 20), # Reduced from 24 to 20
|
||||
spaceAfter=title_style_def.get("space_after", 30),
|
||||
alignment=self._getAlignment(title_style_def.get("align", "center")),
|
||||
textColor=self._hexToColor(title_color),
|
||||
leading=title_style_def.get("font_size", 20) * 1.4, # Add line spacing for multi-line titles
|
||||
spaceBefore=0 # Ensure no space before title
|
||||
)
|
||||
|
||||
def _createHeadingStyle(self, styles: Dict[str, Any], level: int) -> ParagraphStyle:
|
||||
"""Create heading style from style definitions."""
|
||||
heading_key = f"heading{level}"
|
||||
heading_style_def = styles.get(heading_key, styles.get("heading1", {}))
|
||||
|
||||
return ParagraphStyle(
|
||||
f'CustomHeading{level}',
|
||||
fontSize=heading_style_def.get("font_size", 18 - level * 2),
|
||||
spaceAfter=heading_style_def.get("space_after", 12),
|
||||
spaceBefore=heading_style_def.get("space_before", 12),
|
||||
alignment=self._getAlignment(heading_style_def.get("align", "left")),
|
||||
textColor=self._hexToColor(heading_style_def.get("color", "#2F2F2F"))
|
||||
)
|
||||
|
||||
def _createNormalStyle(self, styles: Dict[str, Any]) -> ParagraphStyle:
|
||||
"""Create normal paragraph style from style definitions."""
|
||||
paragraph_style_def = styles.get("paragraph", {})
|
||||
|
||||
return ParagraphStyle(
|
||||
'CustomNormal',
|
||||
fontSize=paragraph_style_def.get("font_size", 11),
|
||||
spaceAfter=paragraph_style_def.get("space_after", 6),
|
||||
alignment=self._getAlignment(paragraph_style_def.get("align", "left")),
|
||||
textColor=self._hexToColor(paragraph_style_def.get("color", "#2F2F2F")),
|
||||
leading=paragraph_style_def.get("line_height", 1.2) * paragraph_style_def.get("font_size", 11)
|
||||
)
|
||||
|
||||
def _getAlignment(self, align: str) -> int:
|
||||
"""Convert alignment string to reportlab alignment constant."""
|
||||
if not align or not isinstance(align, str):
|
||||
return TA_LEFT
|
||||
|
||||
align_map = {
|
||||
"center": TA_CENTER,
|
||||
"left": TA_LEFT,
|
||||
"justify": TA_JUSTIFY,
|
||||
"right": TA_LEFT, # ReportLab doesn't have TA_RIGHT, use LEFT as fallback
|
||||
"0": TA_LEFT, # Handle numeric strings
|
||||
"1": TA_CENTER,
|
||||
"2": TA_JUSTIFY
|
||||
}
|
||||
return align_map.get(align.lower().strip(), TA_LEFT)
|
||||
|
||||
def _getTableAlignment(self, align: str) -> str:
|
||||
"""Convert alignment string to ReportLab table alignment string."""
|
||||
if not align or not isinstance(align, str):
|
||||
return 'LEFT'
|
||||
|
||||
align_map = {
|
||||
"center": 'CENTER',
|
||||
"left": 'LEFT',
|
||||
"justify": 'LEFT', # Tables don't support justify, use LEFT
|
||||
"right": 'RIGHT',
|
||||
"0": 'LEFT', # Handle numeric strings
|
||||
"1": 'CENTER',
|
||||
"2": 'LEFT' # Tables don't support justify, use LEFT
|
||||
}
|
||||
return align_map.get(align.lower().strip(), 'LEFT')
|
||||
|
||||
def _hexToColor(self, hex_color: str) -> colors.Color:
|
||||
"""Convert hex color to reportlab color."""
|
||||
try:
|
||||
hex_color = hex_color.lstrip('#')
|
||||
|
||||
# Handle aRGB format (8 characters: FF + RGB)
|
||||
if len(hex_color) == 8:
|
||||
# Skip the alpha channel (first 2 characters)
|
||||
hex_color = hex_color[2:]
|
||||
|
||||
# Handle RGB format (6 characters)
|
||||
if len(hex_color) == 6:
|
||||
r = int(hex_color[0:2], 16) / 255.0
|
||||
g = int(hex_color[2:4], 16) / 255.0
|
||||
b = int(hex_color[4:6], 16) / 255.0
|
||||
return colors.Color(r, g, b)
|
||||
|
||||
# Fallback for other formats
|
||||
return colors.black
|
||||
except:
|
||||
return colors.black
|
||||
|
||||
def _renderJsonSection(self, section: Dict[str, Any], styles: Dict[str, Any]) -> List[Any]:
|
||||
"""Render a single JSON section to PDF elements using AI-generated styles.
|
||||
Supports three content formats: reference, object (base64), extracted_text.
|
||||
"""
|
||||
try:
|
||||
section_type = self._getSectionType(section)
|
||||
elements = self._getSectionData(section)
|
||||
|
||||
# Process each element in the section
|
||||
all_elements = []
|
||||
for element in elements:
|
||||
element_type = element.get("type", "") if isinstance(element, dict) else ""
|
||||
|
||||
# Support three content formats from Phase 5D
|
||||
if element_type == "reference":
|
||||
# Document reference format
|
||||
doc_ref = element.get("documentReference", "")
|
||||
label = element.get("label", "Reference")
|
||||
ref_style = ParagraphStyle(
|
||||
'Reference',
|
||||
parent=self._createNormalStyle(styles),
|
||||
fontStyle='italic',
|
||||
textColor=colors.grey
|
||||
)
|
||||
all_elements.append(Paragraph(f"[Reference: {label}]", ref_style))
|
||||
all_elements.append(Spacer(1, 6))
|
||||
continue
|
||||
elif element_type == "extracted_text":
|
||||
# Extracted text format
|
||||
content = element.get("content", "")
|
||||
source = element.get("source", "")
|
||||
if content:
|
||||
source_text = f" <i>(Source: {source})</i>" if source else ""
|
||||
all_elements.append(Paragraph(f"{content}{source_text}", self._createNormalStyle(styles)))
|
||||
all_elements.append(Spacer(1, 6))
|
||||
continue
|
||||
|
||||
# Check element type, not section type (elements can have different types than section)
|
||||
if element_type == "table":
|
||||
all_elements.extend(self._renderJsonTable(element, styles))
|
||||
elif element_type == "bullet_list":
|
||||
all_elements.extend(self._renderJsonBulletList(element, styles))
|
||||
elif element_type == "heading":
|
||||
all_elements.extend(self._renderJsonHeading(element, styles))
|
||||
elif element_type == "paragraph":
|
||||
all_elements.extend(self._renderJsonParagraph(element, styles))
|
||||
elif element_type == "code_block":
|
||||
all_elements.extend(self._renderJsonCodeBlock(element, styles))
|
||||
elif element_type == "image":
|
||||
all_elements.extend(self._renderJsonImage(element, styles))
|
||||
else:
|
||||
# Fallback: if element_type not set, use section_type as fallback
|
||||
if section_type == "table":
|
||||
all_elements.extend(self._renderJsonTable(element, styles))
|
||||
elif section_type == "bullet_list":
|
||||
all_elements.extend(self._renderJsonBulletList(element, styles))
|
||||
elif section_type == "heading":
|
||||
all_elements.extend(self._renderJsonHeading(element, styles))
|
||||
elif section_type == "paragraph":
|
||||
all_elements.extend(self._renderJsonParagraph(element, styles))
|
||||
elif section_type == "code_block":
|
||||
all_elements.extend(self._renderJsonCodeBlock(element, styles))
|
||||
elif section_type == "image":
|
||||
all_elements.extend(self._renderJsonImage(element, styles))
|
||||
else:
|
||||
# Final fallback to paragraph for unknown types
|
||||
all_elements.extend(self._renderJsonParagraph(element, styles))
|
||||
|
||||
return all_elements
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error rendering section {self._getSectionId(section)}: {str(e)}")
|
||||
return [Paragraph(f"[Error rendering section: {str(e)}]", self._createNormalStyle(styles))]
|
||||
|
||||
def _renderJsonTable(self, table_data: Dict[str, Any], styles: Dict[str, Any]) -> List[Any]:
|
||||
"""Render a JSON table to PDF elements using AI-generated styles."""
|
||||
try:
|
||||
# Handle nested content structure: element.content.headers vs element.headers
|
||||
# Extract from nested content structure
|
||||
content = table_data.get("content", {})
|
||||
if not isinstance(content, dict):
|
||||
return []
|
||||
headers = content.get("headers", [])
|
||||
rows = content.get("rows", [])
|
||||
|
||||
if not headers or not rows:
|
||||
return []
|
||||
|
||||
# Prepare table data
|
||||
table_data_list = [headers] + rows
|
||||
|
||||
# Create table
|
||||
table = Table(table_data_list)
|
||||
|
||||
# Apply styling
|
||||
table_header_style = styles.get("table_header", {})
|
||||
table_cell_style = styles.get("table_cell", {})
|
||||
|
||||
table_style = [
|
||||
('BACKGROUND', (0, 0), (-1, 0), self._hexToColor(table_header_style.get("background", "#4F4F4F"))),
|
||||
('TEXTCOLOR', (0, 0), (-1, 0), self._hexToColor(table_header_style.get("text_color", "#FFFFFF"))),
|
||||
('ALIGN', (0, 0), (-1, -1), self._getTableAlignment(table_cell_style.get("align", "left"))),
|
||||
('FONTNAME', (0, 0), (-1, 0), 'Helvetica-Bold' if table_header_style.get("bold", True) else 'Helvetica'),
|
||||
('FONTSIZE', (0, 0), (-1, 0), table_header_style.get("font_size", 12)),
|
||||
('BOTTOMPADDING', (0, 0), (-1, 0), 12),
|
||||
('BACKGROUND', (0, 1), (-1, -1), self._hexToColor(table_cell_style.get("background", "#FFFFFF"))),
|
||||
('FONTSIZE', (0, 1), (-1, -1), table_cell_style.get("font_size", 10)),
|
||||
('GRID', (0, 0), (-1, -1), 1, colors.black)
|
||||
]
|
||||
|
||||
table.setStyle(TableStyle(table_style))
|
||||
|
||||
return [table, Spacer(1, 12)]
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error rendering table: {str(e)}")
|
||||
return []
|
||||
|
||||
def _renderJsonBulletList(self, list_data: Dict[str, Any], styles: Dict[str, Any]) -> List[Any]:
|
||||
"""Render a JSON bullet list to PDF elements using AI-generated styles."""
|
||||
try:
|
||||
# Extract from nested content structure
|
||||
content = list_data.get("content", {})
|
||||
if not isinstance(content, dict):
|
||||
return []
|
||||
items = content.get("items", [])
|
||||
bullet_style_def = styles.get("bullet_list", {})
|
||||
|
||||
elements = []
|
||||
for item in items:
|
||||
if isinstance(item, str):
|
||||
elements.append(Paragraph(f"• {item}", self._createNormalStyle(styles)))
|
||||
elif isinstance(item, dict) and "text" in item:
|
||||
elements.append(Paragraph(f"• {item['text']}", self._createNormalStyle(styles)))
|
||||
|
||||
if elements:
|
||||
elements.append(Spacer(1, bullet_style_def.get("space_after", 3)))
|
||||
|
||||
return elements
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error rendering bullet list: {str(e)}")
|
||||
return []
|
||||
|
||||
def _renderJsonHeading(self, heading_data: Dict[str, Any], styles: Dict[str, Any]) -> List[Any]:
|
||||
"""Render a JSON heading to PDF elements using AI-generated styles."""
|
||||
try:
|
||||
# Extract from nested content structure
|
||||
content = heading_data.get("content", {})
|
||||
if not isinstance(content, dict):
|
||||
return []
|
||||
text = content.get("text", "")
|
||||
level = content.get("level", 1)
|
||||
|
||||
if text:
|
||||
level = max(1, min(6, level))
|
||||
heading_style = self._createHeadingStyle(styles, level)
|
||||
return [Paragraph(text, heading_style)]
|
||||
|
||||
return []
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error rendering heading: {str(e)}")
|
||||
return []
|
||||
|
||||
def _renderJsonParagraph(self, paragraph_data: Dict[str, Any], styles: Dict[str, Any]) -> List[Any]:
|
||||
"""Render a JSON paragraph to PDF elements using AI-generated styles."""
|
||||
try:
|
||||
# Extract from nested content structure
|
||||
content = paragraph_data.get("content", {})
|
||||
if isinstance(content, dict):
|
||||
text = content.get("text", "")
|
||||
elif isinstance(content, str):
|
||||
text = content
|
||||
else:
|
||||
text = ""
|
||||
|
||||
if text:
|
||||
return [Paragraph(text, self._createNormalStyle(styles))]
|
||||
|
||||
return []
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error rendering paragraph: {str(e)}")
|
||||
return []
|
||||
|
||||
def _renderJsonCodeBlock(self, code_data: Dict[str, Any], styles: Dict[str, Any]) -> List[Any]:
|
||||
"""Render a JSON code block to PDF elements using AI-generated styles."""
|
||||
try:
|
||||
# Extract from nested content structure
|
||||
content = code_data.get("content", {})
|
||||
if not isinstance(content, dict):
|
||||
return []
|
||||
code = content.get("code", "")
|
||||
language = content.get("language", "")
|
||||
code_style_def = styles.get("code_block", {})
|
||||
|
||||
if code:
|
||||
elements = []
|
||||
|
||||
if language:
|
||||
lang_style = ParagraphStyle(
|
||||
'CodeLanguage',
|
||||
fontSize=code_style_def.get("font_size", 9),
|
||||
textColor=self._hexToColor(code_style_def.get("color", "#2F2F2F")),
|
||||
fontName='Helvetica-Bold'
|
||||
)
|
||||
elements.append(Paragraph(f"Code ({language}):", lang_style))
|
||||
|
||||
code_style = ParagraphStyle(
|
||||
'CodeBlock',
|
||||
fontSize=code_style_def.get("font_size", 9),
|
||||
textColor=self._hexToColor(code_style_def.get("color", "#2F2F2F")),
|
||||
fontName=code_style_def.get("font", "Courier"),
|
||||
backColor=self._hexToColor(code_style_def.get("background", "#F5F5F5")),
|
||||
spaceAfter=code_style_def.get("space_after", 6)
|
||||
)
|
||||
elements.append(Paragraph(code, code_style))
|
||||
|
||||
return elements
|
||||
|
||||
return []
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error rendering code block: {str(e)}")
|
||||
return []
|
||||
|
||||
def _renderJsonImage(self, image_data: Dict[str, Any], styles: Dict[str, Any]) -> List[Any]:
|
||||
"""Render a JSON image to PDF elements using reportlab."""
|
||||
try:
|
||||
# Extract from nested content structure
|
||||
content = image_data.get("content", {})
|
||||
base64_data = ""
|
||||
alt_text = "Image"
|
||||
caption = ""
|
||||
|
||||
if isinstance(content, dict):
|
||||
# Nested content structure
|
||||
base64_data = content.get("base64Data", "")
|
||||
alt_text = content.get("altText", "Image")
|
||||
caption = content.get("caption", "")
|
||||
elif isinstance(content, str):
|
||||
# Content might be base64 string directly (shouldn't happen, but handle it)
|
||||
self.logger.warning("Image content is a string, not a dict. This should not happen.")
|
||||
return [Paragraph(f"[Image: Invalid format]", self._createNormalStyle(styles))]
|
||||
|
||||
# If base64Data not found in content, try direct element fields (fallback)
|
||||
if not base64_data:
|
||||
base64_data = image_data.get("base64Data", "")
|
||||
if not alt_text or alt_text == "Image":
|
||||
alt_text = image_data.get("altText", "Image")
|
||||
if not caption:
|
||||
caption = image_data.get("caption", "")
|
||||
|
||||
# If base64Data still not found, try extracting from url data URI
|
||||
if not base64_data:
|
||||
url = image_data.get("url", "") or (content.get("url", "") if isinstance(content, dict) else "")
|
||||
if url and isinstance(url, str) and url.startswith("data:image/"):
|
||||
# Extract base64 from data URI: data:image/png;base64,<base64>
|
||||
import re
|
||||
match = re.match(r'data:image/[^;]+;base64,(.+)', url)
|
||||
if match:
|
||||
base64_data = match.group(1)
|
||||
|
||||
if not base64_data:
|
||||
self.logger.warning(f"No base64 data found for image. Alt text: {alt_text}")
|
||||
return [Paragraph(f"[Image: {alt_text}]", self._createNormalStyle(styles))]
|
||||
|
||||
# Validate that base64_data is actually base64 (not the entire element rendered as text)
|
||||
if len(base64_data) > 10000: # Very long string might be entire element JSON
|
||||
self.logger.warning(f"Base64 data seems too long ({len(base64_data)} chars), might be incorrectly extracted")
|
||||
|
||||
# Ensure base64_data is a string, not bytes or other type
|
||||
if not isinstance(base64_data, str):
|
||||
self.logger.warning(f"Base64 data is not a string: {type(base64_data)}")
|
||||
return [Paragraph(f"[Image: {alt_text} - Invalid data type]", self._createNormalStyle(styles))]
|
||||
|
||||
try:
|
||||
from reportlab.platypus import Image as ReportLabImage
|
||||
from reportlab.lib.units import inch
|
||||
import base64
|
||||
import io
|
||||
|
||||
# Decode base64 image data
|
||||
imageBytes = base64.b64decode(base64_data)
|
||||
imageStream = io.BytesIO(imageBytes)
|
||||
|
||||
# Create reportlab Image element
|
||||
# Try to get image dimensions from PIL
|
||||
try:
|
||||
from PIL import Image as PILImage
|
||||
from reportlab.lib.pagesizes import A4
|
||||
|
||||
pilImage = PILImage.open(imageStream)
|
||||
originalWidth, originalHeight = pilImage.size
|
||||
|
||||
# Calculate available page dimensions (A4 with margins: 72pt left/right, 72pt top, 18pt bottom)
|
||||
pageWidth = A4[0] # 595.27 points
|
||||
pageHeight = A4[1] # 841.89 points
|
||||
leftMargin = 72
|
||||
rightMargin = 72
|
||||
topMargin = 72
|
||||
bottomMargin = 18
|
||||
|
||||
# Use actual frame dimensions from SimpleDocTemplate
|
||||
# Frame is smaller than page minus margins due to internal spacing
|
||||
# From error message: frame is 439.27559055118115 x 739.8897637795277
|
||||
# Use conservative values with safety margin
|
||||
availableWidth = 430.0 # Slightly smaller than frame width for safety
|
||||
availableHeight = 730.0 # Slightly smaller than frame height for safety
|
||||
|
||||
# Convert original image size from pixels to points
|
||||
# PIL provides size in pixels, need to convert to points
|
||||
# Standard conversion: 1 inch = 72 points, typical screen DPI = 96 pixels/inch
|
||||
# So: pixels * (72/96) = points, or pixels * 0.75 = points
|
||||
# But for images, we should use the image's actual DPI if available
|
||||
dpi = pilImage.info.get('dpi', (96, 96))[0] # Default to 96 DPI if not specified
|
||||
if dpi <= 0:
|
||||
dpi = 96 # Fallback to 96 DPI
|
||||
|
||||
# Convert pixels to points: 1 point = 1/72 inch, so pixels * (72/dpi) = points
|
||||
imgWidthPoints = originalWidth * (72.0 / dpi)
|
||||
imgHeightPoints = originalHeight * (72.0 / dpi)
|
||||
|
||||
# Scale to fit within available page dimensions while maintaining aspect ratio
|
||||
widthScale = availableWidth / imgWidthPoints if imgWidthPoints > 0 else 1.0
|
||||
heightScale = availableHeight / imgHeightPoints if imgHeightPoints > 0 else 1.0
|
||||
|
||||
# Use the smaller scale to ensure image fits both width and height
|
||||
scale = min(widthScale, heightScale, 1.0) # Don't scale up, only down
|
||||
|
||||
imgWidth = imgWidthPoints * scale
|
||||
imgHeight = imgHeightPoints * scale
|
||||
|
||||
# Additional safety check: ensure dimensions don't exceed available space
|
||||
if imgWidth > availableWidth:
|
||||
scale = availableWidth / imgWidth
|
||||
imgWidth = availableWidth
|
||||
imgHeight = imgHeight * scale
|
||||
|
||||
if imgHeight > availableHeight:
|
||||
scale = availableHeight / imgHeight
|
||||
imgHeight = availableHeight
|
||||
imgWidth = imgWidth * scale
|
||||
|
||||
# Reset stream for reportlab
|
||||
imageStream.seek(0)
|
||||
except Exception as e:
|
||||
# Fallback: use default size that fits page
|
||||
self.logger.warning(f"Error calculating image size: {str(e)}, using safe default")
|
||||
# Use 80% of available width as safe default
|
||||
imgWidth = 4 * inch # ~288 points, safe for ~451pt available width
|
||||
imgHeight = 3 * inch # ~216 points, safe for ~751pt available height
|
||||
imageStream.seek(0)
|
||||
|
||||
# Create reportlab Image
|
||||
reportlabImage = ReportLabImage(imageStream, width=imgWidth, height=imgHeight)
|
||||
|
||||
elements = [reportlabImage]
|
||||
|
||||
# Add caption if available
|
||||
if caption:
|
||||
captionStyle = self._createNormalStyle(styles)
|
||||
captionStyle.fontSize = 10
|
||||
captionStyle.textColor = self._hexToColor(styles.get("paragraph", {}).get("color", "#666666"))
|
||||
elements.append(Paragraph(f"<i>{caption}</i>", captionStyle))
|
||||
elif alt_text and alt_text != "Image":
|
||||
# Use alt text as caption if no caption provided, but avoid usageHint format
|
||||
if "Render as visual element:" in alt_text:
|
||||
# Extract filename from usageHint if possible
|
||||
parts = alt_text.split("Render as visual element:")
|
||||
if len(parts) > 1:
|
||||
filename = parts[1].strip()
|
||||
caption_text = f"Figure: {filename}"
|
||||
else:
|
||||
caption_text = alt_text
|
||||
else:
|
||||
caption_text = f"Figure: {alt_text}"
|
||||
captionStyle = self._createNormalStyle(styles)
|
||||
captionStyle.fontSize = 10
|
||||
captionStyle.textColor = self._hexToColor(styles.get("paragraph", {}).get("color", "#666666"))
|
||||
elements.append(Paragraph(f"<i>{caption_text}</i>", captionStyle))
|
||||
|
||||
return elements
|
||||
|
||||
except Exception as imgError:
|
||||
self.logger.error(f"Error embedding image in PDF: {str(imgError)}")
|
||||
# Return error message instead of placeholder
|
||||
errorStyle = self._createNormalStyle(styles)
|
||||
errorStyle.textColor = self._hexToColor("#FF0000") # Red color for error
|
||||
errorMsg = f"[Error: Could not embed image '{alt_text}'. {str(imgError)}]"
|
||||
return [Paragraph(errorMsg, errorStyle)]
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error rendering image: {str(e)}")
|
||||
errorStyle = self._createNormalStyle(styles)
|
||||
errorStyle.textColor = self._hexToColor("#FF0000") # Red color for error
|
||||
errorMsg = f"[Error: Could not render image '{image_data.get('altText', 'Image')}'. {str(e)}]"
|
||||
return [Paragraph(errorMsg, errorStyle)]
|
||||
File diff suppressed because it is too large
Load diff
|
|
@ -0,0 +1,380 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""
|
||||
Text renderer for report generation.
|
||||
"""
|
||||
|
||||
from .documentRendererBaseTemplate import BaseRenderer
|
||||
from modules.datamodels.datamodelDocument import RenderedDocument
|
||||
from typing import Dict, Any, List, Optional
|
||||
|
||||
class RendererText(BaseRenderer):
|
||||
"""Renders content to plain text format with format-specific extraction."""
|
||||
|
||||
@classmethod
|
||||
def getSupportedFormats(cls) -> List[str]:
|
||||
"""Return supported text formats (excluding formats with dedicated renderers)."""
|
||||
return [
|
||||
'txt', 'text', 'plain',
|
||||
# Programming languages
|
||||
'js', 'javascript', 'ts', 'typescript', 'jsx', 'tsx',
|
||||
'py', 'python', 'java', 'cpp', 'c', 'h', 'hpp',
|
||||
'cs', 'csharp', 'php', 'rb', 'ruby', 'go', 'rs', 'rust',
|
||||
'swift', 'kt', 'kotlin', 'scala', 'r', 'm', 'objc',
|
||||
'sh', 'bash', 'zsh', 'fish', 'ps1', 'bat', 'cmd',
|
||||
# Web technologies (excluding html/htm which have dedicated renderer)
|
||||
'css', 'scss', 'sass', 'less', 'xml', 'yaml', 'yml', 'toml', 'ini', 'cfg',
|
||||
# Data formats (excluding csv, md/markdown which have dedicated renderers)
|
||||
'tsv', 'log', 'rst', 'sql', 'dockerfile', 'dockerignore', 'gitignore',
|
||||
# Configuration files
|
||||
'env', 'properties', 'conf', 'config', 'rc',
|
||||
'gitattributes', 'editorconfig', 'eslintrc',
|
||||
# Documentation
|
||||
'readme', 'changelog', 'license', 'authors',
|
||||
'contributing', 'todo', 'notes', 'docs'
|
||||
]
|
||||
|
||||
@classmethod
|
||||
def getFormatAliases(cls) -> List[str]:
|
||||
"""Return format aliases."""
|
||||
return [
|
||||
'ascii', 'utf8', 'utf-8', 'code', 'source',
|
||||
'script', 'program', 'file', 'document',
|
||||
'raw', 'unformatted', 'plaintext'
|
||||
]
|
||||
|
||||
@classmethod
|
||||
def getPriority(cls) -> int:
|
||||
"""Return priority for text renderer."""
|
||||
return 90
|
||||
|
||||
@classmethod
|
||||
def getOutputStyle(cls, formatName: str = None) -> str:
|
||||
"""
|
||||
Return output style classification based on format.
|
||||
For txt/text/plain: 'document' (unstructured text)
|
||||
For all other formats: 'code' (structured formats with rules/syntax)
|
||||
|
||||
Note: formatName parameter is provided by registry when calling this method.
|
||||
"""
|
||||
# Plain text formats are document style
|
||||
if formatName and formatName.lower() in ['txt', 'text', 'plain']:
|
||||
return 'document'
|
||||
# All other formats handled by RendererText are code style
|
||||
return 'code'
|
||||
|
||||
@classmethod
|
||||
def getAcceptedSectionTypes(cls, formatName: Optional[str] = None) -> List[str]:
|
||||
"""
|
||||
Return list of section content types that Text renderer accepts.
|
||||
Text renderer accepts all section types except images (text formats cannot display images).
|
||||
"""
|
||||
from modules.datamodels.datamodelJson import supportedSectionTypes
|
||||
|
||||
# Text renderer accepts all types except images
|
||||
return [st for st in supportedSectionTypes if st != "image"]
|
||||
|
||||
async def render(self, extractedContent: Dict[str, Any], title: str, userPrompt: str = None, aiService=None) -> List[RenderedDocument]:
|
||||
"""Render extracted JSON content to plain text format."""
|
||||
try:
|
||||
# Generate text from JSON structure
|
||||
textContent = self._generateTextFromJson(extractedContent, title)
|
||||
|
||||
# Determine filename from document or title
|
||||
documents = extractedContent.get("documents", [])
|
||||
if documents and isinstance(documents[0], dict):
|
||||
filename = documents[0].get("filename")
|
||||
if not filename:
|
||||
filename = self._determineFilename(title, "text/plain")
|
||||
else:
|
||||
filename = self._determineFilename(title, "text/plain")
|
||||
|
||||
# Extract metadata for document type and other info
|
||||
metadata = extractedContent.get("metadata", {}) if extractedContent else {}
|
||||
documentType = metadata.get("documentType") if isinstance(metadata, dict) else None
|
||||
|
||||
return [
|
||||
RenderedDocument(
|
||||
documentData=textContent.encode('utf-8'),
|
||||
mimeType="text/plain",
|
||||
filename=filename,
|
||||
documentType=documentType,
|
||||
metadata=metadata if isinstance(metadata, dict) else None
|
||||
)
|
||||
]
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error rendering text: {str(e)}")
|
||||
# Return minimal text fallback
|
||||
fallbackContent = f"{title}\n\nError rendering report: {str(e)}"
|
||||
metadata = extractedContent.get("metadata", {}) if extractedContent else {}
|
||||
documentType = metadata.get("documentType") if isinstance(metadata, dict) else None
|
||||
return [
|
||||
RenderedDocument(
|
||||
documentData=fallbackContent.encode('utf-8'),
|
||||
mimeType="text/plain",
|
||||
filename=self._determineFilename(title, "text/plain"),
|
||||
documentType=documentType,
|
||||
metadata=metadata if isinstance(metadata, dict) else None
|
||||
)
|
||||
]
|
||||
|
||||
def _generateTextFromJson(self, jsonContent: Dict[str, Any], title: str) -> str:
|
||||
"""Generate text content from structured JSON document."""
|
||||
try:
|
||||
# Validate JSON structure
|
||||
if not self._validateJsonStructure(jsonContent):
|
||||
raise ValueError("JSON content must follow standardized schema: {metadata: {...}, documents: [{sections: [...]}]}")
|
||||
|
||||
# Extract sections and metadata from standardized schema
|
||||
sections = self._extractSections(jsonContent)
|
||||
metadata = self._extractMetadata(jsonContent)
|
||||
|
||||
# Use provided title (which comes from documents[].title) as primary source
|
||||
# Fallback to metadata.title only if title parameter is empty
|
||||
documentTitle = title if title else metadata.get("title", "Generated Document")
|
||||
|
||||
# Build text content
|
||||
textParts = []
|
||||
|
||||
# Document title
|
||||
textParts.append(documentTitle)
|
||||
textParts.append("=" * len(documentTitle))
|
||||
textParts.append("")
|
||||
|
||||
# Process each section
|
||||
for section in sections:
|
||||
sectionText = self._renderJsonSection(section)
|
||||
if sectionText:
|
||||
textParts.append(sectionText)
|
||||
textParts.append("") # Add spacing between sections
|
||||
|
||||
# Add generation info
|
||||
textParts.append("")
|
||||
textParts.append(f"Generated: {self._formatTimestamp()}")
|
||||
|
||||
return '\n'.join(textParts)
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error generating text from JSON: {str(e)}")
|
||||
raise Exception(f"Text generation failed: {str(e)}")
|
||||
|
||||
def _renderJsonSection(self, section: Dict[str, Any]) -> str:
|
||||
"""Render a single JSON section to text.
|
||||
Supports three content formats: reference, object (base64), extracted_text.
|
||||
"""
|
||||
try:
|
||||
sectionType = self._getSectionType(section)
|
||||
sectionData = self._getSectionData(section)
|
||||
|
||||
# Check for three content formats from Phase 5D in elements
|
||||
if isinstance(sectionData, list):
|
||||
textParts = []
|
||||
for element in sectionData:
|
||||
element_type = element.get("type", "") if isinstance(element, dict) else ""
|
||||
|
||||
# Support three content formats from Phase 5D
|
||||
if element_type == "reference":
|
||||
# Document reference format
|
||||
doc_ref = element.get("documentReference", "")
|
||||
label = element.get("label", "Reference")
|
||||
textParts.append(f"[Reference: {label}]")
|
||||
continue
|
||||
elif element_type == "extracted_text":
|
||||
# Extracted text format
|
||||
content = element.get("content", "")
|
||||
source = element.get("source", "")
|
||||
if content:
|
||||
source_text = f" (Source: {source})" if source else ""
|
||||
textParts.append(f"{content}{source_text}")
|
||||
continue
|
||||
|
||||
# If we processed reference/extracted_text elements, return them
|
||||
if textParts:
|
||||
return '\n\n'.join(textParts)
|
||||
|
||||
if sectionType == "table":
|
||||
# Work directly with elements like other renderers
|
||||
if isinstance(sectionData, list) and sectionData:
|
||||
element = sectionData[0] if isinstance(sectionData[0], dict) else {}
|
||||
return self._renderJsonTable(element)
|
||||
return ""
|
||||
elif sectionType == "bullet_list":
|
||||
# Work directly with elements like other renderers
|
||||
if isinstance(sectionData, list) and sectionData:
|
||||
element = sectionData[0] if isinstance(sectionData[0], dict) else {}
|
||||
return self._renderJsonBulletList(element)
|
||||
return ""
|
||||
elif sectionType == "heading":
|
||||
# Work directly with elements like other renderers
|
||||
if isinstance(sectionData, list) and sectionData:
|
||||
element = sectionData[0] if isinstance(sectionData[0], dict) else {}
|
||||
return self._renderJsonHeading(element)
|
||||
return ""
|
||||
elif sectionType == "paragraph":
|
||||
# Render each paragraph element in the elements array
|
||||
renderedElements = []
|
||||
for element in sectionData:
|
||||
renderedElements.append(self._renderJsonParagraph(element))
|
||||
return "\n".join(renderedElements)
|
||||
elif sectionType == "code_block":
|
||||
# Work directly with elements like other renderers
|
||||
if isinstance(sectionData, list) and sectionData:
|
||||
element = sectionData[0] if isinstance(sectionData[0], dict) else {}
|
||||
return self._renderJsonCodeBlock(element)
|
||||
return ""
|
||||
elif sectionType == "image":
|
||||
# Work directly with elements like other renderers
|
||||
if isinstance(sectionData, list) and sectionData:
|
||||
element = sectionData[0] if isinstance(sectionData[0], dict) else {}
|
||||
return self._renderJsonImage(element)
|
||||
return ""
|
||||
else:
|
||||
# Fallback to paragraph for unknown types - render each element
|
||||
# sectionData is already the elements array from _getSectionData
|
||||
renderedElements = []
|
||||
for element in sectionData:
|
||||
renderedElements.append(self._renderJsonParagraph(element))
|
||||
return "\n".join(renderedElements)
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error rendering section {self._getSectionId(section)}: {str(e)}")
|
||||
return f"[Error rendering section: {str(e)}]"
|
||||
|
||||
def _renderJsonTable(self, tableData: Dict[str, Any]) -> str:
|
||||
"""Render a JSON table to text."""
|
||||
try:
|
||||
# Extract from nested content structure: element.content.{headers, rows}
|
||||
content = tableData.get("content", {})
|
||||
if not isinstance(content, dict):
|
||||
return ""
|
||||
headers = content.get("headers", [])
|
||||
rows = content.get("rows", [])
|
||||
|
||||
if not headers or not rows:
|
||||
return ""
|
||||
|
||||
textParts = []
|
||||
|
||||
# Create table header
|
||||
headerLine = " | ".join(str(header) for header in headers)
|
||||
textParts.append(headerLine)
|
||||
|
||||
# Add separator line
|
||||
separatorLine = " | ".join("-" * len(str(header)) for header in headers)
|
||||
textParts.append(separatorLine)
|
||||
|
||||
# Add data rows
|
||||
for row in rows:
|
||||
rowLine = " | ".join(str(cellData) for cellData in row)
|
||||
textParts.append(rowLine)
|
||||
|
||||
return '\n'.join(textParts)
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error rendering table: {str(e)}")
|
||||
return ""
|
||||
|
||||
def _renderJsonBulletList(self, listData: Dict[str, Any]) -> str:
|
||||
"""Render a JSON bullet list to text."""
|
||||
try:
|
||||
# Extract from nested content structure: element.content.{items}
|
||||
content = listData.get("content", {})
|
||||
if not isinstance(content, dict):
|
||||
return ""
|
||||
items = content.get("items", [])
|
||||
|
||||
if not items:
|
||||
return ""
|
||||
|
||||
textParts = []
|
||||
for item in items:
|
||||
if isinstance(item, str):
|
||||
textParts.append(f"- {item}")
|
||||
elif isinstance(item, dict) and "text" in item:
|
||||
textParts.append(f"- {item['text']}")
|
||||
|
||||
return '\n'.join(textParts)
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error rendering bullet list: {str(e)}")
|
||||
return ""
|
||||
|
||||
def _renderJsonHeading(self, headingData: Dict[str, Any]) -> str:
|
||||
"""Render a JSON heading to text."""
|
||||
try:
|
||||
# Extract from nested content structure: element.content.{text, level}
|
||||
content = headingData.get("content", {})
|
||||
if not isinstance(content, dict):
|
||||
return ""
|
||||
text = content.get("text", "")
|
||||
level = content.get("level", 1)
|
||||
|
||||
if text:
|
||||
level = max(1, min(6, level))
|
||||
if level == 1:
|
||||
return f"{text}\n{'=' * len(text)}"
|
||||
elif level == 2:
|
||||
return f"{text}\n{'-' * len(text)}"
|
||||
else:
|
||||
return f"{'#' * level} {text}"
|
||||
|
||||
return ""
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error rendering heading: {str(e)}")
|
||||
return ""
|
||||
|
||||
def _renderJsonParagraph(self, paragraphData: Dict[str, Any]) -> str:
|
||||
"""Render a JSON paragraph to text."""
|
||||
try:
|
||||
# Extract from nested content structure
|
||||
content = paragraphData.get("content", {})
|
||||
if isinstance(content, dict):
|
||||
text = content.get("text", "")
|
||||
elif isinstance(content, str):
|
||||
text = content
|
||||
else:
|
||||
text = ""
|
||||
return text if text else ""
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error rendering paragraph: {str(e)}")
|
||||
return ""
|
||||
|
||||
def _renderJsonCodeBlock(self, codeData: Dict[str, Any]) -> str:
|
||||
"""Render a JSON code block to text."""
|
||||
try:
|
||||
# Extract from nested content structure: element.content.{code, language}
|
||||
content = codeData.get("content", {})
|
||||
if not isinstance(content, dict):
|
||||
return ""
|
||||
code = content.get("code", "")
|
||||
language = content.get("language", "")
|
||||
|
||||
if code:
|
||||
if language:
|
||||
return f"Code ({language}):\n{code}"
|
||||
else:
|
||||
return code
|
||||
|
||||
return ""
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error rendering code block: {str(e)}")
|
||||
return ""
|
||||
|
||||
def _renderJsonImage(self, imageData: Dict[str, Any]) -> str:
|
||||
"""Render a JSON image to text."""
|
||||
try:
|
||||
# Extract from nested content structure: element.content.{base64Data, altText, caption}
|
||||
content = imageData.get("content", {})
|
||||
if isinstance(content, dict):
|
||||
altText = content.get("altText", "Image")
|
||||
else:
|
||||
altText = imageData.get("altText", "Image")
|
||||
return f"[Image: {altText}]"
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error rendering image: {str(e)}")
|
||||
return f"[Image: Image]"
|
||||
File diff suppressed because it is too large
Load diff
File diff suppressed because it is too large
Load diff
|
|
@ -0,0 +1,163 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""
|
||||
Content Integrator for hierarchical document generation.
|
||||
Merges generated content into document structure and validates completeness.
|
||||
"""
|
||||
|
||||
import logging
|
||||
from typing import Dict, Any, List, Tuple
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class ContentIntegrator:
|
||||
"""Integrates generated content into document structure"""
|
||||
|
||||
def __init__(self, services: Any = None):
|
||||
self.services = services
|
||||
|
||||
def integrateContent(
|
||||
self,
|
||||
structure: Dict[str, Any],
|
||||
generatedSections: List[Dict[str, Any]]
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
Merge generated sections into document structure.
|
||||
|
||||
Args:
|
||||
structure: Original document structure
|
||||
generatedSections: List of sections with populated elements
|
||||
|
||||
Returns:
|
||||
Complete document structure ready for rendering
|
||||
"""
|
||||
try:
|
||||
# Create mapping of section IDs to generated sections
|
||||
sectionMap = {section.get("id"): section for section in generatedSections}
|
||||
|
||||
# Process each document
|
||||
for doc in structure.get("documents", []):
|
||||
sections = doc.get("sections", [])
|
||||
|
||||
for idx, section in enumerate(sections):
|
||||
sectionId = section.get("id")
|
||||
|
||||
# Find corresponding generated section
|
||||
if sectionId in sectionMap:
|
||||
generatedSection = sectionMap[sectionId]
|
||||
|
||||
# Merge elements into structure section
|
||||
if "elements" in generatedSection:
|
||||
section["elements"] = generatedSection["elements"]
|
||||
|
||||
# Preserve error information if present
|
||||
if generatedSection.get("error"):
|
||||
section["error"] = True
|
||||
section["errorMessage"] = generatedSection.get("errorMessage")
|
||||
section["originalContentType"] = generatedSection.get("originalContentType")
|
||||
else:
|
||||
# Section not generated - create error section
|
||||
logger.warning(f"Section {sectionId} not found in generated sections")
|
||||
section = self.createErrorSection(
|
||||
section,
|
||||
f"Section {sectionId} was not generated"
|
||||
)
|
||||
sections[idx] = section
|
||||
|
||||
# Debug: Write final merged structure to debug file (harmonisiert - keine Checks nötig)
|
||||
import json
|
||||
structureJson = json.dumps(structure, indent=2, ensure_ascii=False)
|
||||
self.services.utils.writeDebugFile(
|
||||
structureJson,
|
||||
"document_generation_final_merged_json"
|
||||
)
|
||||
logger.debug(f"Logged final merged JSON structure ({len(structureJson)} chars)")
|
||||
|
||||
return structure
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error integrating content: {str(e)}")
|
||||
raise
|
||||
|
||||
def validateCompleteness(
|
||||
self,
|
||||
document: Dict[str, Any]
|
||||
) -> Tuple[bool, List[str]]:
|
||||
"""
|
||||
Validate that all sections have content.
|
||||
|
||||
Args:
|
||||
document: Document structure to validate
|
||||
|
||||
Returns:
|
||||
(is_complete, list_of_missing_sections)
|
||||
"""
|
||||
missingSections = []
|
||||
|
||||
try:
|
||||
for doc in document.get("documents", []):
|
||||
sections = doc.get("sections", [])
|
||||
|
||||
for section in sections:
|
||||
sectionId = section.get("id", "unknown")
|
||||
elements = section.get("elements", [])
|
||||
|
||||
# Check if section has content
|
||||
if not elements or len(elements) == 0:
|
||||
# Skip error sections (they have error text)
|
||||
if not section.get("error"):
|
||||
missingSections.append(sectionId)
|
||||
else:
|
||||
# Validate elements have actual content
|
||||
hasContent = False
|
||||
for element in elements:
|
||||
# Check different content types
|
||||
if element.get("text") or element.get("base64Data") or \
|
||||
element.get("headers") or element.get("items") or \
|
||||
element.get("code"):
|
||||
hasContent = True
|
||||
break
|
||||
|
||||
if not hasContent and not section.get("error"):
|
||||
missingSections.append(sectionId)
|
||||
|
||||
return len(missingSections) == 0, missingSections
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error validating completeness: {str(e)}")
|
||||
return False, [f"Validation error: {str(e)}"]
|
||||
|
||||
def createErrorSection(
|
||||
self,
|
||||
originalSection: Dict[str, Any],
|
||||
errorMessage: str
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
Create error placeholder section.
|
||||
|
||||
Args:
|
||||
originalSection: Original section that failed
|
||||
errorMessage: Error message to display
|
||||
|
||||
Returns:
|
||||
Error section with placeholder content
|
||||
"""
|
||||
contentType = originalSection.get("content_type", "content")
|
||||
sectionId = originalSection.get("id", "unknown")
|
||||
|
||||
return {
|
||||
"id": sectionId,
|
||||
"content_type": "paragraph", # Change to paragraph for error display
|
||||
"elements": [{
|
||||
"text": f"[ERROR: Failed to generate {contentType} for section '{sectionId}'. Error: {errorMessage}]"
|
||||
}],
|
||||
"order": originalSection.get("order", 0),
|
||||
"error": True,
|
||||
"errorMessage": errorMessage,
|
||||
"originalContentType": contentType,
|
||||
"title": originalSection.get("title"),
|
||||
"generation_hint": originalSection.get("generation_hint"),
|
||||
"complexity": originalSection.get("complexity")
|
||||
}
|
||||
|
||||
|
|
@ -0,0 +1,253 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
from typing import Any, Dict
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
def getFileExtension(fileName: str) -> str:
|
||||
"""Extract file extension from fileName (without dot, lowercased)."""
|
||||
if '.' in fileName:
|
||||
return fileName.rsplit('.', 1)[-1].lower()
|
||||
return ''
|
||||
|
||||
def getMimeTypeFromExtension(extension: str) -> str:
|
||||
"""
|
||||
Get MIME type based on file extension.
|
||||
This method consolidates MIME type detection from extension.
|
||||
|
||||
Args:
|
||||
extension: File extension (with or without dot)
|
||||
|
||||
Returns:
|
||||
str: MIME type for the extension
|
||||
"""
|
||||
# Normalize extension (remove dot if present)
|
||||
if extension.startswith('.'):
|
||||
extension = extension[1:]
|
||||
|
||||
# Map extensions to MIME types
|
||||
mime_types = {
|
||||
'txt': 'text/plain',
|
||||
'json': 'application/json',
|
||||
'xml': 'application/xml',
|
||||
'csv': 'text/csv',
|
||||
'html': 'text/html',
|
||||
'htm': 'text/html',
|
||||
'md': 'text/markdown',
|
||||
'py': 'text/x-python',
|
||||
'js': 'application/javascript',
|
||||
'css': 'text/css',
|
||||
'pdf': 'application/pdf',
|
||||
'doc': 'application/msword',
|
||||
'docx': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
|
||||
'xls': 'application/vnd.ms-excel',
|
||||
'xlsx': 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
|
||||
'ppt': 'application/vnd.ms-powerpoint',
|
||||
'pptx': 'application/vnd.openxmlformats-officedocument.presentationml.presentation',
|
||||
'svg': 'image/svg+xml',
|
||||
'jpg': 'image/jpeg',
|
||||
'jpeg': 'image/jpeg',
|
||||
'png': 'image/png',
|
||||
'gif': 'image/gif',
|
||||
'bmp': 'image/bmp',
|
||||
'webp': 'image/webp',
|
||||
'zip': 'application/zip',
|
||||
'rar': 'application/x-rar-compressed',
|
||||
'7z': 'application/x-7z-compressed',
|
||||
'tar': 'application/x-tar',
|
||||
'gz': 'application/gzip'
|
||||
}
|
||||
return mime_types.get(extension.lower(), 'application/octet-stream')
|
||||
|
||||
def detectContentTypeFromData(fileData: bytes, fileName: str) -> str:
|
||||
"""
|
||||
Detect content type from file data and fileName.
|
||||
This method makes the MIME type detection function accessible through the service center.
|
||||
|
||||
Args:
|
||||
fileData: Raw file data as bytes
|
||||
fileName: Name of the file
|
||||
|
||||
Returns:
|
||||
str: Detected MIME type
|
||||
"""
|
||||
try:
|
||||
# Check file extension first
|
||||
ext = os.path.splitext(fileName)[1].lower()
|
||||
if ext:
|
||||
# Map common extensions to MIME types
|
||||
extToMime = {
|
||||
'.txt': 'text/plain',
|
||||
'.md': 'text/markdown',
|
||||
'.csv': 'text/csv',
|
||||
'.json': 'application/json',
|
||||
'.xml': 'application/xml',
|
||||
'.js': 'application/javascript',
|
||||
'.py': 'application/x-python',
|
||||
'.svg': 'image/svg+xml',
|
||||
'.jpg': 'image/jpeg',
|
||||
'.jpeg': 'image/jpeg',
|
||||
'.png': 'image/png',
|
||||
'.gif': 'image/gif',
|
||||
'.bmp': 'image/bmp',
|
||||
'.webp': 'image/webp',
|
||||
'.pdf': 'application/pdf',
|
||||
'.docx': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
|
||||
'.doc': 'application/msword',
|
||||
'.xlsx': 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
|
||||
'.xls': 'application/vnd.ms-excel',
|
||||
'.pptx': 'application/vnd.openxmlformats-officedocument.presentationml.presentation',
|
||||
'.ppt': 'application/vnd.ms-powerpoint',
|
||||
'.html': 'text/html',
|
||||
'.htm': 'text/html',
|
||||
'.css': 'text/css',
|
||||
'.zip': 'application/zip',
|
||||
'.rar': 'application/x-rar-compressed',
|
||||
'.7z': 'application/x-7z-compressed',
|
||||
'.tar': 'application/x-tar',
|
||||
'.gz': 'application/gzip'
|
||||
}
|
||||
if ext in extToMime:
|
||||
return extToMime[ext]
|
||||
|
||||
# Try to detect from content
|
||||
if fileData.startswith(b'%PDF'):
|
||||
return 'application/pdf'
|
||||
elif fileData.startswith(b'PK\x03\x04'):
|
||||
# ZIP-based formats (docx, xlsx, pptx)
|
||||
return 'application/zip'
|
||||
elif fileData.startswith(b'<'):
|
||||
# XML-based formats
|
||||
try:
|
||||
text = fileData.decode('utf-8', errors='ignore')
|
||||
if '<svg' in text.lower():
|
||||
return 'image/svg+xml'
|
||||
elif '<html' in text.lower():
|
||||
return 'text/html'
|
||||
else:
|
||||
return 'application/xml'
|
||||
except:
|
||||
pass
|
||||
elif fileData.startswith(b'\x89PNG\r\n\x1a\n'):
|
||||
return 'image/png'
|
||||
elif fileData.startswith(b'\xff\xd8\xff'):
|
||||
return 'image/jpeg'
|
||||
elif fileData.startswith(b'GIF87a') or fileData.startswith(b'GIF89a'):
|
||||
return 'image/gif'
|
||||
elif fileData.startswith(b'BM'):
|
||||
return 'image/bmp'
|
||||
elif fileData.startswith(b'RIFF') and fileData[8:12] == b'WEBP':
|
||||
return 'image/webp'
|
||||
|
||||
return 'application/octet-stream'
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error detecting content type from data: {str(e)}")
|
||||
return 'application/octet-stream'
|
||||
|
||||
def detectMimeTypeFromData(file_bytes: bytes, fileName: str, service=None) -> str:
|
||||
"""Detect MIME type from file bytes and fileName using a service if provided."""
|
||||
try:
|
||||
if service and hasattr(service, 'detectContentTypeFromData'):
|
||||
detected = service.detectContentTypeFromData(file_bytes, fileName)
|
||||
if detected and detected != 'application/octet-stream':
|
||||
return detected
|
||||
# Fallback: use our consolidated function
|
||||
return detectContentTypeFromData(file_bytes, fileName)
|
||||
except Exception as e:
|
||||
logger.warning(f"Error in MIME type detection for {fileName}: {str(e)}")
|
||||
return 'application/octet-stream'
|
||||
|
||||
def detectMimeTypeFromContent(content: Any, fileName: str, service=None) -> str:
|
||||
"""Detect MIME type from content and fileName using a service if provided."""
|
||||
try:
|
||||
if isinstance(content, str):
|
||||
file_bytes = content.encode('utf-8')
|
||||
elif isinstance(content, dict):
|
||||
file_bytes = json.dumps(content, ensure_ascii=False).encode('utf-8')
|
||||
else:
|
||||
file_bytes = str(content).encode('utf-8')
|
||||
return detectMimeTypeFromData(file_bytes, fileName, service)
|
||||
except Exception as e:
|
||||
logger.warning(f"Error in MIME type detection for {fileName}: {str(e)}")
|
||||
return 'application/octet-stream'
|
||||
|
||||
def convertDocumentDataToString(document_data: Any, file_extension: str) -> str:
|
||||
"""Convert document data to string content based on file type with enhanced processing."""
|
||||
try:
|
||||
if document_data is None:
|
||||
return ""
|
||||
if isinstance(document_data, bytes):
|
||||
# WICHTIG: Decode bytes to string for text files (HTML, text, etc.)
|
||||
try:
|
||||
return document_data.decode('utf-8')
|
||||
except UnicodeDecodeError:
|
||||
# Fallback: try latin1 or return with error replacement
|
||||
try:
|
||||
return document_data.decode('latin1')
|
||||
except Exception:
|
||||
return document_data.decode('utf-8', errors='replace')
|
||||
if isinstance(document_data, str):
|
||||
return document_data
|
||||
if isinstance(document_data, dict):
|
||||
if file_extension == 'json':
|
||||
return json.dumps(document_data, indent=2, ensure_ascii=False)
|
||||
elif file_extension in ['txt', 'md', 'html', 'css', 'js', 'py']:
|
||||
text_fields = ['content', 'text', 'data', 'result', 'summary', 'extracted_content', 'table_data']
|
||||
for field in text_fields:
|
||||
if field in document_data:
|
||||
content = document_data[field]
|
||||
if isinstance(content, str):
|
||||
return content
|
||||
elif isinstance(content, (dict, list)):
|
||||
return json.dumps(content, indent=2, ensure_ascii=False)
|
||||
return json.dumps(document_data, indent=2, ensure_ascii=False)
|
||||
elif file_extension == 'csv':
|
||||
csv_fields = ['table_data', 'csv_data', 'rows', 'data', 'content', 'text']
|
||||
for field in csv_fields:
|
||||
if field in document_data:
|
||||
content = document_data[field]
|
||||
if isinstance(content, str):
|
||||
return content
|
||||
elif isinstance(content, list):
|
||||
if content and isinstance(content[0], (list, dict)):
|
||||
import csv
|
||||
import io
|
||||
output = io.StringIO()
|
||||
if isinstance(content[0], dict):
|
||||
if content:
|
||||
fieldnames = content[0].keys()
|
||||
writer = csv.DictWriter(output, fieldnames=fieldnames)
|
||||
writer.writeheader()
|
||||
writer.writerows(content)
|
||||
else:
|
||||
writer = csv.writer(output)
|
||||
writer.writerows(content)
|
||||
return output.getvalue()
|
||||
return json.dumps(document_data, indent=2, ensure_ascii=False)
|
||||
else:
|
||||
return json.dumps(document_data, indent=2, ensure_ascii=False)
|
||||
elif isinstance(document_data, list):
|
||||
if file_extension == 'csv':
|
||||
import csv
|
||||
import io
|
||||
output = io.StringIO()
|
||||
if document_data and isinstance(document_data[0], dict):
|
||||
fieldnames = document_data[0].keys()
|
||||
writer = csv.DictWriter(output, fieldnames=fieldnames)
|
||||
writer.writeheader()
|
||||
writer.writerows(document_data)
|
||||
else:
|
||||
writer = csv.writer(output)
|
||||
writer.writerows(document_data)
|
||||
return output.getvalue()
|
||||
else:
|
||||
return json.dumps(document_data, indent=2, ensure_ascii=False)
|
||||
else:
|
||||
return str(document_data)
|
||||
except Exception as e:
|
||||
logger.error(f"Error converting document data to string: {str(e)}")
|
||||
return str(document_data)
|
||||
|
|
@ -0,0 +1,560 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""
|
||||
JSON Schema definitions for AI-generated document structures (unified).
|
||||
This module provides schemas that guide AI to generate structured JSON output
|
||||
that matches the master template in modules.datamodels.datamodelJson.
|
||||
"""
|
||||
|
||||
from typing import Dict, Any
|
||||
|
||||
|
||||
def getMultiDocumentSchema() -> Dict[str, Any]:
|
||||
"""Get the JSON schema for multi-document generation (unified)."""
|
||||
return {
|
||||
"type": "object",
|
||||
"required": ["metadata", "documents"],
|
||||
"properties": {
|
||||
"metadata": {
|
||||
"type": "object",
|
||||
"required": ["split_strategy"],
|
||||
"properties": {
|
||||
"split_strategy": {
|
||||
"type": "string",
|
||||
"enum": [
|
||||
"single_document",
|
||||
"per_entity",
|
||||
"by_section",
|
||||
"by_criteria",
|
||||
"by_data_type",
|
||||
"custom"
|
||||
],
|
||||
"description": "Strategy for splitting content into multiple files"
|
||||
},
|
||||
"splitCriteria": {
|
||||
"type": "object",
|
||||
"description": "Custom criteria for splitting (e.g., entity_id, category, etc.)"
|
||||
},
|
||||
"fileNamingPattern": {
|
||||
"type": "string",
|
||||
"description": "Pattern for generating filenames (e.g., '{entity_name}_data.docx')"
|
||||
},
|
||||
"source_documents": {
|
||||
"type": "array",
|
||||
"items": {"type": "string"},
|
||||
"description": "List of source document IDs"
|
||||
},
|
||||
"extraction_method": {
|
||||
"type": "string",
|
||||
"default": "ai_generation",
|
||||
"description": "Method used for extraction"
|
||||
}
|
||||
}
|
||||
},
|
||||
"documents": {
|
||||
"type": "array",
|
||||
"description": "Array of individual documents to generate",
|
||||
"items": {
|
||||
"type": "object",
|
||||
"required": ["id", "title", "sections", "filename"],
|
||||
"properties": {
|
||||
"id": {"type": "string", "description": "Unique document identifier"},
|
||||
"title": {"type": "string", "description": "Document title"},
|
||||
"filename": {"type": "string", "description": "Generated filename"},
|
||||
"sections": {
|
||||
"type": "array",
|
||||
"description": "Document sections containing structured content",
|
||||
"items": {
|
||||
"type": "object",
|
||||
"required": ["id", "content_type", "elements", "order"],
|
||||
"properties": {
|
||||
"id": {"type": "string", "description": "Unique section identifier"},
|
||||
"title": {"type": "string", "description": "Section title (optional)"},
|
||||
"content_type": {
|
||||
"type": "string",
|
||||
"enum": [
|
||||
"table",
|
||||
"bullet_list",
|
||||
"paragraph",
|
||||
"heading",
|
||||
"code_block",
|
||||
"image",
|
||||
"mixed"
|
||||
],
|
||||
"description": "Primary content type of this section"
|
||||
},
|
||||
"elements": {
|
||||
"type": "array",
|
||||
"description": "Content elements in this section",
|
||||
"items": {
|
||||
"oneOf": [
|
||||
{"$ref": "#/definitions/table"},
|
||||
{"$ref": "#/definitions/bullet_list"},
|
||||
{"$ref": "#/definitions/paragraph"},
|
||||
{"$ref": "#/definitions/heading"},
|
||||
{"$ref": "#/definitions/code_block"},
|
||||
{"$ref": "#/definitions/image"}
|
||||
]
|
||||
}
|
||||
},
|
||||
"order": {"type": "integer", "description": "Section order in document"},
|
||||
"metadata": {
|
||||
"type": "object",
|
||||
"description": "Additional section metadata"
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"metadata": {
|
||||
"type": "object",
|
||||
"description": "Document-specific metadata"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"definitions": {
|
||||
"table": {
|
||||
"type": "object",
|
||||
"required": ["headers", "rows"],
|
||||
"properties": {
|
||||
"headers": {
|
||||
"type": "array",
|
||||
"items": {"type": "string"},
|
||||
"description": "Table column headers"
|
||||
},
|
||||
"rows": {
|
||||
"type": "array",
|
||||
"items": {
|
||||
"type": "array",
|
||||
"items": {"type": "string"}
|
||||
},
|
||||
"description": "Table data rows"
|
||||
},
|
||||
"caption": {
|
||||
"type": "string",
|
||||
"description": "Table caption (optional)"
|
||||
}
|
||||
}
|
||||
},
|
||||
"bullet_list": {
|
||||
"type": "object",
|
||||
"required": ["items"],
|
||||
"properties": {
|
||||
"items": {
|
||||
"type": "array",
|
||||
"items": {
|
||||
"type": "object",
|
||||
"required": ["text"],
|
||||
"properties": {
|
||||
"text": {"type": "string", "description": "List item text"},
|
||||
"subitems": {
|
||||
"type": "array",
|
||||
"items": {"$ref": "#/definitions/list_item"},
|
||||
"description": "Nested sub-items (optional)"
|
||||
}
|
||||
}
|
||||
},
|
||||
"description": "List items"
|
||||
},
|
||||
"list_type": {
|
||||
"type": "string",
|
||||
"enum": ["bullet", "numbered", "checklist"],
|
||||
"default": "bullet",
|
||||
"description": "Type of list"
|
||||
}
|
||||
}
|
||||
},
|
||||
"list_item": {
|
||||
"type": "object",
|
||||
"required": ["text"],
|
||||
"properties": {
|
||||
"text": {"type": "string", "description": "List item text"},
|
||||
"subitems": {
|
||||
"type": "array",
|
||||
"items": {"$ref": "#/definitions/list_item"},
|
||||
"description": "Nested sub-items (optional)"
|
||||
}
|
||||
}
|
||||
},
|
||||
"paragraph": {
|
||||
"type": "object",
|
||||
"required": ["text"],
|
||||
"properties": {
|
||||
"text": {"type": "string", "description": "Paragraph text"},
|
||||
"formatting": {
|
||||
"type": "object",
|
||||
"description": "Text formatting (bold, italic, etc.)"
|
||||
}
|
||||
}
|
||||
},
|
||||
"heading": {
|
||||
"type": "object",
|
||||
"required": ["text", "level"],
|
||||
"properties": {
|
||||
"text": {"type": "string", "description": "Heading text"},
|
||||
"level": {
|
||||
"type": "integer",
|
||||
"minimum": 1,
|
||||
"maximum": 6,
|
||||
"description": "Heading level (1-6)"
|
||||
}
|
||||
}
|
||||
},
|
||||
"code_block": {
|
||||
"type": "object",
|
||||
"required": ["code"],
|
||||
"properties": {
|
||||
"code": {"type": "string", "description": "Code content"},
|
||||
"language": {"type": "string", "description": "Programming language (optional)"}
|
||||
}
|
||||
},
|
||||
"image": {
|
||||
"type": "object",
|
||||
"required": ["url"],
|
||||
"properties": {
|
||||
"url": {"type": "string", "description": "Image URL or data URI"},
|
||||
"caption": {"type": "string", "description": "Image caption (optional)"},
|
||||
"alt": {"type": "string", "description": "Alt text (optional)"}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
def getDocumentSchema() -> Dict[str, Any]:
|
||||
"""Get the JSON schema for structured document generation (single document)."""
|
||||
return {
|
||||
"type": "object",
|
||||
"required": ["metadata", "sections"],
|
||||
"properties": {
|
||||
"metadata": {
|
||||
"type": "object",
|
||||
"required": ["title"],
|
||||
"properties": {
|
||||
"title": {"type": "string", "description": "Document title"},
|
||||
"source_documents": {
|
||||
"type": "array",
|
||||
"items": {"type": "string"},
|
||||
"description": "List of source document IDs"
|
||||
},
|
||||
"extraction_method": {
|
||||
"type": "string",
|
||||
"default": "ai_generation",
|
||||
"description": "Method used for extraction"
|
||||
}
|
||||
}
|
||||
},
|
||||
"sections": {
|
||||
"type": "array",
|
||||
"description": "Document sections containing structured content",
|
||||
"items": {
|
||||
"type": "object",
|
||||
"required": ["id", "content_type", "elements", "order"],
|
||||
"properties": {
|
||||
"id": {"type": "string", "description": "Unique section identifier"},
|
||||
"title": {"type": "string", "description": "Section title (optional)"},
|
||||
"content_type": {
|
||||
"type": "string",
|
||||
"enum": [
|
||||
"table",
|
||||
"bullet_list",
|
||||
"paragraph",
|
||||
"heading",
|
||||
"code_block",
|
||||
"image",
|
||||
"mixed"
|
||||
],
|
||||
"description": "Primary content type of this section"
|
||||
},
|
||||
"elements": {
|
||||
"type": "array",
|
||||
"description": "Content elements in this section",
|
||||
"items": {
|
||||
"oneOf": [
|
||||
{"$ref": "#/definitions/table"},
|
||||
{"$ref": "#/definitions/bullet_list"},
|
||||
{"$ref": "#/definitions/paragraph"},
|
||||
{"$ref": "#/definitions/heading"},
|
||||
{"$ref": "#/definitions/code_block"},
|
||||
{"$ref": "#/definitions/image"}
|
||||
]
|
||||
}
|
||||
},
|
||||
"order": {"type": "integer", "description": "Section order in document"},
|
||||
"metadata": {
|
||||
"type": "object",
|
||||
"description": "Additional section metadata"
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"summary": {
|
||||
"type": "string",
|
||||
"description": "Document summary (optional)"
|
||||
},
|
||||
"tags": {
|
||||
"type": "array",
|
||||
"items": {"type": "string"},
|
||||
"description": "Document tags for categorization"
|
||||
}
|
||||
},
|
||||
"definitions": {
|
||||
"table": {
|
||||
"type": "object",
|
||||
"required": ["headers", "rows"],
|
||||
"properties": {
|
||||
"headers": {
|
||||
"type": "array",
|
||||
"items": {"type": "string"},
|
||||
"description": "Table column headers"
|
||||
},
|
||||
"rows": {
|
||||
"type": "array",
|
||||
"items": {
|
||||
"type": "array",
|
||||
"items": {"type": "string"}
|
||||
},
|
||||
"description": "Table data rows"
|
||||
},
|
||||
"caption": {
|
||||
"type": "string",
|
||||
"description": "Table caption (optional)"
|
||||
}
|
||||
}
|
||||
},
|
||||
"bullet_list": {
|
||||
"type": "object",
|
||||
"required": ["items"],
|
||||
"properties": {
|
||||
"items": {
|
||||
"type": "array",
|
||||
"items": {
|
||||
"type": "object",
|
||||
"required": ["text"],
|
||||
"properties": {
|
||||
"text": {"type": "string", "description": "List item text"},
|
||||
"subitems": {
|
||||
"type": "array",
|
||||
"items": {"$ref": "#/definitions/list_item"},
|
||||
"description": "Nested sub-items (optional)"
|
||||
}
|
||||
}
|
||||
},
|
||||
"description": "List items"
|
||||
},
|
||||
"list_type": {
|
||||
"type": "string",
|
||||
"enum": ["bullet", "numbered", "checklist"],
|
||||
"default": "bullet",
|
||||
"description": "Type of list"
|
||||
}
|
||||
}
|
||||
},
|
||||
"list_item": {
|
||||
"type": "object",
|
||||
"required": ["text"],
|
||||
"properties": {
|
||||
"text": {"type": "string", "description": "List item text"},
|
||||
"subitems": {
|
||||
"type": "array",
|
||||
"items": {"$ref": "#/definitions/list_item"},
|
||||
"description": "Nested sub-items (optional)"
|
||||
}
|
||||
}
|
||||
},
|
||||
"paragraph": {
|
||||
"type": "object",
|
||||
"required": ["text"],
|
||||
"properties": {
|
||||
"text": {"type": "string", "description": "Paragraph text"},
|
||||
"formatting": {
|
||||
"type": "object",
|
||||
"description": "Text formatting (bold, italic, etc.)"
|
||||
}
|
||||
}
|
||||
},
|
||||
"heading": {
|
||||
"type": "object",
|
||||
"required": ["text", "level"],
|
||||
"properties": {
|
||||
"text": {"type": "string", "description": "Heading text"},
|
||||
"level": {
|
||||
"type": "integer",
|
||||
"minimum": 1,
|
||||
"maximum": 6,
|
||||
"description": "Heading level (1-6)"
|
||||
}
|
||||
}
|
||||
},
|
||||
"code_block": {
|
||||
"type": "object",
|
||||
"required": ["code"],
|
||||
"properties": {
|
||||
"code": {"type": "string", "description": "Code content"},
|
||||
"language": {"type": "string", "description": "Programming language (optional)"}
|
||||
}
|
||||
},
|
||||
"image": {
|
||||
"type": "object",
|
||||
"required": ["url"],
|
||||
"properties": {
|
||||
"url": {"type": "string", "description": "Image URL or data URI"},
|
||||
"caption": {"type": "string", "description": "Image caption (optional)"},
|
||||
"alt": {"type": "string", "description": "Alt text (optional)"}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
def getExtractionPromptTemplate() -> str:
|
||||
"""Get the template for AI extraction prompts that request JSON output."""
|
||||
return """
|
||||
You are extracting structured content from documents. Your task is to analyze the provided content and generate a structured JSON document.
|
||||
|
||||
IMPORTANT: You must respond with valid JSON only. No additional text, explanations, or formatting outside the JSON structure.
|
||||
|
||||
JSON Schema Requirements:
|
||||
- Extract the actual data from the source documents
|
||||
- If content is a table, extract it as a table with headers and rows
|
||||
- If content is a list, extract it as a structured list with items
|
||||
- If content is text, extract it as paragraphs or headings
|
||||
- Preserve the original structure and data - do not summarize or interpret
|
||||
- Use the exact JSON schema provided
|
||||
|
||||
Content Types to Extract:
|
||||
1. Tables: Extract all rows and columns with proper headers
|
||||
2. Lists: Extract all items with proper nesting
|
||||
3. Headings: Extract with appropriate levels
|
||||
4. Paragraphs: Extract as structured text
|
||||
5. Code: Extract code blocks with language identification
|
||||
|
||||
Return only the JSON structure following the schema. Do not include any text before or after the JSON.
|
||||
"""
|
||||
|
||||
|
||||
def getGenerationPromptTemplate() -> str:
|
||||
"""Get the template for AI generation prompts that work with JSON input."""
|
||||
return """
|
||||
You are generating a document from structured JSON data. Your task is to create a well-formatted document based on the provided structured content.
|
||||
|
||||
IMPORTANT: You must respond with valid JSON only, following the document schema.
|
||||
|
||||
Generation Guidelines:
|
||||
- Use the provided JSON structure as the foundation
|
||||
- Enhance the content with proper formatting and organization
|
||||
- Ensure logical flow and readability
|
||||
- Maintain the original data integrity
|
||||
- Add appropriate headings and sections
|
||||
- Organize content in a logical sequence
|
||||
|
||||
Content Enhancement:
|
||||
- Tables: Ensure proper headers and data alignment
|
||||
- Lists: Use appropriate list types (bullet, numbered, checklist)
|
||||
- Headings: Use appropriate heading levels for hierarchy
|
||||
- Paragraphs: Ensure proper text flow and formatting
|
||||
- Code: Preserve code blocks with proper language identification
|
||||
|
||||
Return only the enhanced JSON structure following the schema. Do not include any text before or after the JSON.
|
||||
"""
|
||||
|
||||
|
||||
def getAdaptiveJsonSchema(promptAnalysis: Dict[str, Any] = None) -> Dict[str, Any]:
|
||||
"""Automatically select appropriate schema based on prompt analysis."""
|
||||
if promptAnalysis and promptAnalysis.get("is_multi_file", False):
|
||||
return getMultiDocumentSchema()
|
||||
else:
|
||||
return getDocumentSchema()
|
||||
|
||||
def validateJsonDocument(jsonData: Dict[str, Any]) -> bool:
|
||||
"""Validate that the JSON data follows the unified document schema."""
|
||||
try:
|
||||
# Basic validation - check required fields
|
||||
if not isinstance(jsonData, dict):
|
||||
return False
|
||||
|
||||
# Check if it's multi-document or single-document structure
|
||||
if "documents" in jsonData:
|
||||
# Multi-document structure
|
||||
if "metadata" not in jsonData:
|
||||
return False
|
||||
|
||||
metadata = jsonData["metadata"]
|
||||
if not isinstance(metadata, dict) or "split_strategy" not in metadata:
|
||||
return False
|
||||
|
||||
documents = jsonData["documents"]
|
||||
if not isinstance(documents, list):
|
||||
return False
|
||||
|
||||
# Validate each document
|
||||
for doc in documents:
|
||||
if not isinstance(doc, dict):
|
||||
return False
|
||||
|
||||
required_fields = ["id", "title", "sections", "filename"]
|
||||
for field in required_fields:
|
||||
if field not in doc:
|
||||
return False
|
||||
|
||||
# Validate sections in each document
|
||||
sections = doc.get("sections", [])
|
||||
if not isinstance(sections, list):
|
||||
return False
|
||||
|
||||
for section in sections:
|
||||
if not isinstance(section, dict):
|
||||
return False
|
||||
|
||||
section_required = ["id", "content_type", "elements", "order"]
|
||||
for field in section_required:
|
||||
if field not in section:
|
||||
return False
|
||||
|
||||
# Validate content_type
|
||||
valid_types = ["table", "bullet_list", "paragraph", "heading", "code_block", "image", "mixed"]
|
||||
if section["content_type"] not in valid_types:
|
||||
return False
|
||||
|
||||
# Validate elements
|
||||
if not isinstance(section["elements"], list):
|
||||
return False
|
||||
|
||||
elif "sections" in jsonData:
|
||||
# Single-document structure (existing validation)
|
||||
if "metadata" not in jsonData:
|
||||
return False
|
||||
|
||||
metadata = jsonData["metadata"]
|
||||
if not isinstance(metadata, dict) or "title" not in metadata:
|
||||
return False
|
||||
|
||||
sections = jsonData["sections"]
|
||||
if not isinstance(sections, list):
|
||||
return False
|
||||
|
||||
# Validate each section
|
||||
for i, section in enumerate(sections):
|
||||
if not isinstance(section, dict):
|
||||
return False
|
||||
|
||||
required_fields = ["id", "content_type", "elements", "order"]
|
||||
for field in required_fields:
|
||||
if field not in section:
|
||||
return False
|
||||
|
||||
# Validate content_type
|
||||
valid_types = ["table", "bullet_list", "paragraph", "heading", "code_block", "image", "mixed"]
|
||||
if section["content_type"] not in valid_types:
|
||||
return False
|
||||
|
||||
# Validate elements
|
||||
if not isinstance(section["elements"], list):
|
||||
return False
|
||||
else:
|
||||
return False
|
||||
|
||||
return True
|
||||
|
||||
except Exception:
|
||||
return False
|
||||
|
|
@ -0,0 +1,200 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""
|
||||
Prompt builder for document generation.
|
||||
This module builds prompts for generating documents from extracted content.
|
||||
"""
|
||||
|
||||
import logging
|
||||
from typing import Dict, Any
|
||||
from modules.datamodels.datamodelJson import jsonTemplateDocument
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
|
||||
async def buildGenerationPrompt(
|
||||
outputFormat: str,
|
||||
userPrompt: str,
|
||||
title: str,
|
||||
extracted_content: str = None,
|
||||
continuationContext: Dict[str, Any] = None,
|
||||
services: Any = None,
|
||||
useContentParts: bool = False # ARCHITECTURE: If True, don't include full content in prompt (ContentParts will be used directly)
|
||||
) -> str:
|
||||
"""
|
||||
Build the unified generation prompt using a single JSON template.
|
||||
Generic solution that works for any user request.
|
||||
|
||||
Args:
|
||||
outputFormat: Target output format (html, pdf, docx, etc.) - not used in prompt
|
||||
userPrompt: User's original prompt for document generation
|
||||
title: Title for the document
|
||||
extracted_content: Optional extracted content from documents to prepend to prompt
|
||||
continuationContext: Optional context from previous generation for continuation
|
||||
services: Optional services instance for accessing user language
|
||||
|
||||
Returns:
|
||||
Complete generation prompt string
|
||||
"""
|
||||
# Extract user language for document language instruction
|
||||
userLanguage = 'en' # Default fallback
|
||||
if services:
|
||||
try:
|
||||
# Prefer detected language if available
|
||||
if hasattr(services, 'currentUserLanguage') and services.currentUserLanguage:
|
||||
userLanguage = services.currentUserLanguage
|
||||
elif hasattr(services, 'user') and services.user and hasattr(services.user, 'language'):
|
||||
userLanguage = services.user.language
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# Create a template - let AI generate title if not provided
|
||||
titleValue = title if title else "Generated Document"
|
||||
jsonTemplate = jsonTemplateDocument.replace("{{DOCUMENT_TITLE}}", titleValue)
|
||||
|
||||
# Build prompt based on whether this is a continuation or first call
|
||||
# Check if we have valid continuation context with actual JSON fragment
|
||||
# CRITICAL: Allow continuation even if section_count is 0 (broken JSON that couldn't be parsed)
|
||||
# as long as we have last_raw_json - this handles cases where JSON is too broken to extract sections
|
||||
hasContinuation = (
|
||||
continuationContext
|
||||
and continuationContext.get("last_raw_json", "")
|
||||
and continuationContext.get("last_raw_json", "").strip() != "{}"
|
||||
)
|
||||
|
||||
if hasContinuation:
|
||||
# CONTINUATION PROMPT - use centralized jsonContinuation system
|
||||
delivered_summary = continuationContext.get("delivered_summary", "")
|
||||
|
||||
# Use centralized system: overlap_context and hierarchy_context from jsonContinuation.getContexts()
|
||||
overlap_context = continuationContext.get("overlap_context")
|
||||
hierarchy_context = continuationContext.get("hierarchy_context")
|
||||
|
||||
# Build continuation text with delivered summary and cut-off information
|
||||
# CRITICAL: Always include cut-off information if available (per loop_plan.md)
|
||||
continuationText = f"{delivered_summary}\n\n"
|
||||
continuationText += "⚠️ CONTINUATION: Response was cut off. Generate ONLY the remaining content that comes AFTER the reference elements below.\n\n"
|
||||
|
||||
# Add cut-off point information using centralized jsonContinuation contexts
|
||||
# These are shown ONLY as REFERENCE to know where generation stopped
|
||||
if hierarchy_context:
|
||||
continuationText += "# REFERENCE: Structure context (already delivered - DO NOT repeat):\n"
|
||||
continuationText += f"{hierarchy_context}\n\n"
|
||||
|
||||
if overlap_context:
|
||||
continuationText += "# REFERENCE: Overlap context - incomplete element at cut point (DO NOT repeat):\n"
|
||||
continuationText += f"{overlap_context}\n\n"
|
||||
|
||||
continuationText += "⚠️ CRITICAL: The elements above are REFERENCE ONLY. They are already delivered.\n"
|
||||
continuationText += "Generate ONLY what comes AFTER these elements. DO NOT regenerate the entire JSON structure.\n"
|
||||
continuationText += "Start directly with the next element/section that should follow.\n\n"
|
||||
|
||||
# PROMPT FOR CONTINUATION
|
||||
generationPrompt = f"""{'='*80}
|
||||
USER REQUEST / USER PROMPT:
|
||||
{'='*80}
|
||||
{userPrompt}
|
||||
{'='*80}
|
||||
END OF USER REQUEST / USER PROMPT
|
||||
{'='*80}
|
||||
|
||||
⚠️ CONTINUATION MODE: Response was incomplete. Generate ONLY the remaining content.
|
||||
|
||||
LANGUAGE REQUIREMENT: All generated content must be in the language '{userLanguage}'. Generate all text, headings, paragraphs, and content in this language.
|
||||
|
||||
{continuationText}
|
||||
|
||||
JSON structure template:
|
||||
{jsonTemplate}
|
||||
|
||||
Rules:
|
||||
- Return ONLY valid JSON (no comments, no trailing commas, double quotes only).
|
||||
- Reference elements shown above are ALREADY DELIVERED - DO NOT repeat them.
|
||||
- Generate ONLY the remaining content that comes AFTER the reference elements.
|
||||
- DO NOT regenerate the entire JSON structure - start directly with what comes next.
|
||||
- All content must be in the language '{userLanguage}'.
|
||||
- Output JSON only; no markdown fences or extra text.
|
||||
|
||||
Continue generating the remaining content now.
|
||||
"""
|
||||
else:
|
||||
|
||||
# PROMPT FOR FIRST CALL
|
||||
# Structure: User request + Extracted content FIRST (if available), then JSON template, then instructions
|
||||
|
||||
# ARCHITECTURE: If useContentParts=True, don't include full content in prompt
|
||||
# ContentParts will be passed directly to callAi for model-aware chunking
|
||||
if extracted_content and not useContentParts:
|
||||
# If we have extracted content, put it FIRST and make it very clear it's the source data
|
||||
generationPrompt = f"""{'='*80}
|
||||
USER REQUEST / USER PROMPT:
|
||||
{'='*80}
|
||||
{userPrompt}
|
||||
{'='*80}
|
||||
END OF USER REQUEST / USER PROMPT
|
||||
{'='*80}
|
||||
|
||||
{'='*80}
|
||||
⚠️ CRITICAL: USE THIS EXTRACTED CONTENT AS YOUR DATA SOURCE ⚠️
|
||||
{'='*80}
|
||||
The content below contains the ACTUAL DATA extracted from the source documents.
|
||||
You MUST use this data - DO NOT generate fake or example data.
|
||||
{'='*80}
|
||||
EXTRACTED CONTENT FROM DOCUMENTS:
|
||||
{'='*80}
|
||||
{extracted_content}
|
||||
{'='*80}
|
||||
END OF EXTRACTED CONTENT
|
||||
{'='*80}
|
||||
|
||||
LANGUAGE REQUIREMENT: All generated content must be in the language '{userLanguage}'. Generate all text, headings, paragraphs, and content in this language. If the extracted content is in a different language, translate it to '{userLanguage}' while preserving the structure and meaning.
|
||||
|
||||
Generate a VALID JSON response using the EXTRACTED CONTENT above as your data source.
|
||||
The JSON structure template below shows ONLY the structure pattern - the example values are NOT real data.
|
||||
You MUST use the actual data from EXTRACTED CONTENT above, NOT the example values from the template.
|
||||
|
||||
JSON structure template (structure only - use data from EXTRACTED CONTENT above):
|
||||
{jsonTemplate}
|
||||
|
||||
Instructions:
|
||||
- Return ONLY valid JSON (strict). No comments. No trailing commas. Use double quotes.
|
||||
- Do NOT reuse example section IDs; create your own.
|
||||
- CRITICAL: Use the ACTUAL DATA from EXTRACTED CONTENT above, NOT the example values from the template.
|
||||
- Generate complete content based on the user request and the extracted content. Do NOT just give an instruction or comments. Deliver the complete response.
|
||||
- All content must be in the language '{userLanguage}'.
|
||||
- IMPORTANT: Set a meaningful "filename" in each document with appropriate file extension (e.g., "prime_numbers.txt", "report.docx", "data.json"). The filename should reflect the content and task objective.
|
||||
- Output JSON only; no markdown fences or extra text.
|
||||
|
||||
Generate your complete response using the extracted content data.
|
||||
"""
|
||||
else:
|
||||
# No extracted content - generate from scratch
|
||||
generationPrompt = f"""{'='*80}
|
||||
USER REQUEST / USER PROMPT:
|
||||
{'='*80}
|
||||
{userPrompt}
|
||||
{'='*80}
|
||||
END OF USER REQUEST / USER PROMPT
|
||||
{'='*80}
|
||||
|
||||
LANGUAGE REQUIREMENT: All generated content must be in the language '{userLanguage}'. Generate all text, headings, paragraphs, and content in this language.
|
||||
|
||||
Generate a VALID JSON response for the user request. The template below shows ONLY the structure pattern - it is NOT existing content.
|
||||
|
||||
JSON structure template:
|
||||
{jsonTemplate}
|
||||
|
||||
Instructions:
|
||||
- Return ONLY valid JSON (strict). No comments. No trailing commas. Use double quotes.
|
||||
- Do NOT reuse example section IDs; create your own.
|
||||
- Generate complete content based on the user request. Do NOT just give an instruction or comments. Deliver the complete response.
|
||||
- All content must be in the language '{userLanguage}'.
|
||||
- IMPORTANT: Set a meaningful "filename" in each document with appropriate file extension (e.g., "prime_numbers.txt", "report.docx", "data.json"). The filename should reflect the content and task objective.
|
||||
- Output JSON only; no markdown fences or extra text.
|
||||
|
||||
Generate your complete response.
|
||||
"""
|
||||
|
||||
return generationPrompt.strip()
|
||||
|
||||
|
|
@ -0,0 +1,540 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""
|
||||
Structure Generator for hierarchical document generation.
|
||||
Generates document skeleton with section placeholders.
|
||||
"""
|
||||
|
||||
import logging
|
||||
import json
|
||||
from typing import Dict, Any, Optional, List
|
||||
from modules.datamodels.datamodelJson import jsonTemplateDocument
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class StructureGenerator:
|
||||
"""Generates document structure with section placeholders"""
|
||||
|
||||
def __init__(self, services: Any):
|
||||
self.services = services
|
||||
|
||||
async def generateStructure(
|
||||
self,
|
||||
userPrompt: str,
|
||||
documentList: Optional[Any] = None,
|
||||
cachedContent: Optional[Dict[str, Any]] = None,
|
||||
contentParts: Optional[List[Any]] = None,
|
||||
maxSectionLength: int = 500,
|
||||
existingImages: Optional[List[Dict[str, Any]]] = None
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
Generate document structure with sections.
|
||||
|
||||
Args:
|
||||
userPrompt: User's original prompt
|
||||
documentList: Optional document references
|
||||
cachedContent: Optional extracted content cache
|
||||
contentParts: Optional list of ContentParts to analyze for structure generation
|
||||
maxSectionLength: Maximum words for simple sections
|
||||
existingImages: Optional list of existing images to include
|
||||
|
||||
Returns:
|
||||
Document structure with empty elements arrays and contentPartIds per section
|
||||
"""
|
||||
try:
|
||||
# Create structure generation prompt
|
||||
structurePrompt = self._createStructurePrompt(
|
||||
userPrompt=userPrompt,
|
||||
cachedContent=cachedContent,
|
||||
contentParts=contentParts,
|
||||
maxSectionLength=maxSectionLength,
|
||||
existingImages=existingImages or []
|
||||
)
|
||||
|
||||
# Debug: Log structure generation prompt (harmonisiert - keine Checks nötig)
|
||||
self.services.utils.writeDebugFile(
|
||||
structurePrompt,
|
||||
"document_generation_structure_prompt"
|
||||
)
|
||||
|
||||
# Call AI to generate structure
|
||||
from modules.datamodels.datamodelAi import AiCallOptions, OperationTypeEnum
|
||||
|
||||
options = AiCallOptions(
|
||||
operationType=OperationTypeEnum.DATA_GENERATE,
|
||||
resultFormat="json"
|
||||
)
|
||||
|
||||
aiResponse = await self.services.ai.callAiContent(
|
||||
prompt=structurePrompt,
|
||||
options=options,
|
||||
outputFormat="json"
|
||||
)
|
||||
|
||||
# Debug: Log structure generation response (harmonisiert - keine Checks nötig)
|
||||
self.services.utils.writeDebugFile(
|
||||
aiResponse.content if aiResponse and aiResponse.content else '',
|
||||
"document_generation_structure_response"
|
||||
)
|
||||
|
||||
if not aiResponse or not aiResponse.content:
|
||||
raise ValueError("AI structure generation returned empty response")
|
||||
|
||||
# Extract and parse JSON
|
||||
extractedJson = self.services.utils.jsonExtractString(aiResponse.content)
|
||||
if not extractedJson:
|
||||
raise ValueError("No JSON found in AI structure response")
|
||||
|
||||
structure = json.loads(extractedJson)
|
||||
|
||||
# Validate and enhance structure
|
||||
structure = self._validateAndEnhanceStructure(structure, maxSectionLength)
|
||||
|
||||
return structure
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error generating structure: {str(e)}")
|
||||
raise
|
||||
|
||||
def _createStructurePrompt(
|
||||
self,
|
||||
userPrompt: str,
|
||||
cachedContent: Optional[Dict[str, Any]] = None,
|
||||
contentParts: Optional[List[Any]] = None,
|
||||
maxSectionLength: int = 500,
|
||||
existingImages: Optional[List[Dict[str, Any]]] = None
|
||||
) -> str:
|
||||
"""
|
||||
Create prompt for structure generation.
|
||||
"""
|
||||
# Get user language
|
||||
userLanguage = self._getUserLanguage()
|
||||
|
||||
# Format cached content if available
|
||||
cachedContentText = ""
|
||||
if cachedContent and cachedContent.get("extractedContent"):
|
||||
cachedContentText = self._formatCachedContent(cachedContent)
|
||||
|
||||
# Use provided existingImages or extract from cachedContent
|
||||
if existingImages is None:
|
||||
existingImages = []
|
||||
if cachedContent and cachedContent.get("imageDocuments"):
|
||||
existingImages = cachedContent.get("imageDocuments", [])
|
||||
|
||||
# Format ContentParts as JSON for structure generation
|
||||
contentPartsJson = ""
|
||||
if contentParts:
|
||||
try:
|
||||
import json
|
||||
# Convert ContentParts to dict format for JSON serialization
|
||||
contentPartsList = []
|
||||
for part in contentParts:
|
||||
if hasattr(part, 'dict'):
|
||||
partDict = part.dict()
|
||||
elif isinstance(part, dict):
|
||||
partDict = part
|
||||
else:
|
||||
# Try to convert to dict
|
||||
partDict = {
|
||||
"id": getattr(part, 'id', ''),
|
||||
"typeGroup": getattr(part, 'typeGroup', ''),
|
||||
"mimeType": getattr(part, 'mimeType', ''),
|
||||
"label": getattr(part, 'label', ''),
|
||||
"metadata": getattr(part, 'metadata', {})
|
||||
}
|
||||
# Only include essential fields for structure generation (not full data)
|
||||
contentPartsList.append({
|
||||
"id": partDict.get("id", ""),
|
||||
"typeGroup": partDict.get("typeGroup", ""),
|
||||
"mimeType": partDict.get("mimeType", ""),
|
||||
"label": partDict.get("label", ""),
|
||||
"metadata": partDict.get("metadata", {})
|
||||
})
|
||||
|
||||
contentPartsJson = json.dumps(contentPartsList, indent=2, ensure_ascii=False)
|
||||
except Exception as e:
|
||||
logger.warning(f"Could not format ContentParts as JSON: {str(e)}")
|
||||
contentPartsJson = ""
|
||||
|
||||
# Create structure template
|
||||
structureTemplate = jsonTemplateDocument.replace("{{DOCUMENT_TITLE}}", "Document Title")
|
||||
|
||||
prompt = f"""{'='*80}
|
||||
USER REQUEST:
|
||||
{'='*80}
|
||||
{userPrompt}
|
||||
{'='*80}
|
||||
|
||||
TASK: Generate a document STRUCTURE (skeleton) with sections.
|
||||
Do NOT generate actual content yet - only the structure.
|
||||
|
||||
{'='*80}
|
||||
EXTRACTED CONTENT (if available):
|
||||
{'='*80}
|
||||
{cachedContentText if cachedContentText else "No source documents provided."}
|
||||
{'='*80}
|
||||
|
||||
INSTRUCTIONS:
|
||||
1. Analyze the user request, extracted content, and available ContentParts
|
||||
2. Create a document structure with CONTENT sections only
|
||||
3. For each section, specify:
|
||||
- id: Unique identifier (e.g., "section_title_1", "section_image_1")
|
||||
- content_type: "heading" | "paragraph" | "image" | "table" | "bullet_list" | "code_block"
|
||||
- complexity: "simple" (can generate directly) or "complex" (needs sub-prompt)
|
||||
- generation_hint: Brief description of what content should be generated
|
||||
- contentPartIds: Array of ContentPart IDs that should be used for this section (e.g., ["part_1", "part_2"]) - can be empty []
|
||||
- extractionPrompt: (optional) Specific prompt for extracting/processing ContentParts for this section
|
||||
- image_prompt: (only for image sections) Detailed prompt for image generation
|
||||
- order: Section order number (starting from 1)
|
||||
- elements: [] (empty array - will be populated later)
|
||||
|
||||
4. Identify image sections:
|
||||
- If user requests illustrations/images, create image sections
|
||||
- If existing images are provided in documentList (check EXISTING IMAGES section below), create image sections that reference them
|
||||
- Add image_prompt field with detailed description for image generation (only for new images)
|
||||
- Set complexity to "complex" for new images, "simple" for existing/render images
|
||||
- For existing images: Set image_source to "existing" and image_reference_id to the image document ID
|
||||
- For images to render (from input documents): Set image_source to "render" and image_reference_id to the image document ID
|
||||
- Example for new image: {{"id": "section_image_1", "content_type": "image", "complexity": "complex", "generation_hint": "Illustration for chapter 1", "image_prompt": "A detailed description for image generation", "order": 2, "elements": []}}
|
||||
- Example for existing image: {{"id": "section_image_1", "content_type": "image", "complexity": "simple", "generation_hint": "Include provided image", "image_source": "existing", "image_reference_id": "doc_id_here", "order": 2, "elements": []}}
|
||||
- Example for render image: {{"id": "section_image_1", "content_type": "image", "complexity": "simple", "generation_hint": "Render input image", "image_source": "render", "image_reference_id": "doc_id_here", "order": 2, "elements": []}}
|
||||
|
||||
{'='*80}
|
||||
EXISTING IMAGES (to include in document):
|
||||
{'='*80}
|
||||
{self._formatExistingImages(existingImages) if existingImages else "No existing images provided."}
|
||||
{'='*80}
|
||||
|
||||
6. Identify complex text sections:
|
||||
- Long chapters (>{maxSectionLength} words expected) should be marked as "complex"
|
||||
- Short paragraphs/headings should be "simple"
|
||||
|
||||
7. Return ONLY valid JSON following this structure:
|
||||
{structureTemplate}
|
||||
|
||||
5. CRITICAL RULES FOR CONTENT PARTS:
|
||||
- Analyze available ContentParts and determine which ones are needed for each section
|
||||
- For image sections (content_type == "image"): Include image ContentParts in contentPartIds - images will be integrated as visual elements
|
||||
- For other sections (heading, paragraph, etc.): If image ContentParts are referenced, they will be referenced as text in the document language (not integrated as images)
|
||||
- Each section can reference multiple ContentParts via contentPartIds array
|
||||
- If specific extraction/processing is needed for ContentParts, provide extractionPrompt
|
||||
- Image references in non-image sections should be automatically derived in the document language (e.g., "siehe Bild 1" in German, "see Image 1" in English)
|
||||
|
||||
6. CRITICAL RULES:
|
||||
- Return ONLY valid JSON (no comments, no trailing commas, double quotes only)
|
||||
- Follow the exact JSON schema structure provided
|
||||
- IMPORTANT: All sections MUST have empty elements arrays: "elements": [] (the template shows examples with content, but you must use empty arrays)
|
||||
- ALL sections MUST include "generation_hint" field with a brief description of what content should be generated
|
||||
- ALL sections MUST include "complexity" field: "simple" for short content, "complex" for long chapters/images
|
||||
- ALL sections MUST include "contentPartIds" field (can be empty array [] if no ContentParts needed)
|
||||
- Image sections MUST include "image_prompt" field with detailed description for image generation
|
||||
- Order numbers MUST start from 1 (not 0)
|
||||
- All content must be in the language '{userLanguage}'
|
||||
- Do NOT generate actual content - only structure (skeleton)
|
||||
- Use only supported content_type values: "heading", "paragraph", "image", "table", "bullet_list", "code_block"
|
||||
|
||||
Return ONLY the JSON structure. No explanations.
|
||||
"""
|
||||
return prompt
|
||||
|
||||
def _validateAndEnhanceStructure(
|
||||
self,
|
||||
structure: Dict[str, Any],
|
||||
maxSectionLength: int
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
Validate structure and enhance with complexity identification.
|
||||
"""
|
||||
try:
|
||||
# Ensure structure has required fields
|
||||
if "documents" not in structure:
|
||||
if "sections" in structure:
|
||||
# Convert single-document format to multi-document format
|
||||
structure = {
|
||||
"metadata": structure.get("metadata", {}),
|
||||
"documents": [{
|
||||
"id": "doc_1",
|
||||
"title": structure.get("metadata", {}).get("title", "Document"),
|
||||
"filename": "document.json",
|
||||
"sections": structure.get("sections", [])
|
||||
}]
|
||||
}
|
||||
else:
|
||||
raise ValueError("Structure missing 'documents' or 'sections' field")
|
||||
|
||||
# Process each document
|
||||
for doc in structure.get("documents", []):
|
||||
sections = doc.get("sections", [])
|
||||
|
||||
# Process and validate sections according to standardized schema
|
||||
for idx, section in enumerate(sections):
|
||||
# Ensure required fields
|
||||
if "id" not in section:
|
||||
section["id"] = f"section_{idx + 1}"
|
||||
|
||||
sectionId = section.get("id", "")
|
||||
section["order"] = idx + 1
|
||||
|
||||
if "elements" not in section:
|
||||
section["elements"] = []
|
||||
|
||||
# Ensure contentPartIds field exists (can be empty array)
|
||||
if "contentPartIds" not in section:
|
||||
section["contentPartIds"] = []
|
||||
|
||||
# Ensure extractionPrompt field exists (optional)
|
||||
if "extractionPrompt" not in section:
|
||||
section["extractionPrompt"] = None
|
||||
|
||||
# Identify complexity if not set
|
||||
if "complexity" not in section:
|
||||
section["complexity"] = self._identifySectionComplexity(
|
||||
section,
|
||||
maxSectionLength
|
||||
)
|
||||
|
||||
# Ensure generation_hint exists (required for content generation)
|
||||
if "generation_hint" not in section or not section.get("generation_hint"):
|
||||
# Create meaningful generation hint from section id or content type
|
||||
contentType = section.get("content_type", "")
|
||||
|
||||
# Extract meaningful hint from section ID
|
||||
meaningfulHint = self._extractMeaningfulHint(sectionId, contentType, section.get("elements", []))
|
||||
section["generation_hint"] = meaningfulHint
|
||||
|
||||
# Ensure image sections have proper configuration
|
||||
if section.get("content_type") == "image":
|
||||
imageSource = section.get("image_source", "generate")
|
||||
|
||||
if imageSource == "existing" or imageSource == "render":
|
||||
# Existing or render image - ensure image_reference_id is set
|
||||
if "image_reference_id" not in section:
|
||||
logger.warning(f"Image section {sectionId} has image_source='{imageSource}' but no image_reference_id")
|
||||
# Existing/render images are simple (no generation needed, code integration)
|
||||
section["complexity"] = "simple"
|
||||
else:
|
||||
# New image generation - ensure image_prompt
|
||||
if "image_prompt" not in section or not section.get("image_prompt"):
|
||||
# Try to extract from generation_hint
|
||||
generationHint = section.get("generation_hint", "")
|
||||
if generationHint:
|
||||
# Enhance generation_hint to be a proper image prompt
|
||||
section["image_prompt"] = self._enhanceImagePrompt(generationHint)
|
||||
else:
|
||||
# Create default based on document context
|
||||
docTitle = doc.get("title", "Document")
|
||||
section["image_prompt"] = f"Generate an illustration for: {docTitle}"
|
||||
|
||||
# Ensure complexity is set to complex for new image generation
|
||||
section["complexity"] = "complex"
|
||||
|
||||
return structure
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error validating structure: {str(e)}")
|
||||
raise
|
||||
|
||||
def _identifySectionComplexity(
|
||||
self,
|
||||
section: Dict[str, Any],
|
||||
maxSectionLength: int
|
||||
) -> str:
|
||||
"""
|
||||
Identify if section is simple or complex.
|
||||
|
||||
Rules:
|
||||
- Images: always complex
|
||||
- Long chapters (>maxSectionLength words): complex
|
||||
- Others: simple
|
||||
"""
|
||||
contentType = section.get("content_type", "")
|
||||
|
||||
# Images are always complex
|
||||
if contentType == "image":
|
||||
return "complex"
|
||||
|
||||
# Check generation_hint for length indicators
|
||||
generationHint = section.get("generation_hint", "").lower()
|
||||
|
||||
# Keywords indicating long content
|
||||
longContentKeywords = [
|
||||
"chapter", "long", "detailed", "comprehensive",
|
||||
"extensive", "full", "complete story"
|
||||
]
|
||||
|
||||
if any(keyword in generationHint for keyword in longContentKeywords):
|
||||
return "complex"
|
||||
|
||||
# Default to simple
|
||||
return "simple"
|
||||
|
||||
def _extractMeaningfulHint(
|
||||
self,
|
||||
sectionId: str,
|
||||
contentType: str,
|
||||
elements: List[Any]
|
||||
) -> str:
|
||||
"""
|
||||
Extract meaningful generation hint from section ID, content type, or elements.
|
||||
|
||||
Args:
|
||||
sectionId: Section identifier (e.g., "section_heading_current_state")
|
||||
contentType: Content type (e.g., "heading", "paragraph")
|
||||
elements: Existing elements if any
|
||||
|
||||
Returns:
|
||||
Meaningful generation hint string
|
||||
"""
|
||||
sectionIdLower = sectionId.lower()
|
||||
|
||||
# Try to extract text from existing elements first (most accurate)
|
||||
if elements and isinstance(elements, list) and len(elements) > 0:
|
||||
firstElement = elements[0]
|
||||
if isinstance(firstElement, dict):
|
||||
if "text" in firstElement and firstElement["text"]:
|
||||
if contentType == "heading":
|
||||
return firstElement["text"]
|
||||
elif contentType == "paragraph":
|
||||
return f"Content paragraph: {firstElement['text'][:50]}..."
|
||||
|
||||
# Extract meaningful text from section ID
|
||||
# Remove common prefixes: "section_", "section_heading_", "section_paragraph_", etc.
|
||||
meaningfulPart = sectionId
|
||||
for prefix in ["section_heading_", "section_paragraph_", "section_bullet_list_",
|
||||
"section_code_block_", "section_image_", "section_"]:
|
||||
if meaningfulPart.lower().startswith(prefix):
|
||||
meaningfulPart = meaningfulPart[len(prefix):]
|
||||
break
|
||||
|
||||
# Convert snake_case to Title Case
|
||||
# e.g., "current_state" -> "Current State"
|
||||
words = meaningfulPart.replace("_", " ").split()
|
||||
titleCase = " ".join(word.capitalize() for word in words if word)
|
||||
|
||||
# Handle special cases
|
||||
if "introduction" in sectionIdLower or "intro" in sectionIdLower:
|
||||
return "Introduction paragraph"
|
||||
elif "conclusion" in sectionIdLower:
|
||||
return "Conclusion paragraph"
|
||||
elif "footer" in sectionIdLower or "copyright" in sectionIdLower:
|
||||
return "Footer content"
|
||||
elif "title" in sectionIdLower and "main" in sectionIdLower:
|
||||
# Main title - try to get from document title or use generic
|
||||
return "Main document title"
|
||||
|
||||
# Create hint based on content type and extracted text
|
||||
if contentType == "heading":
|
||||
if titleCase:
|
||||
return titleCase
|
||||
else:
|
||||
return "Section heading"
|
||||
elif contentType == "paragraph":
|
||||
if titleCase:
|
||||
return f"Content paragraph about {titleCase.lower()}"
|
||||
else:
|
||||
return f"Content paragraph"
|
||||
elif contentType == "bullet_list":
|
||||
if titleCase:
|
||||
return f"Bullet list: {titleCase.lower()}"
|
||||
else:
|
||||
return "Bullet list items"
|
||||
elif contentType == "code_block":
|
||||
return "Code content"
|
||||
else:
|
||||
if titleCase:
|
||||
return f"Content for {titleCase.lower()}"
|
||||
else:
|
||||
return f"Content for {contentType} section"
|
||||
|
||||
def _extractImagePrompts(
|
||||
self,
|
||||
structure: Dict[str, Any]
|
||||
) -> Dict[str, str]:
|
||||
"""
|
||||
Extract image generation prompts from structure.
|
||||
Maps section_id -> image_prompt
|
||||
"""
|
||||
imagePrompts = {}
|
||||
|
||||
for doc in structure.get("documents", []):
|
||||
for section in doc.get("sections", []):
|
||||
if section.get("content_type") == "image":
|
||||
sectionId = section.get("id")
|
||||
imagePrompt = section.get("image_prompt")
|
||||
if sectionId and imagePrompt:
|
||||
imagePrompts[sectionId] = imagePrompt
|
||||
|
||||
return imagePrompts
|
||||
|
||||
def _formatCachedContent(
|
||||
self,
|
||||
cachedContent: Dict[str, Any]
|
||||
) -> str:
|
||||
"""
|
||||
Format cached content for prompt inclusion.
|
||||
"""
|
||||
try:
|
||||
extractedContent = cachedContent.get("extractedContent", [])
|
||||
if not extractedContent:
|
||||
return "No content extracted."
|
||||
|
||||
# Format ContentPart objects
|
||||
formattedParts = []
|
||||
for extracted in extractedContent:
|
||||
if hasattr(extracted, 'parts'):
|
||||
for part in extracted.parts:
|
||||
if hasattr(part, 'content'):
|
||||
formattedParts.append(part.content)
|
||||
elif isinstance(extracted, dict):
|
||||
formattedParts.append(str(extracted))
|
||||
else:
|
||||
formattedParts.append(str(extracted))
|
||||
|
||||
return "\n\n".join(formattedParts) if formattedParts else "No content extracted."
|
||||
|
||||
except Exception as e:
|
||||
logger.warning(f"Error formatting cached content: {str(e)}")
|
||||
return "Error formatting cached content."
|
||||
|
||||
def _enhanceImagePrompt(self, generationHint: str) -> str:
|
||||
"""
|
||||
Enhance generation hint to be a proper image generation prompt.
|
||||
Adds visual details and style guidance if missing.
|
||||
"""
|
||||
# If hint already contains visual details, use as-is
|
||||
visualKeywords = ["illustration", "image", "picture", "visual", "depict", "show", "drawing"]
|
||||
if any(keyword.lower() in generationHint.lower() for keyword in visualKeywords):
|
||||
return generationHint
|
||||
|
||||
# Enhance with visual description
|
||||
enhanced = f"Create a professional illustration: {generationHint}"
|
||||
return enhanced
|
||||
|
||||
def _formatExistingImages(self, imageDocuments: List[Dict[str, Any]]) -> str:
|
||||
"""Format existing images list for prompt inclusion"""
|
||||
if not imageDocuments:
|
||||
return "No existing images provided."
|
||||
|
||||
formatted = []
|
||||
for i, imgDoc in enumerate(imageDocuments, 1):
|
||||
formatted.append(f"{i}. Image ID: {imgDoc.get('id')}")
|
||||
formatted.append(f" File Name: {imgDoc.get('fileName', 'Unknown')}")
|
||||
formatted.append(f" MIME Type: {imgDoc.get('mimeType', 'Unknown')}")
|
||||
formatted.append(f" Alt Text: {imgDoc.get('altText', 'Image')}")
|
||||
formatted.append("")
|
||||
|
||||
return "\n".join(formatted)
|
||||
|
||||
def _getUserLanguage(self) -> str:
|
||||
"""Get user language for document generation"""
|
||||
try:
|
||||
if self.services:
|
||||
if hasattr(self.services, 'currentUserLanguage') and self.services.currentUserLanguage:
|
||||
return self.services.currentUserLanguage
|
||||
elif hasattr(self.services, 'user') and self.services.user and hasattr(self.services.user, 'language'):
|
||||
return self.services.user.language
|
||||
except Exception:
|
||||
pass
|
||||
return 'en' # Default fallback
|
||||
|
||||
|
|
@ -0,0 +1,7 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""Messaging service for the service center."""
|
||||
|
||||
from .mainServiceMessaging import MessagingService
|
||||
|
||||
__all__ = ["MessagingService"]
|
||||
|
|
@ -0,0 +1,368 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""
|
||||
Messaging service for sending messages across different channels.
|
||||
Provides subscription-based messaging functionality.
|
||||
|
||||
Supports both service center (context, get_service) and legacy (services) initialization.
|
||||
"""
|
||||
|
||||
import logging
|
||||
import re
|
||||
from typing import List, Optional, Callable, Any
|
||||
from modules.datamodels.datamodelMessaging import (
|
||||
MessagingSubscription,
|
||||
MessagingSubscriptionRegistration,
|
||||
MessagingDelivery,
|
||||
MessagingChannel,
|
||||
MessagingEventParameters,
|
||||
MessagingSendResult,
|
||||
MessagingSubscriptionExecutionResult,
|
||||
DeliveryStatus
|
||||
)
|
||||
from modules.interfaces.interfaceMessaging import getInterface as getMessagingInterface
|
||||
from modules.shared.timeUtils import getUtcTimestamp
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class _ServicesAdapter:
|
||||
"""Minimal adapter providing interfaceDbComponent for service center mode."""
|
||||
|
||||
def __init__(self, context: Any):
|
||||
from modules.interfaces.interfaceDbManagement import getInterface as getComponentInterface
|
||||
self.interfaceDbComponent = getComponentInterface(
|
||||
context.user,
|
||||
mandateId=context.mandate_id
|
||||
)
|
||||
|
||||
|
||||
class MessagingService:
|
||||
"""
|
||||
Messaging service providing subscription-based messaging functionality.
|
||||
"""
|
||||
|
||||
def __init__(self, context_or_services: Any, get_service: Optional[Callable[[str], Any]] = None):
|
||||
"""Initialize messaging service.
|
||||
|
||||
Args:
|
||||
context_or_services: ServiceCenterContext (when get_service is callable) or legacy Services hub
|
||||
get_service: Callable to resolve services (service center mode only)
|
||||
"""
|
||||
if get_service is not None and callable(get_service):
|
||||
# Service center: (context, get_service)
|
||||
self.services = _ServicesAdapter(context_or_services)
|
||||
else:
|
||||
# Legacy: (services,)
|
||||
self.services = context_or_services
|
||||
self._messagingInterface = None
|
||||
|
||||
def sendMessage(
|
||||
self,
|
||||
subject: str,
|
||||
message: str,
|
||||
registration: MessagingSubscriptionRegistration
|
||||
) -> MessagingSendResult:
|
||||
"""
|
||||
Sendet eine Nachricht über einen Channel an einen User.
|
||||
Erstellt MessagingDelivery Record.
|
||||
|
||||
Args:
|
||||
subject: Subject der Nachricht (für E-Mail, leer für SMS)
|
||||
message: Nachrichtentext
|
||||
registration: MessagingSubscriptionRegistration mit Channel-Info und userId
|
||||
|
||||
Returns:
|
||||
MessagingSendResult mit Status und Delivery-ID
|
||||
"""
|
||||
# Erstelle Delivery Record
|
||||
delivery = MessagingDelivery(
|
||||
subscriptionId=registration.subscriptionId,
|
||||
userId=registration.userId,
|
||||
channel=registration.channel,
|
||||
status=DeliveryStatus.PENDING
|
||||
)
|
||||
|
||||
# Speichere Delivery Record
|
||||
try:
|
||||
deliveryRecord = self.services.interfaceDbComponent.createDelivery(delivery)
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to create delivery record: {str(e)}")
|
||||
return MessagingSendResult(
|
||||
success=False,
|
||||
errorMessage=f"Failed to create delivery record: {str(e)}"
|
||||
)
|
||||
|
||||
try:
|
||||
# Convert plain text to HTML for email channel
|
||||
messageToSend = message
|
||||
if registration.channel == MessagingChannel.EMAIL:
|
||||
messageToSend = self._textToHtml(message)
|
||||
|
||||
# Versende über interfaceMessaging
|
||||
success = self._getMessagingInterface().send(
|
||||
channel=registration.channel,
|
||||
recipient=registration.channelConfig,
|
||||
subject=subject,
|
||||
message=messageToSend
|
||||
)
|
||||
|
||||
if success:
|
||||
# Update Delivery Record
|
||||
self.services.interfaceDbComponent.updateDelivery(
|
||||
deliveryRecord["id"],
|
||||
{
|
||||
"status": DeliveryStatus.SENT,
|
||||
"sentAt": getUtcTimestamp()
|
||||
}
|
||||
)
|
||||
return MessagingSendResult(
|
||||
success=True,
|
||||
deliveryId=deliveryRecord["id"]
|
||||
)
|
||||
else:
|
||||
# Update Delivery Record mit Fehler
|
||||
self.services.interfaceDbComponent.updateDelivery(
|
||||
deliveryRecord["id"],
|
||||
{
|
||||
"status": DeliveryStatus.FAILED,
|
||||
"errorMessage": "Failed to send message"
|
||||
}
|
||||
)
|
||||
return MessagingSendResult(
|
||||
success=False,
|
||||
deliveryId=deliveryRecord["id"],
|
||||
errorMessage="Failed to send message"
|
||||
)
|
||||
except Exception as e:
|
||||
logger.error(f"Error sending message: {str(e)}")
|
||||
# Update Delivery Record mit Fehler
|
||||
try:
|
||||
self.services.interfaceDbComponent.updateDelivery(
|
||||
deliveryRecord["id"],
|
||||
{
|
||||
"status": DeliveryStatus.FAILED,
|
||||
"errorMessage": str(e)
|
||||
}
|
||||
)
|
||||
except Exception as updateError:
|
||||
logger.error(f"Failed to update delivery record: {str(updateError)}")
|
||||
|
||||
return MessagingSendResult(
|
||||
success=False,
|
||||
deliveryId=deliveryRecord["id"],
|
||||
errorMessage=str(e)
|
||||
)
|
||||
|
||||
def _textToHtml(self, text: str) -> str:
|
||||
"""
|
||||
Convert plain text to simple HTML for email display.
|
||||
|
||||
- Escapes HTML special characters
|
||||
- Converts newlines to <br> tags
|
||||
- Wraps URLs in clickable links
|
||||
- Wraps in a basic HTML structure with nice styling
|
||||
|
||||
Args:
|
||||
text: Plain text message
|
||||
|
||||
Returns:
|
||||
HTML formatted message
|
||||
"""
|
||||
import html
|
||||
|
||||
# Check if already HTML (contains HTML tags)
|
||||
if re.search(r'<[^>]+>', text):
|
||||
return text
|
||||
|
||||
# Escape HTML special characters
|
||||
escaped = html.escape(text)
|
||||
|
||||
# Convert URLs to clickable links (before converting newlines)
|
||||
urlPattern = r'(https?://[^\s<>"\']+)'
|
||||
escaped = re.sub(urlPattern, r'<a href="\1" style="color: #0066cc;">\1</a>', escaped)
|
||||
|
||||
# Convert newlines to <br> tags
|
||||
escaped = escaped.replace('\n', '<br>\n')
|
||||
|
||||
# Wrap in a nice HTML structure
|
||||
htmlContent = f"""<!DOCTYPE html>
|
||||
<html>
|
||||
<head>
|
||||
<meta charset="utf-8">
|
||||
<style>
|
||||
body {{
|
||||
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, 'Helvetica Neue', Arial, sans-serif;
|
||||
font-size: 14px;
|
||||
line-height: 1.6;
|
||||
color: #333333;
|
||||
max-width: 600px;
|
||||
margin: 0 auto;
|
||||
padding: 20px;
|
||||
}}
|
||||
a {{
|
||||
color: #0066cc;
|
||||
}}
|
||||
</style>
|
||||
</head>
|
||||
<body>
|
||||
{escaped}
|
||||
</body>
|
||||
</html>"""
|
||||
|
||||
return htmlContent
|
||||
|
||||
def sendEmailDirect(
|
||||
self,
|
||||
recipient: str,
|
||||
subject: str,
|
||||
message: str,
|
||||
userId: Optional[str] = None
|
||||
) -> bool:
|
||||
"""
|
||||
Send email directly without requiring a subscription.
|
||||
Used for authentication flows (registration, password reset).
|
||||
|
||||
Plain text messages are automatically converted to HTML format.
|
||||
|
||||
Args:
|
||||
recipient: Email address of the recipient
|
||||
subject: Email subject
|
||||
message: Email body (can be HTML or plain text - plain text is auto-converted)
|
||||
userId: Optional user ID for logging/audit purposes
|
||||
|
||||
Returns:
|
||||
bool: True if email was sent successfully, False otherwise
|
||||
"""
|
||||
try:
|
||||
# Convert plain text to HTML if needed
|
||||
htmlMessage = self._textToHtml(message)
|
||||
|
||||
messagingInterface = self._getMessagingInterface()
|
||||
success = messagingInterface.send(
|
||||
channel=MessagingChannel.EMAIL,
|
||||
recipient=recipient,
|
||||
subject=subject,
|
||||
message=htmlMessage
|
||||
)
|
||||
|
||||
if success:
|
||||
logger.info(f"Email sent successfully to {recipient} (userId: {userId})")
|
||||
else:
|
||||
logger.warning(f"Failed to send email to {recipient} (userId: {userId})")
|
||||
|
||||
return success
|
||||
except Exception as e:
|
||||
logger.error(f"Error sending email to {recipient}: {str(e)}", exc_info=True)
|
||||
return False
|
||||
|
||||
def executeSubscription(
|
||||
self,
|
||||
subscriptionId: str,
|
||||
eventParameters: MessagingEventParameters
|
||||
) -> MessagingSubscriptionExecutionResult:
|
||||
"""
|
||||
Führt eine Subscription-Funktion aus.
|
||||
|
||||
Args:
|
||||
subscriptionId: ID der Subscription
|
||||
eventParameters: Parameter vom Trigger (als Pydantic Model)
|
||||
|
||||
Returns:
|
||||
MessagingSubscriptionExecutionResult
|
||||
|
||||
Raises:
|
||||
ValueError: Wenn Subscription nicht existiert oder nicht enabled ist
|
||||
FileNotFoundError: Wenn Subscription-Funktion nicht gefunden wird
|
||||
"""
|
||||
# Prüfe ob Subscription existiert und enabled ist
|
||||
subscription = self.services.interfaceDbComponent.getSubscription(subscriptionId)
|
||||
if not subscription:
|
||||
raise ValueError(f"Subscription {subscriptionId} not found")
|
||||
if not subscription.enabled:
|
||||
logger.warning(f"Subscription {subscriptionId} is disabled, skipping execution")
|
||||
return MessagingSubscriptionExecutionResult(
|
||||
success=False,
|
||||
messagesSent=0,
|
||||
errorMessage="Subscription is disabled"
|
||||
)
|
||||
|
||||
# Hole alle aktiven Registrierungen für diese Subscription
|
||||
registrations = self._getSubscribers(subscriptionId)
|
||||
|
||||
if not registrations:
|
||||
logger.info(f"No active registrations for subscription {subscriptionId}")
|
||||
return MessagingSubscriptionExecutionResult(
|
||||
success=True,
|
||||
messagesSent=0
|
||||
)
|
||||
|
||||
# Lade Subscription-Funktion dynamisch
|
||||
subscriptionFunction = self._loadSubscriptionFunction(subscriptionId)
|
||||
if not subscriptionFunction:
|
||||
errorMsg = f"Subscription function not found for {subscriptionId}"
|
||||
logger.error(errorMsg)
|
||||
raise FileNotFoundError(errorMsg)
|
||||
|
||||
# Führe Funktion aus mit Registrierungen
|
||||
try:
|
||||
return subscriptionFunction.execute(eventParameters, registrations, self)
|
||||
except Exception as e:
|
||||
logger.error(f"Error executing subscription {subscriptionId}: {str(e)}", exc_info=True)
|
||||
return MessagingSubscriptionExecutionResult(
|
||||
success=False,
|
||||
messagesSent=0,
|
||||
errorMessage=str(e)
|
||||
)
|
||||
|
||||
def _getSubscribers(
|
||||
self,
|
||||
subscriptionId: str,
|
||||
channel: Optional[MessagingChannel] = None
|
||||
) -> List[MessagingSubscriptionRegistration]:
|
||||
"""Holt alle aktiven Subscriber einer Subscription"""
|
||||
filters = {"enabled": True}
|
||||
if channel:
|
||||
filters["channel"] = channel.value
|
||||
|
||||
registrations = self.services.interfaceDbComponent.getAllRegistrations(
|
||||
subscriptionId=subscriptionId
|
||||
)
|
||||
|
||||
# Filter nach enabled und channel
|
||||
filteredRegistrations = []
|
||||
for reg in registrations:
|
||||
if reg.enabled and (not channel or reg.channel == channel):
|
||||
filteredRegistrations.append(reg)
|
||||
|
||||
return filteredRegistrations
|
||||
|
||||
def _loadSubscriptionFunction(self, subscriptionId: str) -> Optional[Callable]:
|
||||
"""
|
||||
Lädt die Subscription-Funktion dynamisch.
|
||||
|
||||
Returns:
|
||||
Callable mit execute-Methode oder None wenn nicht gefunden
|
||||
|
||||
Note:
|
||||
subscriptionId wird direkt als Dateiname verwendet (z.B. "SystemErrors" -> subSubscriptionSystemErrors.py)
|
||||
"""
|
||||
# Format: subSubscription{subscriptionId}.py
|
||||
functionName = f"subSubscription{subscriptionId}"
|
||||
moduleName = f"modules.serviceCenter.services.serviceMessaging.subscriptions.{functionName}"
|
||||
|
||||
try:
|
||||
# Dynamisches Import
|
||||
import importlib
|
||||
subscriptionModule = importlib.import_module(moduleName)
|
||||
return subscriptionModule
|
||||
except ImportError:
|
||||
# Funktion existiert noch nicht - das ist OK
|
||||
logger.debug(f"Subscription function {moduleName} not found (this is OK if not yet implemented)")
|
||||
return None
|
||||
|
||||
def _getMessagingInterface(self):
|
||||
"""Holt das Messaging-Interface (interfaceMessaging)"""
|
||||
if not self._messagingInterface:
|
||||
self._messagingInterface = getMessagingInterface()
|
||||
return self._messagingInterface
|
||||
|
|
@ -0,0 +1,3 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""Subscription functions for the messaging service."""
|
||||
|
|
@ -0,0 +1,72 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""
|
||||
Example subscription function for System Errors.
|
||||
This is a template that can be used as a reference for creating other subscription functions.
|
||||
"""
|
||||
|
||||
from typing import List
|
||||
from modules.datamodels.datamodelMessaging import (
|
||||
MessagingEventParameters,
|
||||
MessagingSubscriptionExecutionResult,
|
||||
MessagingSubscriptionRegistration,
|
||||
MessagingChannel
|
||||
)
|
||||
|
||||
|
||||
def execute(
|
||||
eventParameters: MessagingEventParameters,
|
||||
registrations: List[MessagingSubscriptionRegistration],
|
||||
messagingService
|
||||
) -> MessagingSubscriptionExecutionResult:
|
||||
"""
|
||||
Subscription-Funktion für System-Errors.
|
||||
Erhält eventParameters vom Trigger und registrations bereits geholt.
|
||||
|
||||
Args:
|
||||
eventParameters: Event-Parameter vom Trigger
|
||||
registrations: Liste der aktiven Registrierungen für diese Subscription
|
||||
messagingService: MessagingService-Instanz
|
||||
|
||||
Returns:
|
||||
MessagingSubscriptionExecutionResult mit Status und Anzahl gesendeter Nachrichten
|
||||
"""
|
||||
# Gruppiere nach Channel
|
||||
emailRegistrations = [r for r in registrations if r.channel == MessagingChannel.EMAIL]
|
||||
smsRegistrations = [r for r in registrations if r.channel == MessagingChannel.SMS]
|
||||
|
||||
# Bereite Nachrichten vor (können pro Channel unterschiedlich sein)
|
||||
triggerData = eventParameters.triggerData
|
||||
errors = triggerData.get('errors', [])
|
||||
timestamp = triggerData.get('timestamp', 'Unknown')
|
||||
|
||||
emailSubject = "System Error Report"
|
||||
emailMessage = f"System errors detected at {timestamp}:\n\n{errors}"
|
||||
|
||||
smsMessage = f"System Error: {len(errors)} errors detected at {timestamp}"
|
||||
|
||||
messagesSent = 0
|
||||
|
||||
# Versende über sendMessage
|
||||
for reg in emailRegistrations:
|
||||
sendResult = messagingService.sendMessage(
|
||||
subject=emailSubject,
|
||||
message=emailMessage,
|
||||
registration=reg
|
||||
)
|
||||
if sendResult.success:
|
||||
messagesSent += 1
|
||||
|
||||
for reg in smsRegistrations:
|
||||
sendResult = messagingService.sendMessage(
|
||||
subject="", # SMS hat kein Subject
|
||||
message=smsMessage,
|
||||
registration=reg
|
||||
)
|
||||
if sendResult.success:
|
||||
messagesSent += 1
|
||||
|
||||
return MessagingSubscriptionExecutionResult(
|
||||
success=True,
|
||||
messagesSent=messagesSent
|
||||
)
|
||||
|
|
@ -0,0 +1,7 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""SharePoint service."""
|
||||
|
||||
from .mainServiceSharepoint import SharepointService
|
||||
|
||||
__all__ = ["SharepointService"]
|
||||
|
|
@ -0,0 +1,825 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""Connector for SharePoint operations using Microsoft Graph API."""
|
||||
|
||||
import logging
|
||||
import aiohttp
|
||||
import asyncio
|
||||
import time
|
||||
from typing import Dict, Any, List, Optional, Callable
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Cache for discoverSites() to avoid hitting Graph API on every folder-options call (e.g. when UI loads site list).
|
||||
# Key: token prefix (per user), Value: (expiry_ts, sites). TTL 5 minutes.
|
||||
_discoverSitesCache: Dict[str, tuple] = {}
|
||||
_DISCOVER_SITES_TTL_SEC = 300
|
||||
|
||||
|
||||
class SharepointService:
|
||||
"""SharePoint connector using Microsoft Graph API for reliable authentication."""
|
||||
|
||||
def __init__(self, context, get_service: Callable[[str], Any]):
|
||||
"""Initialize SharePoint service without access token.
|
||||
|
||||
Args:
|
||||
context: ServiceCenterContext with user, mandate_id, etc.
|
||||
get_service: Service resolver for dependency injection (e.g. security)
|
||||
|
||||
Use setAccessTokenFromConnection() method to configure the access token before making API calls.
|
||||
"""
|
||||
self._context = context
|
||||
self._get_service = get_service
|
||||
self.accessToken = None
|
||||
self.baseUrl = "https://graph.microsoft.com/v1.0"
|
||||
|
||||
def setAccessTokenFromConnection(self, userConnection) -> bool:
|
||||
"""Set access token from UserConnection.
|
||||
|
||||
Args:
|
||||
userConnection: UserConnection object or dict containing token information
|
||||
|
||||
Returns:
|
||||
bool: True if token was set successfully, False otherwise
|
||||
"""
|
||||
try:
|
||||
if not userConnection:
|
||||
logger.error("UserConnection is required to set access token")
|
||||
return False
|
||||
|
||||
# Handle both dict and UserConnection object
|
||||
if isinstance(userConnection, dict):
|
||||
connectionId = userConnection.get('id')
|
||||
else:
|
||||
connectionId = getattr(userConnection, 'id', None)
|
||||
|
||||
if not connectionId:
|
||||
logger.error("UserConnection must have an 'id' field")
|
||||
return False
|
||||
|
||||
# Get a fresh token for this specific connection via security service
|
||||
security = self._get_service("security")
|
||||
if not security:
|
||||
logger.error("Security service not available for token access")
|
||||
return False
|
||||
|
||||
token = security.getFreshToken(connectionId)
|
||||
if not token:
|
||||
logger.error(f"No token found for connection {connectionId}")
|
||||
return False
|
||||
|
||||
self.accessToken = token.tokenAccess
|
||||
logger.info(f"Access token set for connection {connectionId}")
|
||||
return True
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error setting access token: {str(e)}")
|
||||
return False
|
||||
|
||||
async def _makeGraphApiCall(self, endpoint: str, method: str = "GET", data: bytes = None) -> Dict[str, Any]:
|
||||
"""Make a Microsoft Graph API call with proper error handling."""
|
||||
try:
|
||||
if self.accessToken is None:
|
||||
logger.error("Access token is not set. Please call setAccessTokenFromConnection() before using the SharePoint service.")
|
||||
return {"error": "Access token is not set. Please call setAccessTokenFromConnection() before using the SharePoint service."}
|
||||
|
||||
headers = {
|
||||
"Authorization": f"Bearer {self.accessToken}",
|
||||
"Content-Type": "application/json" if data and method != "PUT" else "application/octet-stream" if data else "application/json"
|
||||
}
|
||||
|
||||
# Remove leading slash from endpoint to avoid double slash
|
||||
cleanEndpoint = endpoint.lstrip('/')
|
||||
url = f"{self.baseUrl}/{cleanEndpoint}"
|
||||
logger.debug(f"Making Graph API call: {method} {url}")
|
||||
|
||||
timeout = aiohttp.ClientTimeout(total=30)
|
||||
|
||||
async with aiohttp.ClientSession(timeout=timeout) as session:
|
||||
if method == "GET":
|
||||
async with session.get(url, headers=headers) as response:
|
||||
if response.status == 200:
|
||||
return await response.json()
|
||||
else:
|
||||
error_text = await response.text()
|
||||
logger.error(f"Graph API call failed: {response.status} - {error_text}")
|
||||
return {"error": f"API call failed: {response.status} - {error_text}"}
|
||||
|
||||
elif method == "PUT":
|
||||
async with session.put(url, headers=headers, data=data) as response:
|
||||
if response.status in [200, 201]:
|
||||
return await response.json()
|
||||
else:
|
||||
error_text = await response.text()
|
||||
logger.error(f"Graph API call failed: {response.status} - {error_text}")
|
||||
return {"error": f"API call failed: {response.status} - {error_text}"}
|
||||
|
||||
elif method == "POST":
|
||||
async with session.post(url, headers=headers, data=data) as response:
|
||||
if response.status in [200, 201]:
|
||||
return await response.json()
|
||||
else:
|
||||
error_text = await response.text()
|
||||
logger.error(f"Graph API call failed: {response.status} - {error_text}")
|
||||
return {"error": f"API call failed: {response.status} - {error_text}"}
|
||||
|
||||
elif method == "DELETE":
|
||||
async with session.delete(url, headers=headers) as response:
|
||||
if response.status in [200, 204]:
|
||||
return {}
|
||||
else:
|
||||
error_text = await response.text()
|
||||
logger.error(f"Graph API call failed: {response.status} - {error_text}")
|
||||
return {"error": f"API call failed: {response.status} - {error_text}"}
|
||||
|
||||
except asyncio.TimeoutError:
|
||||
logger.error(f"Graph API call timed out after 30 seconds: {endpoint}")
|
||||
return {"error": f"API call timed out after 30 seconds: {endpoint}"}
|
||||
except Exception as e:
|
||||
logger.error(f"Error making Graph API call: {str(e)}")
|
||||
return {"error": f"Error making Graph API call: {str(e)}"}
|
||||
|
||||
async def discoverSites(self) -> List[Dict[str, Any]]:
|
||||
"""Discover all SharePoint sites accessible to the user."""
|
||||
try:
|
||||
result = await self._makeGraphApiCall("sites?search=*")
|
||||
|
||||
if "error" in result:
|
||||
logger.error(f"Error discovering SharePoint sites: {result['error']}")
|
||||
return []
|
||||
|
||||
sites = result.get("value", [])
|
||||
logger.info(f"Discovered {len(sites)} SharePoint sites")
|
||||
|
||||
processedSites = []
|
||||
for site in sites:
|
||||
siteInfo = {
|
||||
"id": site.get("id"),
|
||||
"displayName": site.get("displayName"),
|
||||
"name": site.get("name"),
|
||||
"webUrl": site.get("webUrl"),
|
||||
"description": site.get("description"),
|
||||
"createdDateTime": site.get("createdDateTime"),
|
||||
"lastModifiedDateTime": site.get("lastModifiedDateTime")
|
||||
}
|
||||
processedSites.append(siteInfo)
|
||||
logger.debug(f"Site: {siteInfo['displayName']} - {siteInfo['webUrl']}")
|
||||
|
||||
return processedSites
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error discovering SharePoint sites: {str(e)}")
|
||||
return []
|
||||
|
||||
async def findSiteByName(self, siteName: str) -> Optional[Dict[str, Any]]:
|
||||
"""Find a specific SharePoint site by name using direct Graph API call."""
|
||||
try:
|
||||
# Try to get the site directly by name using Graph API
|
||||
endpoint = f"sites/{siteName}"
|
||||
result = await self._makeGraphApiCall(endpoint)
|
||||
|
||||
if result and "error" not in result:
|
||||
siteInfo = {
|
||||
"id": result.get("id"),
|
||||
"displayName": result.get("displayName"),
|
||||
"name": result.get("name"),
|
||||
"webUrl": result.get("webUrl"),
|
||||
"description": result.get("description"),
|
||||
"createdDateTime": result.get("createdDateTime"),
|
||||
"lastModifiedDateTime": result.get("lastModifiedDateTime")
|
||||
}
|
||||
logger.info(f"Found site directly: {siteInfo['displayName']} - {siteInfo['webUrl']}")
|
||||
return siteInfo
|
||||
|
||||
except Exception as e:
|
||||
logger.debug(f"Direct site lookup failed for '{siteName}': {str(e)}")
|
||||
|
||||
# Fallback to discovery if direct lookup fails
|
||||
logger.info(f"Direct lookup failed, trying discovery for site: {siteName}")
|
||||
sites = await self.discoverSites()
|
||||
if not sites:
|
||||
logger.warning("No sites discovered")
|
||||
return None
|
||||
|
||||
logger.info(f"Discovered {len(sites)} SharePoint sites:")
|
||||
for site in sites:
|
||||
logger.info(f" - {site.get('displayName', 'Unknown')} (ID: {site.get('id', 'Unknown')})")
|
||||
|
||||
# Try exact match first
|
||||
for site in sites:
|
||||
if site.get("displayName", "").strip().lower() == siteName.strip().lower():
|
||||
logger.info(f"Found exact match: {site.get('displayName')}")
|
||||
return site
|
||||
|
||||
# Try partial match
|
||||
for site in sites:
|
||||
if siteName.lower() in site.get("displayName", "").lower():
|
||||
logger.info(f"Found partial match: {site.get('displayName')}")
|
||||
return site
|
||||
|
||||
logger.warning(f"No site found matching: {siteName}")
|
||||
return None
|
||||
|
||||
async def findSiteByWebUrl(self, webUrl: str) -> Optional[Dict[str, Any]]:
|
||||
"""Find a SharePoint site using its web URL (useful for guest sites)."""
|
||||
try:
|
||||
# Use the web URL format: sites/{hostname}:/sites/{site-path}
|
||||
# Extract hostname and site path from the web URL
|
||||
if not webUrl.startswith("https://"):
|
||||
webUrl = f"https://{webUrl}"
|
||||
|
||||
# Parse the URL to extract hostname and site path
|
||||
from urllib.parse import urlparse
|
||||
parsed = urlparse(webUrl)
|
||||
hostname = parsed.hostname
|
||||
pathParts = parsed.path.strip('/').split('/')
|
||||
|
||||
if len(pathParts) >= 2 and pathParts[0] == 'sites':
|
||||
sitePath = '/'.join(pathParts[1:]) # Everything after 'sites/'
|
||||
else:
|
||||
logger.error(f"Invalid SharePoint URL format: {webUrl}")
|
||||
return None
|
||||
|
||||
endpoint = f"sites/{hostname}:/sites/{sitePath}"
|
||||
logger.debug(f"Trying web URL format: {endpoint}")
|
||||
|
||||
result = await self._makeGraphApiCall(endpoint)
|
||||
|
||||
if result and "error" not in result:
|
||||
siteInfo = {
|
||||
"id": result.get("id"),
|
||||
"displayName": result.get("displayName"),
|
||||
"name": result.get("name"),
|
||||
"webUrl": result.get("webUrl"),
|
||||
"description": result.get("description"),
|
||||
"createdDateTime": result.get("createdDateTime"),
|
||||
"lastModifiedDateTime": result.get("lastModifiedDateTime")
|
||||
}
|
||||
logger.info(f"Found site by web URL: {siteInfo['displayName']} - {siteInfo['webUrl']} (ID: {siteInfo['id']})")
|
||||
return siteInfo
|
||||
else:
|
||||
logger.warning(f"Site not found using web URL: {webUrl}")
|
||||
return None
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error finding site by web URL: {str(e)}")
|
||||
return None
|
||||
|
||||
async def findSiteByUrl(self, hostname: str, sitePath: str) -> Optional[Dict[str, Any]]:
|
||||
"""Find a SharePoint site using the site URL format."""
|
||||
try:
|
||||
# For guest sites, try different URL formats
|
||||
urlFormats = [
|
||||
f"sites/{hostname}:/sites/{sitePath}", # Standard format
|
||||
f"sites/{hostname}:/sites/{sitePath}/", # With trailing slash
|
||||
f"sites/{hostname}:/sites/{sitePath.lower()}", # Lowercase
|
||||
f"sites/{hostname}:/sites/{sitePath.lower()}/", # Lowercase with slash
|
||||
]
|
||||
|
||||
for endpoint in urlFormats:
|
||||
logger.debug(f"Trying URL format: {endpoint}")
|
||||
result = await self._makeGraphApiCall(endpoint)
|
||||
|
||||
if result and "error" not in result:
|
||||
siteInfo = {
|
||||
"id": result.get("id"),
|
||||
"displayName": result.get("displayName"),
|
||||
"name": result.get("name"),
|
||||
"webUrl": result.get("webUrl"),
|
||||
"description": result.get("description"),
|
||||
"createdDateTime": result.get("createdDateTime"),
|
||||
"lastModifiedDateTime": result.get("lastModifiedDateTime")
|
||||
}
|
||||
logger.info(f"Found site by URL: {siteInfo['displayName']} - {siteInfo['webUrl']} (ID: {siteInfo['id']})")
|
||||
return siteInfo
|
||||
else:
|
||||
logger.debug(f"URL format failed: {endpoint} - {result.get('error', 'Unknown error')}")
|
||||
|
||||
logger.warning(f"Site not found using any URL format for: {hostname}:/sites/{sitePath}")
|
||||
return None
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error finding site by URL: {str(e)}")
|
||||
return None
|
||||
|
||||
async def getFolderByPath(self, siteId: str, folderPath: str) -> Optional[Dict[str, Any]]:
|
||||
"""Get folder information by path within a site."""
|
||||
try:
|
||||
# Clean the path
|
||||
cleanPath = folderPath.lstrip('/')
|
||||
|
||||
# If path is empty, get root directly
|
||||
if not cleanPath:
|
||||
endpoint = f"sites/{siteId}/drive/root"
|
||||
else:
|
||||
endpoint = f"sites/{siteId}/drive/root:/{cleanPath}"
|
||||
|
||||
result = await self._makeGraphApiCall(endpoint)
|
||||
|
||||
if "error" in result:
|
||||
logger.warning(f"Folder not found at path {folderPath}: {result['error']}")
|
||||
return None
|
||||
|
||||
return result
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error getting folder by path: {str(e)}")
|
||||
return None
|
||||
|
||||
async def uploadFile(self, siteId: str, folderPath: str, fileName: str, content: bytes) -> Dict[str, Any]:
|
||||
"""Upload a file to SharePoint."""
|
||||
try:
|
||||
# Clean the path
|
||||
cleanPath = folderPath.lstrip('/')
|
||||
uploadPath = f"{cleanPath.rstrip('/')}/{fileName}"
|
||||
endpoint = f"sites/{siteId}/drive/root:/{uploadPath}:/content"
|
||||
|
||||
logger.info(f"Uploading file to: {endpoint}")
|
||||
|
||||
result = await self._makeGraphApiCall(endpoint, method="PUT", data=content)
|
||||
|
||||
if "error" in result:
|
||||
logger.error(f"Upload failed: {result['error']}")
|
||||
return result
|
||||
|
||||
logger.info(f"File uploaded successfully: {fileName}")
|
||||
return result
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error uploading file: {str(e)}")
|
||||
return {"error": f"Error uploading file: {str(e)}"}
|
||||
|
||||
async def downloadFile(self, siteId: str, fileId: str) -> Optional[bytes]:
|
||||
"""Download a file from SharePoint."""
|
||||
try:
|
||||
if self.accessToken is None:
|
||||
logger.error("Access token is not set. Please call setAccessTokenFromConnection() before using the SharePoint service.")
|
||||
return None
|
||||
|
||||
endpoint = f"sites/{siteId}/drive/items/{fileId}/content"
|
||||
|
||||
headers = {"Authorization": f"Bearer {self.accessToken}"}
|
||||
timeout = aiohttp.ClientTimeout(total=30)
|
||||
|
||||
async with aiohttp.ClientSession(timeout=timeout) as session:
|
||||
async with session.get(f"{self.baseUrl}/{endpoint}", headers=headers) as response:
|
||||
if response.status == 200:
|
||||
return await response.read()
|
||||
else:
|
||||
logger.error(f"Download failed: {response.status}")
|
||||
return None
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error downloading file: {str(e)}")
|
||||
return None
|
||||
|
||||
async def listFolderContents(self, siteId: str, folderPath: str = "") -> List[Dict[str, Any]]:
|
||||
"""List contents of a folder."""
|
||||
try:
|
||||
if not folderPath or folderPath == "/":
|
||||
endpoint = f"sites/{siteId}/drive/root/children"
|
||||
else:
|
||||
cleanPath = folderPath.lstrip('/')
|
||||
endpoint = f"sites/{siteId}/drive/root:/{cleanPath}:/children"
|
||||
|
||||
result = await self._makeGraphApiCall(endpoint)
|
||||
|
||||
if "error" in result:
|
||||
logger.warning(f"Failed to list folder contents: {result['error']}")
|
||||
return None
|
||||
|
||||
items = result.get("value", [])
|
||||
processedItems = []
|
||||
|
||||
for item in items:
|
||||
# Determine if it's a folder or file
|
||||
isFolder = 'folder' in item
|
||||
|
||||
itemInfo = {
|
||||
"id": item.get("id"),
|
||||
"name": item.get("name"),
|
||||
"type": "folder" if isFolder else "file",
|
||||
"size": item.get("size", 0),
|
||||
"createdDateTime": item.get("createdDateTime"),
|
||||
"lastModifiedDateTime": item.get("lastModifiedDateTime"),
|
||||
"webUrl": item.get("webUrl")
|
||||
}
|
||||
|
||||
if "file" in item:
|
||||
itemInfo["mimeType"] = item["file"].get("mimeType")
|
||||
itemInfo["downloadUrl"] = item.get("@microsoft.graph.downloadUrl")
|
||||
|
||||
if "folder" in item:
|
||||
itemInfo["childCount"] = item["folder"].get("childCount", 0)
|
||||
|
||||
processedItems.append(itemInfo)
|
||||
|
||||
return processedItems
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error listing folder contents: {str(e)}")
|
||||
return []
|
||||
|
||||
async def searchFiles(self, siteId: str, query: str) -> List[Dict[str, Any]]:
|
||||
"""Search for files in a site."""
|
||||
try:
|
||||
searchQuery = query.replace("'", "''") # Escape single quotes for OData
|
||||
endpoint = f"sites/{siteId}/drive/root/search(q='{searchQuery}')"
|
||||
|
||||
result = await self._makeGraphApiCall(endpoint)
|
||||
|
||||
if "error" in result:
|
||||
logger.warning(f"Search failed: {result['error']}")
|
||||
return []
|
||||
|
||||
items = result.get("value", [])
|
||||
processedItems = []
|
||||
|
||||
for item in items:
|
||||
isFolder = 'folder' in item
|
||||
|
||||
itemInfo = {
|
||||
"id": item.get("id"),
|
||||
"name": item.get("name"),
|
||||
"type": "folder" if isFolder else "file",
|
||||
"size": item.get("size", 0),
|
||||
"createdDateTime": item.get("createdDateTime"),
|
||||
"lastModifiedDateTime": item.get("lastModifiedDateTime"),
|
||||
"webUrl": item.get("webUrl"),
|
||||
"parentPath": item.get("parentReference", {}).get("path", "")
|
||||
}
|
||||
|
||||
if "file" in item:
|
||||
itemInfo["mimeType"] = item["file"].get("mimeType")
|
||||
itemInfo["downloadUrl"] = item.get("@microsoft.graph.downloadUrl")
|
||||
|
||||
processedItems.append(itemInfo)
|
||||
|
||||
return processedItems
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error searching files: {str(e)}")
|
||||
return []
|
||||
|
||||
async def copyFileAsync(self, siteId: str, sourceFolder: str, sourceFile: str, destFolder: str, destFile: str) -> None:
|
||||
"""Copy a file from source to destination folder (like original synchronizer)."""
|
||||
try:
|
||||
# First, download the source file
|
||||
sourcePath = f"{sourceFolder}/{sourceFile}"
|
||||
fileContent = await self.downloadFileByPath(siteId=siteId, filePath=sourcePath)
|
||||
|
||||
if not fileContent:
|
||||
raise Exception(f"Failed to download source file: {sourcePath}")
|
||||
|
||||
# Upload to destination
|
||||
await self.uploadFile(
|
||||
siteId=siteId,
|
||||
folderPath=destFolder,
|
||||
fileName=destFile,
|
||||
content=fileContent
|
||||
)
|
||||
|
||||
logger.info(f"File copied: {sourceFile} -> {destFile}")
|
||||
|
||||
except Exception as e:
|
||||
# Provide more specific error information
|
||||
errorMsg = str(e)
|
||||
if "itemNotFound" in errorMsg or "404" in errorMsg:
|
||||
raise Exception(f"Source file not found (404): {sourcePath} - {errorMsg}")
|
||||
else:
|
||||
raise Exception(f"Error copying file: {errorMsg}")
|
||||
|
||||
async def deleteFile(self, siteId: str, itemId: str) -> bool:
|
||||
"""Delete a file (or folder) from SharePoint by item ID. Returns True on success."""
|
||||
try:
|
||||
if not siteId or not itemId:
|
||||
logger.warning("deleteFile: siteId and itemId are required")
|
||||
return False
|
||||
endpoint = f"sites/{siteId}/drive/items/{itemId}"
|
||||
result = await self._makeGraphApiCall(endpoint, method="DELETE")
|
||||
if result and "error" in result:
|
||||
logger.warning(f"deleteFile failed: {result.get('error')}")
|
||||
return False
|
||||
return True
|
||||
except Exception as e:
|
||||
logger.error(f"Error deleting file: {str(e)}")
|
||||
return False
|
||||
|
||||
async def downloadFileByPath(self, siteId: str, filePath: str) -> Optional[bytes]:
|
||||
"""Download a file by its path within a site."""
|
||||
try:
|
||||
if self.accessToken is None:
|
||||
logger.error("Access token is not set. Please call setAccessTokenFromConnection() before using the SharePoint service.")
|
||||
return None
|
||||
|
||||
# Clean the path
|
||||
cleanPath = filePath.strip('/')
|
||||
endpoint = f"sites/{siteId}/drive/root:/{cleanPath}:/content"
|
||||
|
||||
# Use direct HTTP call for file downloads (binary content)
|
||||
headers = {
|
||||
"Authorization": f"Bearer {self.accessToken}",
|
||||
}
|
||||
|
||||
# Remove leading slash from endpoint to avoid double slash
|
||||
cleanEndpoint = endpoint.lstrip('/')
|
||||
url = f"{self.baseUrl}/{cleanEndpoint}"
|
||||
logger.debug(f"Downloading file: GET {url}")
|
||||
|
||||
timeout = aiohttp.ClientTimeout(total=30)
|
||||
|
||||
async with aiohttp.ClientSession(timeout=timeout) as session:
|
||||
async with session.get(url, headers=headers) as response:
|
||||
if response.status == 200:
|
||||
return await response.read()
|
||||
else:
|
||||
error_text = await response.text()
|
||||
logger.error(f"File download failed: {response.status} - {error_text}")
|
||||
return None
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error downloading file by path: {str(e)}")
|
||||
return None
|
||||
|
||||
async def _getItemById(self, siteId: str, driveId: str, itemId: str) -> Optional[Dict[str, Any]]:
|
||||
"""Verify that an item exists by getting it by ID."""
|
||||
try:
|
||||
endpoint = f"sites/{siteId}/drives/{driveId}/items/{itemId}"
|
||||
result = await self._makeGraphApiCall(endpoint)
|
||||
|
||||
if "error" in result:
|
||||
logger.warning(f"Item {itemId} not found: {result['error']}")
|
||||
return None
|
||||
|
||||
return result
|
||||
|
||||
except Exception as e:
|
||||
logger.warning(f"Error verifying item {itemId}: {str(e)}")
|
||||
return None
|
||||
|
||||
async def _findDriveForItem(self, siteId: str, itemId: str) -> Optional[str]:
|
||||
"""Find which drive contains a specific item by trying to get it from all drives."""
|
||||
try:
|
||||
endpoint = f"sites/{siteId}/drives"
|
||||
drivesResult = await self._makeGraphApiCall(endpoint)
|
||||
|
||||
if "error" in drivesResult:
|
||||
logger.warning(f"Could not get drives for site {siteId}: {drivesResult['error']}")
|
||||
return None
|
||||
|
||||
drives = drivesResult.get("value", [])
|
||||
if not drives:
|
||||
logger.warning(f"No drives found for site {siteId}")
|
||||
return None
|
||||
|
||||
for drive in drives:
|
||||
driveId = drive.get("id")
|
||||
if not driveId:
|
||||
continue
|
||||
|
||||
itemInfo = await self._getItemById(siteId, driveId, itemId)
|
||||
if itemInfo:
|
||||
logger.info(f"Found item {itemId} in drive {drive.get('name', driveId)}")
|
||||
return driveId
|
||||
|
||||
logger.warning(f"Item {itemId} not found in any drive for site {siteId}")
|
||||
return None
|
||||
|
||||
except Exception as e:
|
||||
logger.warning(f"Error finding drive for item {itemId}: {str(e)}")
|
||||
return None
|
||||
|
||||
async def getFolderUsageAnalytics(self, siteId: str, driveId: str, itemId: str, startDateTime: Optional[str] = None, endDateTime: Optional[str] = None, interval: str = "day") -> Dict[str, Any]:
|
||||
"""Get usage analytics for a folder or file."""
|
||||
try:
|
||||
from datetime import datetime, timedelta, timezone
|
||||
|
||||
if not endDateTime:
|
||||
endDateTime = datetime.now(timezone.utc).isoformat().replace('+00:00', 'Z')
|
||||
if not startDateTime:
|
||||
startDate = datetime.now(timezone.utc) - timedelta(days=30)
|
||||
startDateTime = startDate.isoformat().replace('+00:00', 'Z')
|
||||
|
||||
endpoint = f"sites/{siteId}/drives/{driveId}/items/{itemId}/getActivitiesByInterval"
|
||||
endpoint += f"?startDateTime={startDateTime}&endDateTime={endDateTime}&interval={interval}"
|
||||
|
||||
result = await self._makeGraphApiCall(endpoint)
|
||||
|
||||
if "error" in result:
|
||||
errorMsg = result.get('error', '')
|
||||
if isinstance(errorMsg, str) and '404' in errorMsg:
|
||||
itemInfo = await self._getItemById(siteId, driveId, itemId)
|
||||
if not itemInfo:
|
||||
correctDriveId = await self._findDriveForItem(siteId, itemId)
|
||||
if correctDriveId and correctDriveId != driveId:
|
||||
endpoint = f"sites/{siteId}/drives/{correctDriveId}/items/{itemId}/getActivitiesByInterval"
|
||||
endpoint += f"?startDateTime={startDateTime}&endDateTime={endDateTime}&interval={interval}"
|
||||
result = await self._makeGraphApiCall(endpoint)
|
||||
if "error" not in result:
|
||||
return result
|
||||
itemInfo = await self._getItemById(siteId, correctDriveId, itemId)
|
||||
|
||||
if itemInfo:
|
||||
return {
|
||||
"value": [],
|
||||
"note": "No analytics data available for this item. The item exists but may not have activity data or analytics may not be supported for this item type."
|
||||
}
|
||||
else:
|
||||
return result
|
||||
else:
|
||||
return result
|
||||
|
||||
return result
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error getting folder usage analytics: {str(e)}")
|
||||
return {"error": f"Error getting folder usage analytics: {str(e)}"}
|
||||
|
||||
async def getDriveId(self, siteId: str, driveName: Optional[str] = None) -> Optional[str]:
|
||||
"""Get drive ID for a site."""
|
||||
try:
|
||||
endpoint = f"sites/{siteId}/drives"
|
||||
result = await self._makeGraphApiCall(endpoint)
|
||||
|
||||
if "error" in result:
|
||||
logger.error(f"Error getting drives: {result['error']}")
|
||||
return None
|
||||
|
||||
drives = result.get("value", [])
|
||||
|
||||
if not driveName:
|
||||
for drive in drives:
|
||||
if drive.get("name") == "Documents" or drive.get("name") == "Shared Documents":
|
||||
return drive.get("id")
|
||||
if drives:
|
||||
return drives[0].get("id")
|
||||
return None
|
||||
|
||||
for drive in drives:
|
||||
if drive.get("name", "").lower() == driveName.lower():
|
||||
return drive.get("id")
|
||||
|
||||
return None
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error getting drive ID: {str(e)}")
|
||||
return None
|
||||
|
||||
def extractSiteFromStandardPath(self, pathQuery: str) -> Optional[Dict[str, str]]:
|
||||
"""
|
||||
Extract site name from Microsoft-standard server-relative path:
|
||||
/sites/company-share/Freigegebene Dokumente/...
|
||||
|
||||
Returns dict with keys: siteName, innerPath (no leading slash) on success, else None.
|
||||
"""
|
||||
try:
|
||||
if not pathQuery or not pathQuery.startswith('/sites/'):
|
||||
return None
|
||||
|
||||
remainder = pathQuery[7:]
|
||||
if '/' not in remainder:
|
||||
return {"siteName": remainder, "innerPath": ""}
|
||||
|
||||
siteName, inner = remainder.split('/', 1)
|
||||
siteName = siteName.strip()
|
||||
innerPath = inner.strip()
|
||||
|
||||
if not siteName:
|
||||
return None
|
||||
|
||||
return {"siteName": siteName, "innerPath": innerPath}
|
||||
except Exception as e:
|
||||
logger.error(f"Error extracting site from standard path '{pathQuery}': {str(e)}")
|
||||
return None
|
||||
|
||||
async def getSiteByStandardPath(self, sitePath: str, allSites: Optional[List[Dict[str, Any]]] = None) -> Optional[Dict[str, Any]]:
|
||||
"""Get SharePoint site directly by Microsoft-standard path (/sites/SiteName)."""
|
||||
try:
|
||||
from urllib.parse import urlparse
|
||||
hostname = None
|
||||
|
||||
if allSites and len(allSites) > 0:
|
||||
webUrl = allSites[0].get("webUrl", "")
|
||||
hostname = urlparse(webUrl).hostname if webUrl else None
|
||||
|
||||
if not hostname:
|
||||
rootSite = await self._makeGraphApiCall("sites/root")
|
||||
if rootSite and "webUrl" in rootSite and "error" not in rootSite:
|
||||
hostname = urlparse(rootSite.get("webUrl", "")).hostname
|
||||
|
||||
if not hostname:
|
||||
minimalSites = await self.discoverSites()
|
||||
if not minimalSites:
|
||||
return None
|
||||
hostname = urlparse(minimalSites[0].get("webUrl", "")).hostname
|
||||
|
||||
if not hostname:
|
||||
return None
|
||||
|
||||
endpoint = f"sites/{hostname}:/sites/{sitePath}"
|
||||
result = await self._makeGraphApiCall(endpoint)
|
||||
|
||||
if "error" in result:
|
||||
return None
|
||||
|
||||
return {
|
||||
"id": result.get("id"),
|
||||
"displayName": result.get("displayName"),
|
||||
"name": result.get("name"),
|
||||
"webUrl": result.get("webUrl"),
|
||||
"description": result.get("description"),
|
||||
"createdDateTime": result.get("createdDateTime"),
|
||||
"lastModifiedDateTime": result.get("lastModifiedDateTime")
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error getting site by standard path '{sitePath}': {str(e)}")
|
||||
return None
|
||||
|
||||
def filterSitesByHint(self, sites: List[Dict[str, Any]], siteHint: str) -> List[Dict[str, Any]]:
|
||||
"""Filter discovered sites by a human-entered site hint (case-insensitive substring)."""
|
||||
try:
|
||||
if not siteHint:
|
||||
return sites
|
||||
hint = siteHint.strip().lower()
|
||||
filtered: List[Dict[str, Any]] = []
|
||||
for site in sites:
|
||||
name = (site.get("displayName") or "").lower()
|
||||
webUrl = (site.get("webUrl") or "").lower()
|
||||
if hint in name or hint in webUrl:
|
||||
filtered.append(site)
|
||||
return filtered if filtered else sites
|
||||
except Exception as e:
|
||||
logger.error(f"Error filtering sites by hint '{siteHint}': {str(e)}")
|
||||
return sites
|
||||
|
||||
async def resolveSitesFromPathQuery(self, pathQuery: str, allSites: Optional[List[Dict[str, Any]]] = None) -> List[Dict[str, Any]]:
|
||||
"""Resolve sites from pathQuery. Handles both Microsoft-standard paths and regular paths."""
|
||||
try:
|
||||
if pathQuery.startswith('/sites/'):
|
||||
parsedPath = self.extractSiteFromStandardPath(pathQuery)
|
||||
if parsedPath:
|
||||
siteName = parsedPath.get("siteName")
|
||||
directSite = await self.getSiteByStandardPath(siteName, allSites)
|
||||
if directSite:
|
||||
logger.info(f"Got site directly by standard path - no need to discover all sites")
|
||||
return [directSite]
|
||||
else:
|
||||
logger.warning(f"Could not get site directly, falling back to site discovery")
|
||||
|
||||
if not allSites:
|
||||
allSites = await self.discoverSites()
|
||||
if not allSites:
|
||||
logger.warning("No SharePoint sites found or accessible")
|
||||
return []
|
||||
|
||||
if pathQuery.startswith('/sites/'):
|
||||
parsedPath = self.extractSiteFromStandardPath(pathQuery)
|
||||
if parsedPath:
|
||||
siteName = parsedPath.get("siteName")
|
||||
sites = self.filterSitesByHint(allSites, siteName)
|
||||
if not sites:
|
||||
logger.warning(f"No SharePoint site found matching '{siteName}'")
|
||||
return []
|
||||
logger.info(f"Filtered to site(s) matching '{siteName}': {[s['displayName'] for s in sites]}")
|
||||
return sites
|
||||
else:
|
||||
return allSites
|
||||
else:
|
||||
return allSites
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error resolving sites from pathQuery '{pathQuery}': {str(e)}")
|
||||
return []
|
||||
|
||||
def validatePathQuery(self, pathQuery: str) -> tuple[bool, Optional[str]]:
|
||||
"""Validate pathQuery format. Returns (isValid, errorMessage)."""
|
||||
try:
|
||||
if not pathQuery or pathQuery.strip() == "" or pathQuery.strip() == "*":
|
||||
return False, "pathQuery cannot be empty or '*'"
|
||||
|
||||
if not pathQuery.startswith('/'):
|
||||
return False, "pathQuery must start with '/' and include site name with Microsoft-standard syntax /sites/<SiteName>/... e.g. /sites/company-share/Freigegebene Dokumente/Work"
|
||||
|
||||
validPathPrefixes = ['/sites/', '/Documents', '/documents', '/Shared Documents', '/shared documents']
|
||||
if not any(pathQuery.startswith(prefix) for prefix in validPathPrefixes):
|
||||
return False, f"Invalid pathQuery '{pathQuery}'. This appears to be search terms, not a valid SharePoint path. Use findDocumentPath action first to search for folders, then use the returned folder path as pathQuery."
|
||||
|
||||
return True, None
|
||||
except Exception as e:
|
||||
logger.error(f"Error validating pathQuery '{pathQuery}': {str(e)}")
|
||||
return False, f"Error validating pathQuery: {str(e)}"
|
||||
|
||||
def detectFolderType(self, item: Dict[str, Any]) -> bool:
|
||||
"""Detect if an item is a folder using improved detection logic."""
|
||||
try:
|
||||
if 'folder' in item:
|
||||
return True
|
||||
webUrl = item.get('webUrl', '')
|
||||
name = item.get('name', '')
|
||||
if '.' not in name and ('/' in webUrl or '\\' in webUrl):
|
||||
return True
|
||||
return False
|
||||
except Exception as e:
|
||||
logger.error(f"Error detecting folder type: {str(e)}")
|
||||
return False
|
||||
Some files were not shown because too many files have changed in this diff Show more
Loading…
Reference in a new issue