Merge pull request #78 from valueonag/feat/user-access-process
Feat/user access process
This commit is contained in:
commit
b205a2c5ad
123 changed files with 22473 additions and 2675 deletions
|
|
@ -1,292 +0,0 @@
|
|||
# Module Dependencies Analysis
|
||||
|
||||
This document provides a comprehensive analysis of import dependencies between modules in the `modules` directory.
|
||||
|
||||
## Overview
|
||||
|
||||
The codebase is organized into the following top-level modules:
|
||||
- **aicore** - AI core functionality and model management
|
||||
- **auth** - High-level authentication and token management
|
||||
- **connectors** - External service connectors
|
||||
- **datamodels** - Data models and schemas
|
||||
- **features** - Feature modules (workflow, dynamicOptions, etc.)
|
||||
- **interfaces** - Database and service interfaces
|
||||
- **routes** - API route handlers
|
||||
- **security** - Low-level core security (RBAC and root access)
|
||||
- **services** - Business logic services
|
||||
- **shared** - Shared utilities and helpers
|
||||
- **workflows** - Workflow processing and management
|
||||
|
||||
## Bidirectional Dependency Matrix
|
||||
|
||||
This table shows all module pairs with dependencies, displaying imports in both directions.
|
||||
|
||||
| Module X | Module Y | X → Y | Y → X | Total |
|
||||
|----------|----------|-------|-------|-------|
|
||||
| aicore | connectors | 1 | 0 | 1 |
|
||||
| aicore | datamodels | 13 | 0 | 13 |
|
||||
| aicore | interfaces | 0 | 2 | 2 |
|
||||
| aicore | security | 2 | 0 | 2 |
|
||||
| aicore | services | 0 | 2 | 2 |
|
||||
| aicore | shared | 5 | 0 | 5 |
|
||||
| auth | datamodels | 5 | 0 | 5 |
|
||||
| auth | interfaces | 4 | 0 | 4 |
|
||||
| auth | routes | 0 | 32 | 32 |
|
||||
| auth | security | 4 | 0 | 4 |
|
||||
| auth | services | 0 | 1 | 1 |
|
||||
| auth | shared | 8 | 0 | 8 |
|
||||
| connectors | datamodels | 4 | 0 | 4 |
|
||||
| connectors | interfaces | 0 | 10 | 10 |
|
||||
| connectors | shared | 5 | 0 | 5 |
|
||||
| datamodels | features | 0 | 6 | 6 |
|
||||
| datamodels | interfaces | 0 | 27 | 27 |
|
||||
| datamodels | routes | 0 | 48 | 48 |
|
||||
| datamodels | security | 0 | 5 | 5 |
|
||||
| datamodels | services | 0 | 52 | 52 |
|
||||
| datamodels | shared | 19 | 0 | 19 |
|
||||
| datamodels | workflows | 0 | 72 | 72 |
|
||||
| features | interfaces | 0 | 0 | 0 |
|
||||
| features | routes | 0 | 6 | 6 |
|
||||
| features | services | 4 | 0 | 4 |
|
||||
| features | shared | 3 | 0 | 3 |
|
||||
| features | workflows | 1 | 0 | 1 |
|
||||
| interfaces | routes | 0 | 29 | 29 |
|
||||
| interfaces | security | 9 | 0 | 9 |
|
||||
| interfaces | services | 0 | 8 | 8 |
|
||||
| interfaces | shared | 11 | 0 | 11 |
|
||||
| routes | interfaces | 29 | 0 | 29 |
|
||||
| routes | services | 5 | 0 | 5 |
|
||||
| routes | shared | 21 | 0 | 21 |
|
||||
| security | connectors | 2 | 0 | 2 |
|
||||
| security | datamodels | 5 | 0 | 5 |
|
||||
| services | shared | 16 | 0 | 16 |
|
||||
| services | workflows | 0 | 1 | 1 |
|
||||
| shared | workflows | 0 | 9 | 9 |
|
||||
|
||||
**Legend:**
|
||||
- **X → Y**: Number of imports from Module X to Module Y
|
||||
- **Y → X**: Number of imports from Module Y to Module X
|
||||
- **Total**: Sum of imports in both directions
|
||||
|
||||
## Bidirectional Dependencies Only (Circular Dependencies)
|
||||
|
||||
This table shows only module pairs where imports exist in **both directions**, indicating potential circular dependencies that should be monitored.
|
||||
|
||||
| Module X | Module Y | X → Y | Y → X | Total |
|
||||
|----------|----------|-------|-------|-------|
|
||||
|
||||
**Total bidirectional dependencies: 0**
|
||||
|
||||
**Note:** All circular dependencies have been eliminated. The architecture now has clean one-way dependencies.
|
||||
|
||||
**Key Improvements:**
|
||||
1. **Eliminated `connectors ↔ security` circular dependency**: After moving RBAC logic from `connectorDbPostgre.py` to `interfaces/interfaceRbac.py`, connectors no longer import from security. Security still imports from connectors (for `rootAccess` to create `DatabaseConnector` instances), but this is a one-way dependency (security → connectors: 2, connectors → security: 0).
|
||||
2. **Eliminated `shared ↔ security` circular dependency**: Moved `rbacHelpers.py` from `shared` to `security` module since it was only used in `aicore` and `aicore` already imports from `security`. This eliminates the architectural violation where `shared` imported from `security`.
|
||||
3. **Eliminated `datamodels ↔ shared` circular dependency**: `shared` no longer has any static imports from `datamodels`. The only reference is a dynamic import in `attributeUtils.py` using `importlib.import_module()` for runtime model discovery, which is not detected by static analysis. This is acceptable as it's a runtime-only dependency.
|
||||
4. **New `interfaces/interfaceRbac.py` module**: Created to handle RBAC filtering for interfaces, importing from both `security` and `connectors`. This maintains proper architectural layering where connectors remain generic.
|
||||
5. **Updated dependency counts**:
|
||||
- `interfaces` → `connectors`: increased from 9 to 10 (interfaceRbac imports connectorDbPostgre)
|
||||
- `interfaces` → `security`: increased from 7 to 9 (interfaceRbac imports rbac and rootAccess)
|
||||
- `features` → `interfaces`: increased from 1 to 2 (mainWorkflow imports interfaceRbac)
|
||||
- `routes` → `interfaces`: increased from 28 to 29 (routeWorkflows imports interfaceRbac)
|
||||
- `aicore` → `security`: increased from 1 to 2 (now imports rbacHelpers from security)
|
||||
- `security` → `datamodels`: increased from 3 to 5 (rbacHelpers adds datamodel imports)
|
||||
|
||||
## Dependency Graph (Mermaid)
|
||||
|
||||
```mermaid
|
||||
graph TD
|
||||
aicore[aicore]
|
||||
auth[auth]
|
||||
connectors[connectors]
|
||||
datamodels[datamodels]
|
||||
features[features]
|
||||
interfaces[interfaces]
|
||||
routes[routes]
|
||||
security[security]
|
||||
services[services]
|
||||
shared[shared]
|
||||
workflows[workflows]
|
||||
|
||||
aicore -->|13| datamodels
|
||||
aicore -->|1| connectors
|
||||
aicore -->|2| security
|
||||
aicore -->|5| shared
|
||||
|
||||
auth -->|5| datamodels
|
||||
auth -->|4| interfaces
|
||||
auth -->|4| security
|
||||
auth -->|8| shared
|
||||
|
||||
connectors -->|4| datamodels
|
||||
connectors -->|5| shared
|
||||
|
||||
datamodels -->|19| shared
|
||||
|
||||
features -->|6| datamodels
|
||||
features -->|0| interfaces
|
||||
features -->|4| services
|
||||
features -->|3| shared
|
||||
features -->|1| workflows
|
||||
|
||||
interfaces -->|29| datamodels
|
||||
interfaces -->|10| connectors
|
||||
interfaces -->|2| aicore
|
||||
interfaces -->|9| security
|
||||
interfaces -->|11| shared
|
||||
|
||||
routes -->|48| datamodels
|
||||
routes -->|29| interfaces
|
||||
routes -->|32| auth
|
||||
routes -->|21| shared
|
||||
routes -->|6| features
|
||||
routes -->|5| services
|
||||
|
||||
security -->|5| datamodels
|
||||
security -->|2| connectors
|
||||
security -->|1| shared
|
||||
|
||||
services -->|52| datamodels
|
||||
services -->|8| interfaces
|
||||
services -->|2| aicore
|
||||
services -->|1| auth
|
||||
services -->|16| shared
|
||||
|
||||
|
||||
workflows -->|72| datamodels
|
||||
workflows -->|1| services
|
||||
workflows -->|9| shared
|
||||
```
|
||||
|
||||
## Detailed Module Dependencies
|
||||
|
||||
### aicore
|
||||
**Imports from:**
|
||||
- `connectors` (1 import)
|
||||
- `datamodels` (13 imports)
|
||||
- `security` (2 imports: rbac, rbacHelpers)
|
||||
- `shared` (4 imports)
|
||||
|
||||
**Dependencies:** Low-level AI functionality, depends on data models and connectors.
|
||||
|
||||
### auth
|
||||
**Imports from:**
|
||||
- `datamodels` (5 imports)
|
||||
- `interfaces` (4 imports)
|
||||
- `security` (4 imports)
|
||||
- `shared` (8 imports)
|
||||
|
||||
**Dependencies:** High-level authentication and token management, used by routes and services.
|
||||
|
||||
### connectors
|
||||
**Imports from:**
|
||||
- `datamodels` (4 imports)
|
||||
- `shared` (5 imports)
|
||||
|
||||
**Dependencies:** External service connectors, minimal dependencies. No longer imports from security or interfaces. Connectors are now fully generic and do not depend on security modules.
|
||||
|
||||
### datamodels
|
||||
**Imports from:**
|
||||
- `shared` (19 imports)
|
||||
|
||||
**Dependencies:** Core data models, only depends on shared utilities.
|
||||
|
||||
### features
|
||||
**Imports from:**
|
||||
- `datamodels` (6 imports)
|
||||
- `services` (4 imports)
|
||||
- `shared` (3 imports)
|
||||
- `workflows` (1 import)
|
||||
|
||||
**Dependencies:** Feature modules that orchestrate workflows and services. Features now use services exclusively, not interfaces directly, maintaining proper architectural layering.
|
||||
|
||||
### interfaces
|
||||
**Imports from:**
|
||||
- `aicore` (2 imports)
|
||||
- `connectors` (10 imports)
|
||||
- `datamodels` (29 imports)
|
||||
- `security` (9 imports)
|
||||
- `shared` (11 imports)
|
||||
|
||||
**Dependencies:** Database and service interfaces, heavily depends on data models. Includes `interfaceRbac.py` which handles RBAC filtering for all interfaces. No longer creates circular dependency with connectors.
|
||||
|
||||
### routes
|
||||
**Imports from:**
|
||||
- `auth` (32 imports)
|
||||
- `datamodels` (48 imports)
|
||||
- `features` (6 imports)
|
||||
- `interfaces` (29 imports)
|
||||
- `services` (5 imports)
|
||||
- `shared` (21 imports)
|
||||
|
||||
**Dependencies:** API endpoints, highest dependency count, orchestrates all layers. Now imports from `auth` instead of `security` for authentication. Increased use of services (from 2 to 5 imports) after architectural refactoring to use services instead of direct interface access in features.
|
||||
|
||||
### security
|
||||
**Imports from:**
|
||||
- `connectors` (2 imports)
|
||||
- `datamodels` (5 imports: rbac uses 3, rbacHelpers uses 2)
|
||||
- `shared` (1 import: rootAccess uses configuration)
|
||||
|
||||
**Dependencies:** Low-level core security (RBAC, root access, and RBAC helper functions). Used by interfaces (including `interfaceRbac.py`), auth, and aicore. The `rbacHelpers` module was moved from `shared` to `security` to eliminate the architectural violation where `shared` imported from `security`. Security imports from connectors only for `rootAccess` to create `DatabaseConnector` instances - this is acceptable as it's a one-way dependency (security → connectors).
|
||||
|
||||
### services
|
||||
**Imports from:**
|
||||
- `aicore` (2 imports)
|
||||
- `auth` (1 import)
|
||||
- `datamodels` (52 imports)
|
||||
- `interfaces` (8 imports)
|
||||
- `shared` (16 imports)
|
||||
|
||||
**Dependencies:** Business logic services, heavily depends on data models.
|
||||
|
||||
### shared
|
||||
**Imports from:**
|
||||
- None (0 imports)
|
||||
|
||||
**Dependencies:** Shared utilities, completely self-contained with no dependencies on other modules. No longer imports from security (rbacHelpers was moved to security module) or datamodels (only uses dynamic imports at runtime for model discovery in `attributeUtils.py`), maintaining proper architectural layering.
|
||||
|
||||
### workflows
|
||||
**Imports from:**
|
||||
- `datamodels` (72 imports)
|
||||
- `services` (1 import)
|
||||
- `shared` (9 imports)
|
||||
|
||||
**Dependencies:** Workflow processing, heavily depends on data models (highest count). Reduced from 74 to 72 imports after removing unused imports from `contentValidator.py`.
|
||||
|
||||
## Key Observations
|
||||
|
||||
1. **datamodels** is the most imported module (used by 9 out of 11 modules)
|
||||
2. **shared** is widely used but has minimal dependencies (good design)
|
||||
3. **routes** has the most diverse dependencies (imports from 6 different modules)
|
||||
4. **workflows** has the highest number of imports from datamodels (72)
|
||||
5. **auth** is now a separate module, used exclusively by routes and services
|
||||
6. **security** is now a low-level module, used by interfaces (including `interfaceRbac.py`)
|
||||
7. **connectors** are now fully generic - no dependencies on security or interfaces
|
||||
8. **Circular dependencies eliminated**: Reduced from 3 to 0 after RBAC refactoring and `rbacHelpers` move (eliminated `connectors ↔ security`, `shared ↔ security`, and `datamodels ↔ shared`)
|
||||
9. **New `interfaceRbac.py` module** centralizes RBAC filtering logic for all interfaces
|
||||
10. **`shared` module is now completely self-contained** - no static imports from any other module
|
||||
11. **Features architectural improvements**: Features no longer import directly from interfaces (reduced from 2 to 0). All features now use services exclusively, maintaining proper layering: Features → Services → Interfaces → Connectors
|
||||
12. **Routes increased services usage**: Routes now import from services 5 times (up from 2) after refactoring features to use services instead of direct interface access
|
||||
|
||||
## Dependency Layers
|
||||
|
||||
Based on the analysis, the architecture follows these layers:
|
||||
|
||||
1. **Foundation Layer**: `shared`, `datamodels`
|
||||
2. **Core Layer**: `aicore`, `connectors`, `security`
|
||||
3. **Interface Layer**: `interfaces`
|
||||
4. **Authentication Layer**: `auth`
|
||||
5. **Business Logic Layer**: `services`, `workflows`
|
||||
6. **Feature Layer**: `features`
|
||||
7. **API Layer**: `routes`
|
||||
|
||||
## Recommendations
|
||||
|
||||
1. **datamodels** should remain stable as it's a core dependency
|
||||
2. **shared** is excellently designed - completely self-contained with zero dependencies (perfect foundation layer)
|
||||
3. **security** split and RBAC refactoring were successful - eliminated all circular dependencies (`connectors ↔ security`, `shared ↔ security`)
|
||||
4. **connectors** are now fully generic and maintainable - keep them free of security/interface dependencies
|
||||
5. **interfaceRbac.py** successfully centralizes RBAC logic - consider this pattern for other cross-cutting concerns
|
||||
6. Consider breaking down **workflows** if it continues to grow
|
||||
7. **routes** could benefit from further abstraction to reduce direct dependencies
|
||||
8. **Architecture is now clean** - no circular dependencies remain, maintaining clear separation of concerns
|
||||
|
|
@ -67,7 +67,7 @@ class AiAnthropic(BaseConnectorAi):
|
|||
processingMode=ProcessingModeEnum.DETAILED,
|
||||
operationTypes=createOperationTypeRatings(
|
||||
(OperationTypeEnum.PLAN, 9),
|
||||
(OperationTypeEnum.DATA_ANALYSE, 10),
|
||||
(OperationTypeEnum.DATA_ANALYSE, 9),
|
||||
(OperationTypeEnum.DATA_GENERATE, 9),
|
||||
(OperationTypeEnum.DATA_EXTRACT, 8)
|
||||
),
|
||||
|
|
|
|||
|
|
@ -59,16 +59,16 @@ class AiOpenai(BaseConnectorAi):
|
|||
contextLength=128000,
|
||||
costPer1kTokensInput=0.03,
|
||||
costPer1kTokensOutput=0.06,
|
||||
speedRating=7, # Good speed for complex tasks
|
||||
qualityRating=9, # High quality
|
||||
speedRating=8, # Good speed for complex tasks
|
||||
qualityRating=10, # High quality
|
||||
# capabilities removed (not used in business logic)
|
||||
functionCall=self.callAiBasic,
|
||||
priority=PriorityEnum.BALANCED,
|
||||
processingMode=ProcessingModeEnum.ADVANCED,
|
||||
operationTypes=createOperationTypeRatings(
|
||||
(OperationTypeEnum.PLAN, 8),
|
||||
(OperationTypeEnum.DATA_ANALYSE, 9),
|
||||
(OperationTypeEnum.DATA_GENERATE, 9),
|
||||
(OperationTypeEnum.PLAN, 9),
|
||||
(OperationTypeEnum.DATA_ANALYSE, 10),
|
||||
(OperationTypeEnum.DATA_GENERATE, 10),
|
||||
(OperationTypeEnum.DATA_EXTRACT, 7)
|
||||
),
|
||||
version="gpt-4o",
|
||||
|
|
@ -354,10 +354,11 @@ class AiOpenai(BaseConnectorAi):
|
|||
|
||||
if response.status_code != 200:
|
||||
logger.error(f"DALL-E API error: {response.status_code} - {response.text}")
|
||||
return {
|
||||
"success": False,
|
||||
"error": f"DALL-E API error: {response.status_code} - {response.text}"
|
||||
}
|
||||
return AiModelResponse(
|
||||
content="",
|
||||
success=False,
|
||||
error=f"DALL-E API error: {response.status_code} - {response.text}"
|
||||
)
|
||||
|
||||
responseJson = response.json()
|
||||
|
||||
|
|
|
|||
|
|
@ -13,6 +13,8 @@ class DocumentMetadata(BaseModel):
|
|||
sourceDocuments: List[str] = Field(default_factory=list, description="Source document IDs")
|
||||
extractionMethod: str = Field(default="ai_extraction", description="Method used for extraction")
|
||||
version: str = Field(default="1.0", description="Document version")
|
||||
documentType: Optional[str] = Field(default=None, description="Type of document (e.g., 'report', 'invoice', 'analysis')")
|
||||
styles: Optional[Dict[str, Any]] = Field(default=None, description="Document styling configuration")
|
||||
|
||||
|
||||
class TableData(BaseModel):
|
||||
|
|
@ -107,5 +109,19 @@ class StructuredDocument(BaseModel):
|
|||
|
||||
|
||||
|
||||
class RenderedDocument(BaseModel):
|
||||
"""A single rendered document from a renderer."""
|
||||
documentData: bytes = Field(description="Document content as bytes")
|
||||
mimeType: str = Field(description="MIME type of the document (e.g., 'text/html', 'application/pdf')")
|
||||
filename: str = Field(description="Filename for the document (e.g., 'report.html', 'image.png')")
|
||||
documentType: Optional[str] = Field(default=None, description="Type of document (e.g., 'report', 'invoice', 'analysis')")
|
||||
metadata: Optional[Dict[str, Any]] = Field(default=None, description="Document metadata (title, author, etc.)")
|
||||
|
||||
class Config:
|
||||
json_encoders = {
|
||||
bytes: lambda v: v.decode('utf-8', errors='replace') if isinstance(v, bytes) else v
|
||||
}
|
||||
|
||||
|
||||
# Update forward references
|
||||
ListItem.model_rebuild()
|
||||
|
|
|
|||
|
|
@ -61,6 +61,14 @@ class MergeStrategy(BaseModel):
|
|||
capabilities: Optional[Dict[str, Any]] = Field(default=None, description="Model capabilities for intelligent merging")
|
||||
|
||||
|
||||
class DocumentIntent(BaseModel):
|
||||
"""Intent-Analyse für ein einzelnes Dokument"""
|
||||
documentId: str = Field(description="ID des Dokuments")
|
||||
intents: List[str] = Field(description="Liste von Intents: ['extract', 'render', 'reference'] - mehrere möglich")
|
||||
extractionPrompt: Optional[str] = Field(default=None, description="Spezifischer Prompt für Extraktion (z.B. 'Extract text from images for legends')")
|
||||
reasoning: str = Field(description="Erklärung für Debugging/Transparenz: Warum wurde dieser Intent gewählt?")
|
||||
|
||||
|
||||
class ExtractionOptions(BaseModel):
|
||||
"""Options for document extraction and processing with clear data structures."""
|
||||
|
||||
|
|
|
|||
|
|
@ -19,12 +19,14 @@ supportedSectionTypes: List[str] = [
|
|||
]
|
||||
|
||||
# Canonical JSON template used for AI generation (documents array + sections)
|
||||
# Rendering pipelines can select the first document and read its sections.
|
||||
# This template is used for STRUCTURE generation - sections have empty elements arrays.
|
||||
# For content generation, elements arrays will be populated later.
|
||||
jsonTemplateDocument: str = """{
|
||||
"metadata": {
|
||||
"split_strategy": "single_document",
|
||||
"source_documents": [],
|
||||
"extraction_method": "ai_generation"
|
||||
"extraction_method": "ai_generation",
|
||||
"title": "{{DOCUMENT_TITLE}}"
|
||||
},
|
||||
"documents": [
|
||||
{
|
||||
|
|
@ -33,56 +35,77 @@ jsonTemplateDocument: str = """{
|
|||
"filename": "document.json",
|
||||
"sections": [
|
||||
{
|
||||
"id": "section_heading_example",
|
||||
"id": "section_heading_main_title",
|
||||
"content_type": "heading",
|
||||
"elements": [
|
||||
{"level": 1, "text": "Heading Text"}
|
||||
],
|
||||
"order": 0
|
||||
"complexity": "simple",
|
||||
"generation_hint": "Main document title heading",
|
||||
"order": 1,
|
||||
"elements": []
|
||||
},
|
||||
{
|
||||
"id": "section_paragraph_example",
|
||||
"id": "section_paragraph_introduction",
|
||||
"content_type": "paragraph",
|
||||
"elements": [
|
||||
{"text": "Paragraph text content"}
|
||||
],
|
||||
"order": 0
|
||||
"complexity": "simple",
|
||||
"generation_hint": "Introduction paragraph",
|
||||
"order": 2,
|
||||
"elements": []
|
||||
},
|
||||
{
|
||||
"id": "section_heading_section_1",
|
||||
"content_type": "heading",
|
||||
"complexity": "simple",
|
||||
"generation_hint": "Section heading for topic 1",
|
||||
"order": 3,
|
||||
"elements": []
|
||||
},
|
||||
{
|
||||
"id": "section_paragraph_section_1",
|
||||
"content_type": "paragraph",
|
||||
"complexity": "simple",
|
||||
"generation_hint": "Content paragraph for section 1",
|
||||
"order": 4,
|
||||
"elements": []
|
||||
},
|
||||
{
|
||||
"id": "section_bullet_list_example",
|
||||
"content_type": "bullet_list",
|
||||
"elements": [
|
||||
{
|
||||
"items": ["Item 1", "Item 2"]
|
||||
}
|
||||
],
|
||||
"order": 0
|
||||
"complexity": "simple",
|
||||
"generation_hint": "Bullet list items",
|
||||
"order": 5,
|
||||
"elements": []
|
||||
},
|
||||
{
|
||||
"id": "section_image_example",
|
||||
"content_type": "image",
|
||||
"complexity": "complex",
|
||||
"generation_hint": "Illustration for document",
|
||||
"image_prompt": "A detailed description for image generation",
|
||||
"order": 6,
|
||||
"elements": []
|
||||
},
|
||||
{
|
||||
"id": "section_table_example",
|
||||
"content_type": "table",
|
||||
"elements": [
|
||||
{
|
||||
"headers": ["Column 1", "Column 2"],
|
||||
"rows": [
|
||||
["Row 1 Col 1", "Row 1 Col 2"],
|
||||
["Row 2 Col 1", "Row 2 Col 2"]
|
||||
],
|
||||
"caption": "Table caption"
|
||||
}
|
||||
],
|
||||
"order": 0
|
||||
"complexity": "simple",
|
||||
"generation_hint": "Data table with relevant information",
|
||||
"order": 7,
|
||||
"elements": []
|
||||
},
|
||||
{
|
||||
"id": "section_code_example",
|
||||
"content_type": "code_block",
|
||||
"elements": [
|
||||
{
|
||||
"code": "function example() { return true; }",
|
||||
"language": "javascript"
|
||||
}
|
||||
],
|
||||
"order": 0
|
||||
"complexity": "simple",
|
||||
"generation_hint": "Code example or snippet",
|
||||
"order": 8,
|
||||
"elements": []
|
||||
},
|
||||
{
|
||||
"id": "section_paragraph_conclusion",
|
||||
"content_type": "paragraph",
|
||||
"complexity": "simple",
|
||||
"generation_hint": "Conclusion paragraph",
|
||||
"order": 9,
|
||||
"elements": []
|
||||
}
|
||||
]
|
||||
}
|
||||
|
|
|
|||
88
modules/datamodels/datamodelWorkflowActions.py
Normal file
88
modules/datamodels/datamodelWorkflowActions.py
Normal file
|
|
@ -0,0 +1,88 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""Workflow Action models: WorkflowActionParameter, WorkflowActionDefinition."""
|
||||
|
||||
from typing import Optional, Any, Union, List, Dict, Callable, Awaitable
|
||||
from pydantic import BaseModel, Field
|
||||
from modules.datamodels.datamodelChat import ActionResult
|
||||
from modules.shared.frontendTypes import FrontendType
|
||||
from modules.shared.attributeUtils import registerModelLabels
|
||||
|
||||
|
||||
class WorkflowActionParameter(BaseModel):
|
||||
"""
|
||||
Parameter schema definition for a workflow action.
|
||||
|
||||
This defines the structure and UI rendering for a single action parameter,
|
||||
NOT the actual parameter values (those are in ActionDefinition.parameters).
|
||||
"""
|
||||
name: str = Field(description="Parameter name")
|
||||
type: str = Field(description="Python type as string: 'str', 'int', 'bool', 'List[str]', etc.")
|
||||
frontendType: FrontendType = Field(description="UI rendering type (from global FrontendType enum)")
|
||||
frontendOptions: Optional[Union[str, List[str]]] = Field(
|
||||
None,
|
||||
description="Options for select/multiselect/custom types. String reference (e.g., 'user.connection') or list of strings (e.g., ['txt', 'json']). For custom types, this is automatically set to the API endpoint."
|
||||
)
|
||||
required: bool = Field(False, description="Whether parameter is required")
|
||||
default: Optional[Any] = Field(None, description="Default value")
|
||||
description: str = Field("", description="Parameter description")
|
||||
validation: Optional[Dict[str, Any]] = Field(
|
||||
None,
|
||||
description="Validation rules (e.g., {'min': 1, 'max': 100})"
|
||||
)
|
||||
|
||||
|
||||
class WorkflowActionDefinition(BaseModel):
|
||||
"""
|
||||
Complete schema definition of a workflow action.
|
||||
|
||||
This defines the metadata, parameters, and execution function for an action.
|
||||
This is different from datamodelWorkflow.ActionDefinition which contains
|
||||
actual execution values (action, actionObjective, parameters with values).
|
||||
|
||||
This class defines the ACTION SCHEMA, not the execution plan.
|
||||
"""
|
||||
actionId: str = Field(
|
||||
description="Unique action identifier for RBAC (format: 'module.actionName', e.g., 'outlook.readEmails')"
|
||||
)
|
||||
description: str = Field(description="Action description")
|
||||
parameters: Dict[str, WorkflowActionParameter] = Field(
|
||||
default_factory=dict,
|
||||
description="Parameter schema definitions"
|
||||
)
|
||||
execute: Optional[Callable] = Field(
|
||||
None,
|
||||
description="Execution function - async function that takes parameters dict and returns ActionResult. Set dynamically."
|
||||
)
|
||||
category: Optional[str] = Field(None, description="Action category for grouping")
|
||||
tags: List[str] = Field(default_factory=list, description="Tags for search/filtering")
|
||||
|
||||
|
||||
# Register model labels for UI
|
||||
registerModelLabels(
|
||||
"WorkflowActionDefinition",
|
||||
{"en": "Workflow Action Definition", "fr": "Définition d'action de workflow"},
|
||||
{
|
||||
"actionId": {"en": "Action ID", "fr": "ID d'action"},
|
||||
"description": {"en": "Description", "fr": "Description"},
|
||||
"parameters": {"en": "Parameters", "fr": "Paramètres"},
|
||||
"category": {"en": "Category", "fr": "Catégorie"},
|
||||
"tags": {"en": "Tags", "fr": "Étiquettes"},
|
||||
},
|
||||
)
|
||||
|
||||
registerModelLabels(
|
||||
"WorkflowActionParameter",
|
||||
{"en": "Workflow Action Parameter", "fr": "Paramètre d'action de workflow"},
|
||||
{
|
||||
"name": {"en": "Name", "fr": "Nom"},
|
||||
"type": {"en": "Type", "fr": "Type"},
|
||||
"frontendType": {"en": "Frontend Type", "fr": "Type frontend"},
|
||||
"frontendOptions": {"en": "Frontend Options", "fr": "Options frontend"},
|
||||
"required": {"en": "Required", "fr": "Requis"},
|
||||
"default": {"en": "Default", "fr": "Par défaut"},
|
||||
"description": {"en": "Description", "fr": "Description"},
|
||||
"validation": {"en": "Validation", "fr": "Validation"},
|
||||
},
|
||||
)
|
||||
|
||||
|
|
@ -233,6 +233,9 @@ def initRbacRules(db: DatabaseConnector) -> None:
|
|||
# Create RESOURCE context rules
|
||||
createResourceContextRules(db)
|
||||
|
||||
# Create Action-specific RBAC rules
|
||||
createActionRules(db)
|
||||
|
||||
logger.info("RBAC rules initialization completed")
|
||||
|
||||
|
||||
|
|
@ -785,6 +788,108 @@ def createResourceContextRules(db: DatabaseConnector) -> None:
|
|||
logger.info(f"Created {len(resourceRules)} RESOURCE context rules")
|
||||
|
||||
|
||||
def createActionRules(db: DatabaseConnector) -> None:
|
||||
"""
|
||||
Create default RBAC rules for workflow actions.
|
||||
|
||||
This function dynamically discovers all available actions from all methods
|
||||
and creates RBAC rules for them. Actions are protected via RESOURCE context
|
||||
with actionId as the item identifier (format: 'module.actionName').
|
||||
|
||||
Args:
|
||||
db: Database connector instance
|
||||
"""
|
||||
try:
|
||||
# Import method discovery to get all actions
|
||||
from modules.workflows.processing.shared.methodDiscovery import discoverMethods
|
||||
from modules.services import getInterface as getServices
|
||||
from modules.datamodels.datamodelUam import User
|
||||
|
||||
# Create a temporary user context for discovery (will be filtered by RBAC later)
|
||||
# We need to discover methods, but we'll use a minimal user context
|
||||
# In production, this should use a system user or admin user
|
||||
try:
|
||||
# Try to get an admin user for discovery
|
||||
adminUsers = db.getRecordset("User", recordFilter={"roleLabel": "sysadmin"}, limit=1)
|
||||
if adminUsers:
|
||||
tempUser = User(**adminUsers[0])
|
||||
else:
|
||||
# Fallback: create minimal user context
|
||||
tempUser = User(id="system", roleLabel="sysadmin")
|
||||
except:
|
||||
# Fallback: create minimal user context
|
||||
tempUser = User(id="system", roleLabel="sysadmin")
|
||||
|
||||
# Get services and discover methods
|
||||
services = getServices(tempUser, None)
|
||||
discoverMethods(services)
|
||||
|
||||
# Import methods catalog
|
||||
from modules.workflows.processing.shared.methodDiscovery import methods
|
||||
|
||||
# Collect all action IDs
|
||||
allActionIds = []
|
||||
for methodName, methodInfo in methods.items():
|
||||
# Skip duplicate entries (same method stored with full and short name)
|
||||
if methodName.startswith('Method'):
|
||||
continue
|
||||
|
||||
methodInstance = methodInfo['instance']
|
||||
methodActions = methodInstance.actions
|
||||
|
||||
for actionName in methodActions.keys():
|
||||
actionId = f"{methodInstance.name}.{actionName}"
|
||||
allActionIds.append(actionId)
|
||||
|
||||
logger.info(f"Discovered {len(allActionIds)} actions for RBAC rule creation")
|
||||
|
||||
# Define default action access by role
|
||||
# SysAdmin and Admin: Access to all actions
|
||||
# User: Access to common actions (read, search, process, etc.)
|
||||
# Viewer: Read-only actions
|
||||
|
||||
actionRules = []
|
||||
|
||||
# All roles: Generic access to all actions
|
||||
# Using item=None grants access to all resources (all actions) in RESOURCE context
|
||||
|
||||
# SysAdmin: Access to all actions
|
||||
actionRules.append(AccessRule(
|
||||
roleLabel="sysadmin",
|
||||
context=AccessRuleContext.RESOURCE,
|
||||
item=None, # All resources (covers all actions)
|
||||
view=True
|
||||
))
|
||||
|
||||
# Admin: Access to all actions
|
||||
actionRules.append(AccessRule(
|
||||
roleLabel="admin",
|
||||
context=AccessRuleContext.RESOURCE,
|
||||
item=None, # All resources (covers all actions)
|
||||
view=True
|
||||
))
|
||||
|
||||
# User: Access to all actions (generic rights)
|
||||
actionRules.append(AccessRule(
|
||||
roleLabel="user",
|
||||
context=AccessRuleContext.RESOURCE,
|
||||
item=None, # All resources (covers all actions)
|
||||
view=True
|
||||
))
|
||||
|
||||
|
||||
# Create all action rules
|
||||
for rule in actionRules:
|
||||
db.recordCreate(AccessRule, rule)
|
||||
|
||||
logger.info(f"Created {len(actionRules)} action RBAC rules")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error creating action RBAC rules: {str(e)}", exc_info=True)
|
||||
# Don't fail bootstrap if action rules can't be created
|
||||
# They can be created manually or via migration script
|
||||
|
||||
|
||||
def _addMissingTableRules(db: DatabaseConnector, existingRules: List[Dict[str, Any]]) -> None:
|
||||
"""
|
||||
Add missing RBAC rules for tables that were added after initial bootstrap.
|
||||
|
|
|
|||
|
|
@ -1574,18 +1574,21 @@ class AppObjects:
|
|||
self,
|
||||
roleLabel: Optional[str] = None,
|
||||
context: Optional[AccessRuleContext] = None,
|
||||
item: Optional[str] = None
|
||||
) -> List[AccessRule]:
|
||||
item: Optional[str] = None,
|
||||
pagination: Optional[PaginationParams] = None
|
||||
) -> Union[List[AccessRule], PaginatedResult]:
|
||||
"""
|
||||
Get access rules with optional filters.
|
||||
Get access rules with optional filters and pagination.
|
||||
|
||||
Args:
|
||||
roleLabel: Optional role label filter
|
||||
context: Optional context filter
|
||||
item: Optional item filter
|
||||
pagination: Optional pagination parameters. If None, returns all items.
|
||||
|
||||
Returns:
|
||||
List of AccessRule objects
|
||||
If pagination is None: List[AccessRule]
|
||||
If pagination is provided: PaginatedResult with items and metadata
|
||||
"""
|
||||
try:
|
||||
recordFilter = {}
|
||||
|
|
@ -1596,11 +1599,55 @@ class AppObjects:
|
|||
if item:
|
||||
recordFilter["item"] = item
|
||||
|
||||
rules = self.db.getRecordset(AccessRule, recordFilter=recordFilter if recordFilter else None)
|
||||
return [AccessRule(**rule) for rule in rules]
|
||||
# Use RBAC filtering
|
||||
rules = getRecordsetWithRBAC(
|
||||
self.db,
|
||||
AccessRule,
|
||||
self.currentUser,
|
||||
recordFilter=recordFilter if recordFilter else None
|
||||
)
|
||||
|
||||
# Filter out database-specific fields
|
||||
filteredRules = []
|
||||
for rule in rules:
|
||||
cleanedRule = {k: v for k, v in rule.items() if not k.startswith("_")}
|
||||
filteredRules.append(cleanedRule)
|
||||
|
||||
# If no pagination requested, return all items
|
||||
if pagination is None:
|
||||
return [AccessRule(**rule) for rule in filteredRules]
|
||||
|
||||
# Apply filtering (if filters provided)
|
||||
if pagination.filters:
|
||||
filteredRules = self._applyFilters(filteredRules, pagination.filters)
|
||||
|
||||
# Apply sorting (in order of sortFields)
|
||||
if pagination.sort:
|
||||
filteredRules = self._applySorting(filteredRules, pagination.sort)
|
||||
|
||||
# Count total items after filters
|
||||
totalItems = len(filteredRules)
|
||||
totalPages = math.ceil(totalItems / pagination.pageSize) if totalItems > 0 else 0
|
||||
|
||||
# Apply pagination (skip/limit)
|
||||
startIdx = (pagination.page - 1) * pagination.pageSize
|
||||
endIdx = startIdx + pagination.pageSize
|
||||
pagedRules = filteredRules[startIdx:endIdx]
|
||||
|
||||
# Convert to model objects
|
||||
items = [AccessRule(**rule) for rule in pagedRules]
|
||||
|
||||
return PaginatedResult(
|
||||
items=items,
|
||||
totalItems=totalItems,
|
||||
totalPages=totalPages
|
||||
)
|
||||
except Exception as e:
|
||||
logger.error(f"Error getting access rules: {str(e)}")
|
||||
return []
|
||||
if pagination is None:
|
||||
return []
|
||||
else:
|
||||
return PaginatedResult(items=[], totalItems=0, totalPages=0)
|
||||
|
||||
def getAccessRulesForRoles(
|
||||
self,
|
||||
|
|
@ -1701,19 +1748,62 @@ class AppObjects:
|
|||
logger.error(f"Error getting role by label {roleLabel}: {str(e)}")
|
||||
return None
|
||||
|
||||
def getAllRoles(self) -> List[Role]:
|
||||
def getAllRoles(self, pagination: Optional[PaginationParams] = None) -> Union[List[Role], PaginatedResult]:
|
||||
"""
|
||||
Get all roles.
|
||||
Get all roles with optional pagination, sorting, and filtering.
|
||||
|
||||
Args:
|
||||
pagination: Optional pagination parameters. If None, returns all items.
|
||||
|
||||
Returns:
|
||||
List of Role objects
|
||||
If pagination is None: List[Role]
|
||||
If pagination is provided: PaginatedResult with items and metadata
|
||||
"""
|
||||
try:
|
||||
# Get all roles from database
|
||||
roles = self.db.getRecordset(Role)
|
||||
return [Role(**role) for role in roles]
|
||||
|
||||
# Filter out database-specific fields
|
||||
filteredRoles = []
|
||||
for role in roles:
|
||||
cleanedRole = {k: v for k, v in role.items() if not k.startswith("_")}
|
||||
filteredRoles.append(cleanedRole)
|
||||
|
||||
# If no pagination requested, return all items
|
||||
if pagination is None:
|
||||
return [Role(**role) for role in filteredRoles]
|
||||
|
||||
# Apply filtering (if filters provided)
|
||||
if pagination.filters:
|
||||
filteredRoles = self._applyFilters(filteredRoles, pagination.filters)
|
||||
|
||||
# Apply sorting (in order of sortFields)
|
||||
if pagination.sort:
|
||||
filteredRoles = self._applySorting(filteredRoles, pagination.sort)
|
||||
|
||||
# Count total items after filters
|
||||
totalItems = len(filteredRoles)
|
||||
totalPages = math.ceil(totalItems / pagination.pageSize) if totalItems > 0 else 0
|
||||
|
||||
# Apply pagination (skip/limit)
|
||||
startIdx = (pagination.page - 1) * pagination.pageSize
|
||||
endIdx = startIdx + pagination.pageSize
|
||||
pagedRoles = filteredRoles[startIdx:endIdx]
|
||||
|
||||
# Convert to model objects
|
||||
items = [Role(**role) for role in pagedRoles]
|
||||
|
||||
return PaginatedResult(
|
||||
items=items,
|
||||
totalItems=totalItems,
|
||||
totalPages=totalPages
|
||||
)
|
||||
except Exception as e:
|
||||
logger.error(f"Error getting all roles: {str(e)}")
|
||||
return []
|
||||
if pagination is None:
|
||||
return []
|
||||
else:
|
||||
return PaginatedResult(items=[], totalItems=0, totalPages=0)
|
||||
|
||||
def updateRole(self, roleId: str, role: Role) -> Role:
|
||||
"""
|
||||
|
|
|
|||
|
|
@ -8,10 +8,13 @@ Implements endpoints for role-based access control permissions.
|
|||
from fastapi import APIRouter, HTTPException, Depends, Query, Body, Path, Request
|
||||
from typing import Optional, List, Dict, Any
|
||||
import logging
|
||||
import json
|
||||
import math
|
||||
|
||||
from modules.auth import getCurrentUser, limiter
|
||||
from modules.datamodels.datamodelUam import User, UserPermissions, AccessLevel
|
||||
from modules.datamodels.datamodelRbac import AccessRuleContext, AccessRule, Role
|
||||
from modules.datamodels.datamodelPagination import PaginationParams, PaginatedResponse, PaginationMetadata
|
||||
from modules.interfaces.interfaceDbAppObjects import getInterface
|
||||
|
||||
# Configure logger
|
||||
|
|
@ -86,15 +89,16 @@ async def getPermissions(
|
|||
)
|
||||
|
||||
|
||||
@router.get("/rules", response_model=list)
|
||||
@router.get("/rules", response_model=PaginatedResponse)
|
||||
@limiter.limit("30/minute")
|
||||
async def getAccessRules(
|
||||
request: Request,
|
||||
roleLabel: Optional[str] = Query(None, description="Filter by role label"),
|
||||
context: Optional[str] = Query(None, description="Filter by context (DATA, UI, RESOURCE)"),
|
||||
item: Optional[str] = Query(None, description="Filter by item identifier"),
|
||||
pagination: Optional[str] = Query(None, description="JSON-encoded PaginationParams object"),
|
||||
currentUser: User = Depends(getCurrentUser)
|
||||
) -> list:
|
||||
) -> PaginatedResponse:
|
||||
"""
|
||||
Get access rules with optional filters.
|
||||
Only returns rules that the current user has permission to view.
|
||||
|
|
@ -143,15 +147,45 @@ async def getAccessRules(
|
|||
detail=f"Invalid context '{context}'. Must be one of: DATA, UI, RESOURCE"
|
||||
)
|
||||
|
||||
# Get rules
|
||||
rules = interface.getAccessRules(
|
||||
# Parse pagination parameter
|
||||
paginationParams = None
|
||||
if pagination:
|
||||
try:
|
||||
paginationDict = json.loads(pagination)
|
||||
paginationParams = PaginationParams(**paginationDict) if paginationDict else None
|
||||
except (json.JSONDecodeError, ValueError) as e:
|
||||
raise HTTPException(
|
||||
status_code=400,
|
||||
detail=f"Invalid pagination parameter: {str(e)}"
|
||||
)
|
||||
|
||||
# Get rules with optional pagination
|
||||
result = interface.getAccessRules(
|
||||
roleLabel=roleLabel,
|
||||
context=accessContext,
|
||||
item=item
|
||||
item=item,
|
||||
pagination=paginationParams
|
||||
)
|
||||
|
||||
# Convert to dict for JSON serialization
|
||||
return [rule.model_dump() for rule in rules]
|
||||
# If pagination was requested, result is PaginatedResult
|
||||
# If no pagination, result is List[AccessRule]
|
||||
if paginationParams:
|
||||
return PaginatedResponse(
|
||||
items=[rule.model_dump() for rule in result.items],
|
||||
pagination=PaginationMetadata(
|
||||
currentPage=paginationParams.page,
|
||||
pageSize=paginationParams.pageSize,
|
||||
totalItems=result.totalItems,
|
||||
totalPages=result.totalPages,
|
||||
sort=paginationParams.sort,
|
||||
filters=paginationParams.filters
|
||||
)
|
||||
)
|
||||
else:
|
||||
return PaginatedResponse(
|
||||
items=[rule.model_dump() for rule in result],
|
||||
pagination=None
|
||||
)
|
||||
|
||||
except HTTPException:
|
||||
raise
|
||||
|
|
@ -489,12 +523,13 @@ def _ensureAdminAccess(currentUser: User) -> None:
|
|||
)
|
||||
|
||||
|
||||
@router.get("/roles", response_model=List[Dict[str, Any]])
|
||||
@router.get("/roles", response_model=PaginatedResponse)
|
||||
@limiter.limit("60/minute")
|
||||
async def listRoles(
|
||||
request: Request,
|
||||
pagination: Optional[str] = Query(None, description="JSON-encoded PaginationParams object"),
|
||||
currentUser: User = Depends(getCurrentUser)
|
||||
) -> List[Dict[str, Any]]:
|
||||
) -> PaginatedResponse:
|
||||
"""
|
||||
Get list of all available roles with metadata.
|
||||
|
||||
|
|
@ -506,14 +541,27 @@ async def listRoles(
|
|||
|
||||
interface = getInterface(currentUser)
|
||||
|
||||
# Get all roles from database
|
||||
dbRoles = interface.getAllRoles()
|
||||
# Parse pagination parameter
|
||||
paginationParams = None
|
||||
if pagination:
|
||||
try:
|
||||
paginationDict = json.loads(pagination)
|
||||
paginationParams = PaginationParams(**paginationDict) if paginationDict else None
|
||||
except (json.JSONDecodeError, ValueError) as e:
|
||||
raise HTTPException(
|
||||
status_code=400,
|
||||
detail=f"Invalid pagination parameter: {str(e)}"
|
||||
)
|
||||
|
||||
# Get all roles from database (without pagination) to enrich with user counts and add custom roles
|
||||
# Note: We get all roles first because we need to add custom roles before pagination
|
||||
dbRoles = interface.getAllRoles(pagination=None)
|
||||
|
||||
# Get all users to count role assignments
|
||||
# Since _ensureAdminAccess ensures user is sysadmin or admin,
|
||||
# and getUsersByMandate returns all users for sysadmin regardless of mandateId,
|
||||
# we can pass the current user's mandateId (for sysadmin it will be ignored by RBAC)
|
||||
allUsers = interface.getUsersByMandate(currentUser.mandateId or "")
|
||||
allUsers = interface.getUsersByMandate(currentUser.mandateId or "", pagination=None)
|
||||
|
||||
# Count users per role
|
||||
roleCounts: Dict[str, int] = {}
|
||||
|
|
@ -544,7 +592,45 @@ async def listRoles(
|
|||
"isSystemRole": False
|
||||
})
|
||||
|
||||
return result
|
||||
# Apply filtering and sorting if pagination requested
|
||||
if paginationParams:
|
||||
# Apply filtering (if filters provided)
|
||||
if paginationParams.filters:
|
||||
# Use the interface's filter method
|
||||
filteredResult = interface._applyFilters(result, paginationParams.filters)
|
||||
else:
|
||||
filteredResult = result
|
||||
|
||||
# Apply sorting (in order of sortFields)
|
||||
if paginationParams.sort:
|
||||
sortedResult = interface._applySorting(filteredResult, paginationParams.sort)
|
||||
else:
|
||||
sortedResult = filteredResult
|
||||
|
||||
# Apply pagination
|
||||
totalItems = len(sortedResult)
|
||||
totalPages = math.ceil(totalItems / paginationParams.pageSize) if totalItems > 0 else 0
|
||||
startIdx = (paginationParams.page - 1) * paginationParams.pageSize
|
||||
endIdx = startIdx + paginationParams.pageSize
|
||||
paginatedResult = sortedResult[startIdx:endIdx]
|
||||
|
||||
return PaginatedResponse(
|
||||
items=paginatedResult,
|
||||
pagination=PaginationMetadata(
|
||||
currentPage=paginationParams.page,
|
||||
pageSize=paginationParams.pageSize,
|
||||
totalItems=totalItems,
|
||||
totalPages=totalPages,
|
||||
sort=paginationParams.sort,
|
||||
filters=paginationParams.filters
|
||||
)
|
||||
)
|
||||
else:
|
||||
# No pagination - return all roles
|
||||
return PaginatedResponse(
|
||||
items=result,
|
||||
pagination=None
|
||||
)
|
||||
|
||||
except HTTPException:
|
||||
raise
|
||||
|
|
|
|||
|
|
@ -572,3 +572,247 @@ async def delete_file_from_message(
|
|||
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
|
||||
detail=f"Error deleting file reference: {str(e)}"
|
||||
)
|
||||
|
||||
|
||||
# Action Discovery Endpoints
|
||||
|
||||
@router.get("/actions", response_model=Dict[str, Any])
|
||||
@limiter.limit("120/minute")
|
||||
async def get_all_actions(
|
||||
request: Request,
|
||||
currentUser: User = Depends(getCurrentUser)
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
Get all available workflow actions for the current user (filtered by RBAC).
|
||||
|
||||
Returns:
|
||||
- Dictionary with actions grouped by module, filtered by RBAC permissions
|
||||
|
||||
Example response:
|
||||
{
|
||||
"actions": [
|
||||
{
|
||||
"module": "outlook",
|
||||
"actionId": "outlook.readEmails",
|
||||
"name": "readEmails",
|
||||
"description": "Read emails and metadata from a mailbox folder",
|
||||
"parameters": {...}
|
||||
},
|
||||
...
|
||||
]
|
||||
}
|
||||
"""
|
||||
try:
|
||||
from modules.services import getInterface as getServices
|
||||
from modules.workflows.processing.shared.methodDiscovery import discoverMethods
|
||||
|
||||
# Get services and discover methods
|
||||
services = getServices(currentUser, None)
|
||||
discoverMethods(services)
|
||||
|
||||
# Import methods catalog
|
||||
from modules.workflows.processing.shared.methodDiscovery import methods
|
||||
|
||||
# Collect all actions from all methods
|
||||
allActions = []
|
||||
for methodName, methodInfo in methods.items():
|
||||
# Skip duplicate entries (same method stored with full and short name)
|
||||
if methodName.startswith('Method'):
|
||||
continue
|
||||
|
||||
methodInstance = methodInfo['instance']
|
||||
methodActions = methodInstance.actions
|
||||
|
||||
for actionName, actionInfo in methodActions.items():
|
||||
# Build action response
|
||||
actionResponse = {
|
||||
"module": methodInstance.name,
|
||||
"actionId": f"{methodInstance.name}.{actionName}",
|
||||
"name": actionName,
|
||||
"description": actionInfo.get('description', ''),
|
||||
"parameters": actionInfo.get('parameters', {})
|
||||
}
|
||||
allActions.append(actionResponse)
|
||||
|
||||
return {
|
||||
"actions": allActions
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error getting all actions: {str(e)}", exc_info=True)
|
||||
raise HTTPException(
|
||||
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
|
||||
detail=f"Failed to get actions: {str(e)}"
|
||||
)
|
||||
|
||||
|
||||
@router.get("/actions/{method}", response_model=Dict[str, Any])
|
||||
@limiter.limit("120/minute")
|
||||
async def get_method_actions(
|
||||
request: Request,
|
||||
method: str = Path(..., description="Method name (e.g., 'outlook', 'sharepoint')"),
|
||||
currentUser: User = Depends(getCurrentUser)
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
Get all available actions for a specific method (filtered by RBAC).
|
||||
|
||||
Path Parameters:
|
||||
- method: Method name (e.g., 'outlook', 'sharepoint', 'ai')
|
||||
|
||||
Returns:
|
||||
- Dictionary with actions for the specified method
|
||||
|
||||
Example response:
|
||||
{
|
||||
"module": "outlook",
|
||||
"actions": [
|
||||
{
|
||||
"actionId": "outlook.readEmails",
|
||||
"name": "readEmails",
|
||||
"description": "Read emails and metadata from a mailbox folder",
|
||||
"parameters": {...}
|
||||
},
|
||||
...
|
||||
]
|
||||
}
|
||||
"""
|
||||
try:
|
||||
from modules.services import getInterface as getServices
|
||||
from modules.workflows.processing.shared.methodDiscovery import discoverMethods
|
||||
|
||||
# Get services and discover methods
|
||||
services = getServices(currentUser, None)
|
||||
discoverMethods(services)
|
||||
|
||||
# Import methods catalog
|
||||
from modules.workflows.processing.shared.methodDiscovery import methods
|
||||
|
||||
# Find method instance
|
||||
methodInstance = None
|
||||
for methodName, methodInfo in methods.items():
|
||||
if methodInfo['instance'].name == method:
|
||||
methodInstance = methodInfo['instance']
|
||||
break
|
||||
|
||||
if not methodInstance:
|
||||
raise HTTPException(
|
||||
status_code=status.HTTP_404_NOT_FOUND,
|
||||
detail=f"Method '{method}' not found"
|
||||
)
|
||||
|
||||
# Collect actions for this method
|
||||
actions = []
|
||||
methodActions = methodInstance.actions
|
||||
|
||||
for actionName, actionInfo in methodActions.items():
|
||||
actionResponse = {
|
||||
"actionId": f"{methodInstance.name}.{actionName}",
|
||||
"name": actionName,
|
||||
"description": actionInfo.get('description', ''),
|
||||
"parameters": actionInfo.get('parameters', {})
|
||||
}
|
||||
actions.append(actionResponse)
|
||||
|
||||
return {
|
||||
"module": methodInstance.name,
|
||||
"description": methodInstance.description,
|
||||
"actions": actions
|
||||
}
|
||||
|
||||
except HTTPException:
|
||||
raise
|
||||
except Exception as e:
|
||||
logger.error(f"Error getting actions for method {method}: {str(e)}", exc_info=True)
|
||||
raise HTTPException(
|
||||
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
|
||||
detail=f"Failed to get actions for method {method}: {str(e)}"
|
||||
)
|
||||
|
||||
|
||||
@router.get("/actions/{method}/{action}", response_model=Dict[str, Any])
|
||||
@limiter.limit("120/minute")
|
||||
async def get_action_schema(
|
||||
request: Request,
|
||||
method: str = Path(..., description="Method name (e.g., 'outlook', 'sharepoint')"),
|
||||
action: str = Path(..., description="Action name (e.g., 'readEmails', 'uploadDocument')"),
|
||||
currentUser: User = Depends(getCurrentUser)
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
Get action schema with parameter definitions for a specific action.
|
||||
|
||||
Path Parameters:
|
||||
- method: Method name (e.g., 'outlook', 'sharepoint', 'ai')
|
||||
- action: Action name (e.g., 'readEmails', 'uploadDocument')
|
||||
|
||||
Returns:
|
||||
- Action schema with full parameter definitions
|
||||
|
||||
Example response:
|
||||
{
|
||||
"method": "outlook",
|
||||
"action": "readEmails",
|
||||
"actionId": "outlook.readEmails",
|
||||
"description": "Read emails and metadata from a mailbox folder",
|
||||
"parameters": {
|
||||
"connectionReference": {
|
||||
"name": "connectionReference",
|
||||
"type": "str",
|
||||
"frontendType": "userConnection",
|
||||
"frontendOptions": "user.connection",
|
||||
"required": true,
|
||||
"description": "Microsoft connection label"
|
||||
},
|
||||
...
|
||||
}
|
||||
}
|
||||
"""
|
||||
try:
|
||||
from modules.services import getInterface as getServices
|
||||
from modules.workflows.processing.shared.methodDiscovery import discoverMethods
|
||||
|
||||
# Get services and discover methods
|
||||
services = getServices(currentUser, None)
|
||||
discoverMethods(services)
|
||||
|
||||
# Import methods catalog
|
||||
from modules.workflows.processing.shared.methodDiscovery import methods
|
||||
|
||||
# Find method instance
|
||||
methodInstance = None
|
||||
for methodName, methodInfo in methods.items():
|
||||
if methodInfo['instance'].name == method:
|
||||
methodInstance = methodInfo['instance']
|
||||
break
|
||||
|
||||
if not methodInstance:
|
||||
raise HTTPException(
|
||||
status_code=status.HTTP_404_NOT_FOUND,
|
||||
detail=f"Method '{method}' not found"
|
||||
)
|
||||
|
||||
# Get action
|
||||
methodActions = methodInstance.actions
|
||||
if action not in methodActions:
|
||||
raise HTTPException(
|
||||
status_code=status.HTTP_404_NOT_FOUND,
|
||||
detail=f"Action '{action}' not found in method '{method}'"
|
||||
)
|
||||
|
||||
actionInfo = methodActions[action]
|
||||
|
||||
return {
|
||||
"method": methodInstance.name,
|
||||
"action": action,
|
||||
"actionId": f"{methodInstance.name}.{action}",
|
||||
"description": actionInfo.get('description', ''),
|
||||
"parameters": actionInfo.get('parameters', {})
|
||||
}
|
||||
|
||||
except HTTPException:
|
||||
raise
|
||||
except Exception as e:
|
||||
logger.error(f"Error getting action schema for {method}.{action}: {str(e)}", exc_info=True)
|
||||
raise HTTPException(
|
||||
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
|
||||
detail=f"Failed to get action schema: {str(e)}"
|
||||
)
|
||||
|
|
@ -57,6 +57,9 @@ class Services:
|
|||
from modules.interfaces.interfaceDbComponentObjects import getInterface as getComponentInterface
|
||||
self.interfaceDbComponent = getComponentInterface(user)
|
||||
|
||||
# Expose RBAC directly on services for convenience
|
||||
self.rbac = self.interfaceDbApp.rbac if self.interfaceDbApp else None
|
||||
|
||||
# Initialize service packages
|
||||
|
||||
from .serviceExtraction.mainServiceExtraction import ExtractionService
|
||||
|
|
|
|||
78
modules/services/serviceAi/README_MODULE_STRUCTURE.md
Normal file
78
modules/services/serviceAi/README_MODULE_STRUCTURE.md
Normal file
|
|
@ -0,0 +1,78 @@
|
|||
# Module Structure - serviceAi
|
||||
|
||||
## Übersicht
|
||||
|
||||
Das `mainServiceAi.py` Modul wurde in mehrere Submodule aufgeteilt, um die Übersichtlichkeit zu verbessern.
|
||||
|
||||
## Modulstruktur
|
||||
|
||||
### Hauptmodul
|
||||
- **mainServiceAi.py** (~800 Zeilen)
|
||||
- Initialisierung (`__init__`, `create`, `ensureAiObjectsInitialized`)
|
||||
- Public API (`callAiPlanning`, `callAiContent`)
|
||||
- Routing zu Submodulen
|
||||
- Helper-Methoden
|
||||
|
||||
### Submodule
|
||||
|
||||
1. **subJsonResponseHandling.py** (bereits vorhanden)
|
||||
- JSON Response Merging
|
||||
- Section Merging
|
||||
- Fragment Detection
|
||||
|
||||
2. **subResponseParsing.py** (~200 Zeilen)
|
||||
- `ResponseParser.extractSectionsFromResponse()` - Extrahiert Sections aus AI-Responses
|
||||
- `ResponseParser.shouldContinueGeneration()` - Entscheidet ob Generation fortgesetzt werden soll
|
||||
- `ResponseParser._isStuckInLoop()` - Loop-Detection
|
||||
- `ResponseParser.extractDocumentMetadata()` - Extrahiert Metadaten
|
||||
- `ResponseParser.buildFinalResultFromSections()` - Baut finales JSON
|
||||
|
||||
3. **subDocumentIntents.py** (~300 Zeilen)
|
||||
- `DocumentIntentAnalyzer.clarifyDocumentIntents()` - Analysiert Dokument-Intents
|
||||
- `DocumentIntentAnalyzer.resolvePreExtractedDocument()` - Löst pre-extracted Dokumente auf
|
||||
- `DocumentIntentAnalyzer._buildIntentAnalysisPrompt()` - Baut Intent-Analyse-Prompt
|
||||
|
||||
4. **subContentExtraction.py** (~600 Zeilen)
|
||||
- `ContentExtractor.extractAndPrepareContent()` - Extrahiert und bereitet Content vor
|
||||
- `ContentExtractor.extractTextFromImage()` - Vision AI für Bilder
|
||||
- `ContentExtractor.processTextContentWithAi()` - AI-Verarbeitung von Text
|
||||
- `ContentExtractor._isBinary()` - Helper für Binary-Check
|
||||
|
||||
5. **subStructureGeneration.py** (~200 Zeilen)
|
||||
- `StructureGenerator.generateStructure()` - Generiert Dokument-Struktur
|
||||
- `StructureGenerator._buildStructurePrompt()` - Baut Struktur-Prompt
|
||||
|
||||
6. **subStructureFilling.py** (~400 Zeilen)
|
||||
- `StructureFiller.fillStructure()` - Füllt Struktur mit Content
|
||||
- `StructureFiller._buildSectionGenerationPrompt()` - Baut Section-Generation-Prompt
|
||||
- `StructureFiller._findContentPartById()` - Helper für ContentPart-Suche
|
||||
- `StructureFiller._needsAggregation()` - Entscheidet ob Aggregation nötig
|
||||
|
||||
7. **subAiCallLooping.py** (~400 Zeilen)
|
||||
- `AiCallLooper.callAiWithLooping()` - Haupt-Looping-Logik
|
||||
- `AiCallLooper._defineKpisFromPrompt()` - KPI-Definition
|
||||
|
||||
## Verwendung
|
||||
|
||||
Alle Submodule werden über das Hauptmodul `AiService` verwendet:
|
||||
|
||||
```python
|
||||
# Initialisierung
|
||||
aiService = await AiService.create(serviceCenter)
|
||||
|
||||
# Submodule werden automatisch initialisiert
|
||||
# aiService.responseParser
|
||||
# aiService.intentAnalyzer
|
||||
# aiService.contentExtractor
|
||||
# etc.
|
||||
```
|
||||
|
||||
## Migration
|
||||
|
||||
Die öffentliche API bleibt unverändert. Interne Methoden wurden in Submodule verschoben:
|
||||
|
||||
- `_extractSectionsFromResponse` → `responseParser.extractSectionsFromResponse`
|
||||
- `_clarifyDocumentIntents` → `intentAnalyzer.clarifyDocumentIntents`
|
||||
- `_extractAndPrepareContent` → `contentExtractor.extractAndPrepareContent`
|
||||
- etc.
|
||||
|
||||
126
modules/services/serviceAi/REFACTORING_PLAN.md
Normal file
126
modules/services/serviceAi/REFACTORING_PLAN.md
Normal file
|
|
@ -0,0 +1,126 @@
|
|||
# Refactoring Plan für mainServiceAi.py
|
||||
|
||||
## Ziel
|
||||
Aufteilen des 3000-Zeilen-Moduls in überschaubare Submodule (~300-600 Zeilen pro Modul).
|
||||
|
||||
## Vorgeschlagene Struktur
|
||||
|
||||
### Bereits erstellt:
|
||||
1. ✅ `subResponseParsing.py` - ResponseParser Klasse
|
||||
2. ✅ `subDocumentIntents.py` - DocumentIntentAnalyzer Klasse
|
||||
|
||||
### Noch zu erstellen:
|
||||
3. `subContentExtraction.py` - ContentExtractor Klasse
|
||||
- `extractAndPrepareContent()` (~490 Zeilen)
|
||||
- `extractTextFromImage()` (~55 Zeilen)
|
||||
- `processTextContentWithAi()` (~72 Zeilen)
|
||||
- `_isBinary()` (~10 Zeilen)
|
||||
|
||||
4. `subStructureGeneration.py` - StructureGenerator Klasse
|
||||
- `generateStructure()` (~60 Zeilen)
|
||||
- `_buildStructurePrompt()` (~130 Zeilen)
|
||||
|
||||
5. `subStructureFilling.py` - StructureFiller Klasse
|
||||
- `fillStructure()` (~290 Zeilen)
|
||||
- `_buildSectionGenerationPrompt()` (~185 Zeilen)
|
||||
- `_findContentPartById()` (~5 Zeilen)
|
||||
- `_needsAggregation()` (~20 Zeilen)
|
||||
|
||||
6. `subAiCallLooping.py` - AiCallLooper Klasse
|
||||
- `callAiWithLooping()` (~405 Zeilen)
|
||||
- `_defineKpisFromPrompt()` (~92 Zeilen)
|
||||
|
||||
## Refactoring-Schritte für mainServiceAi.py
|
||||
|
||||
### Schritt 1: Submodule-Initialisierung erweitern
|
||||
|
||||
```python
|
||||
def _initializeSubmodules(self):
|
||||
"""Initialize all submodules after aiObjects is ready."""
|
||||
if self.aiObjects is None:
|
||||
raise RuntimeError("aiObjects must be initialized before initializing submodules")
|
||||
|
||||
if self.extractionService is None:
|
||||
logger.info("Initializing ExtractionService...")
|
||||
self.extractionService = ExtractionService(self.services)
|
||||
|
||||
# Neue Submodule initialisieren
|
||||
from modules.services.serviceAi.subResponseParsing import ResponseParser
|
||||
from modules.services.serviceAi.subDocumentIntents import DocumentIntentAnalyzer
|
||||
from modules.services.serviceAi.subContentExtraction import ContentExtractor
|
||||
from modules.services.serviceAi.subStructureGeneration import StructureGenerator
|
||||
from modules.services.serviceAi.subStructureFilling import StructureFiller
|
||||
|
||||
if not hasattr(self, 'responseParser'):
|
||||
self.responseParser = ResponseParser(self.services)
|
||||
|
||||
if not hasattr(self, 'intentAnalyzer'):
|
||||
self.intentAnalyzer = DocumentIntentAnalyzer(self.services, self)
|
||||
|
||||
if not hasattr(self, 'contentExtractor'):
|
||||
self.contentExtractor = ContentExtractor(self.services, self)
|
||||
|
||||
if not hasattr(self, 'structureGenerator'):
|
||||
self.structureGenerator = StructureGenerator(self.services, self)
|
||||
|
||||
if not hasattr(self, 'structureFiller'):
|
||||
self.structureFiller = StructureFiller(self.services, self)
|
||||
```
|
||||
|
||||
### Schritt 2: Methoden durch Delegation ersetzen
|
||||
|
||||
**Beispiel für Response Parsing:**
|
||||
```python
|
||||
# ALT:
|
||||
def _extractSectionsFromResponse(self, ...):
|
||||
# 100 Zeilen Code
|
||||
...
|
||||
|
||||
# NEU:
|
||||
def _extractSectionsFromResponse(self, ...):
|
||||
return self.responseParser.extractSectionsFromResponse(...)
|
||||
```
|
||||
|
||||
**Beispiel für Document Intents:**
|
||||
```python
|
||||
# ALT:
|
||||
async def _clarifyDocumentIntents(self, ...):
|
||||
# 100 Zeilen Code
|
||||
...
|
||||
|
||||
# NEU:
|
||||
async def _clarifyDocumentIntents(self, ...):
|
||||
return await self.intentAnalyzer.clarifyDocumentIntents(...)
|
||||
```
|
||||
|
||||
### Schritt 3: Helper-Methoden beibehalten
|
||||
|
||||
Kleine Helper-Methoden bleiben im Hauptmodul:
|
||||
- `_buildPromptWithPlaceholders()`
|
||||
- `_getIntentForDocument()`
|
||||
- `_shouldSkipContentPart()`
|
||||
- `_determineDocumentName()`
|
||||
|
||||
### Schritt 4: Public API unverändert lassen
|
||||
|
||||
Die öffentliche API (`callAiPlanning`, `callAiContent`) bleibt unverändert.
|
||||
|
||||
## Erwartete Ergebnis-Größen
|
||||
|
||||
- `mainServiceAi.py`: ~800-1000 Zeilen (von 3016)
|
||||
- `subResponseParsing.py`: ~200 Zeilen ✅
|
||||
- `subDocumentIntents.py`: ~300 Zeilen ✅
|
||||
- `subContentExtraction.py`: ~600 Zeilen
|
||||
- `subStructureGeneration.py`: ~200 Zeilen
|
||||
- `subStructureFilling.py`: ~400 Zeilen
|
||||
- `subAiCallLooping.py`: ~500 Zeilen
|
||||
|
||||
**Gesamt: ~3000 Zeilen** (gleich, aber besser organisiert)
|
||||
|
||||
## Vorteile
|
||||
|
||||
1. **Übersichtlichkeit**: Jedes Modul hat eine klare Verantwortlichkeit
|
||||
2. **Wartbarkeit**: Änderungen sind lokalisiert
|
||||
3. **Testbarkeit**: Module können einzeln getestet werden
|
||||
4. **Wiederverwendbarkeit**: Module können in anderen Kontexten verwendet werden
|
||||
|
||||
File diff suppressed because it is too large
Load diff
533
modules/services/serviceAi/subAiCallLooping.py
Normal file
533
modules/services/serviceAi/subAiCallLooping.py
Normal file
|
|
@ -0,0 +1,533 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""
|
||||
AI Call Looping Module
|
||||
|
||||
Handles AI calls with looping and repair logic, including:
|
||||
- Looping with JSON repair and continuation
|
||||
- KPI definition and tracking
|
||||
- Progress tracking and iteration management
|
||||
"""
|
||||
import json
|
||||
import logging
|
||||
from typing import Dict, Any, List, Optional, Callable
|
||||
|
||||
from modules.datamodels.datamodelAi import AiCallRequest, AiCallOptions, OperationTypeEnum, PriorityEnum, ProcessingModeEnum, JsonAccumulationState
|
||||
from modules.datamodels.datamodelExtraction import ContentPart
|
||||
from modules.shared.jsonUtils import buildContinuationContext, extractJsonString
|
||||
from modules.services.serviceAi.subJsonResponseHandling import JsonResponseHandler
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class AiCallLooper:
|
||||
"""Handles AI calls with looping and repair logic."""
|
||||
|
||||
def __init__(self, services, aiService, responseParser):
|
||||
"""Initialize AiCallLooper with service center, AI service, and response parser access."""
|
||||
self.services = services
|
||||
self.aiService = aiService
|
||||
self.responseParser = responseParser
|
||||
|
||||
async def callAiWithLooping(
|
||||
self,
|
||||
prompt: str,
|
||||
options: AiCallOptions,
|
||||
debugPrefix: str = "ai_call",
|
||||
promptBuilder: Optional[Callable] = None,
|
||||
promptArgs: Optional[Dict[str, Any]] = None,
|
||||
operationId: Optional[str] = None,
|
||||
userPrompt: Optional[str] = None,
|
||||
contentParts: Optional[List[ContentPart]] = None # ARCHITECTURE: Support ContentParts for large content
|
||||
) -> str:
|
||||
"""
|
||||
Shared core function for AI calls with repair-based looping system.
|
||||
Automatically repairs broken JSON and continues generation seamlessly.
|
||||
|
||||
Args:
|
||||
prompt: The prompt to send to AI
|
||||
options: AI call configuration options
|
||||
debugPrefix: Prefix for debug file names
|
||||
promptBuilder: Optional function to rebuild prompts for continuation
|
||||
promptArgs: Optional arguments for prompt builder
|
||||
operationId: Optional operation ID for progress tracking
|
||||
userPrompt: Optional user prompt for KPI definition
|
||||
contentParts: Optional content parts for first iteration
|
||||
|
||||
Returns:
|
||||
Complete AI response after all iterations
|
||||
"""
|
||||
maxIterations = 50 # Prevent infinite loops
|
||||
iteration = 0
|
||||
allSections = [] # Accumulate all sections across iterations
|
||||
lastRawResponse = None # Store last raw JSON response for continuation
|
||||
documentMetadata = None # Store document metadata (title, filename) from first iteration
|
||||
accumulationState = None # Track accumulation state for string accumulation
|
||||
|
||||
# Get parent operation ID for iteration operations (parentId should be operationId, not log entry ID)
|
||||
parentOperationId = operationId # Use the parent's operationId directly
|
||||
|
||||
while iteration < maxIterations:
|
||||
iteration += 1
|
||||
|
||||
# Create separate operation for each iteration with parent reference
|
||||
iterationOperationId = None
|
||||
if operationId:
|
||||
iterationOperationId = f"{operationId}_iter_{iteration}"
|
||||
self.services.chat.progressLogStart(
|
||||
iterationOperationId,
|
||||
"AI Call",
|
||||
f"Iteration {iteration}",
|
||||
"",
|
||||
parentOperationId=parentOperationId
|
||||
)
|
||||
|
||||
# Build iteration prompt
|
||||
# CRITICAL: Build continuation prompt if we have sections OR if we have a previous response (even if broken)
|
||||
# This ensures continuation prompts are built even when JSON is so broken that no sections can be extracted
|
||||
if (len(allSections) > 0 or lastRawResponse) and promptBuilder and promptArgs:
|
||||
# This is a continuation - build continuation context with raw JSON and rebuild prompt
|
||||
continuationContext = buildContinuationContext(allSections, lastRawResponse)
|
||||
if not lastRawResponse:
|
||||
logger.warning(f"Iteration {iteration}: No previous response available for continuation!")
|
||||
|
||||
# Filter promptArgs to only include parameters that buildGenerationPrompt accepts
|
||||
# buildGenerationPrompt accepts: outputFormat, userPrompt, title, extracted_content, continuationContext, services
|
||||
filteredPromptArgs = {
|
||||
k: v for k, v in promptArgs.items()
|
||||
if k in ['outputFormat', 'userPrompt', 'title', 'extracted_content', 'services']
|
||||
}
|
||||
# Always include services if available
|
||||
if not filteredPromptArgs.get('services') and hasattr(self, 'services'):
|
||||
filteredPromptArgs['services'] = self.services
|
||||
|
||||
# Rebuild prompt with continuation context using the provided prompt builder
|
||||
iterationPrompt = await promptBuilder(**filteredPromptArgs, continuationContext=continuationContext)
|
||||
else:
|
||||
# First iteration - use original prompt
|
||||
iterationPrompt = prompt
|
||||
|
||||
# Make AI call
|
||||
try:
|
||||
if iterationOperationId:
|
||||
self.services.chat.progressLogUpdate(iterationOperationId, 0.3, "Calling AI model")
|
||||
# ARCHITECTURE: Pass ContentParts directly to AiCallRequest
|
||||
# This allows model-aware chunking to handle large content properly
|
||||
# ContentParts are only passed in first iteration (continuations don't need them)
|
||||
request = AiCallRequest(
|
||||
prompt=iterationPrompt,
|
||||
context="",
|
||||
options=options,
|
||||
contentParts=contentParts if iteration == 1 else None # Only pass ContentParts in first iteration
|
||||
)
|
||||
|
||||
# Write the ACTUAL prompt sent to AI
|
||||
if iteration == 1:
|
||||
self.services.utils.writeDebugFile(iterationPrompt, f"{debugPrefix}_prompt")
|
||||
else:
|
||||
self.services.utils.writeDebugFile(iterationPrompt, f"{debugPrefix}_prompt_iteration_{iteration}")
|
||||
|
||||
response = await self.aiService.callAi(request)
|
||||
result = response.content
|
||||
|
||||
# Track bytes for progress reporting
|
||||
bytesReceived = len(result.encode('utf-8')) if result else 0
|
||||
totalBytesSoFar = sum(len(section.get('content', '').encode('utf-8')) if isinstance(section.get('content'), str) else 0 for section in allSections) + bytesReceived
|
||||
|
||||
# Update progress after AI call with byte information
|
||||
if iterationOperationId:
|
||||
# Format bytes for display (kB or MB)
|
||||
if totalBytesSoFar < 1024:
|
||||
bytesDisplay = f"{totalBytesSoFar}B"
|
||||
elif totalBytesSoFar < 1024 * 1024:
|
||||
bytesDisplay = f"{totalBytesSoFar / 1024:.1f}kB"
|
||||
else:
|
||||
bytesDisplay = f"{totalBytesSoFar / (1024 * 1024):.1f}MB"
|
||||
self.services.chat.progressLogUpdate(iterationOperationId, 0.6, f"AI response received ({bytesDisplay})")
|
||||
|
||||
# Write raw AI response to debug file
|
||||
if iteration == 1:
|
||||
self.services.utils.writeDebugFile(result, f"{debugPrefix}_response")
|
||||
else:
|
||||
self.services.utils.writeDebugFile(result, f"{debugPrefix}_response_iteration_{iteration}")
|
||||
|
||||
# Emit stats for this iteration (only if workflow exists and has id)
|
||||
if self.services.workflow and hasattr(self.services.workflow, 'id') and self.services.workflow.id:
|
||||
try:
|
||||
self.services.chat.storeWorkflowStat(
|
||||
self.services.workflow,
|
||||
response,
|
||||
f"ai.call.{debugPrefix}.iteration_{iteration}"
|
||||
)
|
||||
except Exception as statError:
|
||||
# Don't break the main loop if stat storage fails
|
||||
logger.warning(f"Failed to store workflow stat: {str(statError)}")
|
||||
|
||||
# Check for error response using generic error detection (errorCount > 0 or modelName == "error")
|
||||
if hasattr(response, 'errorCount') and response.errorCount > 0:
|
||||
errorMsg = f"Iteration {iteration}: Error response detected (errorCount={response.errorCount}), stopping loop: {result[:200] if result else 'empty'}"
|
||||
logger.error(errorMsg)
|
||||
break
|
||||
|
||||
if hasattr(response, 'modelName') and response.modelName == "error":
|
||||
errorMsg = f"Iteration {iteration}: Error response detected (modelName=error), stopping loop: {result[:200] if result else 'empty'}"
|
||||
logger.error(errorMsg)
|
||||
break
|
||||
|
||||
if not result or not result.strip():
|
||||
logger.warning(f"Iteration {iteration}: Empty response, stopping")
|
||||
break
|
||||
|
||||
# Check if this is a text response (not document generation)
|
||||
# Text responses don't need JSON parsing - return immediately after first successful response
|
||||
isTextResponse = (promptBuilder is None and promptArgs is None) or debugPrefix == "text"
|
||||
|
||||
if isTextResponse:
|
||||
# For text responses, return the text immediately - no JSON parsing needed
|
||||
logger.info(f"Iteration {iteration}: Text response received, returning immediately")
|
||||
if iterationOperationId:
|
||||
self.services.chat.progressLogFinish(iterationOperationId, True)
|
||||
return result
|
||||
|
||||
# Store raw response for continuation (even if broken)
|
||||
lastRawResponse = result
|
||||
|
||||
# Extract sections from response (handles both valid and broken JSON)
|
||||
# Only for document generation (JSON responses)
|
||||
# CRITICAL: Pass allSections and accumulationState to enable string accumulation
|
||||
extractedSections, wasJsonComplete, parsedResult, accumulationState = self.responseParser.extractSectionsFromResponse(
|
||||
result, iteration, debugPrefix, allSections, accumulationState
|
||||
)
|
||||
|
||||
# CRITICAL: Merge sections BEFORE KPI validation
|
||||
# This ensures sections are preserved even if KPI validation fails
|
||||
if extractedSections:
|
||||
allSections = JsonResponseHandler.mergeSectionsIntelligently(allSections, extractedSections, iteration)
|
||||
|
||||
# Define KPIs if we just entered accumulation mode (iteration 1, incomplete JSON)
|
||||
if accumulationState and accumulationState.isAccumulationMode and iteration == 1 and not accumulationState.kpis:
|
||||
logger.info(f"Iteration {iteration}: Defining KPIs for accumulation tracking")
|
||||
continuationContext = buildContinuationContext(allSections, result)
|
||||
# Pass raw response string from first iteration for KPI definition
|
||||
kpiDefinitions = await self._defineKpisFromPrompt(
|
||||
userPrompt or prompt,
|
||||
result, # Pass raw JSON string from first iteration
|
||||
continuationContext,
|
||||
debugPrefix
|
||||
)
|
||||
# Initialize KPIs with currentValue = 0
|
||||
accumulationState.kpis = [{**kpi, "currentValue": 0} for kpi in kpiDefinitions]
|
||||
logger.info(f"Defined {len(accumulationState.kpis)} KPIs: {[kpi.get('id') for kpi in accumulationState.kpis]}")
|
||||
|
||||
# Extract and validate KPIs (if in accumulation mode with KPIs defined)
|
||||
if accumulationState and accumulationState.isAccumulationMode and accumulationState.kpis:
|
||||
# For KPI extraction, prefer accumulated JSON string over repaired JSON
|
||||
# because repairBrokenJson may lose data (e.g., empty rows array when JSON is incomplete)
|
||||
updatedKpis = []
|
||||
|
||||
# First try to extract from parsedResult (repaired JSON)
|
||||
if parsedResult:
|
||||
try:
|
||||
updatedKpis = JsonResponseHandler.extractKpiValuesFromJson(
|
||||
parsedResult,
|
||||
accumulationState.kpis
|
||||
)
|
||||
# Check if we got meaningful values (non-zero)
|
||||
hasValidValues = any(kpi.get("currentValue", 0) > 0 for kpi in updatedKpis)
|
||||
if not hasValidValues and accumulationState.accumulatedJsonString:
|
||||
# Repaired JSON has empty values, try accumulated string
|
||||
logger.debug("Repaired JSON has empty KPI values, trying accumulated JSON string")
|
||||
updatedKpis = JsonResponseHandler.extractKpiValuesFromIncompleteJson(
|
||||
accumulationState.accumulatedJsonString,
|
||||
accumulationState.kpis
|
||||
)
|
||||
except Exception as e:
|
||||
logger.debug(f"Error extracting KPIs from parsedResult: {e}")
|
||||
updatedKpis = []
|
||||
|
||||
# If no parsedResult or extraction failed, try accumulated string
|
||||
if not updatedKpis and accumulationState.accumulatedJsonString:
|
||||
try:
|
||||
updatedKpis = JsonResponseHandler.extractKpiValuesFromIncompleteJson(
|
||||
accumulationState.accumulatedJsonString,
|
||||
accumulationState.kpis
|
||||
)
|
||||
except Exception as e:
|
||||
logger.debug(f"Error extracting KPIs from accumulated JSON string: {e}")
|
||||
updatedKpis = []
|
||||
|
||||
if updatedKpis:
|
||||
shouldProceed, reason = JsonResponseHandler.validateKpiProgression(
|
||||
accumulationState,
|
||||
updatedKpis
|
||||
)
|
||||
|
||||
if not shouldProceed:
|
||||
logger.warning(f"Iteration {iteration}: KPI validation failed: {reason}")
|
||||
if iterationOperationId:
|
||||
self.services.chat.progressLogFinish(iterationOperationId, False)
|
||||
if operationId:
|
||||
self.services.chat.progressLogUpdate(operationId, 0.9, f"KPI validation failed: {reason} ({iteration} iterations)")
|
||||
break
|
||||
|
||||
# Update KPIs in accumulation state
|
||||
accumulationState.kpis = updatedKpis
|
||||
logger.info(f"Iteration {iteration}: KPIs updated: {[(kpi.get('id'), kpi.get('currentValue')) for kpi in updatedKpis]}")
|
||||
|
||||
# Check if all KPIs completed
|
||||
allCompleted = True
|
||||
for kpi in updatedKpis:
|
||||
targetValue = kpi.get("targetValue", 0)
|
||||
currentValue = kpi.get("currentValue", 0)
|
||||
if currentValue < targetValue:
|
||||
allCompleted = False
|
||||
break
|
||||
|
||||
if allCompleted:
|
||||
logger.info(f"Iteration {iteration}: All KPIs completed, finishing accumulation")
|
||||
wasJsonComplete = True # Mark as complete to exit loop
|
||||
|
||||
# CRITICAL: Handle JSON fragments (continuation content)
|
||||
# Fragment merging happens inside extractSectionsFromResponse
|
||||
# If merge fails (returns wasJsonComplete=True), stop iterations and complete JSON
|
||||
if not extractedSections and allSections:
|
||||
if wasJsonComplete:
|
||||
# Merge failed - stop iterations, complete JSON with available data
|
||||
logger.error(f"Iteration {iteration}: ❌ MERGE FAILED - Stopping iterations, completing JSON with available data")
|
||||
if iterationOperationId:
|
||||
self.services.chat.progressLogFinish(iterationOperationId, False)
|
||||
if operationId:
|
||||
self.services.chat.progressLogUpdate(operationId, 0.9, f"Merge failed, completing JSON ({iteration} iterations)")
|
||||
break
|
||||
|
||||
# Fragment was detected and merged successfully
|
||||
logger.info(f"Iteration {iteration}: JSON fragment detected and merged, continuing")
|
||||
# Don't break - fragment was merged, continue to get more content if needed
|
||||
# Check if we should continue based on JSON completeness
|
||||
shouldContinue = self.responseParser.shouldContinueGeneration(
|
||||
allSections,
|
||||
iteration,
|
||||
wasJsonComplete,
|
||||
result
|
||||
)
|
||||
if shouldContinue:
|
||||
if iterationOperationId:
|
||||
self.services.chat.progressLogUpdate(iterationOperationId, 0.8, "Fragment merged, continuing")
|
||||
self.services.chat.progressLogFinish(iterationOperationId, True)
|
||||
continue
|
||||
else:
|
||||
# Done - fragment was merged and JSON is complete
|
||||
if iterationOperationId:
|
||||
self.services.chat.progressLogFinish(iterationOperationId, True)
|
||||
if operationId:
|
||||
self.services.chat.progressLogUpdate(operationId, 0.95, f"Generation complete ({iteration} iterations, fragment merged)")
|
||||
logger.info(f"Generation complete after {iteration} iterations: fragment merged")
|
||||
break
|
||||
|
||||
# Extract document metadata from first iteration if available
|
||||
if iteration == 1 and parsedResult and not documentMetadata:
|
||||
documentMetadata = self.responseParser.extractDocumentMetadata(parsedResult)
|
||||
|
||||
# Update progress after parsing
|
||||
if iterationOperationId:
|
||||
if extractedSections:
|
||||
self.services.chat.progressLogUpdate(iterationOperationId, 0.8, f"Extracted {len(extractedSections)} sections")
|
||||
|
||||
if not extractedSections:
|
||||
# CRITICAL: If JSON was incomplete/broken, continue even if no sections extracted
|
||||
# This allows the AI to retry and complete the broken JSON
|
||||
if not wasJsonComplete:
|
||||
logger.warning(f"Iteration {iteration}: No sections extracted from broken JSON, continuing for another attempt")
|
||||
continue
|
||||
# If JSON was complete but no sections extracted - check if it was a fragment
|
||||
# Fragments are handled above, so if we get here and it's complete, it's an error
|
||||
logger.warning(f"Iteration {iteration}: No sections extracted from complete JSON, stopping")
|
||||
break
|
||||
|
||||
# NOTE: Section merging now happens BEFORE KPI validation (see above)
|
||||
# This ensures sections are preserved even if KPI validation fails
|
||||
|
||||
# Calculate total bytes in merged content for progress display
|
||||
merged_json_str = json.dumps(allSections, indent=2, ensure_ascii=False)
|
||||
totalBytesGenerated = len(merged_json_str.encode('utf-8'))
|
||||
|
||||
# Update main operation with byte progress
|
||||
if operationId:
|
||||
# Format bytes for display
|
||||
if totalBytesGenerated < 1024:
|
||||
bytesDisplay = f"{totalBytesGenerated}B"
|
||||
elif totalBytesGenerated < 1024 * 1024:
|
||||
bytesDisplay = f"{totalBytesGenerated / 1024:.1f}kB"
|
||||
else:
|
||||
bytesDisplay = f"{totalBytesGenerated / (1024 * 1024):.1f}MB"
|
||||
# Estimate progress based on iterations (rough estimate)
|
||||
estimatedProgress = min(0.9, 0.4 + (iteration * 0.1))
|
||||
self.services.chat.progressLogUpdate(operationId, estimatedProgress, f"Pipeline: {bytesDisplay} (iteration {iteration})")
|
||||
|
||||
# Log merged sections for debugging
|
||||
self.services.utils.writeDebugFile(merged_json_str, f"{debugPrefix}_merged_sections_iteration_{iteration}")
|
||||
|
||||
# Check if we should continue (completion detection)
|
||||
# Simple logic: JSON completeness determines continuation
|
||||
shouldContinue = self.responseParser.shouldContinueGeneration(
|
||||
allSections,
|
||||
iteration,
|
||||
wasJsonComplete,
|
||||
result
|
||||
)
|
||||
|
||||
if shouldContinue:
|
||||
# Finish iteration operation (will continue with next iteration)
|
||||
if iterationOperationId:
|
||||
# Show byte progress in iteration completion
|
||||
iterBytes = len(result.encode('utf-8')) if result else 0
|
||||
if iterBytes < 1024:
|
||||
iterBytesDisplay = f"{iterBytes}B"
|
||||
elif iterBytes < 1024 * 1024:
|
||||
iterBytesDisplay = f"{iterBytes / 1024:.1f}kB"
|
||||
else:
|
||||
iterBytesDisplay = f"{iterBytes / (1024 * 1024):.1f}MB"
|
||||
self.services.chat.progressLogUpdate(iterationOperationId, 0.95, f"Completed ({iterBytesDisplay})")
|
||||
self.services.chat.progressLogFinish(iterationOperationId, True)
|
||||
continue
|
||||
else:
|
||||
# Done - finish iteration and update main operation
|
||||
if iterationOperationId:
|
||||
# Show final byte count
|
||||
finalBytes = len(merged_json_str.encode('utf-8'))
|
||||
if finalBytes < 1024:
|
||||
finalBytesDisplay = f"{finalBytes}B"
|
||||
elif finalBytes < 1024 * 1024:
|
||||
finalBytesDisplay = f"{finalBytes / 1024:.1f}kB"
|
||||
else:
|
||||
finalBytesDisplay = f"{finalBytes / (1024 * 1024):.1f}MB"
|
||||
self.services.chat.progressLogUpdate(iterationOperationId, 0.95, f"Complete ({finalBytesDisplay})")
|
||||
self.services.chat.progressLogFinish(iterationOperationId, True)
|
||||
if operationId:
|
||||
# Show final size in main operation
|
||||
finalBytes = len(merged_json_str.encode('utf-8'))
|
||||
if finalBytes < 1024:
|
||||
finalBytesDisplay = f"{finalBytes}B"
|
||||
elif finalBytes < 1024 * 1024:
|
||||
finalBytesDisplay = f"{finalBytes / 1024:.1f}kB"
|
||||
else:
|
||||
finalBytesDisplay = f"{finalBytes / (1024 * 1024):.1f}MB"
|
||||
self.services.chat.progressLogUpdate(operationId, 0.95, f"Generation complete: {finalBytesDisplay} ({iteration} iterations, {len(allSections)} sections)")
|
||||
logger.info(f"Generation complete after {iteration} iterations: {len(allSections)} sections")
|
||||
break
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error in AI call iteration {iteration}: {str(e)}")
|
||||
if iterationOperationId:
|
||||
self.services.chat.progressLogFinish(iterationOperationId, False)
|
||||
break
|
||||
|
||||
if iteration >= maxIterations:
|
||||
logger.warning(f"AI call stopped after maximum iterations ({maxIterations})")
|
||||
|
||||
# CRITICAL: Complete any incomplete structures in sections before building final result
|
||||
# This ensures JSON is properly closed even if merge failed or iterations stopped early
|
||||
allSections = JsonResponseHandler.completeIncompleteStructures(allSections)
|
||||
|
||||
# Build final result from accumulated sections
|
||||
final_result = self.responseParser.buildFinalResultFromSections(allSections, documentMetadata)
|
||||
|
||||
# Write final result to debug file
|
||||
self.services.utils.writeDebugFile(final_result, f"{debugPrefix}_final_result")
|
||||
|
||||
return final_result
|
||||
|
||||
async def _defineKpisFromPrompt(
|
||||
self,
|
||||
userPrompt: str,
|
||||
rawJsonString: Optional[str],
|
||||
continuationContext: Dict[str, Any],
|
||||
debugPrefix: str = "kpi"
|
||||
) -> List[Dict[str, Any]]:
|
||||
"""
|
||||
Make separate AI call to define KPIs based on user prompt and incomplete JSON.
|
||||
|
||||
Args:
|
||||
userPrompt: Original user prompt
|
||||
rawJsonString: Raw JSON string from first iteration response
|
||||
continuationContext: Continuation context (not used for JSON, kept for compatibility)
|
||||
debugPrefix: Prefix for debug file names
|
||||
|
||||
Returns:
|
||||
List of KPI definitions: [{"id": str, "description": str, "jsonPath": str, "targetValue": int}, ...]
|
||||
"""
|
||||
# Use raw JSON string from first iteration response
|
||||
if rawJsonString:
|
||||
# Remove markdown code fences if present
|
||||
from modules.shared.jsonUtils import stripCodeFences
|
||||
incompleteJson = stripCodeFences(rawJsonString.strip())
|
||||
else:
|
||||
incompleteJson = "Not available"
|
||||
|
||||
kpiDefinitionPrompt = f"""Analyze the user request and incomplete JSON to define KPIs (Key Performance Indicators) for tracking progress.
|
||||
|
||||
User Request:
|
||||
{userPrompt}
|
||||
|
||||
Delivered JSON part:
|
||||
{incompleteJson}
|
||||
|
||||
Task: Define which JSON items should be tracked to measure completion progress.
|
||||
|
||||
IMPORTANT: Analyze the Delivered JSON part structure to understand what is being tracked:
|
||||
1. Identify the structure type (table with rows, list with items, etc.)
|
||||
2. Determine what the jsonPath actually counts (number of rows, number of items, etc.)
|
||||
3. Calculate targetValue based on what is being tracked, NOT the total quantity requested
|
||||
|
||||
For each trackable item, provide:
|
||||
- id: Unique identifier (use descriptive name)
|
||||
- description: What this KPI measures (be specific about what is counted)
|
||||
- jsonPath: Path to extract value from JSON (use dot notation with array indices, e.g., "documents[0].sections[1].elements[0].rows")
|
||||
- targetValue: Target value to reach (integer) - MUST match what jsonPath actually tracks (rows count, items count, etc.)
|
||||
|
||||
Return ONLY valid JSON in this format:
|
||||
{{
|
||||
"kpis": [
|
||||
{{
|
||||
"id": "unique_id",
|
||||
"description": "Description of what is measured",
|
||||
"jsonPath": "path.to.value",
|
||||
"targetValue": 0
|
||||
}}
|
||||
]
|
||||
}}
|
||||
|
||||
If no trackable items can be identified, return: {{"kpis": []}}
|
||||
"""
|
||||
|
||||
try:
|
||||
request = AiCallRequest(
|
||||
prompt=kpiDefinitionPrompt,
|
||||
options=AiCallOptions(
|
||||
operationType=OperationTypeEnum.DATA_ANALYSE,
|
||||
priority=PriorityEnum.SPEED,
|
||||
processingMode=ProcessingModeEnum.BASIC
|
||||
)
|
||||
)
|
||||
|
||||
# Write KPI definition prompt to debug file
|
||||
self.services.utils.writeDebugFile(kpiDefinitionPrompt, f"{debugPrefix}_kpi_definition_prompt")
|
||||
|
||||
response = await self.aiService.callAi(request)
|
||||
|
||||
# Write KPI definition response to debug file
|
||||
self.services.utils.writeDebugFile(response.content, f"{debugPrefix}_kpi_definition_response")
|
||||
|
||||
# Parse response
|
||||
extracted = extractJsonString(response.content)
|
||||
kpiResponse = json.loads(extracted)
|
||||
|
||||
kpiDefinitions = kpiResponse.get("kpis", [])
|
||||
logger.info(f"Defined {len(kpiDefinitions)} KPIs for tracking")
|
||||
|
||||
return kpiDefinitions
|
||||
|
||||
except Exception as e:
|
||||
logger.warning(f"Failed to define KPIs: {e}, continuing without KPI tracking")
|
||||
return []
|
||||
|
||||
670
modules/services/serviceAi/subContentExtraction.py
Normal file
670
modules/services/serviceAi/subContentExtraction.py
Normal file
|
|
@ -0,0 +1,670 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""
|
||||
Content Extraction Module
|
||||
|
||||
Handles content extraction and preparation, including:
|
||||
- Extracting content from documents based on intents
|
||||
- Processing pre-extracted documents
|
||||
- Vision AI for image text extraction
|
||||
- AI processing of text content
|
||||
"""
|
||||
import json
|
||||
import logging
|
||||
import base64
|
||||
from typing import Dict, Any, List, Optional
|
||||
|
||||
from modules.datamodels.datamodelChat import ChatDocument
|
||||
from modules.datamodels.datamodelExtraction import ContentPart, DocumentIntent
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class ContentExtractor:
|
||||
"""Handles content extraction and preparation."""
|
||||
|
||||
def __init__(self, services, aiService, intentAnalyzer):
|
||||
"""Initialize ContentExtractor with service center, AI service, and intent analyzer access."""
|
||||
self.services = services
|
||||
self.aiService = aiService
|
||||
self.intentAnalyzer = intentAnalyzer
|
||||
|
||||
async def extractAndPrepareContent(
|
||||
self,
|
||||
documents: List[ChatDocument],
|
||||
documentIntents: List[DocumentIntent],
|
||||
parentOperationId: str,
|
||||
getIntentForDocument: callable
|
||||
) -> List[ContentPart]:
|
||||
"""
|
||||
Phase 5B: Extrahiert Content basierend auf Intents und bereitet ContentParts mit Metadaten vor.
|
||||
Gibt Liste von ContentParts im passenden Format zurück.
|
||||
|
||||
WICHTIG: Ein Dokument kann mehrere ContentParts erzeugen, wenn mehrere Intents vorhanden sind.
|
||||
Beispiel: Bild mit intents=["extract", "render"] erzeugt:
|
||||
- ContentPart(contentFormat="object", ...) für Rendering
|
||||
- ContentPart(contentFormat="extracted", ...) für Text-Analyse
|
||||
|
||||
Args:
|
||||
documents: Liste der zu verarbeitenden Dokumente
|
||||
documentIntents: Liste von DocumentIntent-Objekten
|
||||
parentOperationId: Parent Operation-ID für ChatLog-Hierarchie
|
||||
getIntentForDocument: Callable to get intent for document ID
|
||||
|
||||
Returns:
|
||||
Liste von ContentParts mit vollständigen Metadaten
|
||||
"""
|
||||
# Erstelle Operation-ID für Extraktion
|
||||
extractionOperationId = f"{parentOperationId}_content_extraction"
|
||||
|
||||
# Starte ChatLog mit Parent-Referenz
|
||||
self.services.chat.progressLogStart(
|
||||
extractionOperationId,
|
||||
"Content Extraction",
|
||||
"Extraction",
|
||||
f"Extracting from {len(documents)} documents",
|
||||
parentOperationId=parentOperationId
|
||||
)
|
||||
|
||||
try:
|
||||
allContentParts = []
|
||||
|
||||
for document in documents:
|
||||
# Check if document is already a ContentExtracted document (pre-extracted JSON)
|
||||
logger.debug(f"Checking document {document.id} ({document.fileName}, mimeType={document.mimeType}) for pre-extracted content")
|
||||
preExtracted = self.intentAnalyzer.resolvePreExtractedDocument(document)
|
||||
|
||||
if preExtracted:
|
||||
logger.info(f"✅ Found pre-extracted document: {document.fileName} -> Original: {preExtracted['originalDocument']['fileName']}")
|
||||
logger.info(f" Pre-extracted document ID: {document.id}, Original document ID: {preExtracted['originalDocument']['id']}")
|
||||
logger.info(f" ContentParts count: {len(preExtracted['contentExtracted'].parts) if preExtracted['contentExtracted'].parts else 0}")
|
||||
|
||||
# Verwende bereits extrahierte ContentParts direkt
|
||||
contentExtracted = preExtracted["contentExtracted"]
|
||||
|
||||
# WICHTIG: Intent muss für das JSON-Dokument gefunden werden, nicht für das Original
|
||||
# (Intent-Analyse mappt bereits zurück zu JSON-Dokument-ID)
|
||||
intent = getIntentForDocument(document.id, documentIntents)
|
||||
logger.info(f" Intent lookup for document {document.id}: found={intent is not None}")
|
||||
if intent:
|
||||
logger.info(f" Intent: {intent.intents}, extractionPrompt: {intent.extractionPrompt[:100] if intent.extractionPrompt else None}...")
|
||||
else:
|
||||
logger.warning(f" ⚠️ No intent found for pre-extracted document {document.id}! Available intent documentIds: {[i.documentId for i in documentIntents]}")
|
||||
|
||||
if contentExtracted.parts:
|
||||
for part in contentExtracted.parts:
|
||||
# Überspringe leere Parts (Container ohne Daten)
|
||||
if not part.data or (isinstance(part.data, str) and len(part.data.strip()) == 0):
|
||||
if part.typeGroup == "container":
|
||||
continue # Überspringe leere Container
|
||||
|
||||
if not part.metadata:
|
||||
part.metadata = {}
|
||||
|
||||
# Ensure metadata is complete
|
||||
if "documentId" not in part.metadata:
|
||||
part.metadata["documentId"] = document.id
|
||||
|
||||
# WICHTIG: Prüfe Intent für dieses Part
|
||||
partIntent = intent.intents if intent else ["extract"]
|
||||
|
||||
# Debug-Logging für Intent-Verarbeitung
|
||||
logger.debug(f"Processing part {part.id}: typeGroup={part.typeGroup}, intents={partIntent}, hasData={bool(part.data)}, dataLength={len(str(part.data)) if part.data else 0}")
|
||||
|
||||
# WICHTIG: Ein Part kann mehrere Intents haben - erstelle für jeden Intent einen ContentPart
|
||||
# Generische Intent-Verarbeitung für ALLE Content-Typen
|
||||
hasReferenceIntent = "reference" in partIntent
|
||||
hasRenderIntent = "render" in partIntent
|
||||
hasExtractIntent = "extract" in partIntent
|
||||
hasPartData = bool(part.data) and (not isinstance(part.data, str) or len(part.data.strip()) > 0)
|
||||
|
||||
logger.debug(f"Part {part.id}: reference={hasReferenceIntent}, render={hasRenderIntent}, extract={hasExtractIntent}, hasData={hasPartData}")
|
||||
|
||||
# Track ob der originale Part bereits hinzugefügt wurde
|
||||
originalPartAdded = False
|
||||
|
||||
# 1. Reference Intent: Erstelle Reference ContentPart
|
||||
if hasReferenceIntent:
|
||||
referencePart = ContentPart(
|
||||
id=f"ref_{document.id}_{part.id}",
|
||||
label=f"Reference: {part.label or 'Content'}",
|
||||
typeGroup="reference",
|
||||
mimeType=part.mimeType or "application/octet-stream",
|
||||
data="", # Leer - nur Referenz
|
||||
metadata={
|
||||
"contentFormat": "reference",
|
||||
"documentId": document.id,
|
||||
"documentReference": f"docItem:{document.id}:{preExtracted['originalDocument']['fileName']}",
|
||||
"intent": "reference",
|
||||
"usageHint": f"Reference: {preExtracted['originalDocument']['fileName']}",
|
||||
"originalFileName": preExtracted["originalDocument"]["fileName"]
|
||||
}
|
||||
)
|
||||
allContentParts.append(referencePart)
|
||||
logger.debug(f"✅ Created reference ContentPart for {part.id}")
|
||||
|
||||
# 2. Render Intent: Erstelle Object ContentPart (für Binary/Image Rendering)
|
||||
if hasRenderIntent and hasPartData:
|
||||
# Prüfe ob es ein Binary/Image ist (kann gerendert werden)
|
||||
isRenderable = (
|
||||
part.typeGroup == "image" or
|
||||
part.typeGroup == "binary" or
|
||||
(part.mimeType and (
|
||||
part.mimeType.startswith("image/") or
|
||||
part.mimeType.startswith("video/") or
|
||||
part.mimeType.startswith("audio/") or
|
||||
self._isBinary(part.mimeType)
|
||||
))
|
||||
)
|
||||
|
||||
if isRenderable:
|
||||
objectPart = ContentPart(
|
||||
id=f"obj_{document.id}_{part.id}",
|
||||
label=f"Object: {part.label or 'Content'}",
|
||||
typeGroup=part.typeGroup,
|
||||
mimeType=part.mimeType or "application/octet-stream",
|
||||
data=part.data, # Base64/Binary data ist bereits vorhanden
|
||||
metadata={
|
||||
"contentFormat": "object",
|
||||
"documentId": document.id,
|
||||
"intent": "render",
|
||||
"usageHint": f"Render as visual element: {preExtracted['originalDocument']['fileName']}",
|
||||
"originalFileName": preExtracted["originalDocument"]["fileName"],
|
||||
"relatedExtractedPartId": f"extracted_{document.id}_{part.id}" if hasExtractIntent else None
|
||||
}
|
||||
)
|
||||
allContentParts.append(objectPart)
|
||||
logger.debug(f"✅ Created object ContentPart for {part.id} (render intent)")
|
||||
else:
|
||||
logger.warning(f"⚠️ Part {part.id} has render intent but is not renderable (typeGroup={part.typeGroup}, mimeType={part.mimeType})")
|
||||
elif hasRenderIntent and not hasPartData:
|
||||
logger.warning(f"⚠️ Part {part.id} has render intent but no data, skipping render part")
|
||||
|
||||
# 3. Extract Intent: Erstelle Extracted ContentPart (möglicherweise mit zusätzlicher Verarbeitung)
|
||||
if hasExtractIntent:
|
||||
# Spezielle Behandlung für Images: Vision AI für Text-Extraktion
|
||||
if part.typeGroup == "image" and hasPartData:
|
||||
logger.info(f"🔄 Processing image {part.id} with Vision AI (extract intent)")
|
||||
try:
|
||||
extractionPrompt = intent.extractionPrompt if intent and intent.extractionPrompt else "Extract all text content from this image. Return only the extracted text, no additional formatting."
|
||||
extractedText = await self.extractTextFromImage(part, extractionPrompt)
|
||||
if extractedText:
|
||||
# Prüfe ob es ein Error-Message ist
|
||||
isError = extractedText.startswith("[ERROR:")
|
||||
|
||||
# Erstelle neuen Text-Part mit extrahiertem Text oder Error-Message
|
||||
textPart = ContentPart(
|
||||
id=f"extracted_{document.id}_{part.id}",
|
||||
label=f"Extracted text from {part.label or 'Image'}" if not isError else f"Error extracting from {part.label or 'Image'}",
|
||||
typeGroup="text",
|
||||
mimeType="text/plain",
|
||||
data=extractedText,
|
||||
metadata={
|
||||
"contentFormat": "extracted",
|
||||
"documentId": document.id,
|
||||
"intent": "extract",
|
||||
"originalFileName": preExtracted["originalDocument"]["fileName"],
|
||||
"relatedObjectPartId": f"obj_{document.id}_{part.id}" if hasRenderIntent else None,
|
||||
"extractionPrompt": extractionPrompt,
|
||||
"extractionMethod": "vision",
|
||||
"isError": isError
|
||||
}
|
||||
)
|
||||
allContentParts.append(textPart)
|
||||
if isError:
|
||||
logger.error(f"❌ Vision AI extraction failed for image {part.id}: {extractedText}")
|
||||
else:
|
||||
logger.info(f"✅ Extracted text from image {part.id} using Vision AI: {len(extractedText)} chars")
|
||||
else:
|
||||
# Sollte nicht vorkommen (Funktion gibt jetzt immer Error-Message zurück)
|
||||
errorMsg = f"Vision AI extraction failed: Unexpected empty response for image {part.id}"
|
||||
logger.error(errorMsg)
|
||||
errorPart = ContentPart(
|
||||
id=f"extracted_{document.id}_{part.id}",
|
||||
label=f"Error extracting from {part.label or 'Image'}",
|
||||
typeGroup="text",
|
||||
mimeType="text/plain",
|
||||
data=f"[ERROR: {errorMsg}]",
|
||||
metadata={
|
||||
"contentFormat": "extracted",
|
||||
"documentId": document.id,
|
||||
"intent": "extract",
|
||||
"originalFileName": preExtracted["originalDocument"]["fileName"],
|
||||
"extractionPrompt": extractionPrompt,
|
||||
"extractionMethod": "vision",
|
||||
"isError": True
|
||||
}
|
||||
)
|
||||
allContentParts.append(errorPart)
|
||||
except Exception as e:
|
||||
logger.error(f"❌ Failed to extract text from image {part.id}: {str(e)}")
|
||||
import traceback
|
||||
logger.debug(f"Traceback: {traceback.format_exc()}")
|
||||
# Kein Fallback: Wenn render Intent vorhanden, haben wir bereits object Part
|
||||
# Wenn nur extract Intent: Original Part ist kein Text, daher nicht als extracted hinzufügen
|
||||
if not hasRenderIntent:
|
||||
logger.debug(f"Image {part.id} has only extract intent, Vision AI failed - no extracted text available")
|
||||
else:
|
||||
# Für alle anderen Content-Typen: Prüfe ob AI-Verarbeitung benötigt wird
|
||||
# WICHTIG: Pre-extracted ContentParts von context.extractContent enthalten RAW extrahierten Content
|
||||
# (z.B. Text aus PDF-Text-Layer, Tabellen, etc.). Wenn "extract" Intent vorhanden ist,
|
||||
# muss dieser Content mit AI verarbeitet werden basierend auf extractionPrompt.
|
||||
|
||||
# Prüfe ob Part Text-Content hat (kann mit AI verarbeitet werden)
|
||||
isTextContent = (
|
||||
part.typeGroup == "text" or
|
||||
part.typeGroup == "table" or
|
||||
(part.data and isinstance(part.data, str) and len(part.data.strip()) > 0)
|
||||
)
|
||||
|
||||
if isTextContent and intent and intent.extractionPrompt:
|
||||
# Text-Content mit extractionPrompt: Verarbeite mit AI
|
||||
logger.info(f"🔄 Processing text content {part.id} with AI (extract intent with prompt)")
|
||||
try:
|
||||
extractionPrompt = intent.extractionPrompt
|
||||
processedText = await self.processTextContentWithAi(part, extractionPrompt)
|
||||
if processedText:
|
||||
# Prüfe ob es ein Error-Message ist
|
||||
isError = processedText.startswith("[ERROR:")
|
||||
|
||||
# Erstelle neuen Text-Part mit AI-verarbeitetem Text oder Error-Message
|
||||
processedPart = ContentPart(
|
||||
id=f"extracted_{document.id}_{part.id}",
|
||||
label=f"AI-processed: {part.label or 'Content'}" if not isError else f"Error processing {part.label or 'Content'}",
|
||||
typeGroup="text",
|
||||
mimeType="text/plain",
|
||||
data=processedText,
|
||||
metadata={
|
||||
"contentFormat": "extracted",
|
||||
"documentId": document.id,
|
||||
"intent": "extract",
|
||||
"originalFileName": preExtracted["originalDocument"]["fileName"],
|
||||
"relatedObjectPartId": f"obj_{document.id}_{part.id}" if hasRenderIntent else None,
|
||||
"extractionPrompt": extractionPrompt,
|
||||
"extractionMethod": "ai",
|
||||
"sourcePartId": part.id,
|
||||
"fromExtractContent": True,
|
||||
"isError": isError
|
||||
}
|
||||
)
|
||||
allContentParts.append(processedPart)
|
||||
originalPartAdded = True
|
||||
if isError:
|
||||
logger.error(f"❌ AI text processing failed for part {part.id}: {processedText}")
|
||||
else:
|
||||
logger.info(f"✅ Processed text content {part.id} with AI: {len(processedText)} chars")
|
||||
else:
|
||||
# Sollte nicht vorkommen (Funktion gibt jetzt immer Error-Message zurück)
|
||||
errorMsg = f"AI text processing failed: Unexpected empty response for part {part.id}"
|
||||
logger.error(errorMsg)
|
||||
errorPart = ContentPart(
|
||||
id=f"extracted_{document.id}_{part.id}",
|
||||
label=f"Error processing {part.label or 'Content'}",
|
||||
typeGroup="text",
|
||||
mimeType="text/plain",
|
||||
data=f"[ERROR: {errorMsg}]",
|
||||
metadata={
|
||||
"contentFormat": "extracted",
|
||||
"documentId": document.id,
|
||||
"intent": "extract",
|
||||
"originalFileName": preExtracted["originalDocument"]["fileName"],
|
||||
"extractionPrompt": extractionPrompt,
|
||||
"extractionMethod": "ai",
|
||||
"sourcePartId": part.id,
|
||||
"isError": True
|
||||
}
|
||||
)
|
||||
allContentParts.append(errorPart)
|
||||
originalPartAdded = True
|
||||
except Exception as e:
|
||||
logger.error(f"❌ Failed to process text content {part.id} with AI: {str(e)}")
|
||||
import traceback
|
||||
logger.debug(f"Traceback: {traceback.format_exc()}")
|
||||
# Fallback: Verwende Original-Part
|
||||
if not originalPartAdded:
|
||||
part.metadata.update({
|
||||
"contentFormat": "extracted",
|
||||
"intent": "extract",
|
||||
"fromExtractContent": True,
|
||||
"skipExtraction": True,
|
||||
"originalFileName": preExtracted["originalDocument"]["fileName"],
|
||||
"relatedObjectPartId": f"obj_{document.id}_{part.id}" if hasRenderIntent else None
|
||||
})
|
||||
allContentParts.append(part)
|
||||
originalPartAdded = True
|
||||
else:
|
||||
# Kein extractionPrompt oder kein Text-Content: Verwende Part direkt als extracted
|
||||
# (Content ist bereits extrahiert von context.extractContent, keine weitere AI-Verarbeitung nötig)
|
||||
# WICHTIG: Nur hinzufügen wenn noch nicht hinzugefügt (z.B. durch render Intent)
|
||||
if not originalPartAdded:
|
||||
part.metadata.update({
|
||||
"contentFormat": "extracted",
|
||||
"intent": "extract",
|
||||
"fromExtractContent": True,
|
||||
"skipExtraction": True, # Bereits extrahiert
|
||||
"originalFileName": preExtracted["originalDocument"]["fileName"],
|
||||
"relatedObjectPartId": f"obj_{document.id}_{part.id}" if hasRenderIntent else None
|
||||
})
|
||||
# Stelle sicher dass contentFormat gesetzt ist
|
||||
if "contentFormat" not in part.metadata:
|
||||
part.metadata["contentFormat"] = "extracted"
|
||||
allContentParts.append(part)
|
||||
originalPartAdded = True
|
||||
logger.debug(f"✅ Using pre-extracted ContentPart {part.id} as extracted (no AI processing needed)")
|
||||
|
||||
# 4. Fallback: Wenn kein Intent vorhanden oder Part wurde noch nicht hinzugefügt
|
||||
# (sollte normalerweise nicht vorkommen, da default "extract" ist)
|
||||
if not hasReferenceIntent and not hasRenderIntent and not hasExtractIntent and not originalPartAdded:
|
||||
logger.warning(f"⚠️ Part {part.id} has no recognized intents, adding as extracted by default")
|
||||
part.metadata.update({
|
||||
"contentFormat": "extracted",
|
||||
"intent": "extract",
|
||||
"fromExtractContent": True,
|
||||
"skipExtraction": True,
|
||||
"originalFileName": preExtracted["originalDocument"]["fileName"]
|
||||
})
|
||||
allContentParts.append(part)
|
||||
originalPartAdded = True
|
||||
|
||||
logger.info(f"✅ Using {len([p for p in contentExtracted.parts if p.data and len(str(p.data)) > 0])} pre-extracted ContentParts from ContentExtracted document {document.fileName}")
|
||||
logger.info(f" Original document: {preExtracted['originalDocument']['fileName']}")
|
||||
continue # Skip normal extraction for this document
|
||||
|
||||
# Check if it's standardized JSON format (has "documents" or "sections")
|
||||
if document.mimeType == "application/json":
|
||||
try:
|
||||
docBytes = self.services.interfaceDbComponent.getFileData(document.fileId)
|
||||
if docBytes:
|
||||
docData = docBytes.decode('utf-8')
|
||||
jsonData = json.loads(docData)
|
||||
|
||||
if isinstance(jsonData, dict) and ("documents" in jsonData or "sections" in jsonData):
|
||||
logger.info(f"Document is already in standardized JSON format, using as reference")
|
||||
# Create reference ContentPart for structured JSON
|
||||
contentPart = ContentPart(
|
||||
id=f"ref_{document.id}",
|
||||
label=f"Reference: {document.fileName}",
|
||||
typeGroup="structure",
|
||||
mimeType="application/json",
|
||||
data=docData,
|
||||
metadata={
|
||||
"contentFormat": "reference",
|
||||
"documentId": document.id,
|
||||
"documentReference": f"docItem:{document.id}:{document.fileName}",
|
||||
"skipExtraction": True,
|
||||
"intent": "reference"
|
||||
}
|
||||
)
|
||||
allContentParts.append(contentPart)
|
||||
logger.info(f"✅ Using JSON document directly without extraction")
|
||||
continue # Skip normal extraction for this document
|
||||
except Exception as e:
|
||||
logger.warning(f"Could not parse JSON document {document.fileName}, will extract normally: {str(e)}")
|
||||
# Continue with normal extraction
|
||||
|
||||
# Normal extraction path
|
||||
intent = getIntentForDocument(document.id, documentIntents)
|
||||
|
||||
if not intent:
|
||||
# Default: extract für alle Dokumente ohne Intent
|
||||
logger.warning(f"No intent found for document {document.id}, using default 'extract'")
|
||||
intent = DocumentIntent(
|
||||
documentId=document.id,
|
||||
intents=["extract"],
|
||||
extractionPrompt="Extract all content from the document",
|
||||
reasoning="Default intent: no specific intent found"
|
||||
)
|
||||
|
||||
# WICHTIG: Prüfe alle Intents - ein Dokument kann mehrere ContentParts erzeugen
|
||||
|
||||
if "reference" in intent.intents:
|
||||
# Erstelle Reference ContentPart
|
||||
contentPart = ContentPart(
|
||||
id=f"ref_{document.id}",
|
||||
label=f"Reference: {document.fileName}",
|
||||
typeGroup="reference",
|
||||
mimeType=document.mimeType,
|
||||
data="",
|
||||
metadata={
|
||||
"contentFormat": "reference",
|
||||
"documentId": document.id,
|
||||
"documentReference": f"docItem:{document.id}:{document.fileName}",
|
||||
"intent": "reference",
|
||||
"usageHint": f"Reference document: {document.fileName}"
|
||||
}
|
||||
)
|
||||
allContentParts.append(contentPart)
|
||||
|
||||
# WICHTIG: "render" und "extract" können beide vorhanden sein!
|
||||
# In diesem Fall erzeugen wir BEIDE ContentParts
|
||||
|
||||
if "render" in intent.intents:
|
||||
# Für Images/Binary: extrahiere als Object
|
||||
if document.mimeType.startswith("image/") or self._isBinary(document.mimeType):
|
||||
try:
|
||||
# Lade Binary-Daten (getFileData ist nicht async - keine await nötig)
|
||||
binaryData = self.services.interfaceDbComponent.getFileData(document.fileId)
|
||||
if not binaryData:
|
||||
logger.warning(f"No binary data found for document {document.id}")
|
||||
continue
|
||||
base64Data = base64.b64encode(binaryData).decode('utf-8')
|
||||
|
||||
contentPart = ContentPart(
|
||||
id=f"obj_{document.id}",
|
||||
label=f"Object: {document.fileName}",
|
||||
typeGroup="image" if document.mimeType.startswith("image/") else "binary",
|
||||
mimeType=document.mimeType,
|
||||
data=base64Data,
|
||||
metadata={
|
||||
"contentFormat": "object",
|
||||
"documentId": document.id,
|
||||
"intent": "render",
|
||||
"usageHint": f"Render as visual element: {document.fileName}",
|
||||
"originalFileName": document.fileName,
|
||||
# Verknüpfung zu extracted Part (falls vorhanden)
|
||||
"relatedExtractedPartId": f"ext_{document.id}" if "extract" in intent.intents else None
|
||||
}
|
||||
)
|
||||
allContentParts.append(contentPart)
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to load binary data for document {document.id}: {str(e)}")
|
||||
|
||||
if "extract" in intent.intents:
|
||||
# Extrahiere Content mit Extraction Service
|
||||
extractionPrompt = intent.extractionPrompt or "Extract all content from the document"
|
||||
|
||||
# Debug-Log (harmonisiert)
|
||||
self.services.utils.writeDebugFile(
|
||||
extractionPrompt,
|
||||
f"content_extraction_prompt_{document.id}"
|
||||
)
|
||||
|
||||
# Führe Extraktion aus
|
||||
from modules.datamodels.datamodelExtraction import ExtractionOptions, MergeStrategy
|
||||
|
||||
extractionOptions = ExtractionOptions(
|
||||
prompt=extractionPrompt,
|
||||
mergeStrategy=MergeStrategy()
|
||||
)
|
||||
|
||||
# extractContent ist nicht async - keine await nötig
|
||||
extractedResults = self.services.extraction.extractContent(
|
||||
[document],
|
||||
extractionOptions,
|
||||
operationId=extractionOperationId,
|
||||
parentOperationId=extractionOperationId
|
||||
)
|
||||
|
||||
# Konvertiere extrahierte Ergebnisse zu ContentParts mit Metadaten
|
||||
for extracted in extractedResults:
|
||||
for part in extracted.parts:
|
||||
# Markiere als extracted Format
|
||||
part.metadata.update({
|
||||
"contentFormat": "extracted",
|
||||
"documentId": document.id,
|
||||
"extractionPrompt": extractionPrompt,
|
||||
"intent": "extract",
|
||||
"usageHint": f"Use extracted content from {document.fileName}",
|
||||
# Verknüpfung zu object Part (falls vorhanden)
|
||||
"relatedObjectPartId": f"obj_{document.id}" if "render" in intent.intents else None
|
||||
})
|
||||
# Stelle sicher, dass ID eindeutig ist (falls object Part existiert)
|
||||
if "render" in intent.intents:
|
||||
part.id = f"ext_{document.id}_{part.id}"
|
||||
allContentParts.append(part)
|
||||
|
||||
# Debug-Log (harmonisiert)
|
||||
self.services.utils.writeDebugFile(
|
||||
json.dumps([part.dict() for part in allContentParts], indent=2, default=str),
|
||||
"content_extraction_result"
|
||||
)
|
||||
|
||||
# ChatLog abschließen
|
||||
self.services.chat.progressLogFinish(extractionOperationId, True)
|
||||
|
||||
return allContentParts
|
||||
|
||||
except Exception as e:
|
||||
self.services.chat.progressLogFinish(extractionOperationId, False)
|
||||
logger.error(f"Error in extractAndPrepareContent: {str(e)}")
|
||||
raise
|
||||
|
||||
async def extractTextFromImage(self, imagePart: ContentPart, extractionPrompt: str) -> Optional[str]:
|
||||
"""
|
||||
Extrahiere Text aus einem Image-Part mit Vision AI.
|
||||
|
||||
Args:
|
||||
imagePart: ContentPart mit typeGroup="image"
|
||||
extractionPrompt: Prompt für die Text-Extraktion
|
||||
|
||||
Returns:
|
||||
Extrahierter Text oder None bei Fehler
|
||||
"""
|
||||
try:
|
||||
from modules.datamodels.datamodelAi import AiCallRequest, AiCallOptions, OperationTypeEnum
|
||||
|
||||
# Final extraction prompt
|
||||
finalPrompt = extractionPrompt or "Extract all text content from this image. Return only the extracted text, no additional formatting."
|
||||
|
||||
# Debug-Log (harmonisiert)
|
||||
self.services.utils.writeDebugFile(
|
||||
finalPrompt,
|
||||
f"content_extraction_prompt_image_{imagePart.id}"
|
||||
)
|
||||
|
||||
# Erstelle AI-Call-Request mit Image-Part
|
||||
request = AiCallRequest(
|
||||
prompt=finalPrompt,
|
||||
context="",
|
||||
options=AiCallOptions(operationType=OperationTypeEnum.IMAGE_ANALYSE),
|
||||
contentParts=[imagePart]
|
||||
)
|
||||
|
||||
# Verwende AI-Service für Vision AI-Verarbeitung
|
||||
response = await self.aiService.callAi(request)
|
||||
|
||||
# Debug-Log für Response (harmonisiert)
|
||||
if response and response.content:
|
||||
self.services.utils.writeDebugFile(
|
||||
response.content,
|
||||
f"content_extraction_response_image_{imagePart.id}"
|
||||
)
|
||||
|
||||
if response and response.content:
|
||||
return response.content.strip()
|
||||
|
||||
# Kein Content zurückgegeben - return error message für Debugging
|
||||
errorMsg = f"Vision AI extraction failed: No content returned for image {imagePart.id}"
|
||||
logger.warning(errorMsg)
|
||||
return f"[ERROR: {errorMsg}]"
|
||||
except Exception as e:
|
||||
errorMsg = f"Vision AI extraction failed for image {imagePart.id}: {str(e)}"
|
||||
logger.error(errorMsg)
|
||||
import traceback
|
||||
logger.debug(f"Traceback: {traceback.format_exc()}")
|
||||
# Return error message statt None für Debugging
|
||||
return f"[ERROR: {errorMsg}]"
|
||||
|
||||
async def processTextContentWithAi(self, textPart: ContentPart, extractionPrompt: str) -> Optional[str]:
|
||||
"""
|
||||
Verarbeite Text-Content mit AI basierend auf extractionPrompt.
|
||||
|
||||
WICHTIG: Pre-extracted ContentParts von context.extractContent enthalten RAW extrahierten Text
|
||||
(z.B. aus PDF-Text-Layer). Wenn "extract" Intent vorhanden ist, muss dieser Text mit AI
|
||||
verarbeitet werden (Transformation, Strukturierung, etc.) basierend auf extractionPrompt.
|
||||
|
||||
Args:
|
||||
textPart: ContentPart mit typeGroup="text" (oder anderer Text-basierter Typ)
|
||||
extractionPrompt: Prompt für die AI-Verarbeitung des Textes
|
||||
|
||||
Returns:
|
||||
AI-verarbeiteter Text oder None bei Fehler
|
||||
"""
|
||||
try:
|
||||
from modules.datamodels.datamodelAi import AiCallRequest, AiCallOptions, OperationTypeEnum
|
||||
|
||||
# Final extraction prompt
|
||||
finalPrompt = extractionPrompt or "Process and extract the key information from the following text content."
|
||||
|
||||
# Debug-Log (harmonisiert) - log prompt with text preview
|
||||
textPreview = textPart.data[:500] + "..." if textPart.data and len(textPart.data) > 500 else (textPart.data or "")
|
||||
promptWithContext = f"{finalPrompt}\n\n--- Text Content (preview) ---\n{textPreview}"
|
||||
self.services.utils.writeDebugFile(
|
||||
promptWithContext,
|
||||
f"content_extraction_prompt_text_{textPart.id}"
|
||||
)
|
||||
|
||||
# Erstelle Text-ContentPart für AI-Verarbeitung
|
||||
# Verwende den vorhandenen Text als Input
|
||||
textContentPart = ContentPart(
|
||||
id=textPart.id,
|
||||
label=textPart.label,
|
||||
typeGroup="text",
|
||||
mimeType="text/plain",
|
||||
data=textPart.data if textPart.data else "",
|
||||
metadata=textPart.metadata.copy() if textPart.metadata else {}
|
||||
)
|
||||
|
||||
# Erstelle AI-Call-Request mit Text-Part
|
||||
request = AiCallRequest(
|
||||
prompt=finalPrompt,
|
||||
context="",
|
||||
options=AiCallOptions(operationType=OperationTypeEnum.DATA_EXTRACT),
|
||||
contentParts=[textContentPart]
|
||||
)
|
||||
|
||||
# Verwende AI-Service für Text-Verarbeitung
|
||||
response = await self.aiService.callAi(request)
|
||||
|
||||
# Debug-Log für Response (harmonisiert)
|
||||
if response and response.content:
|
||||
self.services.utils.writeDebugFile(
|
||||
response.content,
|
||||
f"content_extraction_response_text_{textPart.id}"
|
||||
)
|
||||
|
||||
if response and response.content:
|
||||
return response.content.strip()
|
||||
|
||||
# Kein Content zurückgegeben - return error message für Debugging
|
||||
errorMsg = f"AI text processing failed: No content returned for text part {textPart.id}"
|
||||
logger.warning(errorMsg)
|
||||
return f"[ERROR: {errorMsg}]"
|
||||
except Exception as e:
|
||||
errorMsg = f"AI text processing failed for text part {textPart.id}: {str(e)}"
|
||||
logger.error(errorMsg)
|
||||
import traceback
|
||||
logger.debug(f"Traceback: {traceback.format_exc()}")
|
||||
# Return error message statt None für Debugging
|
||||
return f"[ERROR: {errorMsg}]"
|
||||
|
||||
def _isBinary(self, mimeType: str) -> bool:
|
||||
"""Prüfe ob MIME-Type binary ist."""
|
||||
binaryTypes = [
|
||||
"application/octet-stream",
|
||||
"application/pdf",
|
||||
"application/zip",
|
||||
"application/x-zip-compressed"
|
||||
]
|
||||
return mimeType in binaryTypes or mimeType.startswith("image/") or mimeType.startswith("video/") or mimeType.startswith("audio/")
|
||||
|
||||
302
modules/services/serviceAi/subDocumentIntents.py
Normal file
302
modules/services/serviceAi/subDocumentIntents.py
Normal file
|
|
@ -0,0 +1,302 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""
|
||||
Document Intent Analysis Module
|
||||
|
||||
Handles analysis of document intents, including:
|
||||
- Clarifying which documents need extraction vs reference
|
||||
- Resolving pre-extracted documents
|
||||
- Building intent analysis prompts
|
||||
"""
|
||||
import json
|
||||
import logging
|
||||
from typing import Dict, Any, List, Optional
|
||||
|
||||
from modules.datamodels.datamodelChat import ChatDocument
|
||||
from modules.datamodels.datamodelExtraction import DocumentIntent
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class DocumentIntentAnalyzer:
|
||||
"""Handles document intent analysis and resolution."""
|
||||
|
||||
def __init__(self, services, aiService):
|
||||
"""Initialize DocumentIntentAnalyzer with service center and AI service access."""
|
||||
self.services = services
|
||||
self.aiService = aiService
|
||||
|
||||
async def clarifyDocumentIntents(
|
||||
self,
|
||||
documents: List[ChatDocument],
|
||||
userPrompt: str,
|
||||
actionParameters: Dict[str, Any],
|
||||
parentOperationId: str
|
||||
) -> List[DocumentIntent]:
|
||||
"""
|
||||
Phase 5A: Analysiert, welche Dokumente Extraktion vs Referenz benötigen.
|
||||
Gibt DocumentIntent für jedes Dokument zurück.
|
||||
|
||||
Args:
|
||||
documents: Liste der zu verarbeitenden Dokumente
|
||||
userPrompt: User-Anfrage
|
||||
actionParameters: Action-spezifische Parameter (z.B. resultType, outputFormat)
|
||||
parentOperationId: Parent Operation-ID für ChatLog-Hierarchie
|
||||
|
||||
Returns:
|
||||
Liste von DocumentIntent-Objekten
|
||||
"""
|
||||
# Erstelle Operation-ID für Intent-Analyse
|
||||
intentOperationId = f"{parentOperationId}_intent_analysis"
|
||||
|
||||
# Starte ChatLog mit Parent-Referenz
|
||||
self.services.chat.progressLogStart(
|
||||
intentOperationId,
|
||||
"Document Intent Analysis",
|
||||
"Intent Analysis",
|
||||
f"Analyzing {len(documents)} documents",
|
||||
parentOperationId=parentOperationId
|
||||
)
|
||||
|
||||
try:
|
||||
# Mappe pre-extracted JSONs zu ursprünglichen Dokument-IDs für Intent-Analyse
|
||||
documentMapping = {} # Maps original doc ID -> JSON doc ID
|
||||
resolvedDocuments = []
|
||||
|
||||
for doc in documents:
|
||||
preExtracted = self.resolvePreExtractedDocument(doc)
|
||||
if preExtracted:
|
||||
originalDocId = preExtracted["originalDocument"]["id"]
|
||||
documentMapping[originalDocId] = doc.id
|
||||
# Erstelle temporäres ChatDocument für ursprüngliches Dokument
|
||||
originalDoc = ChatDocument(
|
||||
id=originalDocId,
|
||||
fileName=preExtracted["originalDocument"]["fileName"],
|
||||
mimeType=preExtracted["originalDocument"]["mimeType"],
|
||||
fileSize=preExtracted["originalDocument"].get("fileSize", doc.fileSize),
|
||||
fileId=doc.fileId, # Behalte fileId vom JSON
|
||||
messageId=doc.messageId if hasattr(doc, 'messageId') else None # Behalte messageId falls vorhanden
|
||||
)
|
||||
resolvedDocuments.append(originalDoc)
|
||||
else:
|
||||
resolvedDocuments.append(doc)
|
||||
|
||||
# Baue Intent-Analyse-Prompt mit ursprünglichen Dokumenten
|
||||
intentPrompt = self._buildIntentAnalysisPrompt(userPrompt, resolvedDocuments, actionParameters)
|
||||
|
||||
# AI-Call (verwende callAiPlanning für einfache JSON-Responses)
|
||||
# Debug-Logs werden bereits von callAiPlanning geschrieben
|
||||
aiResponse = await self.aiService.callAiPlanning(
|
||||
prompt=intentPrompt,
|
||||
debugType="document_intent_analysis"
|
||||
)
|
||||
|
||||
# Parse Result und mappe zurück zu JSON-Dokument-IDs falls nötig
|
||||
intentsData = json.loads(self.services.utils.jsonExtractString(aiResponse))
|
||||
documentIntents = []
|
||||
for intent in intentsData.get("intents", []):
|
||||
docId = intent.get("documentId")
|
||||
# Wenn Intent für ursprüngliches Dokument, mappe zurück zu JSON-Dokument-ID
|
||||
if docId in documentMapping:
|
||||
intent["documentId"] = documentMapping[docId]
|
||||
documentIntents.append(DocumentIntent(**intent))
|
||||
|
||||
# Debug-Log (harmonisiert)
|
||||
self.services.utils.writeDebugFile(
|
||||
json.dumps([intent.dict() for intent in documentIntents], indent=2),
|
||||
"document_intent_analysis_result"
|
||||
)
|
||||
|
||||
# ChatLog abschließen
|
||||
self.services.chat.progressLogFinish(intentOperationId, True)
|
||||
|
||||
return documentIntents
|
||||
|
||||
except Exception as e:
|
||||
self.services.chat.progressLogFinish(intentOperationId, False)
|
||||
logger.error(f"Error in clarifyDocumentIntents: {str(e)}")
|
||||
raise
|
||||
|
||||
def resolvePreExtractedDocument(self, document: ChatDocument) -> Optional[Dict[str, Any]]:
|
||||
"""
|
||||
Prüft ob ein JSON-Dokument bereits extrahierte ContentParts enthält.
|
||||
Gibt Dict zurück mit:
|
||||
- originalDocument: ChatDocument-Info des ursprünglichen Dokuments
|
||||
- contentExtracted: ContentExtracted-Objekt mit Parts
|
||||
- parts: Liste der ContentParts
|
||||
|
||||
Returns None wenn kein pre-extracted Format erkannt wird.
|
||||
"""
|
||||
if document.mimeType != "application/json":
|
||||
logger.debug(f"Document {document.id} is not JSON (mimeType={document.mimeType}), skipping pre-extracted check")
|
||||
return None
|
||||
|
||||
try:
|
||||
docBytes = self.services.interfaceDbComponent.getFileData(document.fileId)
|
||||
if not docBytes:
|
||||
return None
|
||||
|
||||
docData = docBytes.decode('utf-8')
|
||||
jsonData = json.loads(docData)
|
||||
|
||||
if not isinstance(jsonData, dict):
|
||||
return None
|
||||
|
||||
# Check for ContentExtracted format
|
||||
# Nur Format 1 (ActionDocument-Format mit validationMetadata) wird unterstützt
|
||||
documentData = None
|
||||
|
||||
validationMetadata = jsonData.get("validationMetadata", {})
|
||||
actionType = validationMetadata.get("actionType")
|
||||
logger.debug(f"JSON document {document.id}: validationMetadata.actionType={actionType}, keys={list(jsonData.keys())}")
|
||||
|
||||
if actionType == "context.extractContent":
|
||||
# Format: {"validationMetadata": {"actionType": "context.extractContent"}, "documentData": {...}}
|
||||
documentData = jsonData.get("documentData")
|
||||
logger.debug(f"Found ContentExtracted via validationMetadata for {document.fileName}, documentData keys: {list(documentData.keys()) if documentData else None}")
|
||||
else:
|
||||
logger.debug(f"JSON document {document.id} does not have actionType='context.extractContent' (got: {actionType})")
|
||||
|
||||
if documentData:
|
||||
from modules.datamodels.datamodelExtraction import ContentExtracted
|
||||
|
||||
try:
|
||||
# Stelle sicher, dass "id" vorhanden ist
|
||||
if "id" not in documentData:
|
||||
documentData["id"] = document.id
|
||||
|
||||
contentExtracted = ContentExtracted(**documentData)
|
||||
|
||||
if contentExtracted.parts:
|
||||
# Extrahiere ursprüngliche Dokument-Info aus den Parts
|
||||
originalDocId = None
|
||||
originalFileName = None
|
||||
originalMimeType = None
|
||||
|
||||
for part in contentExtracted.parts:
|
||||
if part.metadata:
|
||||
# Versuche ursprüngliche Dokument-Info zu finden
|
||||
if not originalDocId and part.metadata.get("documentId"):
|
||||
originalDocId = part.metadata.get("documentId")
|
||||
if not originalFileName and part.metadata.get("originalFileName"):
|
||||
originalFileName = part.metadata.get("originalFileName")
|
||||
if not originalMimeType and part.metadata.get("documentMimeType"):
|
||||
originalMimeType = part.metadata.get("documentMimeType")
|
||||
|
||||
# Falls nicht gefunden, versuche aus documentName zu extrahieren
|
||||
if not originalFileName:
|
||||
# Versuche aus documentName zu extrahieren (z.B. "B2025-02c_28_extracted_...json" -> "B2025-02c_28.pdf")
|
||||
if document.fileName and "_extracted_" in document.fileName:
|
||||
originalFileName = document.fileName.split("_extracted_")[0] + ".pdf"
|
||||
|
||||
return {
|
||||
"originalDocument": {
|
||||
"id": originalDocId or document.id,
|
||||
"fileName": originalFileName or document.fileName,
|
||||
"mimeType": originalMimeType or "application/pdf",
|
||||
"fileSize": document.fileSize
|
||||
},
|
||||
"contentExtracted": contentExtracted,
|
||||
"parts": contentExtracted.parts
|
||||
}
|
||||
except Exception as parseError:
|
||||
logger.warning(f"Could not parse ContentExtracted format from {document.fileName}: {str(parseError)}")
|
||||
logger.debug(f"JSON keys: {list(jsonData.keys())}, has parts: {'parts' in jsonData}")
|
||||
import traceback
|
||||
logger.debug(f"Parse error traceback: {traceback.format_exc()}")
|
||||
return None
|
||||
else:
|
||||
logger.debug(f"JSON document {document.id} has no documentData (actionType={actionType})")
|
||||
|
||||
return None
|
||||
except Exception as e:
|
||||
logger.debug(f"Error resolving pre-extracted document {document.fileName}: {str(e)}")
|
||||
return None
|
||||
|
||||
def _buildIntentAnalysisPrompt(
|
||||
self,
|
||||
userPrompt: str,
|
||||
documents: List[ChatDocument],
|
||||
actionParameters: Dict[str, Any]
|
||||
) -> str:
|
||||
"""Baue Prompt für Intent-Analyse."""
|
||||
# Baue Dokument-Liste - zeige ursprüngliche Dokumente für pre-extracted JSONs
|
||||
docListText = ""
|
||||
for i, doc in enumerate(documents, 1):
|
||||
# Prüfe ob es ein pre-extracted JSON ist
|
||||
preExtracted = self.resolvePreExtractedDocument(doc)
|
||||
|
||||
if preExtracted:
|
||||
# Zeige ursprüngliches Dokument statt JSON
|
||||
originalDoc = preExtracted["originalDocument"]
|
||||
partsInfo = f" (contains {len(preExtracted['parts'])} pre-extracted parts: {', '.join([p.typeGroup for p in preExtracted['parts'] if p.data and len(str(p.data)) > 0])})"
|
||||
docListText += f"\n{i}. Document ID: {originalDoc['id']}\n"
|
||||
docListText += f" File Name: {originalDoc['fileName']}{partsInfo}\n"
|
||||
docListText += f" MIME Type: {originalDoc['mimeType']}\n"
|
||||
docListText += f" File Size: {originalDoc.get('fileSize', doc.fileSize)} bytes\n"
|
||||
else:
|
||||
# Normales Dokument
|
||||
docListText += f"\n{i}. Document ID: {doc.id}\n"
|
||||
docListText += f" File Name: {doc.fileName}\n"
|
||||
docListText += f" MIME Type: {doc.mimeType}\n"
|
||||
docListText += f" File Size: {doc.fileSize} bytes\n"
|
||||
|
||||
outputFormat = actionParameters.get("outputFormat", "txt")
|
||||
|
||||
prompt = f"""USER REQUEST:
|
||||
{userPrompt}
|
||||
|
||||
DOCUMENTS TO ANALYZE:
|
||||
{docListText}
|
||||
|
||||
TASK: For each document, determine its intents (can be multiple):
|
||||
- "extract": Content extraction needed (text, structure, OCR, etc.)
|
||||
- "render": Image/binary should be rendered as-is (visual element)
|
||||
- "reference": Document reference/attachment (no extraction, just reference)
|
||||
|
||||
OUTPUT FORMAT: {outputFormat}
|
||||
|
||||
RETURN JSON:
|
||||
{{
|
||||
"intents": [
|
||||
{{
|
||||
"documentId": "doc_1",
|
||||
"intents": ["extract"], # Array - can contain multiple!
|
||||
"extractionPrompt": "Extract all text content, preserving structure",
|
||||
"reasoning": "User needs text content for document generation"
|
||||
}},
|
||||
{{
|
||||
"documentId": "doc_2",
|
||||
"intents": ["extract", "render"], # Both! Image needs text extraction AND visual rendering
|
||||
"extractionPrompt": "Extract text content from image using vision AI",
|
||||
"reasoning": "Image contains text that needs extraction, but also should be rendered visually"
|
||||
}},
|
||||
{{
|
||||
"documentId": "doc_3",
|
||||
"intents": ["reference"],
|
||||
"extractionPrompt": null,
|
||||
"reasoning": "Document is only used as reference, no extraction needed"
|
||||
}}
|
||||
]
|
||||
}}
|
||||
|
||||
CRITICAL RULES:
|
||||
1. For images (mimeType starts with "image/"):
|
||||
- If user wants to "include" or "show" images → add "render"
|
||||
- If user wants to "analyze", "read text", or "extract text" from images → add "extract"
|
||||
- Can have BOTH "extract" and "render" if image needs both text extraction and visual rendering
|
||||
|
||||
2. For text documents:
|
||||
- If user mentions "template" or "structure" → "reference" or "extract" based on context
|
||||
- If user mentions "reference" or "context" → "reference"
|
||||
- Default → "extract"
|
||||
|
||||
3. Consider output format:
|
||||
- For formats like PDF, DOCX, PPTX: images usually need "render"
|
||||
- For formats like CSV, JSON: usually "extract" only
|
||||
- For HTML: can have both "extract" and "render"
|
||||
|
||||
Return ONLY valid JSON following the structure above.
|
||||
"""
|
||||
return prompt
|
||||
|
||||
275
modules/services/serviceAi/subResponseParsing.py
Normal file
275
modules/services/serviceAi/subResponseParsing.py
Normal file
|
|
@ -0,0 +1,275 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""
|
||||
Response Parsing Module
|
||||
|
||||
Handles parsing of AI responses, including:
|
||||
- Section extraction from responses
|
||||
- JSON completeness detection
|
||||
- Loop detection
|
||||
- Document metadata extraction
|
||||
- Final result building
|
||||
"""
|
||||
import json
|
||||
import logging
|
||||
from typing import Dict, Any, List, Optional, Tuple
|
||||
|
||||
from modules.shared.jsonUtils import extractJsonString, repairBrokenJson, extractSectionsFromDocument
|
||||
from modules.services.serviceAi.subJsonResponseHandling import JsonResponseHandler
|
||||
from modules.datamodels.datamodelAi import JsonAccumulationState
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class ResponseParser:
|
||||
"""Handles parsing of AI responses and completion detection."""
|
||||
|
||||
def __init__(self, services):
|
||||
"""Initialize ResponseParser with service center access."""
|
||||
self.services = services
|
||||
|
||||
def extractSectionsFromResponse(
|
||||
self,
|
||||
result: str,
|
||||
iteration: int,
|
||||
debugPrefix: str,
|
||||
allSections: List[Dict[str, Any]] = None,
|
||||
accumulationState: Optional[JsonAccumulationState] = None
|
||||
) -> Tuple[List[Dict[str, Any]], bool, Optional[Dict[str, Any]], Optional[JsonAccumulationState]]:
|
||||
"""
|
||||
Extract sections from AI response, handling both valid and broken JSON.
|
||||
|
||||
NEW BEHAVIOR:
|
||||
- First iteration: Check if complete, if not start accumulation
|
||||
- Subsequent iterations: Accumulate strings, parse when complete
|
||||
|
||||
Returns:
|
||||
Tuple of:
|
||||
- sections: Extracted sections
|
||||
- wasJsonComplete: True if JSON is complete
|
||||
- parsedResult: Parsed JSON object
|
||||
- updatedAccumulationState: Updated accumulation state (None if not in accumulation mode)
|
||||
"""
|
||||
if allSections is None:
|
||||
allSections = []
|
||||
|
||||
if iteration == 1:
|
||||
# First iteration - check if complete
|
||||
parsed = None
|
||||
try:
|
||||
extracted = extractJsonString(result)
|
||||
parsed = json.loads(extracted)
|
||||
|
||||
# Check completeness
|
||||
if JsonResponseHandler.isJsonComplete(parsed):
|
||||
# Complete JSON - no accumulation needed
|
||||
sections = extractSectionsFromDocument(parsed)
|
||||
logger.info(f"Iteration 1: Complete JSON detected, no accumulation needed")
|
||||
return sections, True, parsed, None # No accumulation
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# Incomplete - try to extract partial sections from broken JSON
|
||||
logger.info(f"Iteration 1: Incomplete JSON detected, attempting to extract partial sections")
|
||||
|
||||
partialSections = []
|
||||
if parsed:
|
||||
# Try to extract sections from parsed (even if incomplete)
|
||||
partialSections = extractSectionsFromDocument(parsed)
|
||||
else:
|
||||
# Try to repair broken JSON and extract sections
|
||||
try:
|
||||
repaired = repairBrokenJson(result)
|
||||
if repaired:
|
||||
partialSections = extractSectionsFromDocument(repaired)
|
||||
parsed = repaired # Use repaired version for accumulation state
|
||||
except Exception:
|
||||
pass # If repair fails, continue with empty sections
|
||||
|
||||
|
||||
# Define KPIs (async call - need to handle this)
|
||||
# For now, create accumulation state without KPIs, will be updated after async call
|
||||
accumulationState = JsonAccumulationState(
|
||||
accumulatedJsonString=result,
|
||||
isAccumulationMode=True,
|
||||
lastParsedResult=parsed,
|
||||
allSections=partialSections,
|
||||
kpis=[]
|
||||
)
|
||||
|
||||
# Note: KPI definition will be done in the caller (async context)
|
||||
return partialSections, False, parsed, accumulationState
|
||||
|
||||
else:
|
||||
# Subsequent iterations - accumulate
|
||||
if accumulationState and accumulationState.isAccumulationMode:
|
||||
accumulated, sections, isComplete, parsedResult = \
|
||||
JsonResponseHandler.accumulateAndParseJsonFragments(
|
||||
accumulationState.accumulatedJsonString,
|
||||
result,
|
||||
allSections,
|
||||
iteration
|
||||
)
|
||||
|
||||
# Update accumulation state
|
||||
accumulationState.accumulatedJsonString = accumulated
|
||||
accumulationState.lastParsedResult = parsedResult
|
||||
accumulationState.allSections = allSections + sections if sections else allSections
|
||||
accumulationState.isAccumulationMode = not isComplete
|
||||
|
||||
# Log accumulated JSON for debugging
|
||||
if parsedResult:
|
||||
accumulated_json_str = json.dumps(parsedResult, indent=2, ensure_ascii=False)
|
||||
self.services.utils.writeDebugFile(accumulated_json_str, f"{debugPrefix}_accumulated_json_iteration_{iteration}.json")
|
||||
|
||||
return sections, isComplete, parsedResult, accumulationState
|
||||
else:
|
||||
# No accumulation mode - process normally (shouldn't happen)
|
||||
logger.warning(f"Iteration {iteration}: No accumulation state but iteration > 1")
|
||||
return [], False, None, None
|
||||
|
||||
def shouldContinueGeneration(
|
||||
self,
|
||||
allSections: List[Dict[str, Any]],
|
||||
iteration: int,
|
||||
wasJsonComplete: bool,
|
||||
rawResponse: str = None
|
||||
) -> bool:
|
||||
"""
|
||||
Determine if AI generation loop should continue.
|
||||
|
||||
CRITICAL: This is ONLY about AI Loop Completion, NOT Action DoD!
|
||||
Action DoD is checked AFTER the AI Loop completes in _refineDecide.
|
||||
|
||||
Simple logic:
|
||||
- If JSON parsing failed or incomplete → continue (needs more content)
|
||||
- If JSON parses successfully and is complete → stop (all content delivered)
|
||||
- Loop detection prevents infinite loops
|
||||
|
||||
CRITICAL: JSON completeness is determined by parsing, NOT by last character check!
|
||||
Returns True if we should continue, False if AI Loop is done.
|
||||
"""
|
||||
if len(allSections) == 0:
|
||||
return True # No sections yet, continue
|
||||
|
||||
# CRITERION 1: If JSON was incomplete/broken (parsing failed or incomplete) - continue to repair/complete
|
||||
if not wasJsonComplete:
|
||||
logger.info(f"Iteration {iteration}: JSON incomplete/broken - continuing to complete")
|
||||
return True
|
||||
|
||||
# CRITERION 2: JSON is complete (parsed successfully) - check for loop detection
|
||||
if self._isStuckInLoop(allSections, iteration):
|
||||
logger.warning(f"Iteration {iteration}: Detected potential infinite loop - stopping AI loop")
|
||||
return False
|
||||
|
||||
# JSON is complete and not stuck in loop - done
|
||||
logger.info(f"Iteration {iteration}: JSON complete - AI loop done")
|
||||
return False
|
||||
|
||||
def _isStuckInLoop(
|
||||
self,
|
||||
allSections: List[Dict[str, Any]],
|
||||
iteration: int
|
||||
) -> bool:
|
||||
"""
|
||||
Detect if we're stuck in a loop (same content being repeated).
|
||||
|
||||
Generic approach: Check if recent iterations are adding minimal or duplicate content.
|
||||
"""
|
||||
if iteration < 3:
|
||||
return False # Need at least 3 iterations to detect a loop
|
||||
|
||||
if len(allSections) == 0:
|
||||
return False
|
||||
|
||||
# Check if last section is very small (might be stuck)
|
||||
lastSection = allSections[-1]
|
||||
elements = lastSection.get("elements", [])
|
||||
|
||||
if isinstance(elements, list) and elements:
|
||||
lastElem = elements[-1] if elements else {}
|
||||
else:
|
||||
lastElem = elements if isinstance(elements, dict) else {}
|
||||
|
||||
# Check content size of last section
|
||||
lastSectionSize = 0
|
||||
if isinstance(lastElem, dict):
|
||||
for key, value in lastElem.items():
|
||||
if isinstance(value, str):
|
||||
lastSectionSize += len(value)
|
||||
elif isinstance(value, list):
|
||||
lastSectionSize += len(str(value))
|
||||
|
||||
# If last section is very small and we've done many iterations, might be stuck
|
||||
if lastSectionSize < 100 and iteration > 10:
|
||||
logger.warning(f"Potential loop detected: iteration {iteration}, last section size {lastSectionSize}")
|
||||
return True
|
||||
|
||||
return False
|
||||
|
||||
def extractDocumentMetadata(
|
||||
self,
|
||||
parsedResult: Dict[str, Any]
|
||||
) -> Optional[Dict[str, Any]]:
|
||||
"""
|
||||
Extract document metadata (title, filename) from parsed AI response.
|
||||
Returns dict with 'title' and 'filename' keys if found, None otherwise.
|
||||
"""
|
||||
if not isinstance(parsedResult, dict):
|
||||
return None
|
||||
|
||||
# Try to get from documents array (preferred structure)
|
||||
if "documents" in parsedResult and isinstance(parsedResult["documents"], list) and len(parsedResult["documents"]) > 0:
|
||||
firstDoc = parsedResult["documents"][0]
|
||||
if isinstance(firstDoc, dict):
|
||||
title = firstDoc.get("title")
|
||||
filename = firstDoc.get("filename")
|
||||
if title or filename:
|
||||
return {
|
||||
"title": title,
|
||||
"filename": filename
|
||||
}
|
||||
|
||||
return None
|
||||
|
||||
def buildFinalResultFromSections(
|
||||
self,
|
||||
allSections: List[Dict[str, Any]],
|
||||
documentMetadata: Optional[Dict[str, Any]] = None
|
||||
) -> str:
|
||||
"""
|
||||
Build final JSON result from accumulated sections.
|
||||
Uses AI-provided metadata (title, filename) if available.
|
||||
"""
|
||||
if not allSections:
|
||||
return ""
|
||||
|
||||
# Extract metadata from AI response if available
|
||||
title = "Generated Document"
|
||||
filename = "document.json"
|
||||
if documentMetadata:
|
||||
if documentMetadata.get("title"):
|
||||
title = documentMetadata["title"]
|
||||
if documentMetadata.get("filename"):
|
||||
filename = documentMetadata["filename"]
|
||||
|
||||
# Build documents structure
|
||||
# Assuming single document for now
|
||||
documents = [{
|
||||
"id": "doc_1",
|
||||
"title": title,
|
||||
"filename": filename,
|
||||
"sections": allSections
|
||||
}]
|
||||
|
||||
result = {
|
||||
"metadata": {
|
||||
"split_strategy": "single_document",
|
||||
"source_documents": [],
|
||||
"extraction_method": "ai_generation"
|
||||
},
|
||||
"documents": documents
|
||||
}
|
||||
|
||||
return json.dumps(result, indent=2)
|
||||
|
||||
1443
modules/services/serviceAi/subStructureFilling.py
Normal file
1443
modules/services/serviceAi/subStructureFilling.py
Normal file
File diff suppressed because it is too large
Load diff
238
modules/services/serviceAi/subStructureGeneration.py
Normal file
238
modules/services/serviceAi/subStructureGeneration.py
Normal file
|
|
@ -0,0 +1,238 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""
|
||||
Structure Generation Module
|
||||
|
||||
Handles document structure generation, including:
|
||||
- Generating document structure with sections
|
||||
- Building structure prompts
|
||||
"""
|
||||
import json
|
||||
import logging
|
||||
from typing import Dict, Any, List
|
||||
|
||||
from modules.datamodels.datamodelExtraction import ContentPart
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class StructureGenerator:
|
||||
"""Handles document structure generation."""
|
||||
|
||||
def __init__(self, services, aiService):
|
||||
"""Initialize StructureGenerator with service center and AI service access."""
|
||||
self.services = services
|
||||
self.aiService = aiService
|
||||
|
||||
async def generateStructure(
|
||||
self,
|
||||
userPrompt: str,
|
||||
contentParts: List[ContentPart],
|
||||
outputFormat: str,
|
||||
parentOperationId: str
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
Phase 5C: Generiert Chapter-Struktur (Table of Contents).
|
||||
Definiert für jedes Chapter:
|
||||
- Level, Title
|
||||
- contentPartIds
|
||||
- contentPartInstructions
|
||||
- generationHint
|
||||
|
||||
Args:
|
||||
userPrompt: User-Anfrage
|
||||
contentParts: Alle vorbereiteten ContentParts mit Metadaten
|
||||
outputFormat: Ziel-Format (html, docx, pdf, etc.)
|
||||
parentOperationId: Parent Operation-ID für ChatLog-Hierarchie
|
||||
|
||||
Returns:
|
||||
Struktur-Dict mit documents und chapters (nicht sections!)
|
||||
"""
|
||||
# Erstelle Operation-ID für Struktur-Generierung
|
||||
structureOperationId = f"{parentOperationId}_structure_generation"
|
||||
|
||||
# Starte ChatLog mit Parent-Referenz
|
||||
self.services.chat.progressLogStart(
|
||||
structureOperationId,
|
||||
"Chapter Structure Generation",
|
||||
"Structure",
|
||||
f"Generating chapter structure for {outputFormat}",
|
||||
parentOperationId=parentOperationId
|
||||
)
|
||||
|
||||
try:
|
||||
# Baue Chapter-Struktur-Prompt mit Content-Index
|
||||
structurePrompt = self._buildChapterStructurePrompt(
|
||||
userPrompt=userPrompt,
|
||||
contentParts=contentParts,
|
||||
outputFormat=outputFormat
|
||||
)
|
||||
|
||||
# AI-Call für Chapter-Struktur-Generierung
|
||||
# Note: Debug logging is handled by callAiPlanning
|
||||
aiResponse = await self.aiService.callAiPlanning(
|
||||
prompt=structurePrompt,
|
||||
debugType="chapter_structure_generation"
|
||||
)
|
||||
|
||||
# Parse Struktur
|
||||
structure = json.loads(self.services.utils.jsonExtractString(aiResponse))
|
||||
|
||||
# ChatLog abschließen
|
||||
self.services.chat.progressLogFinish(structureOperationId, True)
|
||||
|
||||
return structure
|
||||
|
||||
except Exception as e:
|
||||
self.services.chat.progressLogFinish(structureOperationId, False)
|
||||
logger.error(f"Error in generateStructure: {str(e)}")
|
||||
raise
|
||||
|
||||
def _buildChapterStructurePrompt(
|
||||
self,
|
||||
userPrompt: str,
|
||||
contentParts: List[ContentPart],
|
||||
outputFormat: str
|
||||
) -> str:
|
||||
"""Baue Prompt für Chapter-Struktur-Generierung."""
|
||||
# Baue ContentParts-Index - filtere leere Parts heraus
|
||||
contentPartsIndex = ""
|
||||
validParts = []
|
||||
filteredParts = []
|
||||
|
||||
for part in contentParts:
|
||||
contentFormat = part.metadata.get("contentFormat", "unknown")
|
||||
|
||||
# WICHTIG: Reference Parts haben absichtlich leere Daten - immer einschließen
|
||||
if contentFormat == "reference":
|
||||
validParts.append(part)
|
||||
logger.debug(f"Including reference ContentPart {part.id} (intentionally empty data)")
|
||||
continue
|
||||
|
||||
# Überspringe leere Parts (keine Daten oder nur Container ohne Inhalt)
|
||||
# ABER: Reference Parts wurden bereits oben behandelt
|
||||
if not part.data or (isinstance(part.data, str) and len(part.data.strip()) == 0):
|
||||
# Überspringe Container-Parts ohne Daten
|
||||
if part.typeGroup == "container" and not part.data:
|
||||
filteredParts.append((part.id, "container without data"))
|
||||
continue
|
||||
# Überspringe andere leere Parts (aber nicht Reference, die wurden bereits behandelt)
|
||||
if not part.data:
|
||||
filteredParts.append((part.id, f"no data (format: {contentFormat})"))
|
||||
continue
|
||||
|
||||
validParts.append(part)
|
||||
logger.debug(f"Including ContentPart {part.id}: format={contentFormat}, type={part.typeGroup}, dataLength={len(str(part.data)) if part.data else 0}")
|
||||
|
||||
if filteredParts:
|
||||
logger.debug(f"Filtered out {len(filteredParts)} empty ContentParts: {filteredParts}")
|
||||
|
||||
logger.info(f"Building structure prompt with {len(validParts)} valid ContentParts (from {len(contentParts)} total)")
|
||||
|
||||
# Baue Index nur für gültige Parts
|
||||
for i, part in enumerate(validParts, 1):
|
||||
contentFormat = part.metadata.get("contentFormat", "unknown")
|
||||
originalFileName = part.metadata.get('originalFileName', 'N/A')
|
||||
|
||||
contentPartsIndex += f"\n{i}. ContentPart ID: {part.id}\n"
|
||||
contentPartsIndex += f" Format: {contentFormat}\n"
|
||||
contentPartsIndex += f" Type: {part.typeGroup}\n"
|
||||
contentPartsIndex += f" MIME Type: {part.mimeType or 'N/A'}\n"
|
||||
contentPartsIndex += f" Source: {part.metadata.get('documentId', 'unknown')}\n"
|
||||
contentPartsIndex += f" Original file name: {originalFileName}\n"
|
||||
contentPartsIndex += f" Usage hint: {part.metadata.get('usageHint', 'N/A')}\n"
|
||||
|
||||
if not contentPartsIndex:
|
||||
contentPartsIndex = "\n(No content parts available)"
|
||||
|
||||
prompt = f"""USER REQUEST (for context):
|
||||
```
|
||||
{userPrompt}
|
||||
```
|
||||
|
||||
AVAILABLE CONTENT PARTS:
|
||||
{contentPartsIndex}
|
||||
|
||||
TASK: Generate Chapter Structure for the documents to be generated.
|
||||
|
||||
IMPORTANT - CHAPTER INDEPENDENCE:
|
||||
- Each chapter is independent and self-contained
|
||||
- One chapter does NOT have information about another chapter
|
||||
- Each chapter must provide its own context and be understandable alone
|
||||
|
||||
CRITICAL - CONTENT ASSIGNMENT TO CHAPTERS:
|
||||
- You MUST assign available ContentParts to chapters using contentPartIds
|
||||
- Based on the user request, determine which content should be used in which chapter
|
||||
- If the user request mentions specific content, assign the corresponding ContentPart to the appropriate chapter
|
||||
- Chapters WITHOUT contentPartIds can only generate generic content, NOT document-specific analysis
|
||||
- To include document content analysis, chapters MUST have contentPartIds assigned
|
||||
- Review the user request carefully to match ContentParts to chapters based on context and purpose
|
||||
|
||||
CRITICAL - CHAPTERS WITHOUT CONTENT PARTS:
|
||||
- If contentPartIds is EMPTY, generationHint MUST be VERY DETAILED with all context needed to generate content from scratch
|
||||
- Include: what to generate, what information to include, purpose, specific details
|
||||
- Without content parts, AI relies ENTIRELY on generationHint and CANNOT analyze document content
|
||||
|
||||
IMPORTANT - FORMATTING:
|
||||
- Formatting (fonts, colors, layouts, styles) is handled AUTOMATICALLY by the renderer
|
||||
- Do NOT specify formatting details in generationHint unless it's content-specific (e.g., "pie chart with 3 segments")
|
||||
- Focus on CONTENT and STRUCTURE, not visual formatting
|
||||
- The renderer will apply appropriate styling based on the output format ({outputFormat})
|
||||
|
||||
For each chapter:
|
||||
- chapter id
|
||||
- level (1, 2, 3, etc.)
|
||||
- title
|
||||
- contentPartIds: [List of ContentPart IDs] - ASSIGN content based on user request and chapter purpose
|
||||
- contentPartInstructions: {{
|
||||
"partId": {{
|
||||
"instruction": "How content should be structured"
|
||||
}}
|
||||
}}
|
||||
- generationHint: Description of the content (must be self-contained with all necessary context)
|
||||
* If contentPartIds is EMPTY, generationHint MUST be VERY DETAILED with all context needed to generate content from scratch
|
||||
* Focus on content and structure, NOT formatting details
|
||||
|
||||
OUTPUT FORMAT: {outputFormat}
|
||||
|
||||
RETURN JSON:
|
||||
{{
|
||||
"metadata": {{
|
||||
"title": "Document Title",
|
||||
"language": "de"
|
||||
}},
|
||||
"documents": [{{
|
||||
"id": "doc_1",
|
||||
"title": "Document Title",
|
||||
"filename": "document.{outputFormat}",
|
||||
"chapters": [
|
||||
{{
|
||||
"id": "chapter_1",
|
||||
"level": 1,
|
||||
"title": "Introduction",
|
||||
"contentPartIds": ["part_ext_1"],
|
||||
"contentPartInstructions": {{
|
||||
"part_ext_1": {{
|
||||
"instruction": "Use full extracted text"
|
||||
}}
|
||||
}},
|
||||
"generationHint": "Create introduction section",
|
||||
"sections": []
|
||||
}},
|
||||
{{
|
||||
"id": "chapter_2",
|
||||
"level": 1,
|
||||
"title": "Main Title",
|
||||
"contentPartIds": [],
|
||||
"contentPartInstructions": {{}},
|
||||
"generationHint": "Create [specific content description] with [formatting details]. Include [required information]. Purpose: [explanation of what this chapter provides].",
|
||||
"sections": []
|
||||
}}
|
||||
]
|
||||
}}]
|
||||
}}
|
||||
|
||||
Return ONLY valid JSON following the structure above.
|
||||
"""
|
||||
return prompt
|
||||
|
||||
|
|
@ -34,12 +34,42 @@ class StructureChunker(Chunker):
|
|||
if bucket:
|
||||
emit(bucket)
|
||||
else:
|
||||
# JSON object (dict) - check if it fits
|
||||
text = json.dumps(obj, ensure_ascii=False)
|
||||
if len(text.encode('utf-8')) <= maxBytes:
|
||||
textSize = len(text.encode('utf-8'))
|
||||
if textSize <= maxBytes:
|
||||
emit(obj)
|
||||
else:
|
||||
# fallback to line chunking
|
||||
raise ValueError("too large")
|
||||
# Object too large - try to split by keys if possible
|
||||
# For large objects, we need to chunk by character boundaries
|
||||
# since we can't split JSON objects arbitrarily
|
||||
if isinstance(obj, dict) and len(obj) > 1:
|
||||
# Try to split object into multiple chunks by keys
|
||||
# This preserves JSON structure better than line-based chunking
|
||||
currentChunk: Dict[str, Any] = {}
|
||||
currentSize = 2 # Start with "{}" overhead
|
||||
for key, value in obj.items():
|
||||
itemText = json.dumps({key: value}, ensure_ascii=False)
|
||||
itemSize = len(itemText.encode('utf-8'))
|
||||
# Account for comma and spacing between items
|
||||
if currentChunk:
|
||||
itemSize += 2 # ", " separator
|
||||
|
||||
if currentSize + itemSize > maxBytes and currentChunk:
|
||||
# Current chunk is full, emit it
|
||||
emit(currentChunk)
|
||||
currentChunk = {key: value}
|
||||
currentSize = len(itemText.encode('utf-8'))
|
||||
else:
|
||||
currentChunk[key] = value
|
||||
currentSize += itemSize
|
||||
|
||||
# Emit remaining chunk
|
||||
if currentChunk:
|
||||
emit(currentChunk)
|
||||
else:
|
||||
# Single large value or can't split - fallback to line chunking
|
||||
raise ValueError("too large")
|
||||
except Exception:
|
||||
current: List[str] = []
|
||||
size = 0
|
||||
|
|
|
|||
|
|
@ -6,10 +6,11 @@ import logging
|
|||
import time
|
||||
import asyncio
|
||||
import base64
|
||||
import json
|
||||
|
||||
from .subRegistry import ExtractorRegistry, ChunkerRegistry
|
||||
from .subPipeline import runExtraction
|
||||
from modules.datamodels.datamodelExtraction import ContentExtracted, ContentPart, MergeStrategy, ExtractionOptions, PartResult
|
||||
from modules.datamodels.datamodelExtraction import ContentExtracted, ContentPart, MergeStrategy, ExtractionOptions, PartResult, DocumentIntent
|
||||
from modules.datamodels.datamodelChat import ChatDocument
|
||||
from modules.datamodels.datamodelAi import AiCallResponse, AiCallRequest, AiCallOptions, OperationTypeEnum, AiModelCall
|
||||
from modules.aicore.aicoreModelRegistry import modelRegistry
|
||||
|
|
@ -73,12 +74,14 @@ class ExtractionService:
|
|||
if operationId:
|
||||
workflowId = self.services.workflow.id if self.services.workflow else f"no-workflow-{int(time.time())}"
|
||||
docOperationId = f"{operationId}_doc_{i}"
|
||||
# Use parentOperationId if provided, otherwise use operationId as parent
|
||||
parentId = parentOperationId if parentOperationId else operationId
|
||||
self.services.chat.progressLogStart(
|
||||
docOperationId,
|
||||
"Extracting Document",
|
||||
f"Document {i + 1}/{totalDocs}",
|
||||
doc.fileName[:50] + "..." if len(doc.fileName) > 50 else doc.fileName,
|
||||
parentOperationId=operationId # Use operationId as parent (not parentOperationId)
|
||||
parentOperationId=parentId # Correct parent reference for ChatLog hierarchy
|
||||
)
|
||||
|
||||
# Start timing for this document
|
||||
|
|
@ -125,12 +128,41 @@ class ExtractionService:
|
|||
if part.metadata:
|
||||
logger.debug(f" Metadata: {part.metadata}")
|
||||
|
||||
# Attach document id and MIME type to parts if missing
|
||||
# Attach complete metadata to parts according to ContentPart Metadaten-Schema
|
||||
for p in ec.parts:
|
||||
# Ensure metadata dict exists
|
||||
if not p.metadata:
|
||||
p.metadata = {}
|
||||
|
||||
# Required metadata fields (from concept)
|
||||
if "documentId" not in p.metadata:
|
||||
p.metadata["documentId"] = documentData["id"] or str(uuid.uuid4())
|
||||
if "documentMimeType" not in p.metadata:
|
||||
p.metadata["documentMimeType"] = documentData["mimeType"]
|
||||
if "originalFileName" not in p.metadata:
|
||||
p.metadata["originalFileName"] = documentData["fileName"]
|
||||
|
||||
# ContentFormat: Set based on typeGroup and mimeType
|
||||
# Default to "extracted" for text content, but can be overridden by caller
|
||||
if "contentFormat" not in p.metadata:
|
||||
# Default: extracted text content
|
||||
p.metadata["contentFormat"] = "extracted"
|
||||
|
||||
# Intent: Default to "extract" for extracted content
|
||||
if "intent" not in p.metadata:
|
||||
p.metadata["intent"] = "extract"
|
||||
|
||||
# ExtractionPrompt: Use from options if available
|
||||
if "extractionPrompt" not in p.metadata and options and options.prompt:
|
||||
p.metadata["extractionPrompt"] = options.prompt
|
||||
|
||||
# UsageHint: Provide default hint
|
||||
if "usageHint" not in p.metadata:
|
||||
p.metadata["usageHint"] = f"Use extracted content from {documentData['fileName']}"
|
||||
|
||||
# SourceAction: Mark as from extraction service
|
||||
if "sourceAction" not in p.metadata:
|
||||
p.metadata["sourceAction"] = "extraction.extractContent"
|
||||
|
||||
# Log chunking information
|
||||
chunkedParts = [p for p in ec.parts if p.metadata.get("chunk", False)]
|
||||
|
|
@ -185,7 +217,7 @@ class ExtractionService:
|
|||
# Write extraction results to debug file
|
||||
try:
|
||||
from modules.shared.debugLogger import writeDebugFile
|
||||
import json
|
||||
# json is already imported at module level
|
||||
# Create summary of extraction results for debug
|
||||
extractionSummary = {
|
||||
"documentName": doc.fileName,
|
||||
|
|
@ -208,7 +240,7 @@ class ExtractionService:
|
|||
partSummary["dataPreview"] = f"[Large data: {len(part.data)} chars - truncated]"
|
||||
extractionSummary["parts"].append(partSummary)
|
||||
|
||||
writeDebugFile(json.dumps(extractionSummary, indent=2, ensure_ascii=False), f"extraction_result_{doc.fileName}")
|
||||
writeDebugFile(json.dumps(extractionSummary, indent=2, ensure_ascii=False), f"extraction_result_{doc.fileName}.txt")
|
||||
except Exception as e:
|
||||
logger.debug(f"Failed to write extraction debug file: {str(e)}")
|
||||
|
||||
|
|
@ -487,7 +519,8 @@ class ExtractionService:
|
|||
prompt: str,
|
||||
aiObjects: Any,
|
||||
options: Optional[AiCallOptions] = None,
|
||||
operationId: Optional[str] = None
|
||||
operationId: Optional[str] = None,
|
||||
parentOperationId: Optional[str] = None
|
||||
) -> str:
|
||||
"""
|
||||
Process documents with model-aware chunking and merge results.
|
||||
|
|
@ -499,6 +532,7 @@ class ExtractionService:
|
|||
aiObjects: AiObjects instance for making AI calls
|
||||
options: AI call options
|
||||
operationId: Optional operation ID for progress tracking
|
||||
parentOperationId: Optional parent operation ID for hierarchical logging
|
||||
|
||||
Returns:
|
||||
Merged AI results as string with preserved document structure
|
||||
|
|
@ -514,7 +548,8 @@ class ExtractionService:
|
|||
operationId,
|
||||
"AI Text Extract",
|
||||
"Document Processing",
|
||||
f"Processing {len(documents)} documents"
|
||||
f"Processing {len(documents)} documents",
|
||||
parentOperationId=parentOperationId # Use parentOperationId if provided
|
||||
)
|
||||
|
||||
try:
|
||||
|
|
@ -539,7 +574,8 @@ class ExtractionService:
|
|||
if operationId:
|
||||
self.services.chat.progressLogUpdate(operationId, 0.1, f"Extracting content from {len(documents)} documents")
|
||||
# Pass operationId as parentOperationId for hierarchical logging
|
||||
extractionResult = self.extractContent(documents, extractionOptions, operationId=operationId, parentOperationId=parentOperationId)
|
||||
# Correct hierarchy: parentOperationId -> operationId -> docOperationId
|
||||
extractionResult = self.extractContent(documents, extractionOptions, operationId=operationId, parentOperationId=operationId)
|
||||
|
||||
if not isinstance(extractionResult, list):
|
||||
if operationId:
|
||||
|
|
@ -549,9 +585,10 @@ class ExtractionService:
|
|||
# Process parts (not chunks) with model-aware AI calls
|
||||
if operationId:
|
||||
self.services.chat.progressLogUpdate(operationId, 0.3, f"Processing {len(extractionResult)} extracted content parts")
|
||||
# Use parent operation ID directly (parentId should be operationId, not log entry ID)
|
||||
parentOperationId = operationId # Use the parent's operationId directly
|
||||
partResults = await self._processPartsWithMapping(extractionResult, prompt, aiObjects, options, operationId, parentOperationId)
|
||||
# Use operationId as parentOperationId for child operations
|
||||
# Correct hierarchy: parentOperationId -> operationId -> partOperationId
|
||||
processParentOperationId = operationId
|
||||
partResults = await self._processPartsWithMapping(extractionResult, prompt, aiObjects, options, operationId, processParentOperationId)
|
||||
|
||||
# Merge results using existing merging system
|
||||
if operationId:
|
||||
|
|
@ -733,7 +770,8 @@ class ExtractionService:
|
|||
# Detect input type and convert accordingly
|
||||
if isinstance(partResults[0], PartResult):
|
||||
# Existing logic for PartResult (from processDocumentsPerChunk)
|
||||
for part_result in partResults:
|
||||
# Phase 7: Add originalIndex for explicit ordering
|
||||
for i, part_result in enumerate(partResults):
|
||||
content_part = ContentPart(
|
||||
id=part_result.originalPart.id,
|
||||
parentId=part_result.originalPart.parentId,
|
||||
|
|
@ -744,7 +782,9 @@ class ExtractionService:
|
|||
metadata={
|
||||
**part_result.originalPart.metadata,
|
||||
"aiResult": True,
|
||||
"originalIndex": i, # Phase 7: Explicit order index
|
||||
"partIndex": part_result.partIndex,
|
||||
"processingOrder": i, # Phase 7: Processing order
|
||||
"documentId": part_result.documentId,
|
||||
"processingTime": part_result.processingTime,
|
||||
"success": part_result.metadata.get("success", False)
|
||||
|
|
@ -753,6 +793,7 @@ class ExtractionService:
|
|||
content_parts.append(content_part)
|
||||
elif isinstance(partResults[0], AiCallResponse):
|
||||
# Logic from interfaceAiObjects (from content parts processing)
|
||||
# Phase 7: Add originalIndex for explicit ordering
|
||||
for i, result in enumerate(partResults):
|
||||
if result.content:
|
||||
content_part = ContentPart(
|
||||
|
|
@ -764,6 +805,8 @@ class ExtractionService:
|
|||
data=result.content,
|
||||
metadata={
|
||||
"aiResult": True,
|
||||
"originalIndex": i, # Phase 7: Explicit order index
|
||||
"processingOrder": i, # Phase 7: Processing order
|
||||
"modelName": result.modelName,
|
||||
"priceUsd": result.priceUsd,
|
||||
"processingTime": result.processingTime,
|
||||
|
|
@ -792,11 +835,12 @@ class ExtractionService:
|
|||
|
||||
# Determine merge strategy based on input type
|
||||
if isinstance(partResults[0], PartResult):
|
||||
# Use strategy for extraction workflow (group by document, order by part index)
|
||||
# Phase 7: Use originalIndex for explicit ordering
|
||||
# Use strategy for extraction workflow (group by document, order by originalIndex)
|
||||
merge_strategy = MergeStrategy(
|
||||
useIntelligentMerging=True,
|
||||
groupBy="documentId", # Group by document
|
||||
orderBy="partIndex", # Order by part index
|
||||
orderBy="originalIndex", # Phase 7: Order by originalIndex instead of partIndex
|
||||
mergeType="concatenate"
|
||||
)
|
||||
else:
|
||||
|
|
@ -811,10 +855,46 @@ class ExtractionService:
|
|||
# Apply merging
|
||||
merged_parts = applyMerging(content_parts, merge_strategy)
|
||||
|
||||
# Convert back to string
|
||||
final_content = "\n\n".join([part.data for part in merged_parts])
|
||||
# Phase 6: Enhanced format with metadata preservation
|
||||
# CRITICAL: Don't add SOURCE markers for internal use - metadata is already preserved in ContentPart objects
|
||||
# SOURCE markers should ONLY be added when content is returned directly to user for display/debugging
|
||||
# For extraction content used in generation pipelines, metadata is in ContentPart.metadata, not in text markers
|
||||
|
||||
logger.info(f"Merged {len(partResults)} parts using unified merging system")
|
||||
# Check if this is a generation response by looking at operationType or content structure
|
||||
isGenerationResponse = False
|
||||
if options and hasattr(options, 'operationType'):
|
||||
# Generation responses use DATA_GENERATE operation type
|
||||
from modules.datamodels.datamodelAi import OperationTypeEnum
|
||||
isGenerationResponse = options.operationType == OperationTypeEnum.DATA_GENERATE
|
||||
|
||||
# Also check if content looks like JSON (starts with { or [)
|
||||
if not isGenerationResponse and merged_parts:
|
||||
firstPartData = merged_parts[0].data if merged_parts[0].data else ""
|
||||
if isinstance(firstPartData, str) and firstPartData.strip().startswith(('{', '[')):
|
||||
# Check if it's a complete JSON structure (not extracted content)
|
||||
# Generation responses are complete JSON, extraction responses are text content
|
||||
try:
|
||||
# json is already imported at module level
|
||||
json.loads(firstPartData.strip())
|
||||
# If it parses as JSON and has "documents" key, it's likely a generation response
|
||||
parsed = json.loads(firstPartData.strip())
|
||||
if isinstance(parsed, dict) and "documents" in parsed:
|
||||
isGenerationResponse = True
|
||||
except:
|
||||
pass
|
||||
|
||||
# ROOT CAUSE FIX: Never add SOURCE markers - metadata is preserved in ContentPart.metadata
|
||||
# SOURCE markers pollute content and cause issues when content is used in generation pipelines
|
||||
# If traceability is needed, use ContentPart.metadata fields (documentId, documentMimeType, label, etc.)
|
||||
content_sections = []
|
||||
for part in merged_parts:
|
||||
# Always return clean content without SOURCE markers
|
||||
# Metadata is available in ContentPart.metadata for traceability
|
||||
content_sections.append(part.data if part.data else "")
|
||||
|
||||
final_content = "\n\n".join(content_sections)
|
||||
|
||||
logger.info(f"Merged {len(partResults)} parts using unified merging system with metadata preservation (generationResponse={isGenerationResponse})")
|
||||
return final_content.strip()
|
||||
|
||||
async def chunkContentPartForAi(self, contentPart, model, options, prompt: str = "") -> List[Dict[str, Any]]:
|
||||
|
|
@ -827,9 +907,14 @@ class ExtractionService:
|
|||
modelContextTokens = model.contextLength # Total context in tokens
|
||||
modelMaxOutputTokens = model.maxTokens # Maximum output tokens
|
||||
|
||||
# CRITICAL: Use same conservative token factor as in processContentPartWithFallback
|
||||
# Real-world observation: Our calculation says 94k tokens, but API says 217k tokens (2.3x difference!)
|
||||
TOKEN_SAFETY_FACTOR = 2.2 # Conservative: accounts for JSON tokenization and API overhead
|
||||
|
||||
# Reserve tokens for:
|
||||
# 1. Prompt (user message)
|
||||
promptTokens = len(prompt.encode('utf-8')) / 4 if prompt else 0
|
||||
# 1. Prompt (user message) - use conservative factor
|
||||
promptSize = len(prompt.encode('utf-8')) if prompt else 0
|
||||
promptTokens = promptSize / TOKEN_SAFETY_FACTOR
|
||||
|
||||
# 2. System message wrapper ("Context from documents:\n")
|
||||
systemMessageTokens = 10 # ~40 bytes = 10 tokens
|
||||
|
|
@ -844,31 +929,38 @@ class ExtractionService:
|
|||
totalReservedTokens = promptTokens + systemMessageTokens + messageOverheadTokens + outputTokens
|
||||
|
||||
# Available tokens for content = context length - reserved tokens
|
||||
# Use 80% of available for safety margin
|
||||
availableContentTokens = int((modelContextTokens - totalReservedTokens) * 0.8)
|
||||
# Use 60% of available (same conservative margin as in processContentPartWithFallback)
|
||||
availableContentTokens = int((modelContextTokens - totalReservedTokens) * 0.60)
|
||||
|
||||
# Ensure we have at least some space
|
||||
if availableContentTokens < 100:
|
||||
logger.warning(f"Very limited space for content: {availableContentTokens} tokens available. Model: {model.name}, contextLength: {modelContextTokens}, maxTokens: {modelMaxOutputTokens}, prompt: {promptTokens:.0f} tokens")
|
||||
availableContentTokens = max(100, int(modelContextTokens * 0.1)) # Fallback to 10% of context
|
||||
|
||||
# Convert tokens to bytes (1 token ≈ 4 bytes)
|
||||
availableContentBytes = availableContentTokens * 4
|
||||
# Convert tokens to bytes using conservative factor (reverse: bytes = tokens * factor)
|
||||
availableContentBytes = int(availableContentTokens * TOKEN_SAFETY_FACTOR)
|
||||
|
||||
logger.debug(f"Chunking calculation for {model.name}: contextLength={modelContextTokens} tokens, maxTokens={modelMaxOutputTokens} tokens, prompt={promptTokens:.0f} tokens, reserved={totalReservedTokens:.0f} tokens, available={availableContentTokens} tokens ({availableContentBytes} bytes)")
|
||||
logger.info(f"Chunking calculation for {model.name}: contextLength={modelContextTokens} tokens, maxTokens={modelMaxOutputTokens} tokens, prompt={promptTokens:.0f} tokens est., reserved={totalReservedTokens:.0f} tokens est., available={availableContentTokens} tokens est. ({availableContentBytes} bytes), factor={TOKEN_SAFETY_FACTOR}")
|
||||
|
||||
# Use 70% of available content bytes for text chunks (conservative)
|
||||
textChunkSize = int(availableContentBytes * 0.7)
|
||||
imageChunkSize = int(availableContentBytes * 0.8) # 80% for image chunks
|
||||
# Use 50% of available content bytes for text chunks (very conservative to ensure chunks fit)
|
||||
# This ensures that even with token counting inaccuracies, chunks will fit
|
||||
textChunkSize = int(availableContentBytes * 0.5)
|
||||
structureChunkSize = int(availableContentBytes * 0.5) # CRITICAL: Also set for StructureChunker (JSON content)
|
||||
tableChunkSize = int(availableContentBytes * 0.5) # Also set for TableChunker
|
||||
imageChunkSize = int(availableContentBytes * 0.6) # 60% for image chunks
|
||||
|
||||
# Build chunking options
|
||||
# Build chunking options - include ALL chunk size options for different chunkers
|
||||
chunkingOptions = {
|
||||
"textChunkSize": textChunkSize,
|
||||
"structureChunkSize": structureChunkSize, # CRITICAL: Required for StructureChunker (JSON)
|
||||
"tableChunkSize": tableChunkSize, # Required for TableChunker
|
||||
"imageChunkSize": imageChunkSize,
|
||||
"maxSize": availableContentBytes,
|
||||
"chunkAllowed": True
|
||||
}
|
||||
|
||||
logger.info(f"Chunking options: textChunkSize={textChunkSize} bytes, structureChunkSize={structureChunkSize} bytes, tableChunkSize={tableChunkSize} bytes, imageChunkSize={imageChunkSize} bytes, contentPartSize={len(contentPart.data.encode('utf-8')) if contentPart.data else 0} bytes")
|
||||
|
||||
# Get appropriate chunker (uses existing ChunkerRegistry ✅)
|
||||
chunker = self._chunkerRegistry.resolve(contentPart.typeGroup)
|
||||
|
||||
|
|
@ -878,8 +970,14 @@ class ExtractionService:
|
|||
|
||||
# Chunk the content part
|
||||
try:
|
||||
contentSize = len(contentPart.data.encode('utf-8')) if contentPart.data else 0
|
||||
logger.info(f"Chunking {contentPart.typeGroup} part: contentSize={contentSize} bytes, textChunkSize={textChunkSize} bytes, structureChunkSize={structureChunkSize} bytes")
|
||||
chunks = chunker.chunk(contentPart, chunkingOptions)
|
||||
logger.debug(f"Created {len(chunks)} chunks for {contentPart.typeGroup} part")
|
||||
logger.info(f"Created {len(chunks)} chunks for {contentPart.typeGroup} part (contentSize={contentSize} bytes)")
|
||||
if chunks:
|
||||
for i, chunk in enumerate(chunks):
|
||||
chunkSize = len(chunk.get('data', '').encode('utf-8')) if chunk.get('data') else 0
|
||||
logger.info(f" Chunk {i+1}/{len(chunks)}: {chunkSize} bytes")
|
||||
return chunks
|
||||
except Exception as e:
|
||||
logger.error(f"Chunking failed for {contentPart.typeGroup}: {str(e)}")
|
||||
|
|
@ -999,15 +1097,87 @@ class ExtractionService:
|
|||
|
||||
availableContentBytes = availableContentTokens * 4
|
||||
|
||||
logger.debug(f"Size check for {model.name}: partSize={partSize} bytes, availableContentBytes={availableContentBytes} bytes")
|
||||
# Also check prompt size - prompt + content together must fit
|
||||
promptSize = len(prompt.encode('utf-8')) if prompt else 0
|
||||
|
||||
if partSize <= availableContentBytes:
|
||||
# CRITICAL: Token counting approximation is VERY inaccurate for JSON/content
|
||||
# Real-world observation: Our calculation says 94k tokens, but API says 217k tokens (2.3x difference!)
|
||||
# This happens because:
|
||||
# 1. JSON/structured content tokenizes differently (more tokens per byte)
|
||||
# 2. API has message structure overhead (system prompts, message wrappers)
|
||||
# 3. Tokenizer differences between our approximation and actual API tokenizer
|
||||
# Use conservative factor: 1 token ≈ 2.2 bytes (instead of 4) to account for these differences
|
||||
TOKEN_SAFETY_FACTOR = 2.2 # Conservative: accounts for JSON tokenization and API overhead
|
||||
promptTokens = promptSize / TOKEN_SAFETY_FACTOR
|
||||
contentTokens = partSize / TOKEN_SAFETY_FACTOR
|
||||
totalTokens = promptTokens + contentTokens
|
||||
|
||||
# CRITICAL: Use very conservative margin (60%) because:
|
||||
# 1. Token counting approximation is inaccurate - real tokens can be 2-3x more
|
||||
# 2. API has additional overhead (message structure, system prompts, etc.)
|
||||
# 3. Anthropic API is strict about the 200k limit
|
||||
# 4. We've seen cases where our calculation says "fits" but API says "too long"
|
||||
maxTotalTokens = int(modelContextTokens * 0.60)
|
||||
|
||||
logger.info(f"Size check for {model.name}: partSize={partSize} bytes ({contentTokens:.0f} tokens est.), promptSize={promptSize} bytes ({promptTokens:.0f} tokens est.), total={totalTokens:.0f} tokens est., modelContext={modelContextTokens} tokens, maxTotal={maxTotalTokens} tokens (60% margin, conservative factor={TOKEN_SAFETY_FACTOR})")
|
||||
|
||||
# CRITICAL: Always check totalTokens first - if prompt + content exceeds limit, MUST chunk
|
||||
# Token counting approximation may differ significantly from API, so use very conservative margin
|
||||
if totalTokens > maxTotalTokens:
|
||||
logger.warning(f"⚠️ Total tokens ({totalTokens:.0f} est.) exceed model limit ({maxTotalTokens}), chunking required. Prompt: {promptTokens:.0f} tokens est., Content: {contentTokens:.0f} tokens est.")
|
||||
elif partSize > availableContentBytes:
|
||||
logger.warning(f"⚠️ Content part ({contentTokens:.0f} tokens est.) exceeds available space ({availableContentBytes/TOKEN_SAFETY_FACTOR:.0f} tokens est.), chunking required")
|
||||
|
||||
# If either condition fails, chunk the content
|
||||
# CRITICAL: IMAGE_GENERATE operations should NOT use chunking - they generate images from prompts, not process content chunks
|
||||
if (totalTokens > maxTotalTokens or partSize > availableContentBytes) and options.operationType != OperationTypeEnum.IMAGE_GENERATE:
|
||||
# Part too large or total exceeds limit - chunk it (but not for image generation)
|
||||
chunks = await self.chunkContentPartForAi(contentPart, model, options, prompt)
|
||||
if not chunks:
|
||||
raise ValueError(f"Failed to chunk content part for model {model.name}")
|
||||
|
||||
logger.info(f"Starting to process {len(chunks)} chunks with model {model.name}")
|
||||
|
||||
if progressCallback:
|
||||
progressCallback(0.0, f"Starting to process {len(chunks)} chunks")
|
||||
|
||||
chunkResults = []
|
||||
for idx, chunk in enumerate(chunks):
|
||||
chunkNum = idx + 1
|
||||
chunkData = chunk.get('data', '')
|
||||
logger.info(f"Processing chunk {chunkNum}/{len(chunks)} with model {model.name}")
|
||||
|
||||
if progressCallback:
|
||||
progressCallback(chunkNum / len(chunks), f"Processing chunk {chunkNum}/{len(chunks)}")
|
||||
|
||||
try:
|
||||
chunkResponse = await aiObjects._callWithModel(model, prompt, chunkData, options)
|
||||
chunkResults.append(chunkResponse)
|
||||
except Exception as chunkError:
|
||||
logger.error(f"Error processing chunk {chunkNum}/{len(chunks)}: {str(chunkError)}")
|
||||
# Continue with other chunks even if one fails
|
||||
continue
|
||||
|
||||
# Merge chunk results
|
||||
if not chunkResults:
|
||||
raise ValueError(f"All chunks failed for content part")
|
||||
|
||||
mergedContent = self.mergePartResults(chunkResults, options)
|
||||
return AiCallResponse(
|
||||
content=mergedContent,
|
||||
modelName=model.name,
|
||||
priceUsd=sum(r.priceUsd for r in chunkResults),
|
||||
processingTime=sum(r.processingTime for r in chunkResults),
|
||||
bytesSent=sum(r.bytesSent for r in chunkResults),
|
||||
bytesReceived=sum(r.bytesReceived for r in chunkResults),
|
||||
errorCount=sum(r.errorCount for r in chunkResults)
|
||||
)
|
||||
else:
|
||||
# Part fits - call AI directly via aiObjects interface
|
||||
logger.info(f"✅ Content part fits within model limits, processing directly")
|
||||
response = await aiObjects._callWithModel(model, prompt, contentPart.data, options)
|
||||
logger.info(f"✅ Content part processed successfully with model: {model.name}")
|
||||
return response
|
||||
else:
|
||||
# Part too large - chunk it
|
||||
chunks = await self.chunkContentPartForAi(contentPart, model, options, prompt)
|
||||
if not chunks:
|
||||
raise ValueError(f"Failed to chunk content part for model {model.name}")
|
||||
|
|
@ -1037,8 +1207,8 @@ class ExtractionService:
|
|||
logger.error(f"❌ Error processing chunk {chunkNum}/{len(chunks)}: {str(e)}")
|
||||
raise
|
||||
|
||||
# Merge chunk results
|
||||
mergedContent = self.mergeChunkResults(chunkResults)
|
||||
# Merge chunk results using unified mergePartResults
|
||||
mergedContent = self.mergePartResults(chunkResults, options)
|
||||
|
||||
logger.info(f"✅ Content part chunked and processed with model: {model.name} ({len(chunks)} chunks)")
|
||||
return AiCallResponse(
|
||||
|
|
|
|||
|
|
@ -2,7 +2,10 @@
|
|||
# All rights reserved.
|
||||
import logging
|
||||
import uuid
|
||||
from typing import Any, Dict, List, Optional
|
||||
import base64
|
||||
import traceback
|
||||
from typing import Any, Dict, List, Optional, Callable
|
||||
from modules.datamodels.datamodelDocument import RenderedDocument
|
||||
from modules.datamodels.datamodelChat import ChatDocument
|
||||
from modules.services.serviceGeneration.subDocumentUtility import (
|
||||
getFileExtension,
|
||||
|
|
@ -56,11 +59,35 @@ class GenerationService:
|
|||
# Detect MIME without relying on a service center
|
||||
mime_type = detectMimeTypeFromContent(content, doc.documentName)
|
||||
|
||||
# WICHTIG: Für ActionDocuments mit validationMetadata (z.B. context.extractContent)
|
||||
# müssen wir das gesamte ActionDocument serialisieren, nicht nur documentData
|
||||
document_data = doc.documentData
|
||||
if hasattr(doc, 'validationMetadata') and doc.validationMetadata:
|
||||
# Wenn validationMetadata vorhanden ist, serialisiere das gesamte ActionDocument-Format
|
||||
if mime_type == "application/json":
|
||||
# Erstelle ActionDocument-Format mit validationMetadata und documentData
|
||||
if hasattr(document_data, 'model_dump'):
|
||||
# Pydantic v2
|
||||
document_data_dict = document_data.model_dump()
|
||||
elif hasattr(document_data, 'dict'):
|
||||
# Pydantic v1
|
||||
document_data_dict = document_data.dict()
|
||||
elif isinstance(document_data, dict):
|
||||
document_data_dict = document_data
|
||||
else:
|
||||
document_data_dict = {"data": str(document_data)}
|
||||
|
||||
# Erstelle ActionDocument-Format
|
||||
document_data = {
|
||||
"validationMetadata": doc.validationMetadata,
|
||||
"documentData": document_data_dict
|
||||
}
|
||||
|
||||
return {
|
||||
'fileName': doc.documentName,
|
||||
'fileSize': len(str(doc.documentData)),
|
||||
'fileSize': len(str(document_data)),
|
||||
'mimeType': mime_type,
|
||||
'content': doc.documentData,
|
||||
'content': document_data,
|
||||
'document': doc
|
||||
}
|
||||
except Exception as e:
|
||||
|
|
@ -82,14 +109,62 @@ class GenerationService:
|
|||
documentData = doc_data['content']
|
||||
mimeType = doc_data['mimeType']
|
||||
|
||||
# Convert document data to string content
|
||||
content = convertDocumentDataToString(documentData, getFileExtension(documentName))
|
||||
# Handle binary data (images, PDFs, Office docs) differently from text
|
||||
# Check if this is a binary MIME type
|
||||
binaryMimeTypes = {
|
||||
"application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
||||
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
|
||||
"application/vnd.openxmlformats-officedocument.presentationml.presentation",
|
||||
"application/pdf",
|
||||
"image/png", "image/jpeg", "image/jpg", "image/gif", "image/webp", "image/bmp", "image/svg+xml",
|
||||
}
|
||||
|
||||
# Skip empty or minimal content
|
||||
minimalContentPatterns = ['{}', '[]', 'null', '""', "''"]
|
||||
if not content or content.strip() == "" or content.strip() in minimalContentPatterns:
|
||||
logger.warning(f"Empty or minimal content for document {documentName}, skipping")
|
||||
continue
|
||||
isBinaryMimeType = mimeType in binaryMimeTypes
|
||||
base64encoded = False
|
||||
content = None
|
||||
|
||||
if isBinaryMimeType:
|
||||
# For binary data, handle bytes vs base64 string vs regular string
|
||||
if isinstance(documentData, bytes):
|
||||
# Already bytes - encode to base64 string for storage
|
||||
# base64 is already imported at module level
|
||||
content = base64.b64encode(documentData).decode('utf-8')
|
||||
base64encoded = True
|
||||
elif isinstance(documentData, str):
|
||||
# Check if it's already valid base64
|
||||
# base64 is already imported at module level
|
||||
try:
|
||||
# Try to decode to verify it's base64
|
||||
base64.b64decode(documentData, validate=True)
|
||||
# Valid base64 - use as is
|
||||
content = documentData
|
||||
base64encoded = True
|
||||
except Exception:
|
||||
# Not valid base64 - might be raw string, try encoding
|
||||
try:
|
||||
content = base64.b64encode(documentData.encode('utf-8')).decode('utf-8')
|
||||
base64encoded = True
|
||||
except Exception:
|
||||
logger.warning(f"Could not process binary data for {documentName}, skipping")
|
||||
continue
|
||||
else:
|
||||
# Other types - convert to string then base64
|
||||
# base64 is already imported at module level
|
||||
try:
|
||||
content = base64.b64encode(str(documentData).encode('utf-8')).decode('utf-8')
|
||||
base64encoded = True
|
||||
except Exception:
|
||||
logger.warning(f"Could not encode binary data for {documentName}, skipping")
|
||||
continue
|
||||
else:
|
||||
# Text data - convert to string
|
||||
content = convertDocumentDataToString(documentData, getFileExtension(documentName))
|
||||
|
||||
# Skip empty or minimal content
|
||||
minimalContentPatterns = ['{}', '[]', 'null', '""', "''"]
|
||||
if not content or content.strip() == "" or content.strip() in minimalContentPatterns:
|
||||
logger.warning(f"Empty or minimal content for document {documentName}, skipping")
|
||||
continue
|
||||
|
||||
# Normalize file extension based on mime type if missing or incorrect
|
||||
try:
|
||||
|
|
@ -102,6 +177,13 @@ class GenerationService:
|
|||
"text/markdown": ".md",
|
||||
"text/plain": ".txt",
|
||||
"application/json": ".json",
|
||||
"image/png": ".png",
|
||||
"image/jpeg": ".jpg",
|
||||
"image/jpg": ".jpg",
|
||||
"image/gif": ".gif",
|
||||
"image/webp": ".webp",
|
||||
"image/bmp": ".bmp",
|
||||
"image/svg+xml": ".svg",
|
||||
}
|
||||
expectedExt = mime_to_ext.get(mimeType)
|
||||
if expectedExt:
|
||||
|
|
@ -114,20 +196,6 @@ class GenerationService:
|
|||
except Exception:
|
||||
pass
|
||||
|
||||
# Decide if content is base64-encoded binary (e.g., docx/pdf) or plain text
|
||||
base64encoded = False
|
||||
try:
|
||||
binaryMimeTypes = {
|
||||
"application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
||||
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
|
||||
"application/vnd.openxmlformats-officedocument.presentationml.presentation",
|
||||
"application/pdf",
|
||||
}
|
||||
if isinstance(documentData, str) and mimeType in binaryMimeTypes:
|
||||
base64encoded = True
|
||||
except Exception:
|
||||
base64encoded = False
|
||||
|
||||
# Create document with file in one step using interfaces directly
|
||||
document = self._createDocument(
|
||||
fileName=documentName,
|
||||
|
|
@ -190,7 +258,7 @@ class GenerationService:
|
|||
return None
|
||||
# Convert content to bytes
|
||||
if base64encoded:
|
||||
import base64
|
||||
# base64 is already imported at module level
|
||||
content_bytes = base64.b64decode(content)
|
||||
else:
|
||||
content_bytes = content.encode('utf-8')
|
||||
|
|
@ -278,27 +346,31 @@ class GenerationService:
|
|||
'workflowId': 'unknown'
|
||||
}
|
||||
|
||||
async def renderReport(self, extractedContent: Dict[str, Any], outputFormat: str, title: str, userPrompt: str = None, aiService=None) -> tuple[str, str]:
|
||||
async def renderReport(self, extractedContent: Dict[str, Any], outputFormat: str, title: str, userPrompt: str = None, aiService=None, parentOperationId: Optional[str] = None) -> List[RenderedDocument]:
|
||||
"""
|
||||
Render extracted JSON content to the specified output format.
|
||||
Always uses unified "documents" array format.
|
||||
Processes EACH document separately and calls renderer for each.
|
||||
Each renderer can return 1..n documents (e.g., HTML + images).
|
||||
|
||||
Args:
|
||||
extractedContent: Structured JSON document from AI extraction
|
||||
extractedContent: Structured JSON document with documents array
|
||||
outputFormat: Target format (html, pdf, docx, txt, md, json, csv, xlsx)
|
||||
In future, each document can have its own format
|
||||
title: Report title
|
||||
userPrompt: User's original prompt for report generation
|
||||
aiService: AI service instance for generation prompt creation
|
||||
parentOperationId: Optional parent operation ID for hierarchical logging
|
||||
|
||||
Returns:
|
||||
tuple: (rendered_content, mime_type)
|
||||
List of RenderedDocument objects.
|
||||
Each RenderedDocument represents one rendered file (main document or supporting file)
|
||||
"""
|
||||
try:
|
||||
# Validate JSON input
|
||||
if not isinstance(extractedContent, dict):
|
||||
raise ValueError("extractedContent must be a JSON dictionary")
|
||||
|
||||
# Unified approach: Always expect "documents" array (single doc = n=1)
|
||||
# Unified approach: Always expect "documents" array
|
||||
if "documents" not in extractedContent:
|
||||
raise ValueError("extractedContent must contain 'documents' array")
|
||||
|
||||
|
|
@ -306,32 +378,136 @@ class GenerationService:
|
|||
if len(documents) == 0:
|
||||
raise ValueError("No documents found in 'documents' array")
|
||||
|
||||
# Use first document for rendering
|
||||
single_doc = documents[0]
|
||||
if "sections" not in single_doc:
|
||||
raise ValueError("Document must contain 'sections' field")
|
||||
metadata = extractedContent.get("metadata", {})
|
||||
allRenderedDocuments = []
|
||||
|
||||
# Create content for single document renderer
|
||||
contentToRender = {
|
||||
"sections": single_doc["sections"],
|
||||
"metadata": extractedContent.get("metadata", {}),
|
||||
"continuation": extractedContent.get("continuation", None)
|
||||
}
|
||||
|
||||
# Get the appropriate renderer for the format
|
||||
renderer = self._getFormatRenderer(outputFormat)
|
||||
if not renderer:
|
||||
raise ValueError(f"Unsupported output format: {outputFormat}")
|
||||
# Process EACH document separately
|
||||
for docIndex, doc in enumerate(documents):
|
||||
if not isinstance(doc, dict):
|
||||
logger.warning(f"Skipping invalid document at index {docIndex}")
|
||||
continue
|
||||
|
||||
if "sections" not in doc:
|
||||
logger.warning(f"Document {doc.get('id', docIndex)} has no sections, skipping")
|
||||
continue
|
||||
|
||||
# Determine format for this document
|
||||
# TODO: In future, each document can have its own format field
|
||||
# For now, use the global outputFormat
|
||||
docFormat = doc.get("format", outputFormat)
|
||||
|
||||
# Get renderer for this document's format
|
||||
renderer = self._getFormatRenderer(docFormat)
|
||||
if not renderer:
|
||||
logger.warning(f"Unsupported format '{docFormat}' for document {doc.get('id', docIndex)}, skipping")
|
||||
continue
|
||||
|
||||
# Create JSON structure with single document (preserving metadata)
|
||||
singleDocContent = {
|
||||
"metadata": metadata,
|
||||
"documents": [doc] # Only this document
|
||||
}
|
||||
|
||||
# Use document title or fallback to provided title
|
||||
docTitle = doc.get("title", title)
|
||||
|
||||
# Render this document (can return multiple files, e.g., HTML + images)
|
||||
renderedDocs = await renderer.render(singleDocContent, docTitle, userPrompt, aiService)
|
||||
allRenderedDocuments.extend(renderedDocs)
|
||||
|
||||
# Render the JSON content directly (AI generation handled by main service)
|
||||
renderedContent, mimeType = await renderer.render(contentToRender, title, userPrompt, aiService)
|
||||
|
||||
return renderedContent, mimeType
|
||||
logger.info(f"Rendered {len(documents)} document(s) into {len(allRenderedDocuments)} file(s)")
|
||||
return allRenderedDocuments
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error rendering JSON report to {outputFormat}: {str(e)}")
|
||||
raise
|
||||
|
||||
async def generateDocumentWithTwoPhases(
|
||||
self,
|
||||
userPrompt: str,
|
||||
cachedContent: Optional[Dict[str, Any]] = None,
|
||||
contentParts: Optional[List[Any]] = None,
|
||||
maxSectionLength: int = 500,
|
||||
parallelGeneration: bool = True,
|
||||
progressCallback: Optional[Callable] = None
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
Generate document using two-phase approach:
|
||||
1. Generate structure skeleton with empty sections
|
||||
2. Generate content for each section iteratively
|
||||
|
||||
This is the core logic for document generation in AI calls.
|
||||
|
||||
Args:
|
||||
userPrompt: User's original prompt
|
||||
cachedContent: Optional extracted content cache (from extraction phase)
|
||||
contentParts: Optional list of ContentParts to use for structure generation
|
||||
maxSectionLength: Maximum words for simple sections
|
||||
parallelGeneration: Enable parallel section generation
|
||||
progressCallback: Optional callback function(progress, total, message) for progress updates
|
||||
|
||||
Returns:
|
||||
Complete document structure with populated elements ready for rendering
|
||||
"""
|
||||
try:
|
||||
from modules.services.serviceGeneration.subStructureGenerator import StructureGenerator
|
||||
from modules.services.serviceGeneration.subContentGenerator import ContentGenerator
|
||||
|
||||
# Phase 1: Generate structure skeleton
|
||||
if progressCallback:
|
||||
progressCallback(0, 100, "Generating document structure...")
|
||||
|
||||
structureGenerator = StructureGenerator(self.services)
|
||||
|
||||
# Extract imageDocuments from cachedContent if available
|
||||
existingImages = None
|
||||
if cachedContent and cachedContent.get("imageDocuments"):
|
||||
existingImages = cachedContent.get("imageDocuments")
|
||||
|
||||
structure = await structureGenerator.generateStructure(
|
||||
userPrompt=userPrompt,
|
||||
documentList=None, # Not used in current implementation
|
||||
cachedContent=cachedContent,
|
||||
contentParts=contentParts, # Pass ContentParts for structure generation
|
||||
maxSectionLength=maxSectionLength,
|
||||
existingImages=existingImages
|
||||
)
|
||||
|
||||
if progressCallback:
|
||||
progressCallback(30, 100, "Structure generated, starting content generation...")
|
||||
|
||||
# Phase 2: Generate content for each section
|
||||
contentGenerator = ContentGenerator(self.services)
|
||||
|
||||
# Create progress callback wrapper for content generation phase (30-90%)
|
||||
def contentProgressCallback(sectionIndex: int, totalSections: int, message: str):
|
||||
if progressCallback:
|
||||
# Map section progress to overall progress (30% to 90%)
|
||||
if totalSections > 0:
|
||||
overallProgress = 30 + int(60 * (sectionIndex / totalSections))
|
||||
else:
|
||||
overallProgress = 30
|
||||
progressCallback(overallProgress, 100, f"Section {sectionIndex}/{totalSections}: {message}")
|
||||
|
||||
completeStructure = await contentGenerator.generateContent(
|
||||
structure=structure,
|
||||
cachedContent=cachedContent,
|
||||
userPrompt=userPrompt,
|
||||
contentParts=contentParts, # Pass ContentParts for content generation
|
||||
progressCallback=contentProgressCallback,
|
||||
parallelGeneration=parallelGeneration
|
||||
)
|
||||
|
||||
if progressCallback:
|
||||
progressCallback(100, 100, "Document generation complete")
|
||||
|
||||
return completeStructure
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error in two-phase document generation: {str(e)}")
|
||||
logger.debug(traceback.format_exc())
|
||||
raise
|
||||
|
||||
async def getAdaptiveExtractionPrompt(
|
||||
self,
|
||||
outputFormat: str,
|
||||
|
|
@ -353,14 +529,21 @@ class GenerationService:
|
|||
def _getFormatRenderer(self, output_format: str):
|
||||
"""Get the appropriate renderer for the specified format using auto-discovery."""
|
||||
try:
|
||||
from .renderers.registry import getRenderer
|
||||
from .renderers.registry import getRenderer, getSupportedFormats
|
||||
renderer = getRenderer(output_format, services=self.services)
|
||||
|
||||
if renderer:
|
||||
return renderer
|
||||
|
||||
# Log available formats for debugging
|
||||
availableFormats = getSupportedFormats()
|
||||
logger.error(
|
||||
f"No renderer found for format '{output_format}'. "
|
||||
f"Available formats: {availableFormats}"
|
||||
)
|
||||
|
||||
# Fallback to text renderer if no specific renderer found
|
||||
logger.warning(f"No renderer found for format {output_format}, falling back to text")
|
||||
logger.warning(f"Falling back to text renderer for format {output_format}")
|
||||
fallbackRenderer = getRenderer('text', services=self.services)
|
||||
if fallbackRenderer:
|
||||
return fallbackRenderer
|
||||
|
|
@ -370,4 +553,6 @@ class GenerationService:
|
|||
|
||||
except Exception as e:
|
||||
logger.error(f"Error getting renderer for {output_format}: {str(e)}")
|
||||
# traceback is already imported at module level
|
||||
logger.debug(traceback.format_exc())
|
||||
return None
|
||||
|
|
@ -5,8 +5,9 @@ Base renderer class for all format renderers.
|
|||
"""
|
||||
|
||||
from abc import ABC, abstractmethod
|
||||
from typing import Dict, Any, Tuple, List
|
||||
from typing import Dict, Any, List, Tuple
|
||||
from modules.datamodels.datamodelJson import supportedSectionTypes
|
||||
from modules.datamodels.datamodelDocument import RenderedDocument
|
||||
import json
|
||||
import logging
|
||||
import re
|
||||
|
|
@ -50,28 +51,86 @@ class BaseRenderer(ABC):
|
|||
return 0
|
||||
|
||||
@abstractmethod
|
||||
async def render(self, extractedContent: Dict[str, Any], title: str, userPrompt: str = None, aiService=None) -> Tuple[str, str]:
|
||||
async def render(self, extractedContent: Dict[str, Any], title: str, userPrompt: str = None, aiService=None) -> List[RenderedDocument]:
|
||||
"""
|
||||
Render extracted JSON content to the target format.
|
||||
Render extracted JSON content to multiple documents.
|
||||
Each renderer must implement this method.
|
||||
Can return 1..n documents (e.g., HTML + images).
|
||||
|
||||
Args:
|
||||
extractedContent: Structured JSON content with sections and metadata
|
||||
extractedContent: Structured JSON content with sections and metadata (contains single document)
|
||||
title: Report title
|
||||
userPrompt: Original user prompt for context
|
||||
aiService: AI service instance for additional processing
|
||||
|
||||
Returns:
|
||||
tuple: (renderedContent, mimeType)
|
||||
List of RenderedDocument objects.
|
||||
First document is the main document, additional documents are supporting files (e.g., images).
|
||||
Even if only one document is returned, it must be wrapped in a list.
|
||||
"""
|
||||
pass
|
||||
|
||||
def _determineFilename(self, title: str, mimeType: str) -> str:
|
||||
"""Determine filename from title and mimeType."""
|
||||
import re
|
||||
# Get extension from mimeType
|
||||
extensionMap = {
|
||||
"text/html": "html",
|
||||
"application/pdf": "pdf",
|
||||
"application/vnd.openxmlformats-officedocument.wordprocessingml.document": "docx",
|
||||
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet": "xlsx",
|
||||
"text/plain": "txt",
|
||||
"text/markdown": "md",
|
||||
"application/json": "json",
|
||||
"text/csv": "csv"
|
||||
}
|
||||
extension = extensionMap.get(mimeType, "txt")
|
||||
|
||||
# Sanitize title for filename
|
||||
sanitized = re.sub(r"[^a-zA-Z0-9._-]", "_", title)
|
||||
sanitized = re.sub(r"_+", "_", sanitized).strip("_")
|
||||
if not sanitized:
|
||||
sanitized = "document"
|
||||
|
||||
return f"{sanitized}.{extension}"
|
||||
|
||||
def _extractSections(self, reportData: Dict[str, Any]) -> List[Dict[str, Any]]:
|
||||
"""Extract sections from report data."""
|
||||
return reportData.get('sections', [])
|
||||
"""
|
||||
Extract sections from standardized schema: {metadata: {...}, documents: [{sections: [...]}]}
|
||||
Phase 5: Supports multiple documents - extracts all sections from all documents.
|
||||
"""
|
||||
if "documents" not in reportData:
|
||||
raise ValueError("Report data must follow standardized schema with 'documents' array")
|
||||
|
||||
documents = reportData.get("documents", [])
|
||||
if not isinstance(documents, list) or len(documents) == 0:
|
||||
raise ValueError("Standardized schema must contain at least one document in 'documents' array")
|
||||
|
||||
# Phase 5: Extract sections from ALL documents
|
||||
all_sections = []
|
||||
for doc in documents:
|
||||
if isinstance(doc, dict) and "sections" in doc:
|
||||
sections = doc.get("sections", [])
|
||||
if isinstance(sections, list):
|
||||
all_sections.extend(sections)
|
||||
|
||||
if not all_sections:
|
||||
raise ValueError("No sections found in any document")
|
||||
|
||||
return all_sections
|
||||
|
||||
def _extractMetadata(self, reportData: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""Extract metadata from report data."""
|
||||
return reportData.get('metadata', {})
|
||||
"""
|
||||
Extract metadata from standardized schema: {metadata: {...}, documents: [{sections: [...]}]}
|
||||
"""
|
||||
if "metadata" not in reportData:
|
||||
raise ValueError("Report data must follow standardized schema with 'metadata' field")
|
||||
|
||||
metadata = reportData.get("metadata", {})
|
||||
if not isinstance(metadata, dict):
|
||||
raise ValueError("Metadata in standardized schema must be a dictionary")
|
||||
|
||||
return metadata
|
||||
|
||||
def _getTitle(self, reportData: Dict[str, Any], fallbackTitle: str) -> str:
|
||||
"""Get title from report data or use fallback."""
|
||||
|
|
@ -79,14 +138,33 @@ class BaseRenderer(ABC):
|
|||
return metadata.get('title', fallbackTitle)
|
||||
|
||||
def _validateJsonStructure(self, jsonContent: Dict[str, Any]) -> bool:
|
||||
"""Validate that JSON content has the expected structure."""
|
||||
"""
|
||||
Validate that JSON content follows standardized schema: {metadata: {...}, documents: [{sections: [...]}]}
|
||||
"""
|
||||
if not isinstance(jsonContent, dict):
|
||||
return False
|
||||
|
||||
if "sections" not in jsonContent:
|
||||
# Validate metadata field exists
|
||||
if "metadata" not in jsonContent:
|
||||
return False
|
||||
|
||||
sections = jsonContent.get("sections", [])
|
||||
if not isinstance(jsonContent.get("metadata"), dict):
|
||||
return False
|
||||
|
||||
# Validate documents array exists and is not empty
|
||||
if "documents" not in jsonContent:
|
||||
return False
|
||||
|
||||
documents = jsonContent.get("documents", [])
|
||||
if not isinstance(documents, list) or len(documents) == 0:
|
||||
return False
|
||||
|
||||
# Validate first document has sections
|
||||
firstDoc = documents[0]
|
||||
if not isinstance(firstDoc, dict) or "sections" not in firstDoc:
|
||||
return False
|
||||
|
||||
sections = firstDoc.get("sections", [])
|
||||
if not isinstance(sections, list):
|
||||
return False
|
||||
|
||||
|
|
@ -120,98 +198,6 @@ class BaseRenderer(ABC):
|
|||
return section.get("id", "unknown")
|
||||
return "unknown"
|
||||
|
||||
def _extractTableData(self, sectionData: Dict[str, Any]) -> Tuple[List[str], List[List[str]]]:
|
||||
"""Extract table headers and rows from section data."""
|
||||
# Normalize when elements array was passed in
|
||||
if isinstance(sectionData, list) and sectionData:
|
||||
candidate = sectionData[0]
|
||||
sectionData = candidate if isinstance(candidate, dict) else {}
|
||||
headers = sectionData.get("headers", [])
|
||||
rows = sectionData.get("rows", [])
|
||||
return headers, rows
|
||||
|
||||
def _extractBulletListItems(self, sectionData: Dict[str, Any]) -> List[str]:
|
||||
"""Extract bullet list items from section data."""
|
||||
# Normalize when elements array or raw list was passed in
|
||||
if isinstance(sectionData, list):
|
||||
# Already a list of items (strings or dicts)
|
||||
items = sectionData
|
||||
else:
|
||||
items = sectionData.get("items", [])
|
||||
result = []
|
||||
for item in items:
|
||||
if isinstance(item, str):
|
||||
result.append(item)
|
||||
elif isinstance(item, dict) and "text" in item:
|
||||
result.append(item["text"])
|
||||
return result
|
||||
|
||||
def _extractHeadingData(self, sectionData: Dict[str, Any]) -> Tuple[int, str]:
|
||||
"""Extract heading level and text from section data."""
|
||||
# Normalize when elements array was passed in
|
||||
if isinstance(sectionData, list) and sectionData:
|
||||
sectionData = sectionData[0] if isinstance(sectionData[0], dict) else {}
|
||||
level = sectionData.get("level", 1)
|
||||
text = sectionData.get("text", "")
|
||||
return level, text
|
||||
|
||||
def _extractParagraphText(self, sectionData: Dict[str, Any]) -> str:
|
||||
"""Extract paragraph text from section data."""
|
||||
if isinstance(sectionData, list):
|
||||
# Join multiple paragraph elements if provided as a list
|
||||
texts = []
|
||||
for el in sectionData:
|
||||
if isinstance(el, dict) and "text" in el:
|
||||
texts.append(el["text"])
|
||||
elif isinstance(el, str):
|
||||
texts.append(el)
|
||||
return "\n".join(texts)
|
||||
return sectionData.get("text", "")
|
||||
|
||||
def _extractCodeBlockData(self, sectionData: Dict[str, Any]) -> Tuple[str, str]:
|
||||
"""Extract code and language from section data."""
|
||||
# Normalize when elements array was passed in
|
||||
if isinstance(sectionData, list) and sectionData:
|
||||
sectionData = sectionData[0] if isinstance(sectionData[0], dict) else {}
|
||||
code = sectionData.get("code", "")
|
||||
language = sectionData.get("language", "")
|
||||
return code, language
|
||||
|
||||
def _extractImageData(self, sectionData: Dict[str, Any]) -> Tuple[str, str]:
|
||||
"""Extract base64 data and alt text from section data."""
|
||||
# Normalize when elements array was passed in
|
||||
if isinstance(sectionData, list) and sectionData:
|
||||
sectionData = sectionData[0] if isinstance(sectionData[0], dict) else {}
|
||||
base64Data = sectionData.get("base64Data", "")
|
||||
altText = sectionData.get("altText", "Image")
|
||||
return base64Data, altText
|
||||
|
||||
def _renderImageSection(self, section: Dict[str, Any], styles: Dict[str, Any] = None) -> Any:
|
||||
"""
|
||||
Render an image section. This is a base implementation that should be overridden
|
||||
by format-specific renderers.
|
||||
|
||||
Args:
|
||||
section: Image section data
|
||||
styles: Optional styling information
|
||||
|
||||
Returns:
|
||||
Format-specific image representation
|
||||
"""
|
||||
sectionData = self._getSectionData(section)
|
||||
base64Data, altText = self._extractImageData(sectionData)
|
||||
|
||||
# Base implementation returns a simple dict
|
||||
# Format-specific renderers should override this method
|
||||
return {
|
||||
"content_type": "image",
|
||||
"base64Data": base64Data,
|
||||
"altText": altText,
|
||||
"width": sectionData.get("width", None),
|
||||
"height": sectionData.get("height", None),
|
||||
"caption": sectionData.get("caption", "")
|
||||
}
|
||||
|
||||
def _validateImageData(self, base64Data: str, altText: str) -> bool:
|
||||
"""Validate image data."""
|
||||
if not base64Data:
|
||||
|
|
@ -288,46 +274,6 @@ class BaseRenderer(ABC):
|
|||
"""Check if a section type is valid."""
|
||||
return sectionType in self._getSupportedSectionTypes()
|
||||
|
||||
def _processSectionByType(self, section: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""Process a section and return structured data based on its type."""
|
||||
sectionType = self._getSectionType(section)
|
||||
sectionData = self._getSectionData(section)
|
||||
|
||||
if sectionType == "table":
|
||||
headers, rows = self._extractTableData(sectionData)
|
||||
return {"content_type": "table", "headers": headers, "rows": rows}
|
||||
elif sectionType == "bullet_list":
|
||||
items = self._extractBulletListItems(sectionData)
|
||||
return {"content_type": "bullet_list", "items": items}
|
||||
elif sectionType == "heading":
|
||||
level, text = self._extractHeadingData(sectionData)
|
||||
return {"content_type": "heading", "level": level, "text": text}
|
||||
elif sectionType == "paragraph":
|
||||
text = self._extractParagraphText(sectionData)
|
||||
return {"content_type": "paragraph", "text": text}
|
||||
elif sectionType == "code_block":
|
||||
code, language = self._extractCodeBlockData(sectionData)
|
||||
return {"content_type": "code_block", "code": code, "language": language}
|
||||
elif sectionType == "image":
|
||||
base64Data, altText = self._extractImageData(sectionData)
|
||||
# Validate image data
|
||||
if self._validateImageData(base64Data, altText):
|
||||
return {
|
||||
"content_type": "image",
|
||||
"base64Data": base64Data,
|
||||
"altText": altText,
|
||||
"width": sectionData.get("width") if isinstance(sectionData, dict) else None,
|
||||
"height": sectionData.get("height") if isinstance(sectionData, dict) else None,
|
||||
"caption": sectionData.get("caption", "") if isinstance(sectionData, dict) else ""
|
||||
}
|
||||
else:
|
||||
# Return placeholder if image data is invalid
|
||||
return {"content_type": "paragraph", "text": f"[Image: {altText}]"}
|
||||
else:
|
||||
# Fallback to paragraph
|
||||
text = self._extractParagraphText(sectionData)
|
||||
return {"content_type": "paragraph", "text": text}
|
||||
|
||||
def _formatTimestamp(self, timestamp: str = None) -> str:
|
||||
"""Format timestamp for display."""
|
||||
if timestamp:
|
||||
|
|
|
|||
|
|
@ -5,7 +5,8 @@ CSV renderer for report generation.
|
|||
"""
|
||||
|
||||
from .rendererBaseTemplate import BaseRenderer
|
||||
from typing import Dict, Any, Tuple, List
|
||||
from modules.datamodels.datamodelDocument import RenderedDocument
|
||||
from typing import Dict, Any, List
|
||||
|
||||
class RendererCsv(BaseRenderer):
|
||||
"""Renders content to CSV format with format-specific extraction."""
|
||||
|
|
@ -25,13 +26,34 @@ class RendererCsv(BaseRenderer):
|
|||
"""Return priority for CSV renderer."""
|
||||
return 70
|
||||
|
||||
async def render(self, extractedContent: Dict[str, Any], title: str, userPrompt: str = None, aiService=None) -> Tuple[str, str]:
|
||||
async def render(self, extractedContent: Dict[str, Any], title: str, userPrompt: str = None, aiService=None) -> List[RenderedDocument]:
|
||||
"""Render extracted JSON content to CSV format."""
|
||||
try:
|
||||
# Generate CSV directly from JSON (no styling needed for CSV)
|
||||
csvContent = await self._generateCsvFromJson(extractedContent, title)
|
||||
|
||||
return csvContent, "text/csv"
|
||||
# Determine filename from document or title
|
||||
documents = extractedContent.get("documents", [])
|
||||
if documents and isinstance(documents[0], dict):
|
||||
filename = documents[0].get("filename")
|
||||
if not filename:
|
||||
filename = self._determineFilename(title, "text/csv")
|
||||
else:
|
||||
filename = self._determineFilename(title, "text/csv")
|
||||
|
||||
# Extract metadata for document type and other info
|
||||
metadata = extractedContent.get("metadata", {}) if extractedContent else {}
|
||||
documentType = metadata.get("documentType") if isinstance(metadata, dict) else None
|
||||
|
||||
return [
|
||||
RenderedDocument(
|
||||
documentData=csvContent.encode('utf-8'),
|
||||
mimeType="text/csv",
|
||||
filename=filename,
|
||||
documentType=documentType,
|
||||
metadata=metadata if isinstance(metadata, dict) else None
|
||||
)
|
||||
]
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error rendering CSV: {str(e)}")
|
||||
|
|
@ -41,15 +63,16 @@ class RendererCsv(BaseRenderer):
|
|||
async def _generateCsvFromJson(self, jsonContent: Dict[str, Any], title: str) -> str:
|
||||
"""Generate CSV content from structured JSON document."""
|
||||
try:
|
||||
# Validate JSON structure
|
||||
if not isinstance(jsonContent, dict):
|
||||
raise ValueError("JSON content must be a dictionary")
|
||||
# Validate JSON structure (standardized schema: {metadata: {...}, documents: [{sections: [...]}]})
|
||||
if not self._validateJsonStructure(jsonContent):
|
||||
raise ValueError("JSON content must follow standardized schema: {metadata: {...}, documents: [{sections: [...]}]}")
|
||||
|
||||
if "sections" not in jsonContent:
|
||||
raise ValueError("JSON content must contain 'sections' field")
|
||||
# Extract sections and metadata from standardized schema
|
||||
sections = self._extractSections(jsonContent)
|
||||
metadata = self._extractMetadata(jsonContent)
|
||||
|
||||
# Use title from JSON metadata if available, otherwise use provided title
|
||||
documentTitle = jsonContent.get("metadata", {}).get("title", title)
|
||||
documentTitle = metadata.get("title", title)
|
||||
|
||||
# Generate CSV content
|
||||
csvRows = []
|
||||
|
|
@ -60,7 +83,6 @@ class RendererCsv(BaseRenderer):
|
|||
csvRows.append([]) # Empty row
|
||||
|
||||
# Process each section in order
|
||||
sections = jsonContent.get("sections", [])
|
||||
for section in sections:
|
||||
sectionCsv = self._renderJsonSectionToCsv(section)
|
||||
if sectionCsv:
|
||||
|
|
@ -114,8 +136,12 @@ class RendererCsv(BaseRenderer):
|
|||
def _renderJsonTableToCsv(self, tableData: Dict[str, Any]) -> List[List[str]]:
|
||||
"""Render a JSON table to CSV rows."""
|
||||
try:
|
||||
headers = tableData.get("headers", [])
|
||||
rows = tableData.get("rows", [])
|
||||
# Extract from nested content structure
|
||||
content = tableData.get("content", {})
|
||||
if not isinstance(content, dict):
|
||||
return []
|
||||
headers = content.get("headers", [])
|
||||
rows = content.get("rows", [])
|
||||
|
||||
csvRows = []
|
||||
|
||||
|
|
@ -134,7 +160,11 @@ class RendererCsv(BaseRenderer):
|
|||
def _renderJsonListToCsv(self, listData: Dict[str, Any]) -> List[List[str]]:
|
||||
"""Render a JSON list to CSV rows."""
|
||||
try:
|
||||
items = listData.get("items", [])
|
||||
# Extract from nested content structure
|
||||
content = listData.get("content", {})
|
||||
if not isinstance(content, dict):
|
||||
return []
|
||||
items = content.get("items", [])
|
||||
csvRows = []
|
||||
|
||||
for item in items:
|
||||
|
|
@ -161,8 +191,12 @@ class RendererCsv(BaseRenderer):
|
|||
def _renderJsonHeadingToCsv(self, headingData: Dict[str, Any]) -> List[List[str]]:
|
||||
"""Render a JSON heading to CSV rows."""
|
||||
try:
|
||||
text = headingData.get("text", "")
|
||||
level = headingData.get("level", 1)
|
||||
# Extract from nested content structure
|
||||
content = headingData.get("content", {})
|
||||
if not isinstance(content, dict):
|
||||
return []
|
||||
text = content.get("text", "")
|
||||
level = content.get("level", 1)
|
||||
|
||||
if text:
|
||||
# Use # symbols for heading levels
|
||||
|
|
@ -178,7 +212,14 @@ class RendererCsv(BaseRenderer):
|
|||
def _renderJsonParagraphToCsv(self, paragraphData: Dict[str, Any]) -> List[List[str]]:
|
||||
"""Render a JSON paragraph to CSV rows."""
|
||||
try:
|
||||
text = paragraphData.get("text", "")
|
||||
# Extract from nested content structure
|
||||
content = paragraphData.get("content", {})
|
||||
if isinstance(content, dict):
|
||||
text = content.get("text", "")
|
||||
elif isinstance(content, str):
|
||||
text = content
|
||||
else:
|
||||
text = ""
|
||||
|
||||
if text:
|
||||
# Split long paragraphs into multiple rows if needed
|
||||
|
|
@ -213,8 +254,12 @@ class RendererCsv(BaseRenderer):
|
|||
def _renderJsonCodeToCsv(self, codeData: Dict[str, Any]) -> List[List[str]]:
|
||||
"""Render a JSON code block to CSV rows."""
|
||||
try:
|
||||
code = codeData.get("code", "")
|
||||
language = codeData.get("language", "")
|
||||
# Extract from nested content structure
|
||||
content = codeData.get("content", {})
|
||||
if not isinstance(content, dict):
|
||||
return []
|
||||
code = content.get("code", "")
|
||||
language = content.get("language", "")
|
||||
|
||||
csvRows = []
|
||||
|
||||
|
|
|
|||
|
|
@ -5,10 +5,12 @@ DOCX renderer for report generation using python-docx.
|
|||
"""
|
||||
|
||||
from .rendererBaseTemplate import BaseRenderer
|
||||
from typing import Dict, Any, Tuple, List
|
||||
from modules.datamodels.datamodelDocument import RenderedDocument
|
||||
from typing import Dict, Any, List
|
||||
import io
|
||||
import base64
|
||||
import re
|
||||
import csv
|
||||
|
||||
try:
|
||||
from docx import Document
|
||||
|
|
@ -37,7 +39,7 @@ class RendererDocx(BaseRenderer):
|
|||
"""Return priority for DOCX renderer."""
|
||||
return 115
|
||||
|
||||
async def render(self, extractedContent: Dict[str, Any], title: str, userPrompt: str = None, aiService=None) -> Tuple[str, str]:
|
||||
async def render(self, extractedContent: Dict[str, Any], title: str, userPrompt: str = None, aiService=None) -> List[RenderedDocument]:
|
||||
"""Render extracted JSON content to DOCX format using AI-analyzed styling."""
|
||||
self.services.utils.debugLogToFile(f"DOCX RENDER CALLED: title={title}, user_prompt={userPrompt[:50] if userPrompt else 'None'}...", "DOCX_RENDERER")
|
||||
try:
|
||||
|
|
@ -45,18 +47,58 @@ class RendererDocx(BaseRenderer):
|
|||
# Fallback to HTML if python-docx not available
|
||||
from .rendererHtml import RendererHtml
|
||||
htmlRenderer = RendererHtml()
|
||||
htmlContent, _ = await htmlRenderer.render(extractedContent, title)
|
||||
return htmlContent, "text/html"
|
||||
return await htmlRenderer.render(extractedContent, title, userPrompt, aiService)
|
||||
|
||||
# Generate DOCX using AI-analyzed styling
|
||||
docx_content = await self._generateDocxFromJson(extractedContent, title, userPrompt, aiService)
|
||||
|
||||
return docx_content, "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
|
||||
# Extract metadata for document type and other info
|
||||
metadata = extractedContent.get("metadata", {}) if extractedContent else {}
|
||||
documentType = metadata.get("documentType") if isinstance(metadata, dict) else None
|
||||
|
||||
# Determine filename from document or title
|
||||
documents = extractedContent.get("documents", [])
|
||||
if documents and isinstance(documents[0], dict):
|
||||
filename = documents[0].get("filename")
|
||||
if not filename:
|
||||
filename = self._determineFilename(title, "application/vnd.openxmlformats-officedocument.wordprocessingml.document")
|
||||
else:
|
||||
filename = self._determineFilename(title, "application/vnd.openxmlformats-officedocument.wordprocessingml.document")
|
||||
|
||||
# Convert DOCX content to bytes if it's a string (base64)
|
||||
if isinstance(docx_content, str):
|
||||
try:
|
||||
docx_bytes = base64.b64decode(docx_content)
|
||||
except Exception:
|
||||
docx_bytes = docx_content.encode('utf-8')
|
||||
else:
|
||||
docx_bytes = docx_content
|
||||
|
||||
return [
|
||||
RenderedDocument(
|
||||
documentData=docx_bytes,
|
||||
mimeType="application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
||||
filename=filename,
|
||||
documentType=documentType,
|
||||
metadata=metadata if isinstance(metadata, dict) else None
|
||||
)
|
||||
]
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error rendering DOCX: {str(e)}")
|
||||
# Return minimal fallback
|
||||
return f"DOCX Generation Error: {str(e)}", "text/plain"
|
||||
fallbackContent = f"DOCX Generation Error: {str(e)}"
|
||||
metadata = extractedContent.get("metadata", {}) if extractedContent else {}
|
||||
documentType = metadata.get("documentType") if isinstance(metadata, dict) else None
|
||||
return [
|
||||
RenderedDocument(
|
||||
documentData=fallbackContent.encode('utf-8'),
|
||||
mimeType="text/plain",
|
||||
filename=self._determineFilename(title, "text/plain"),
|
||||
documentType=documentType,
|
||||
metadata=metadata if isinstance(metadata, dict) else None
|
||||
)
|
||||
]
|
||||
|
||||
async def _generateDocxFromJson(self, json_content: Dict[str, Any], title: str, userPrompt: str = None, aiService=None) -> str:
|
||||
"""Generate DOCX content from structured JSON document."""
|
||||
|
|
@ -64,29 +106,29 @@ class RendererDocx(BaseRenderer):
|
|||
# Create new document
|
||||
doc = Document()
|
||||
|
||||
# Get style set: default styles, enhanced with AI if style instructions present
|
||||
styleSet = await self._getStyleSet(userPrompt, aiService)
|
||||
# Get style set: use styles from metadata if available, otherwise enhance with AI
|
||||
styleSet = await self._getStyleSet(json_content, userPrompt, aiService)
|
||||
|
||||
# Setup basic document styles and create all styles from style set
|
||||
self._setupBasicDocumentStyles(doc)
|
||||
self._setupDocumentStyles(doc, styleSet)
|
||||
|
||||
# Validate JSON structure
|
||||
if not isinstance(json_content, dict):
|
||||
raise ValueError("JSON content must be a dictionary")
|
||||
# Validate JSON structure (standardized schema: {metadata: {...}, documents: [{sections: [...]}]})
|
||||
if not self._validateJsonStructure(json_content):
|
||||
raise ValueError("JSON content must follow standardized schema: {metadata: {...}, documents: [{sections: [...]}]}")
|
||||
|
||||
if "sections" not in json_content:
|
||||
raise ValueError("JSON content must contain 'sections' field")
|
||||
# Extract sections and metadata from standardized schema
|
||||
sections = self._extractSections(json_content)
|
||||
metadata = self._extractMetadata(json_content)
|
||||
|
||||
# Use title from JSON metadata if available, otherwise use provided title
|
||||
document_title = json_content.get("metadata", {}).get("title", title)
|
||||
document_title = metadata.get("title", title)
|
||||
|
||||
# Add document title using Title style
|
||||
if document_title:
|
||||
doc.add_paragraph(document_title, style='Title')
|
||||
|
||||
# Process each section in order
|
||||
sections = json_content.get("sections", [])
|
||||
for section in sections:
|
||||
self._renderJsonSection(doc, section, styleSet)
|
||||
|
||||
|
|
@ -105,12 +147,17 @@ class RendererDocx(BaseRenderer):
|
|||
self.logger.error(f"Error generating DOCX from JSON: {str(e)}")
|
||||
raise Exception(f"DOCX generation failed: {str(e)}")
|
||||
|
||||
async def _getStyleSet(self, userPrompt: str = None, aiService=None, templateName: str = None) -> Dict[str, Any]:
|
||||
"""Get style set - default styles, enhanced with AI if userPrompt provided.
|
||||
async def _getStyleSet(self, extractedContent: Dict[str, Any] = None, userPrompt: str = None, aiService=None, templateName: str = None) -> Dict[str, Any]:
|
||||
"""Get style set - use styles from document generation metadata if available,
|
||||
otherwise enhance default styles with AI if userPrompt provided.
|
||||
|
||||
WICHTIG: In a dynamic scalable AI system, styling should come from document generation,
|
||||
not be generated separately by renderers. Only fall back to AI if styles not provided.
|
||||
|
||||
Args:
|
||||
extractedContent: Document content with metadata (may contain styles)
|
||||
userPrompt: User's prompt (AI will detect style instructions in any language)
|
||||
aiService: AI service (used only if userPrompt provided)
|
||||
aiService: AI service (used only if styles not in metadata and userPrompt provided)
|
||||
templateName: Name of template style set (None = default)
|
||||
|
||||
Returns:
|
||||
|
|
@ -124,10 +171,18 @@ class RendererDocx(BaseRenderer):
|
|||
else:
|
||||
defaultStyleSet = self._getDefaultStyleSet()
|
||||
|
||||
# Enhance with AI if userPrompt provided (AI handles multilingual style detection)
|
||||
# FIRST: Check if styles are provided in document generation metadata (preferred approach)
|
||||
if extractedContent:
|
||||
metadata = extractedContent.get("metadata", {})
|
||||
if isinstance(metadata, dict):
|
||||
styles = metadata.get("styles")
|
||||
if styles and isinstance(styles, dict):
|
||||
self.logger.debug("Using styles from document generation metadata")
|
||||
return self._validateStylesContrast(styles)
|
||||
|
||||
# FALLBACK: Enhance with AI if userPrompt provided (only if styles not in metadata)
|
||||
if userPrompt and aiService:
|
||||
# AI will naturally detect style instructions in any language
|
||||
self.logger.info(f"Enhancing styles with AI based on user prompt...")
|
||||
self.logger.info(f"Styles not in metadata, enhancing with AI based on user prompt...")
|
||||
enhancedStyleSet = await self._enhanceStylesWithAI(userPrompt, defaultStyleSet, aiService)
|
||||
return self._validateStylesContrast(enhancedStyleSet)
|
||||
else:
|
||||
|
|
@ -225,28 +280,81 @@ class RendererDocx(BaseRenderer):
|
|||
self.logger.warning(f"Could not clear template content: {str(e)}")
|
||||
|
||||
def _renderJsonSection(self, doc: Document, section: Dict[str, Any], styles: Dict[str, Any]) -> None:
|
||||
"""Render a single JSON section to DOCX using AI-generated styles."""
|
||||
"""Render a single JSON section to DOCX using AI-generated styles.
|
||||
Supports three content formats: reference, object (base64), extracted_text.
|
||||
"""
|
||||
try:
|
||||
section_type = section.get("content_type", "paragraph")
|
||||
elements = section.get("elements", [])
|
||||
|
||||
# If no elements, skip this section (it has no content to render)
|
||||
if not elements:
|
||||
return
|
||||
|
||||
# Process each element in the section
|
||||
for element in elements:
|
||||
if section_type == "table":
|
||||
element_type = element.get("type", "")
|
||||
|
||||
# Support three content formats from Phase 5D
|
||||
if element_type == "reference":
|
||||
# Document reference format
|
||||
doc_ref = element.get("documentReference", "")
|
||||
label = element.get("label", "Reference")
|
||||
para = doc.add_paragraph(f"[Reference: {label}]")
|
||||
para.runs[0].italic = True
|
||||
continue
|
||||
elif element_type == "extracted_text":
|
||||
# Extracted text format - render as paragraph
|
||||
content = element.get("content", "")
|
||||
source = element.get("source", "")
|
||||
if content:
|
||||
para = doc.add_paragraph(content)
|
||||
if source:
|
||||
para.add_run(f" (Source: {source})").italic = True
|
||||
continue
|
||||
|
||||
# Check element type, not section type (elements can have different types than section)
|
||||
if element_type == "table":
|
||||
self._renderJsonTable(doc, element, styles)
|
||||
elif section_type == "bullet_list":
|
||||
elif element_type == "bullet_list":
|
||||
self._renderJsonBulletList(doc, element, styles)
|
||||
elif section_type == "heading":
|
||||
elif element_type == "heading":
|
||||
self._renderJsonHeading(doc, element, styles)
|
||||
elif section_type == "paragraph":
|
||||
elif element_type == "paragraph":
|
||||
self._renderJsonParagraph(doc, element, styles)
|
||||
elif section_type == "code_block":
|
||||
elif element_type == "code_block":
|
||||
self._renderJsonCodeBlock(doc, element, styles)
|
||||
elif section_type == "image":
|
||||
elif element_type == "image":
|
||||
self._renderJsonImage(doc, element, styles)
|
||||
else:
|
||||
# Fallback to paragraph for unknown types
|
||||
self._renderJsonParagraph(doc, element, styles)
|
||||
# Fallback: if element_type not set, use section_type
|
||||
if section_type == "table":
|
||||
self._renderJsonTable(doc, element, styles)
|
||||
elif section_type == "bullet_list":
|
||||
self._renderJsonBulletList(doc, element, styles)
|
||||
elif section_type == "heading":
|
||||
self._renderJsonHeading(doc, element, styles)
|
||||
elif section_type == "paragraph":
|
||||
# CRITICAL: Check if this is actually an image element before rendering as paragraph
|
||||
# Image elements might not have type set, but have base64Data in content
|
||||
content = element.get("content", {})
|
||||
if isinstance(content, dict) and content.get("base64Data"):
|
||||
# This is actually an image, render it as such
|
||||
self._renderJsonImage(doc, element, styles)
|
||||
else:
|
||||
self._renderJsonParagraph(doc, element, styles)
|
||||
elif section_type == "code_block":
|
||||
self._renderJsonCodeBlock(doc, element, styles)
|
||||
elif section_type == "image":
|
||||
self._renderJsonImage(doc, element, styles)
|
||||
else:
|
||||
# Fallback to paragraph for unknown types, but check for image data first
|
||||
content = element.get("content", {})
|
||||
if isinstance(content, dict) and content.get("base64Data"):
|
||||
# This is actually an image, render it as such
|
||||
self._renderJsonImage(doc, element, styles)
|
||||
else:
|
||||
self._renderJsonParagraph(doc, element, styles)
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error rendering section {section.get('id', 'unknown')}: {str(e)}")
|
||||
|
|
@ -256,8 +364,12 @@ class RendererDocx(BaseRenderer):
|
|||
def _renderJsonTable(self, doc: Document, table_data: Dict[str, Any], styles: Dict[str, Any]) -> None:
|
||||
"""Render a JSON table to DOCX using AI-generated styles."""
|
||||
try:
|
||||
headers = table_data.get("headers", [])
|
||||
rows = table_data.get("rows", [])
|
||||
# Extract from nested content structure
|
||||
content = table_data.get("content", {})
|
||||
if not isinstance(content, dict):
|
||||
return
|
||||
headers = content.get("headers", [])
|
||||
rows = content.get("rows", [])
|
||||
|
||||
if not headers or not rows:
|
||||
return
|
||||
|
|
@ -412,14 +524,27 @@ class RendererDocx(BaseRenderer):
|
|||
def _renderJsonBulletList(self, doc: Document, list_data: Dict[str, Any], styles: Dict[str, Any]) -> None:
|
||||
"""Render a JSON bullet list to DOCX using AI-generated styles."""
|
||||
try:
|
||||
items = list_data.get("items", [])
|
||||
bullet_style = styles["bullet_list"]
|
||||
# Extract from nested content structure
|
||||
content = list_data.get("content", {})
|
||||
if not isinstance(content, dict):
|
||||
return
|
||||
items = content.get("items", [])
|
||||
bullet_style = styles.get("bullet_list", {})
|
||||
|
||||
for item in items:
|
||||
if isinstance(item, str):
|
||||
para = doc.add_paragraph(item, style='List Bullet')
|
||||
elif isinstance(item, dict) and "text" in item:
|
||||
para = doc.add_paragraph(item["text"], style='List Bullet')
|
||||
|
||||
# Apply bullet list styling from style set
|
||||
if bullet_style and para.runs:
|
||||
for run in para.runs:
|
||||
if "font_size" in bullet_style:
|
||||
run.font.size = Pt(bullet_style["font_size"])
|
||||
if "color" in bullet_style:
|
||||
color_hex = bullet_style["color"].lstrip('#')
|
||||
run.font.color.rgb = RGBColor(int(color_hex[0:2], 16), int(color_hex[2:4], 16), int(color_hex[4:6], 16))
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error rendering bullet list: {str(e)}")
|
||||
|
|
@ -427,12 +552,22 @@ class RendererDocx(BaseRenderer):
|
|||
def _renderJsonHeading(self, doc: Document, heading_data: Dict[str, Any], styles: Dict[str, Any]) -> None:
|
||||
"""Render a JSON heading to DOCX using AI-generated styles."""
|
||||
try:
|
||||
level = heading_data.get("level", 1)
|
||||
text = heading_data.get("text", "")
|
||||
# Extract from nested content structure
|
||||
content = heading_data.get("content", {})
|
||||
if not isinstance(content, dict):
|
||||
return
|
||||
text = content.get("text", "")
|
||||
level = content.get("level", 1)
|
||||
|
||||
if text:
|
||||
level = max(1, min(6, level))
|
||||
doc.add_heading(text, level=level)
|
||||
# Use custom heading style if available, otherwise use built-in
|
||||
style_name = f"Heading {level}" if level <= 2 else "Heading 1"
|
||||
try:
|
||||
para = doc.add_paragraph(text, style=style_name)
|
||||
except KeyError:
|
||||
# Fallback to built-in heading if custom style doesn't exist
|
||||
doc.add_heading(text, level=level)
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error rendering heading: {str(e)}")
|
||||
|
|
@ -440,10 +575,47 @@ class RendererDocx(BaseRenderer):
|
|||
def _renderJsonParagraph(self, doc: Document, paragraph_data: Dict[str, Any], styles: Dict[str, Any]) -> None:
|
||||
"""Render a JSON paragraph to DOCX using AI-generated styles."""
|
||||
try:
|
||||
text = paragraph_data.get("text", "")
|
||||
# Extract from nested content structure
|
||||
content = paragraph_data.get("content", {})
|
||||
if isinstance(content, dict):
|
||||
text = content.get("text", "")
|
||||
elif isinstance(content, str):
|
||||
text = content
|
||||
else:
|
||||
text = ""
|
||||
|
||||
# CRITICAL: Prevent rendering base64 image data as text
|
||||
# Base64 image data typically starts with /9j/ (JPEG) or iVBORw0KGgo (PNG)
|
||||
if text and (text.startswith("/9j/") or text.startswith("iVBORw0KGgo") or
|
||||
(len(text) > 100 and all(c in "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/=" for c in text[:100]))):
|
||||
# This looks like base64 data - don't render as text
|
||||
self.logger.warning(f"Skipping rendering of what appears to be base64 data in paragraph (length: {len(text)})")
|
||||
para = doc.add_paragraph("[Error: Image data found in text content - image embedding may have failed]")
|
||||
if para.runs:
|
||||
para.runs[0].font.color.rgb = RGBColor(255, 0, 0) # Red color for error
|
||||
return
|
||||
|
||||
if text:
|
||||
para = doc.add_paragraph(text)
|
||||
# Apply paragraph styling from style set
|
||||
paragraph_style = styles.get("paragraph", {})
|
||||
if paragraph_style:
|
||||
for run in para.runs:
|
||||
if "font_size" in paragraph_style:
|
||||
run.font.size = Pt(paragraph_style["font_size"])
|
||||
if "bold" in paragraph_style:
|
||||
run.font.bold = paragraph_style["bold"]
|
||||
if "color" in paragraph_style:
|
||||
color_hex = paragraph_style["color"].lstrip('#')
|
||||
run.font.color.rgb = RGBColor(int(color_hex[0:2], 16), int(color_hex[2:4], 16), int(color_hex[4:6], 16))
|
||||
if "align" in paragraph_style:
|
||||
align = paragraph_style["align"]
|
||||
if align == "center":
|
||||
para.alignment = WD_ALIGN_PARAGRAPH.CENTER
|
||||
elif align == "right":
|
||||
para.alignment = WD_ALIGN_PARAGRAPH.RIGHT
|
||||
else:
|
||||
para.alignment = WD_ALIGN_PARAGRAPH.LEFT
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error rendering paragraph: {str(e)}")
|
||||
|
|
@ -451,18 +623,27 @@ class RendererDocx(BaseRenderer):
|
|||
def _renderJsonCodeBlock(self, doc: Document, code_data: Dict[str, Any], styles: Dict[str, Any]) -> None:
|
||||
"""Render a JSON code block to DOCX using AI-generated styles."""
|
||||
try:
|
||||
code = code_data.get("code", "")
|
||||
language = code_data.get("language", "")
|
||||
# Extract from nested content structure
|
||||
content = code_data.get("content", {})
|
||||
if not isinstance(content, dict):
|
||||
return
|
||||
code = content.get("code", "")
|
||||
language = content.get("language", "")
|
||||
code_style = styles.get("code_block", {})
|
||||
|
||||
if code:
|
||||
if language:
|
||||
lang_para = doc.add_paragraph(f"Code ({language}):")
|
||||
lang_para.runs[0].bold = True
|
||||
if lang_para.runs:
|
||||
lang_para.runs[0].bold = True
|
||||
|
||||
code_para = doc.add_paragraph(code)
|
||||
for run in code_para.runs:
|
||||
run.font.name = 'Courier New'
|
||||
run.font.size = Pt(10)
|
||||
run.font.name = code_style.get("font", "Courier New")
|
||||
run.font.size = Pt(code_style.get("font_size", 9))
|
||||
if "color" in code_style:
|
||||
color_hex = code_style["color"].lstrip('#')
|
||||
run.font.color.rgb = RGBColor(int(color_hex[0:2], 16), int(color_hex[2:4], 16), int(color_hex[4:6], 16))
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error rendering code block: {str(e)}")
|
||||
|
|
@ -470,20 +651,80 @@ class RendererDocx(BaseRenderer):
|
|||
def _renderJsonImage(self, doc: Document, image_data: Dict[str, Any], styles: Dict[str, Any]) -> None:
|
||||
"""Render a JSON image to DOCX."""
|
||||
try:
|
||||
base64_data = image_data.get("base64Data", "")
|
||||
alt_text = image_data.get("altText", "Image")
|
||||
# Extract from nested content structure
|
||||
content = image_data.get("content", {})
|
||||
base64_data = ""
|
||||
alt_text = "Image"
|
||||
|
||||
if base64_data:
|
||||
if isinstance(content, dict):
|
||||
base64_data = content.get("base64Data", "")
|
||||
alt_text = content.get("altText", "Image")
|
||||
elif isinstance(content, str):
|
||||
# Content might be base64 string directly (shouldn't happen, but handle it)
|
||||
self.logger.warning("Image content is a string, not a dict. This should not happen.")
|
||||
return
|
||||
|
||||
# If base64Data not found in content, try direct element fields (fallback)
|
||||
if not base64_data:
|
||||
base64_data = image_data.get("base64Data", "")
|
||||
if not alt_text or alt_text == "Image":
|
||||
alt_text = image_data.get("altText", "Image")
|
||||
|
||||
# CRITICAL: Ensure we don't render base64 data as text
|
||||
# If base64_data looks like it might be rendered elsewhere, skip it
|
||||
if not base64_data:
|
||||
raise Exception("No image data provided (base64Data is empty)")
|
||||
|
||||
try:
|
||||
image_bytes = base64.b64decode(base64_data)
|
||||
doc.add_picture(io.BytesIO(image_bytes), width=Inches(4))
|
||||
image_stream = io.BytesIO(image_bytes)
|
||||
|
||||
if alt_text:
|
||||
# Get image dimensions to calculate proper size
|
||||
try:
|
||||
from PIL import Image as PILImage
|
||||
pil_image = PILImage.open(image_stream)
|
||||
img_width_px, img_height_px = pil_image.size
|
||||
|
||||
# DOCX page width is typically 8.5 inches, usable width ~6.5 inches with margins
|
||||
# Standard margins: 1 inch left/right, so usable width = 6.5 inches
|
||||
max_width_inches = 6.5
|
||||
max_height_inches = 9.0 # Leave room for text above/below
|
||||
|
||||
# Calculate scale factor to fit within page dimensions
|
||||
# Convert pixels to inches (assuming 96 DPI for modern displays, but images may vary)
|
||||
# Use conservative estimate: 1 inch = 96 pixels
|
||||
img_width_inches = img_width_px / 96.0
|
||||
img_height_inches = img_height_px / 96.0
|
||||
|
||||
# Calculate scale to fit
|
||||
width_scale = max_width_inches / img_width_inches if img_width_inches > max_width_inches else 1.0
|
||||
height_scale = max_height_inches / img_height_inches if img_height_inches > max_height_inches else 1.0
|
||||
scale = min(width_scale, height_scale, 1.0) # Don't scale up, only down
|
||||
|
||||
final_width = img_width_inches * scale
|
||||
final_height = img_height_inches * scale
|
||||
|
||||
# Reset stream for docx
|
||||
image_stream.seek(0)
|
||||
doc.add_picture(image_stream, width=Inches(final_width))
|
||||
except Exception:
|
||||
# Fallback: use conservative default size if PIL fails
|
||||
image_stream.seek(0)
|
||||
doc.add_picture(image_stream, width=Inches(6.0))
|
||||
|
||||
if alt_text and alt_text != "Image":
|
||||
caption_para = doc.add_paragraph(f"Figure: {alt_text}")
|
||||
caption_para.runs[0].italic = True
|
||||
except Exception as embedError:
|
||||
# Image decoding or embedding failed
|
||||
raise Exception(f"Failed to decode or embed image: {str(embedError)}")
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error rendering image: {str(e)}")
|
||||
doc.add_paragraph(f"[Image: {image_data.get('altText', 'Image')}]")
|
||||
self.logger.error(f"Error embedding image in DOCX: {str(e)}")
|
||||
errorMsg = f"[Error: Could not embed image '{image_data.get('altText', 'Image')}'. {str(e)}]"
|
||||
errorPara = doc.add_paragraph(errorMsg)
|
||||
if errorPara.runs:
|
||||
errorPara.runs[0].font.color.rgb = RGBColor(255, 0, 0) # Red color for error
|
||||
|
||||
def _extractStructureFromPrompt(self, userPrompt: str, title: str) -> Dict[str, Any]:
|
||||
"""Extract document structure from user prompt."""
|
||||
|
|
@ -649,7 +890,11 @@ class RendererDocx(BaseRenderer):
|
|||
if "heading2" in styleSet:
|
||||
self._createStyle(doc, "Heading 2", styleSet["heading2"], WD_STYLE_TYPE.PARAGRAPH)
|
||||
|
||||
# Note: List Bullet and List Number are built-in Word styles, no need to create
|
||||
# Create Paragraph style
|
||||
if "paragraph" in styleSet:
|
||||
self._createStyle(doc, "Custom Paragraph", styleSet["paragraph"], WD_STYLE_TYPE.PARAGRAPH)
|
||||
|
||||
# Note: List Bullet and List Number are built-in Word styles, but we apply custom styling to runs
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Could not set up document styles: {str(e)}")
|
||||
|
|
@ -848,7 +1093,7 @@ class RendererDocx(BaseRenderer):
|
|||
Process tables in the content (both CSV and pipe-separated) and convert them to Word tables.
|
||||
Returns the content with tables replaced by placeholders.
|
||||
"""
|
||||
import csv
|
||||
# csv is already imported at module level
|
||||
|
||||
lines = content.split('\n')
|
||||
processed_lines = []
|
||||
|
|
|
|||
|
|
@ -5,7 +5,8 @@ HTML renderer for report generation.
|
|||
"""
|
||||
|
||||
from .rendererBaseTemplate import BaseRenderer
|
||||
from typing import Dict, Any, Tuple, List
|
||||
from modules.datamodels.datamodelDocument import RenderedDocument
|
||||
from typing import Dict, Any, List
|
||||
|
||||
class RendererHtml(BaseRenderer):
|
||||
"""Renders content to HTML format with format-specific extraction."""
|
||||
|
|
@ -25,34 +26,89 @@ class RendererHtml(BaseRenderer):
|
|||
"""Return priority for HTML renderer."""
|
||||
return 100
|
||||
|
||||
async def render(self, extractedContent: Dict[str, Any], title: str, userPrompt: str = None, aiService=None) -> Tuple[str, str]:
|
||||
"""Render extracted JSON content to HTML format using AI-analyzed styling."""
|
||||
try:
|
||||
# Generate HTML using AI-analyzed styling
|
||||
htmlContent = await self._generateHtmlFromJson(extractedContent, title, userPrompt, aiService)
|
||||
async def render(self, extractedContent: Dict[str, Any], title: str, userPrompt: str = None, aiService=None) -> List[RenderedDocument]:
|
||||
"""
|
||||
Render HTML document with images as separate files.
|
||||
Returns list of documents: [HTML document, image1, image2, ...]
|
||||
"""
|
||||
import base64
|
||||
|
||||
# Extract images first
|
||||
images = self._extractImages(extractedContent)
|
||||
|
||||
# Store images in instance for later retrieval
|
||||
self._renderedImages = images
|
||||
|
||||
# Generate HTML using AI-analyzed styling
|
||||
htmlContent = await self._generateHtmlFromJson(extractedContent, title, userPrompt, aiService)
|
||||
|
||||
# Replace base64 data URIs with relative file paths if images exist
|
||||
if images:
|
||||
htmlContent = self._replaceImageDataUris(htmlContent, images)
|
||||
|
||||
# Determine HTML filename from document or title
|
||||
documents = extractedContent.get("documents", [])
|
||||
if documents and isinstance(documents[0], dict):
|
||||
htmlFilename = documents[0].get("filename")
|
||||
if not htmlFilename:
|
||||
htmlFilename = self._determineFilename(title, "text/html")
|
||||
else:
|
||||
htmlFilename = self._determineFilename(title, "text/html")
|
||||
|
||||
# Extract metadata for document type and other info
|
||||
metadata = extractedContent.get("metadata", {}) if extractedContent else {}
|
||||
documentType = metadata.get("documentType") if isinstance(metadata, dict) else None
|
||||
|
||||
# Start with HTML document
|
||||
resultDocuments = [
|
||||
RenderedDocument(
|
||||
documentData=htmlContent.encode('utf-8'),
|
||||
mimeType="text/html",
|
||||
filename=htmlFilename,
|
||||
documentType=documentType,
|
||||
metadata=metadata if isinstance(metadata, dict) else None
|
||||
)
|
||||
]
|
||||
|
||||
# Add images as separate documents
|
||||
for img in images:
|
||||
base64Data = img.get("base64Data", "")
|
||||
filename = img.get("filename", f"image_{len(resultDocuments)}.png")
|
||||
mimeType = img.get("mimeType", "image/png")
|
||||
|
||||
return htmlContent, "text/html"
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error rendering HTML: {str(e)}")
|
||||
# Return minimal HTML fallback
|
||||
return f"<html><head><title>{title}</title></head><body><h1>{title}</h1><p>Error rendering report: {str(e)}</p></body></html>", "text/html"
|
||||
if base64Data:
|
||||
try:
|
||||
# Decode base64 to bytes
|
||||
imageBytes = base64.b64decode(base64Data)
|
||||
resultDocuments.append(
|
||||
RenderedDocument(
|
||||
documentData=imageBytes,
|
||||
mimeType=mimeType,
|
||||
filename=filename
|
||||
)
|
||||
)
|
||||
self.logger.debug(f"Added image file: {filename} ({len(imageBytes)} bytes)")
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error creating image file {filename}: {str(e)}")
|
||||
|
||||
return resultDocuments
|
||||
|
||||
async def _generateHtmlFromJson(self, jsonContent: Dict[str, Any], title: str, userPrompt: str = None, aiService=None) -> str:
|
||||
"""Generate HTML content from structured JSON document using AI-generated styling."""
|
||||
try:
|
||||
# Get style set: default styles, enhanced with AI if userPrompt provided
|
||||
styles = await self._getStyleSet(userPrompt, aiService)
|
||||
# Get style set: use styles from metadata if available, otherwise enhance with AI
|
||||
styles = await self._getStyleSet(jsonContent, userPrompt, aiService)
|
||||
|
||||
# Validate JSON structure
|
||||
if not isinstance(jsonContent, dict):
|
||||
raise ValueError("JSON content must be a dictionary")
|
||||
if not self._validateJsonStructure(jsonContent):
|
||||
raise ValueError("JSON content must follow standardized schema: {metadata: {...}, documents: [{sections: [...]}]}")
|
||||
|
||||
if "sections" not in jsonContent:
|
||||
raise ValueError("JSON content must contain 'sections' field")
|
||||
# Extract sections and metadata from standardized schema
|
||||
sections = self._extractSections(jsonContent)
|
||||
metadata = self._extractMetadata(jsonContent)
|
||||
|
||||
# Use title from JSON metadata if available, otherwise use provided title
|
||||
documentTitle = jsonContent.get("metadata", {}).get("title", title)
|
||||
documentTitle = metadata.get("title", title)
|
||||
|
||||
# Build HTML document
|
||||
htmlParts = []
|
||||
|
|
@ -77,7 +133,6 @@ class RendererHtml(BaseRenderer):
|
|||
htmlParts.append('<main>')
|
||||
|
||||
# Process each section
|
||||
sections = jsonContent.get("sections", [])
|
||||
for section in sections:
|
||||
sectionHtml = self._renderJsonSection(section, styles)
|
||||
if sectionHtml:
|
||||
|
|
@ -99,12 +154,17 @@ class RendererHtml(BaseRenderer):
|
|||
self.logger.error(f"Error generating HTML from JSON: {str(e)}")
|
||||
raise Exception(f"HTML generation failed: {str(e)}")
|
||||
|
||||
async def _getStyleSet(self, userPrompt: str = None, aiService=None, templateName: str = None) -> Dict[str, Any]:
|
||||
"""Get style set - default styles, enhanced with AI if userPrompt provided.
|
||||
async def _getStyleSet(self, extractedContent: Dict[str, Any] = None, userPrompt: str = None, aiService=None, templateName: str = None) -> Dict[str, Any]:
|
||||
"""Get style set - use styles from document generation metadata if available,
|
||||
otherwise enhance default styles with AI if userPrompt provided.
|
||||
|
||||
WICHTIG: In a dynamic scalable AI system, styling should come from document generation,
|
||||
not be generated separately by renderers. Only fall back to AI if styles not provided.
|
||||
|
||||
Args:
|
||||
extractedContent: Document content with metadata (may contain styles)
|
||||
userPrompt: User's prompt (AI will detect style instructions in any language)
|
||||
aiService: AI service (used only if userPrompt provided)
|
||||
aiService: AI service (used only if styles not in metadata and userPrompt provided)
|
||||
templateName: Name of template style set (None = default)
|
||||
|
||||
Returns:
|
||||
|
|
@ -113,10 +173,18 @@ class RendererHtml(BaseRenderer):
|
|||
# Get default style set
|
||||
defaultStyleSet = self._getDefaultStyleSet()
|
||||
|
||||
# Enhance with AI if userPrompt provided (AI handles multilingual style detection)
|
||||
# FIRST: Check if styles are provided in document generation metadata (preferred approach)
|
||||
if extractedContent:
|
||||
metadata = extractedContent.get("metadata", {})
|
||||
if isinstance(metadata, dict):
|
||||
styles = metadata.get("styles")
|
||||
if styles and isinstance(styles, dict):
|
||||
self.logger.debug("Using styles from document generation metadata")
|
||||
return self._validateStylesContrast(styles)
|
||||
|
||||
# FALLBACK: Enhance with AI if userPrompt provided (only if styles not in metadata)
|
||||
if userPrompt and aiService:
|
||||
# AI will naturally detect style instructions in any language
|
||||
self.logger.info(f"Enhancing styles with AI based on user prompt...")
|
||||
self.logger.info(f"Styles not in metadata, enhancing with AI based on user prompt...")
|
||||
enhancedStyleSet = await self._enhanceStylesWithAI(userPrompt, defaultStyleSet, aiService)
|
||||
return self._validateStylesContrast(enhancedStyleSet)
|
||||
else:
|
||||
|
|
@ -286,32 +354,102 @@ class RendererHtml(BaseRenderer):
|
|||
return '\n'.join(css_parts)
|
||||
|
||||
def _renderJsonSection(self, section: Dict[str, Any], styles: Dict[str, Any]) -> str:
|
||||
"""Render a single JSON section to HTML using AI-generated styles."""
|
||||
"""Render a single JSON section to HTML using AI-generated styles.
|
||||
Supports three content formats: reference, object (base64), extracted_text.
|
||||
WICHTIG: Respektiert sectionType (content_type) für korrekte Rendering-Logik.
|
||||
"""
|
||||
try:
|
||||
sectionType = self._getSectionType(section)
|
||||
sectionData = self._getSectionData(section)
|
||||
|
||||
# WICHTIG: Respektiere sectionType (content_type) ZUERST, dann process elements entsprechend
|
||||
# Process elements according to section's content_type, not just element types
|
||||
|
||||
if sectionType == "table":
|
||||
# Process the section data to extract table structure
|
||||
processedData = self._processSectionByType(section)
|
||||
return self._renderJsonTable(processedData, styles)
|
||||
# Work directly with elements like other renderers
|
||||
if isinstance(sectionData, list) and sectionData:
|
||||
element = sectionData[0] if isinstance(sectionData[0], dict) else {}
|
||||
return self._renderJsonTable(element, styles)
|
||||
return ""
|
||||
elif sectionType == "bullet_list":
|
||||
# Process the section data to extract bullet list structure
|
||||
processedData = self._processSectionByType(section)
|
||||
return self._renderJsonBulletList(processedData, styles)
|
||||
# Work directly with elements like other renderers
|
||||
if isinstance(sectionData, list) and sectionData:
|
||||
element = sectionData[0] if isinstance(sectionData[0], dict) else {}
|
||||
return self._renderJsonBulletList(element, styles)
|
||||
return ""
|
||||
elif sectionType == "heading":
|
||||
return self._renderJsonHeading(sectionData, styles)
|
||||
# Work directly with elements like other renderers
|
||||
if isinstance(sectionData, list) and sectionData:
|
||||
element = sectionData[0] if isinstance(sectionData[0], dict) else {}
|
||||
return self._renderJsonHeading(element, styles)
|
||||
return ""
|
||||
elif sectionType == "paragraph":
|
||||
# Process paragraph elements, including extracted_text
|
||||
if isinstance(sectionData, list):
|
||||
htmlParts = []
|
||||
for element in sectionData:
|
||||
element_type = element.get("type", "") if isinstance(element, dict) else ""
|
||||
|
||||
if element_type == "reference":
|
||||
doc_ref = element.get("documentReference", "")
|
||||
label = element.get("label", "Reference")
|
||||
htmlParts.append(f'<p class="reference"><em>[Reference: {label}]</em></p>')
|
||||
elif element_type == "extracted_text":
|
||||
content = element.get("content", "")
|
||||
source = element.get("source", "")
|
||||
if content:
|
||||
source_text = f' <small><em>(Source: {source})</em></small>' if source else ''
|
||||
htmlParts.append(f'<p>{content}{source_text}</p>')
|
||||
elif isinstance(element, dict):
|
||||
# Regular paragraph element - extract from nested content structure (standard JSON format)
|
||||
content = element.get("content", {})
|
||||
if isinstance(content, dict):
|
||||
text = content.get("text", "")
|
||||
elif isinstance(content, str):
|
||||
text = content
|
||||
else:
|
||||
text = ""
|
||||
|
||||
if text:
|
||||
htmlParts.append(f'<p>{text}</p>')
|
||||
elif isinstance(element, str):
|
||||
htmlParts.append(f'<p>{element}</p>')
|
||||
|
||||
if htmlParts:
|
||||
return '\n'.join(htmlParts)
|
||||
return self._renderJsonParagraph(sectionData, styles)
|
||||
elif sectionType == "code_block":
|
||||
# Process the section data to extract code block structure
|
||||
processedData = self._processSectionByType(section)
|
||||
return self._renderJsonCodeBlock(processedData, styles)
|
||||
# Work directly with elements like other renderers
|
||||
if isinstance(sectionData, list) and sectionData:
|
||||
element = sectionData[0] if isinstance(sectionData[0], dict) else {}
|
||||
return self._renderJsonCodeBlock(element, styles)
|
||||
return ""
|
||||
elif sectionType == "image":
|
||||
# Process the section data to extract image structure
|
||||
processedData = self._processSectionByType(section)
|
||||
return self._renderJsonImage(processedData, styles)
|
||||
# Work directly with elements like other renderers
|
||||
if isinstance(sectionData, list) and sectionData:
|
||||
element = sectionData[0] if isinstance(sectionData[0], dict) else {}
|
||||
return self._renderJsonImage(element, styles)
|
||||
return ""
|
||||
else:
|
||||
# Fallback: Check for special element types first
|
||||
if isinstance(sectionData, list):
|
||||
htmlParts = []
|
||||
for element in sectionData:
|
||||
element_type = element.get("type", "") if isinstance(element, dict) else ""
|
||||
|
||||
if element_type == "reference":
|
||||
doc_ref = element.get("documentReference", "")
|
||||
label = element.get("label", "Reference")
|
||||
htmlParts.append(f'<p class="reference"><em>[Reference: {label}]</em></p>')
|
||||
elif element_type == "extracted_text":
|
||||
content = element.get("content", "")
|
||||
source = element.get("source", "")
|
||||
if content:
|
||||
source_text = f' <small><em>(Source: {source})</em></small>' if source else ''
|
||||
htmlParts.append(f'<p>{content}{source_text}</p>')
|
||||
|
||||
if htmlParts:
|
||||
return '\n'.join(htmlParts)
|
||||
# Fallback to paragraph for unknown types
|
||||
return self._renderJsonParagraph(sectionData, styles)
|
||||
|
||||
|
|
@ -322,8 +460,12 @@ class RendererHtml(BaseRenderer):
|
|||
def _renderJsonTable(self, tableData: Dict[str, Any], styles: Dict[str, Any]) -> str:
|
||||
"""Render a JSON table to HTML using AI-generated styles."""
|
||||
try:
|
||||
headers = tableData.get("headers", [])
|
||||
rows = tableData.get("rows", [])
|
||||
# Extract from nested content structure: element.content.{headers, rows}
|
||||
content = tableData.get("content", {})
|
||||
if not isinstance(content, dict):
|
||||
return ""
|
||||
headers = content.get("headers", [])
|
||||
rows = content.get("rows", [])
|
||||
|
||||
if not headers or not rows:
|
||||
return ""
|
||||
|
|
@ -355,7 +497,11 @@ class RendererHtml(BaseRenderer):
|
|||
def _renderJsonBulletList(self, listData: Dict[str, Any], styles: Dict[str, Any]) -> str:
|
||||
"""Render a JSON bullet list to HTML using AI-generated styles."""
|
||||
try:
|
||||
items = listData.get("items", [])
|
||||
# Extract from nested content structure: element.content.{items}
|
||||
content = listData.get("content", {})
|
||||
if not isinstance(content, dict):
|
||||
return ""
|
||||
items = content.get("items", [])
|
||||
|
||||
if not items:
|
||||
return ""
|
||||
|
|
@ -377,17 +523,12 @@ class RendererHtml(BaseRenderer):
|
|||
def _renderJsonHeading(self, headingData: Dict[str, Any], styles: Dict[str, Any]) -> str:
|
||||
"""Render a JSON heading to HTML using AI-generated styles."""
|
||||
try:
|
||||
# Normalize non-dict inputs
|
||||
if isinstance(headingData, str):
|
||||
headingData = {"text": headingData, "level": 2}
|
||||
elif isinstance(headingData, list):
|
||||
# Render a list as bullet list under a default heading label
|
||||
return self._renderJsonBulletList({"items": headingData}, styles)
|
||||
elif not isinstance(headingData, dict):
|
||||
# Extract from nested content structure: element.content.{text, level}
|
||||
content = headingData.get("content", {})
|
||||
if not isinstance(content, dict):
|
||||
return ""
|
||||
|
||||
level = headingData.get("level", 1)
|
||||
text = headingData.get("text", "")
|
||||
text = content.get("text", "")
|
||||
level = content.get("level", 1)
|
||||
|
||||
if text:
|
||||
level = max(1, min(6, level))
|
||||
|
|
@ -402,21 +543,44 @@ class RendererHtml(BaseRenderer):
|
|||
def _renderJsonParagraph(self, paragraphData: Dict[str, Any], styles: Dict[str, Any]) -> str:
|
||||
"""Render a JSON paragraph to HTML using AI-generated styles."""
|
||||
try:
|
||||
# Normalize non-dict inputs
|
||||
if isinstance(paragraphData, str):
|
||||
paragraphData = {"text": paragraphData}
|
||||
elif isinstance(paragraphData, list):
|
||||
# Treat list as bullet list paragraph
|
||||
return self._renderJsonBulletList({"items": paragraphData}, styles)
|
||||
elif not isinstance(paragraphData, dict):
|
||||
# Normalize inputs - paragraphData is typically a list of elements from _getSectionData
|
||||
if isinstance(paragraphData, list):
|
||||
# Extract text from all paragraph elements (expects nested content structure)
|
||||
texts = []
|
||||
for el in paragraphData:
|
||||
if isinstance(el, dict):
|
||||
content = el.get("content", {})
|
||||
if isinstance(content, dict):
|
||||
text = content.get("text", "")
|
||||
elif isinstance(content, str):
|
||||
text = content
|
||||
else:
|
||||
text = ""
|
||||
if text:
|
||||
texts.append(text)
|
||||
elif isinstance(el, str):
|
||||
texts.append(el)
|
||||
if texts:
|
||||
# Join multiple paragraphs with <p> tags
|
||||
return '\n'.join(f'<p>{text}</p>' for text in texts)
|
||||
return ""
|
||||
elif isinstance(paragraphData, str):
|
||||
return f'<p>{paragraphData}</p>'
|
||||
elif isinstance(paragraphData, dict):
|
||||
# Handle nested content structure: element.content vs element.text
|
||||
# Extract from nested content structure
|
||||
content = paragraphData.get("content", {})
|
||||
if isinstance(content, dict):
|
||||
text = content.get("text", "")
|
||||
elif isinstance(content, str):
|
||||
text = content
|
||||
else:
|
||||
text = ""
|
||||
if text:
|
||||
return f'<p>{text}</p>'
|
||||
return ""
|
||||
else:
|
||||
return ""
|
||||
|
||||
text = paragraphData.get("text", "")
|
||||
|
||||
if text:
|
||||
return f'<p>{text}</p>'
|
||||
|
||||
return ""
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error rendering paragraph: {str(e)}")
|
||||
|
|
@ -425,8 +589,12 @@ class RendererHtml(BaseRenderer):
|
|||
def _renderJsonCodeBlock(self, codeData: Dict[str, Any], styles: Dict[str, Any]) -> str:
|
||||
"""Render a JSON code block to HTML using AI-generated styles."""
|
||||
try:
|
||||
code = codeData.get("code", "")
|
||||
language = codeData.get("language", "")
|
||||
# Extract from nested content structure: element.content.{code, language}
|
||||
content = codeData.get("content", {})
|
||||
if not isinstance(content, dict):
|
||||
return ""
|
||||
code = content.get("code", "")
|
||||
language = content.get("language", "")
|
||||
|
||||
if code:
|
||||
if language:
|
||||
|
|
@ -441,16 +609,213 @@ class RendererHtml(BaseRenderer):
|
|||
return ""
|
||||
|
||||
def _renderJsonImage(self, imageData: Dict[str, Any], styles: Dict[str, Any]) -> str:
|
||||
"""Render a JSON image to HTML."""
|
||||
"""Render a JSON image to HTML with placeholder for later replacement. Expects nested content structure."""
|
||||
try:
|
||||
base64Data = imageData.get("base64Data", "")
|
||||
altText = imageData.get("altText", "Image")
|
||||
import html
|
||||
# Extract from nested content structure (standard JSON format)
|
||||
content = imageData.get("content", {})
|
||||
if not isinstance(content, dict):
|
||||
return ""
|
||||
|
||||
base64Data = content.get("base64Data", "")
|
||||
altText = content.get("altText", "Image")
|
||||
caption = content.get("caption", "")
|
||||
|
||||
# Escape HTML in altText and caption to prevent injection
|
||||
altTextEscaped = html.escape(str(altText))
|
||||
captionEscaped = html.escape(str(caption)) if caption else ""
|
||||
|
||||
if base64Data:
|
||||
return f'<img src="data:image/png;base64,{base64Data}" alt="{altText}">'
|
||||
# Use data URI as placeholder - will be replaced with file path in _replaceImageDataUris
|
||||
# Include a marker so we can find and replace it
|
||||
imageMarker = f"<!--IMAGE_MARKER:{len(base64Data)}:{altTextEscaped[:50]}-->"
|
||||
# Add max-width and max-height to ensure image fits within page dimensions
|
||||
# Typical page width is ~800-1200px, height varies but we limit to 600px for readability
|
||||
imgTag = f'<img src="data:image/png;base64,{base64Data}" alt="{altTextEscaped}" style="max-width: 100%; max-height: 600px; width: auto; height: auto;">'
|
||||
|
||||
if captionEscaped:
|
||||
return f'{imageMarker}<figure>{imgTag}<figcaption>{captionEscaped}</figcaption></figure>'
|
||||
else:
|
||||
return f'{imageMarker}{imgTag}'
|
||||
|
||||
return ""
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error rendering image: {str(e)}")
|
||||
return f'<div class="error">[Image: {imageData.get("altText", "Image")}]</div>'
|
||||
self.logger.error(f"Error embedding image in HTML: {str(e)}")
|
||||
altText = imageData.get("altText", "Image")
|
||||
errorMsg = html.escape(f"[Error: Could not embed image '{altText}'. {str(e)}]")
|
||||
return f'<div class="error" style="color: red; padding: 10px; border: 1px solid red;">{errorMsg}</div>'
|
||||
|
||||
def _extractImages(self, jsonContent: Dict[str, Any]) -> List[Dict[str, Any]]:
|
||||
"""
|
||||
Extract all images from JSON structure.
|
||||
|
||||
Returns:
|
||||
List of image data dictionaries with base64Data, altText, caption, sectionId
|
||||
"""
|
||||
images = []
|
||||
|
||||
try:
|
||||
# Extract from standardized schema: {metadata: {...}, documents: [{sections: [...]}]}
|
||||
documents = jsonContent.get("documents", [])
|
||||
if not documents or not isinstance(documents, list):
|
||||
return images
|
||||
|
||||
for doc in documents:
|
||||
if not isinstance(doc, dict):
|
||||
continue
|
||||
sections = doc.get("sections", [])
|
||||
for section in sections:
|
||||
if section.get("content_type") == "image":
|
||||
elements = section.get("elements", [])
|
||||
for element in elements:
|
||||
# Extract from nested content structure
|
||||
content = element.get("content", {})
|
||||
base64Data = ""
|
||||
|
||||
if isinstance(content, dict):
|
||||
base64Data = content.get("base64Data", "")
|
||||
elif isinstance(content, str):
|
||||
# Content might be base64 string directly (shouldn't happen)
|
||||
pass
|
||||
|
||||
# If base64Data not found in content, try direct element fields (fallback)
|
||||
if not base64Data:
|
||||
base64Data = element.get("base64Data", "")
|
||||
|
||||
# If base64Data still not found, try extracting from url data URI
|
||||
if not base64Data:
|
||||
url = element.get("url", "") or (content.get("url", "") if isinstance(content, dict) else "")
|
||||
if url and isinstance(url, str) and url.startswith("data:image/"):
|
||||
# Extract base64 from data URI: data:image/png;base64,<base64>
|
||||
import re
|
||||
match = re.match(r'data:image/[^;]+;base64,(.+)', url)
|
||||
if match:
|
||||
base64Data = match.group(1)
|
||||
|
||||
if base64Data:
|
||||
sectionId = section.get("id", "unknown")
|
||||
|
||||
# Bestimme MIME-Type und Extension
|
||||
mimeType = element.get("mimeType", "") or (content.get("mimeType", "") if isinstance(content, dict) else "")
|
||||
if not mimeType or mimeType == "unknown":
|
||||
# Versuche MIME-Type aus base64 zu erkennen
|
||||
if base64Data.startswith("/9j/"):
|
||||
mimeType = "image/jpeg"
|
||||
elif base64Data.startswith("iVBORw0KGgo"):
|
||||
mimeType = "image/png"
|
||||
else:
|
||||
mimeType = "image/png" # Default
|
||||
|
||||
# Bestimme Extension basierend auf MIME-Type
|
||||
extension = "png"
|
||||
if mimeType == "image/jpeg" or mimeType == "image/jpg":
|
||||
extension = "jpg"
|
||||
elif mimeType == "image/png":
|
||||
extension = "png"
|
||||
elif mimeType == "image/gif":
|
||||
extension = "gif"
|
||||
elif mimeType == "image/webp":
|
||||
extension = "webp"
|
||||
|
||||
# Generate filename from section ID
|
||||
filename = f"{sectionId}.{extension}"
|
||||
# Clean filename (remove invalid characters)
|
||||
filename = "".join(c if c.isalnum() or c in "._-" else "_" for c in filename)
|
||||
|
||||
images.append({
|
||||
"base64Data": base64Data,
|
||||
"altText": element.get("altText", "Image"),
|
||||
"caption": element.get("caption"),
|
||||
"sectionId": sectionId,
|
||||
"filename": filename,
|
||||
"mimeType": mimeType
|
||||
})
|
||||
self.logger.debug(f"Extracted image from section {sectionId}: {filename}")
|
||||
|
||||
self.logger.info(f"Extracted {len(images)} image(s) from JSON structure")
|
||||
return images
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error extracting images: {str(e)}")
|
||||
return []
|
||||
|
||||
def _replaceImageDataUris(self, htmlContent: str, images: List[Dict[str, Any]]) -> str:
|
||||
"""
|
||||
Replace base64 data URIs in HTML with relative file paths.
|
||||
|
||||
Args:
|
||||
htmlContent: HTML content with data URIs
|
||||
images: List of image data dictionaries
|
||||
|
||||
Returns:
|
||||
HTML content with relative file paths
|
||||
"""
|
||||
try:
|
||||
import base64
|
||||
import re
|
||||
|
||||
# Find entire img tags with data URIs and replace them
|
||||
# Pattern: <img src="data:image/[type];base64,<base64>" [other attributes]>
|
||||
imgTagPattern = r'<img\s+src="data:image/[^"]+"[^>]*>'
|
||||
|
||||
def replaceImgTag(match):
|
||||
imgTag = match.group(0)
|
||||
|
||||
# Extract base64 data from the img tag
|
||||
base64Match = re.search(r'data:image/[^;]+;base64,([A-Za-z0-9+/=]+)', imgTag)
|
||||
if not base64Match:
|
||||
return imgTag # Return original if no base64 found
|
||||
|
||||
base64Data = base64Match.group(1)
|
||||
|
||||
# Find matching image in images list
|
||||
matchingImage = None
|
||||
for img in images:
|
||||
imgBase64 = img.get("base64Data", "")
|
||||
# Vergleiche base64-Daten (kann unterschiedliche Längen haben durch Padding)
|
||||
if imgBase64 == base64Data or imgBase64.startswith(base64Data[:100]) or base64Data.startswith(imgBase64[:100]):
|
||||
matchingImage = img
|
||||
break
|
||||
|
||||
if matchingImage:
|
||||
import html
|
||||
# Use filename from image data (generated from section ID)
|
||||
filename = matchingImage.get("filename", f"image_{images.index(matchingImage) + 1}.png")
|
||||
|
||||
# Extract existing alt text or use from matchingImage
|
||||
altMatch = re.search(r'alt="([^"]*)"', imgTag)
|
||||
existingAlt = altMatch.group(1) if altMatch else ""
|
||||
altText = html.escape(str(matchingImage.get("altText", existingAlt or "Image")))
|
||||
caption = html.escape(str(matchingImage.get("caption", ""))) if matchingImage.get("caption") else ""
|
||||
|
||||
# Create new img tag with filename
|
||||
imgTag = f'<img src="{filename}" alt="{altText}">'
|
||||
|
||||
if caption:
|
||||
return f'<figure>{imgTag}<figcaption>{caption}</figcaption></figure>'
|
||||
else:
|
||||
return imgTag
|
||||
else:
|
||||
# Keep original if no match found
|
||||
return match.group(0)
|
||||
|
||||
# Replace all img tags with data URIs (auch IMAGE_MARKER Kommentare entfernen)
|
||||
updatedHtml = re.sub(imgTagPattern, replaceImgTag, htmlContent)
|
||||
# Entferne IMAGE_MARKER Kommentare die übrig geblieben sind
|
||||
updatedHtml = re.sub(r'<!--IMAGE_MARKER:[^>]+-->', '', updatedHtml)
|
||||
|
||||
return updatedHtml
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error replacing image data URIs: {str(e)}")
|
||||
return htmlContent # Return original if replacement fails
|
||||
|
||||
def getRenderedImages(self) -> List[Dict[str, Any]]:
|
||||
"""
|
||||
Get images that were extracted during rendering.
|
||||
Returns list of image dicts with base64Data, altText, caption, and filename.
|
||||
"""
|
||||
if not hasattr(self, '_renderedImages'):
|
||||
return []
|
||||
return self._renderedImages
|
||||
|
|
|
|||
|
|
@ -5,8 +5,10 @@ Image renderer for report generation using AI image generation.
|
|||
"""
|
||||
|
||||
from .rendererBaseTemplate import BaseRenderer
|
||||
from typing import Dict, Any, Tuple, List
|
||||
from modules.datamodels.datamodelDocument import RenderedDocument
|
||||
from typing import Dict, Any, List
|
||||
import logging
|
||||
import base64
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
|
@ -28,13 +30,43 @@ class RendererImage(BaseRenderer):
|
|||
"""Return priority for image renderer."""
|
||||
return 90
|
||||
|
||||
async def render(self, extractedContent: Dict[str, Any], title: str, userPrompt: str = None, aiService=None) -> Tuple[str, str]:
|
||||
async def render(self, extractedContent: Dict[str, Any], title: str, userPrompt: str = None, aiService=None) -> List[RenderedDocument]:
|
||||
"""Render extracted JSON content to image format using AI image generation."""
|
||||
try:
|
||||
# Generate AI image from content
|
||||
imageContent = await self._generateAiImage(extractedContent, title, userPrompt, aiService)
|
||||
|
||||
return imageContent, "image/png"
|
||||
# Determine filename from document or title
|
||||
documents = extractedContent.get("documents", [])
|
||||
if documents and isinstance(documents[0], dict):
|
||||
filename = documents[0].get("filename")
|
||||
if not filename:
|
||||
filename = self._determineFilename(title, "image/png")
|
||||
else:
|
||||
filename = self._determineFilename(title, "image/png")
|
||||
|
||||
# Convert image content to bytes (base64 string or bytes)
|
||||
if isinstance(imageContent, str):
|
||||
try:
|
||||
imageBytes = base64.b64decode(imageContent)
|
||||
except Exception:
|
||||
imageBytes = imageContent.encode('utf-8')
|
||||
else:
|
||||
imageBytes = imageContent
|
||||
|
||||
# Extract metadata for document type and other info
|
||||
metadata = extractedContent.get("metadata", {}) if extractedContent else {}
|
||||
documentType = metadata.get("documentType") if isinstance(metadata, dict) else None
|
||||
|
||||
return [
|
||||
RenderedDocument(
|
||||
documentData=imageBytes,
|
||||
mimeType="image/png",
|
||||
filename=filename,
|
||||
documentType=documentType,
|
||||
metadata=metadata if isinstance(metadata, dict) else None
|
||||
)
|
||||
]
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error rendering image: {str(e)}")
|
||||
|
|
@ -47,15 +79,15 @@ class RendererImage(BaseRenderer):
|
|||
if not aiService:
|
||||
raise ValueError("AI service is required for image generation")
|
||||
|
||||
# Validate JSON structure
|
||||
if not isinstance(extractedContent, dict):
|
||||
raise ValueError("Extracted content must be a dictionary")
|
||||
# Validate JSON structure (standardized schema: {metadata: {...}, documents: [{sections: [...]}]})
|
||||
if not self._validateJsonStructure(extractedContent):
|
||||
raise ValueError("Extracted content must follow standardized schema: {metadata: {...}, documents: [{sections: [...]}]}")
|
||||
|
||||
if "sections" not in extractedContent:
|
||||
raise ValueError("Extracted content must contain 'sections' field")
|
||||
# Extract metadata from standardized schema
|
||||
metadata = self._extractMetadata(extractedContent)
|
||||
|
||||
# Use title from JSON metadata if available, otherwise use provided title
|
||||
documentTitle = extractedContent.get("metadata", {}).get("title", title)
|
||||
documentTitle = metadata.get("title", title)
|
||||
|
||||
# Create AI prompt for image generation
|
||||
imagePrompt = await self._createImageGeneratePrompt(extractedContent, documentTitle, userPrompt, aiService)
|
||||
|
|
@ -123,7 +155,7 @@ class RendererImage(BaseRenderer):
|
|||
promptParts.append(f"Document Title: {title}")
|
||||
|
||||
# Analyze content and create visual description
|
||||
sections = extractedContent.get("sections", [])
|
||||
sections = self._extractSections(extractedContent)
|
||||
contentDescription = self._analyzeContentForVisualDescription(sections)
|
||||
|
||||
if contentDescription:
|
||||
|
|
@ -286,7 +318,7 @@ Return only the compressed prompt, no explanations.
|
|||
styleElements.append("corporate, professional design")
|
||||
|
||||
# Analyze content type for additional style hints
|
||||
sections = extractedContent.get("sections", [])
|
||||
sections = self._extractSections(extractedContent)
|
||||
hasTables = any(self._getSectionType(s) == "table" for s in sections)
|
||||
hasLists = any(self._getSectionType(s) == "bullet_list" for s in sections)
|
||||
hasCode = any(self._getSectionType(s) == "code_block" for s in sections)
|
||||
|
|
|
|||
|
|
@ -5,7 +5,8 @@ JSON renderer for report generation.
|
|||
"""
|
||||
|
||||
from .rendererBaseTemplate import BaseRenderer
|
||||
from typing import Dict, Any, Tuple, List
|
||||
from modules.datamodels.datamodelDocument import RenderedDocument
|
||||
from typing import Dict, Any, List
|
||||
import json
|
||||
|
||||
class RendererJson(BaseRenderer):
|
||||
|
|
@ -26,14 +27,35 @@ class RendererJson(BaseRenderer):
|
|||
"""Return priority for JSON renderer."""
|
||||
return 80
|
||||
|
||||
async def render(self, extractedContent: Dict[str, Any], title: str, userPrompt: str = None, aiService=None) -> Tuple[str, str]:
|
||||
async def render(self, extractedContent: Dict[str, Any], title: str, userPrompt: str = None, aiService=None) -> List[RenderedDocument]:
|
||||
"""Render extracted JSON content to JSON format."""
|
||||
try:
|
||||
# The extracted content should already be JSON from the AI
|
||||
# Just validate and format it
|
||||
jsonContent = self._cleanJsonContent(extractedContent, title)
|
||||
|
||||
return jsonContent, "application/json"
|
||||
# Determine filename from document or title
|
||||
documents = extractedContent.get("documents", [])
|
||||
if documents and isinstance(documents[0], dict):
|
||||
filename = documents[0].get("filename")
|
||||
if not filename:
|
||||
filename = self._determineFilename(title, "application/json")
|
||||
else:
|
||||
filename = self._determineFilename(title, "application/json")
|
||||
|
||||
# Extract metadata for document type and other info
|
||||
metadata = extractedContent.get("metadata", {}) if extractedContent else {}
|
||||
documentType = metadata.get("documentType") if isinstance(metadata, dict) else None
|
||||
|
||||
return [
|
||||
RenderedDocument(
|
||||
documentData=jsonContent.encode('utf-8'),
|
||||
mimeType="application/json",
|
||||
filename=filename,
|
||||
documentType=documentType,
|
||||
metadata=metadata if isinstance(metadata, dict) else None
|
||||
)
|
||||
]
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error rendering JSON: {str(e)}")
|
||||
|
|
@ -43,7 +65,18 @@ class RendererJson(BaseRenderer):
|
|||
"sections": [{"content_type": "paragraph", "elements": [{"text": f"Error rendering report: {str(e)}"}]}],
|
||||
"metadata": {"error": str(e)}
|
||||
}
|
||||
return json.dumps(fallbackData, indent=2), "application/json"
|
||||
fallbackContent = json.dumps(fallbackData, indent=2)
|
||||
metadata = extractedContent.get("metadata", {}) if extractedContent else {}
|
||||
documentType = metadata.get("documentType") if isinstance(metadata, dict) else None
|
||||
return [
|
||||
RenderedDocument(
|
||||
documentData=fallbackContent.encode('utf-8'),
|
||||
mimeType="application/json",
|
||||
filename=self._determineFilename(title, "application/json"),
|
||||
documentType=documentType,
|
||||
metadata=metadata if isinstance(metadata, dict) else None
|
||||
)
|
||||
]
|
||||
|
||||
def _cleanJsonContent(self, content: Dict[str, Any], title: str) -> str:
|
||||
"""Clean and validate JSON content from AI."""
|
||||
|
|
|
|||
|
|
@ -5,7 +5,8 @@ Markdown renderer for report generation.
|
|||
"""
|
||||
|
||||
from .rendererBaseTemplate import BaseRenderer
|
||||
from typing import Dict, Any, Tuple, List
|
||||
from modules.datamodels.datamodelDocument import RenderedDocument
|
||||
from typing import Dict, Any, List
|
||||
|
||||
class RendererMarkdown(BaseRenderer):
|
||||
"""Renders content to Markdown format with format-specific extraction."""
|
||||
|
|
@ -25,31 +26,64 @@ class RendererMarkdown(BaseRenderer):
|
|||
"""Return priority for markdown renderer."""
|
||||
return 95
|
||||
|
||||
async def render(self, extractedContent: Dict[str, Any], title: str, userPrompt: str = None, aiService=None) -> Tuple[str, str]:
|
||||
async def render(self, extractedContent: Dict[str, Any], title: str, userPrompt: str = None, aiService=None) -> List[RenderedDocument]:
|
||||
"""Render extracted JSON content to Markdown format."""
|
||||
try:
|
||||
# Generate markdown from JSON structure
|
||||
markdownContent = self._generateMarkdownFromJson(extractedContent, title)
|
||||
|
||||
return markdownContent, "text/markdown"
|
||||
# Determine filename from document or title
|
||||
documents = extractedContent.get("documents", [])
|
||||
if documents and isinstance(documents[0], dict):
|
||||
filename = documents[0].get("filename")
|
||||
if not filename:
|
||||
filename = self._determineFilename(title, "text/markdown")
|
||||
else:
|
||||
filename = self._determineFilename(title, "text/markdown")
|
||||
|
||||
# Extract metadata for document type and other info
|
||||
metadata = extractedContent.get("metadata", {}) if extractedContent else {}
|
||||
documentType = metadata.get("documentType") if isinstance(metadata, dict) else None
|
||||
|
||||
return [
|
||||
RenderedDocument(
|
||||
documentData=markdownContent.encode('utf-8'),
|
||||
mimeType="text/markdown",
|
||||
filename=filename,
|
||||
documentType=documentType,
|
||||
metadata=metadata if isinstance(metadata, dict) else None
|
||||
)
|
||||
]
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error rendering markdown: {str(e)}")
|
||||
# Return minimal markdown fallback
|
||||
return f"# {title}\n\nError rendering report: {str(e)}", "text/markdown"
|
||||
fallbackContent = f"# {title}\n\nError rendering report: {str(e)}"
|
||||
metadata = extractedContent.get("metadata", {}) if extractedContent else {}
|
||||
documentType = metadata.get("documentType") if isinstance(metadata, dict) else None
|
||||
return [
|
||||
RenderedDocument(
|
||||
documentData=fallbackContent.encode('utf-8'),
|
||||
mimeType="text/markdown",
|
||||
filename=self._determineFilename(title, "text/markdown"),
|
||||
documentType=documentType,
|
||||
metadata=metadata if isinstance(metadata, dict) else None
|
||||
)
|
||||
]
|
||||
|
||||
def _generateMarkdownFromJson(self, jsonContent: Dict[str, Any], title: str) -> str:
|
||||
"""Generate markdown content from structured JSON document."""
|
||||
try:
|
||||
# Validate JSON structure
|
||||
if not isinstance(jsonContent, dict):
|
||||
raise ValueError("JSON content must be a dictionary")
|
||||
# Validate JSON structure (standardized schema: {metadata: {...}, documents: [{sections: [...]}]})
|
||||
if not self._validateJsonStructure(jsonContent):
|
||||
raise ValueError("JSON content must follow standardized schema: {metadata: {...}, documents: [{sections: [...]}]}")
|
||||
|
||||
if "sections" not in jsonContent:
|
||||
raise ValueError("JSON content must contain 'sections' field")
|
||||
# Extract sections and metadata from standardized schema
|
||||
sections = self._extractSections(jsonContent)
|
||||
metadata = self._extractMetadata(jsonContent)
|
||||
|
||||
# Use title from JSON metadata if available, otherwise use provided title
|
||||
documentTitle = jsonContent.get("metadata", {}).get("title", title)
|
||||
documentTitle = metadata.get("title", title)
|
||||
|
||||
# Build markdown content
|
||||
markdownParts = []
|
||||
|
|
@ -59,7 +93,6 @@ class RendererMarkdown(BaseRenderer):
|
|||
markdownParts.append("")
|
||||
|
||||
# Process each section
|
||||
sections = jsonContent.get("sections", [])
|
||||
for section in sections:
|
||||
sectionMarkdown = self._renderJsonSection(section)
|
||||
if sectionMarkdown:
|
||||
|
|
@ -77,31 +110,71 @@ class RendererMarkdown(BaseRenderer):
|
|||
raise Exception(f"Markdown generation failed: {str(e)}")
|
||||
|
||||
def _renderJsonSection(self, section: Dict[str, Any]) -> str:
|
||||
"""Render a single JSON section to markdown."""
|
||||
"""Render a single JSON section to markdown.
|
||||
Supports three content formats: reference, object (base64), extracted_text.
|
||||
"""
|
||||
try:
|
||||
sectionType = self._getSectionType(section)
|
||||
sectionData = self._getSectionData(section)
|
||||
|
||||
# Check for three content formats from Phase 5D in elements
|
||||
if isinstance(sectionData, list):
|
||||
markdownParts = []
|
||||
for element in sectionData:
|
||||
element_type = element.get("type", "") if isinstance(element, dict) else ""
|
||||
|
||||
# Support three content formats from Phase 5D
|
||||
if element_type == "reference":
|
||||
# Document reference format
|
||||
doc_ref = element.get("documentReference", "")
|
||||
label = element.get("label", "Reference")
|
||||
markdownParts.append(f"*[Reference: {label}]*")
|
||||
continue
|
||||
elif element_type == "extracted_text":
|
||||
# Extracted text format
|
||||
content = element.get("content", "")
|
||||
source = element.get("source", "")
|
||||
if content:
|
||||
source_text = f" *(Source: {source})*" if source else ""
|
||||
markdownParts.append(f"{content}{source_text}")
|
||||
continue
|
||||
|
||||
# If we processed reference/extracted_text elements, return them
|
||||
if markdownParts:
|
||||
return '\n\n'.join(markdownParts)
|
||||
|
||||
if sectionType == "table":
|
||||
# Process the section data to extract table structure
|
||||
processedData = self._processSectionByType(section)
|
||||
return self._renderJsonTable(processedData)
|
||||
# Work directly with elements like other renderers
|
||||
if isinstance(sectionData, list) and sectionData:
|
||||
element = sectionData[0] if isinstance(sectionData[0], dict) else {}
|
||||
return self._renderJsonTable(element)
|
||||
return ""
|
||||
elif sectionType == "bullet_list":
|
||||
# Process the section data to extract bullet list structure
|
||||
processedData = self._processSectionByType(section)
|
||||
return self._renderJsonBulletList(processedData)
|
||||
# Work directly with elements like other renderers
|
||||
if isinstance(sectionData, list) and sectionData:
|
||||
element = sectionData[0] if isinstance(sectionData[0], dict) else {}
|
||||
return self._renderJsonBulletList(element)
|
||||
return ""
|
||||
elif sectionType == "heading":
|
||||
return self._renderJsonHeading(sectionData)
|
||||
# Work directly with elements like other renderers
|
||||
if isinstance(sectionData, list) and sectionData:
|
||||
element = sectionData[0] if isinstance(sectionData[0], dict) else {}
|
||||
return self._renderJsonHeading(element)
|
||||
return ""
|
||||
elif sectionType == "paragraph":
|
||||
return self._renderJsonParagraph(sectionData)
|
||||
elif sectionType == "code_block":
|
||||
# Process the section data to extract code block structure
|
||||
processedData = self._processSectionByType(section)
|
||||
return self._renderJsonCodeBlock(processedData)
|
||||
# Work directly with elements like other renderers
|
||||
if isinstance(sectionData, list) and sectionData:
|
||||
element = sectionData[0] if isinstance(sectionData[0], dict) else {}
|
||||
return self._renderJsonCodeBlock(element)
|
||||
return ""
|
||||
elif sectionType == "image":
|
||||
# Process the section data to extract image structure
|
||||
processedData = self._processSectionByType(section)
|
||||
return self._renderJsonImage(processedData)
|
||||
# Work directly with elements like other renderers
|
||||
if isinstance(sectionData, list) and sectionData:
|
||||
element = sectionData[0] if isinstance(sectionData[0], dict) else {}
|
||||
return self._renderJsonImage(element)
|
||||
return ""
|
||||
else:
|
||||
# Fallback to paragraph for unknown types
|
||||
return self._renderJsonParagraph(sectionData)
|
||||
|
|
@ -113,8 +186,12 @@ class RendererMarkdown(BaseRenderer):
|
|||
def _renderJsonTable(self, tableData: Dict[str, Any]) -> str:
|
||||
"""Render a JSON table to markdown."""
|
||||
try:
|
||||
headers = tableData.get("headers", [])
|
||||
rows = tableData.get("rows", [])
|
||||
# Extract from nested content structure: element.content.{headers, rows}
|
||||
content = tableData.get("content", {})
|
||||
if not isinstance(content, dict):
|
||||
return ""
|
||||
headers = content.get("headers", [])
|
||||
rows = content.get("rows", [])
|
||||
|
||||
if not headers or not rows:
|
||||
return ""
|
||||
|
|
@ -143,7 +220,11 @@ class RendererMarkdown(BaseRenderer):
|
|||
def _renderJsonBulletList(self, listData: Dict[str, Any]) -> str:
|
||||
"""Render a JSON bullet list to markdown."""
|
||||
try:
|
||||
items = listData.get("items", [])
|
||||
# Extract from nested content structure: element.content.{items}
|
||||
content = listData.get("content", {})
|
||||
if not isinstance(content, dict):
|
||||
return ""
|
||||
items = content.get("items", [])
|
||||
|
||||
if not items:
|
||||
return ""
|
||||
|
|
@ -164,8 +245,12 @@ class RendererMarkdown(BaseRenderer):
|
|||
def _renderJsonHeading(self, headingData: Dict[str, Any]) -> str:
|
||||
"""Render a JSON heading to markdown."""
|
||||
try:
|
||||
level = headingData.get("level", 1)
|
||||
text = headingData.get("text", "")
|
||||
# Extract from nested content structure: element.content.{text, level}
|
||||
content = headingData.get("content", {})
|
||||
if not isinstance(content, dict):
|
||||
return ""
|
||||
text = content.get("text", "")
|
||||
level = content.get("level", 1)
|
||||
|
||||
if text:
|
||||
level = max(1, min(6, level))
|
||||
|
|
@ -180,7 +265,14 @@ class RendererMarkdown(BaseRenderer):
|
|||
def _renderJsonParagraph(self, paragraphData: Dict[str, Any]) -> str:
|
||||
"""Render a JSON paragraph to markdown."""
|
||||
try:
|
||||
text = paragraphData.get("text", "")
|
||||
# Extract from nested content structure
|
||||
content = paragraphData.get("content", {})
|
||||
if isinstance(content, dict):
|
||||
text = content.get("text", "")
|
||||
elif isinstance(content, str):
|
||||
text = content
|
||||
else:
|
||||
text = ""
|
||||
return text if text else ""
|
||||
|
||||
except Exception as e:
|
||||
|
|
@ -190,8 +282,12 @@ class RendererMarkdown(BaseRenderer):
|
|||
def _renderJsonCodeBlock(self, codeData: Dict[str, Any]) -> str:
|
||||
"""Render a JSON code block to markdown."""
|
||||
try:
|
||||
code = codeData.get("code", "")
|
||||
language = codeData.get("language", "")
|
||||
# Extract from nested content structure
|
||||
content = codeData.get("content", {})
|
||||
if not isinstance(content, dict):
|
||||
return ""
|
||||
code = content.get("code", "")
|
||||
language = content.get("language", "")
|
||||
|
||||
if code:
|
||||
if language:
|
||||
|
|
@ -208,8 +304,12 @@ class RendererMarkdown(BaseRenderer):
|
|||
def _renderJsonImage(self, imageData: Dict[str, Any]) -> str:
|
||||
"""Render a JSON image to markdown."""
|
||||
try:
|
||||
altText = imageData.get("altText", "Image")
|
||||
base64Data = imageData.get("base64Data", "")
|
||||
# Extract from nested content structure: element.content.{base64Data, altText, caption}
|
||||
content = imageData.get("content", {})
|
||||
if not isinstance(content, dict):
|
||||
return ""
|
||||
altText = content.get("altText", "Image")
|
||||
base64Data = content.get("base64Data", "")
|
||||
|
||||
if base64Data:
|
||||
# For base64 images, we can't embed them directly in markdown
|
||||
|
|
|
|||
|
|
@ -5,7 +5,8 @@ PDF renderer for report generation using reportlab.
|
|||
"""
|
||||
|
||||
from .rendererBaseTemplate import BaseRenderer
|
||||
from typing import Dict, Any, Tuple, List
|
||||
from modules.datamodels.datamodelDocument import RenderedDocument
|
||||
from typing import Dict, Any, List
|
||||
import io
|
||||
import base64
|
||||
|
||||
|
|
@ -38,41 +39,79 @@ class RendererPdf(BaseRenderer):
|
|||
"""Return priority for PDF renderer."""
|
||||
return 120
|
||||
|
||||
async def render(self, extractedContent: Dict[str, Any], title: str, userPrompt: str = None, aiService=None) -> Tuple[str, str]:
|
||||
async def render(self, extractedContent: Dict[str, Any], title: str, userPrompt: str = None, aiService=None) -> List[RenderedDocument]:
|
||||
"""Render extracted JSON content to PDF format using AI-analyzed styling."""
|
||||
try:
|
||||
if not REPORTLAB_AVAILABLE:
|
||||
# Fallback to HTML if reportlab not available
|
||||
from .rendererHtml import RendererHtml
|
||||
html_renderer = RendererHtml()
|
||||
html_content, _ = await html_renderer.render(extractedContent, title, userPrompt, aiService)
|
||||
return html_content, "text/html"
|
||||
return await html_renderer.render(extractedContent, title, userPrompt, aiService)
|
||||
|
||||
# Generate PDF using AI-analyzed styling
|
||||
pdf_content = await self._generatePdfFromJson(extractedContent, title, userPrompt, aiService)
|
||||
|
||||
return pdf_content, "application/pdf"
|
||||
# Extract metadata for document type and other info
|
||||
metadata = extractedContent.get("metadata", {}) if extractedContent else {}
|
||||
documentType = metadata.get("documentType") if isinstance(metadata, dict) else None
|
||||
|
||||
# Determine filename from document or title
|
||||
documents = extractedContent.get("documents", [])
|
||||
if documents and isinstance(documents[0], dict):
|
||||
filename = documents[0].get("filename")
|
||||
if not filename:
|
||||
filename = self._determineFilename(title, "application/pdf")
|
||||
else:
|
||||
filename = self._determineFilename(title, "application/pdf")
|
||||
|
||||
# Convert PDF content to bytes if it's a string (base64)
|
||||
if isinstance(pdf_content, str):
|
||||
# Try to decode as base64, otherwise encode as UTF-8
|
||||
try:
|
||||
pdf_bytes = base64.b64decode(pdf_content)
|
||||
except Exception:
|
||||
pdf_bytes = pdf_content.encode('utf-8')
|
||||
else:
|
||||
pdf_bytes = pdf_content
|
||||
|
||||
return [
|
||||
RenderedDocument(
|
||||
documentData=pdf_bytes,
|
||||
mimeType="application/pdf",
|
||||
filename=filename,
|
||||
documentType=documentType,
|
||||
metadata=metadata if isinstance(metadata, dict) else None
|
||||
)
|
||||
]
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error rendering PDF: {str(e)}")
|
||||
# Return minimal fallback
|
||||
return f"PDF Generation Error: {str(e)}", "text/plain"
|
||||
fallbackContent = f"PDF Generation Error: {str(e)}"
|
||||
return [
|
||||
RenderedDocument(
|
||||
documentData=fallbackContent.encode('utf-8'),
|
||||
mimeType="text/plain",
|
||||
filename=self._determineFilename(title, "text/plain")
|
||||
)
|
||||
]
|
||||
|
||||
async def _generatePdfFromJson(self, json_content: Dict[str, Any], title: str, userPrompt: str = None, aiService=None) -> str:
|
||||
"""Generate PDF content from structured JSON document using AI-generated styling."""
|
||||
try:
|
||||
# Get style set: default styles, enhanced with AI if userPrompt provided
|
||||
styles = await self._getStyleSet(userPrompt, aiService)
|
||||
# Get style set: use styles from metadata if available, otherwise enhance with AI
|
||||
styles = await self._getStyleSet(json_content, userPrompt, aiService)
|
||||
|
||||
# Validate JSON structure
|
||||
if not isinstance(json_content, dict):
|
||||
raise ValueError("JSON content must be a dictionary")
|
||||
if not self._validateJsonStructure(json_content):
|
||||
raise ValueError("JSON content must follow standardized schema: {metadata: {...}, documents: [{sections: [...]}]}")
|
||||
|
||||
if "sections" not in json_content:
|
||||
raise ValueError("JSON content must contain 'sections' field")
|
||||
# Extract sections and metadata from standardized schema
|
||||
sections = self._extractSections(json_content)
|
||||
metadata = self._extractMetadata(json_content)
|
||||
|
||||
# Use title from JSON metadata if available, otherwise use provided title
|
||||
document_title = json_content.get("metadata", {}).get("title", title)
|
||||
document_title = metadata.get("title", title)
|
||||
|
||||
# Make title shorter to prevent wrapping/overlapping
|
||||
if len(document_title) > 40:
|
||||
|
|
@ -102,8 +141,7 @@ class RendererPdf(BaseRenderer):
|
|||
story.append(Spacer(1, 30)) # Add spacing before page break
|
||||
story.append(PageBreak())
|
||||
|
||||
# Process each section
|
||||
sections = json_content.get("sections", [])
|
||||
# Process each section (sections already extracted above)
|
||||
self.services.utils.debugLogToFile(f"PDF SECTIONS TO PROCESS: {len(sections)} sections", "PDF_RENDERER")
|
||||
for i, section in enumerate(sections):
|
||||
self.services.utils.debugLogToFile(f"PDF SECTION {i}: content_type={section.get('content_type', 'unknown')}, id={section.get('id', 'unknown')}", "PDF_RENDERER")
|
||||
|
|
@ -125,12 +163,17 @@ class RendererPdf(BaseRenderer):
|
|||
self.logger.error(f"Error generating PDF from JSON: {str(e)}")
|
||||
raise Exception(f"PDF generation failed: {str(e)}")
|
||||
|
||||
async def _getStyleSet(self, userPrompt: str = None, aiService=None, templateName: str = None) -> Dict[str, Any]:
|
||||
"""Get style set - default styles, enhanced with AI if userPrompt provided.
|
||||
async def _getStyleSet(self, extractedContent: Dict[str, Any] = None, userPrompt: str = None, aiService=None, templateName: str = None) -> Dict[str, Any]:
|
||||
"""Get style set - use styles from document generation metadata if available,
|
||||
otherwise enhance default styles with AI if userPrompt provided.
|
||||
|
||||
WICHTIG: In a dynamic scalable AI system, styling should come from document generation,
|
||||
not be generated separately by renderers. Only fall back to AI if styles not provided.
|
||||
|
||||
Args:
|
||||
extractedContent: Document content with metadata (may contain styles)
|
||||
userPrompt: User's prompt (AI will detect style instructions in any language)
|
||||
aiService: AI service (used only if userPrompt provided)
|
||||
aiService: AI service (used only if styles not in metadata and userPrompt provided)
|
||||
templateName: Name of template style set (None = default)
|
||||
|
||||
Returns:
|
||||
|
|
@ -139,10 +182,19 @@ class RendererPdf(BaseRenderer):
|
|||
# Get default style set
|
||||
defaultStyleSet = self._getDefaultStyleSet()
|
||||
|
||||
# Enhance with AI if userPrompt provided (AI handles multilingual style detection)
|
||||
# FIRST: Check if styles are provided in document generation metadata (preferred approach)
|
||||
if extractedContent:
|
||||
metadata = extractedContent.get("metadata", {})
|
||||
if isinstance(metadata, dict):
|
||||
styles = metadata.get("styles")
|
||||
if styles and isinstance(styles, dict):
|
||||
self.logger.debug("Using styles from document generation metadata")
|
||||
enhancedStyleSet = self._convertColorsFormat(styles)
|
||||
return self._validateStylesContrast(enhancedStyleSet)
|
||||
|
||||
# FALLBACK: Enhance with AI if userPrompt provided (only if styles not in metadata)
|
||||
if userPrompt and aiService:
|
||||
# AI will naturally detect style instructions in any language
|
||||
self.logger.info(f"Enhancing styles with AI based on user prompt...")
|
||||
self.logger.info(f"Styles not in metadata, enhancing with AI based on user prompt...")
|
||||
enhancedStyleSet = await self._enhanceStylesWithAI(userPrompt, defaultStyleSet, aiService)
|
||||
# Convert colors to PDF format after getting styles
|
||||
enhancedStyleSet = self._convertColorsFormat(enhancedStyleSet)
|
||||
|
|
@ -477,7 +529,9 @@ class RendererPdf(BaseRenderer):
|
|||
return colors.black
|
||||
|
||||
def _renderJsonSection(self, section: Dict[str, Any], styles: Dict[str, Any]) -> List[Any]:
|
||||
"""Render a single JSON section to PDF elements using AI-generated styles."""
|
||||
"""Render a single JSON section to PDF elements using AI-generated styles.
|
||||
Supports three content formats: reference, object (base64), extracted_text.
|
||||
"""
|
||||
try:
|
||||
section_type = self._getSectionType(section)
|
||||
elements = self._getSectionData(section)
|
||||
|
|
@ -485,33 +539,79 @@ class RendererPdf(BaseRenderer):
|
|||
# Process each element in the section
|
||||
all_elements = []
|
||||
for element in elements:
|
||||
if section_type == "table":
|
||||
element_type = element.get("type", "") if isinstance(element, dict) else ""
|
||||
|
||||
# Support three content formats from Phase 5D
|
||||
if element_type == "reference":
|
||||
# Document reference format
|
||||
doc_ref = element.get("documentReference", "")
|
||||
label = element.get("label", "Reference")
|
||||
ref_style = ParagraphStyle(
|
||||
'Reference',
|
||||
parent=self._createNormalStyle(styles),
|
||||
fontStyle='italic',
|
||||
textColor=colors.grey
|
||||
)
|
||||
all_elements.append(Paragraph(f"[Reference: {label}]", ref_style))
|
||||
all_elements.append(Spacer(1, 6))
|
||||
continue
|
||||
elif element_type == "extracted_text":
|
||||
# Extracted text format
|
||||
content = element.get("content", "")
|
||||
source = element.get("source", "")
|
||||
if content:
|
||||
source_text = f" <i>(Source: {source})</i>" if source else ""
|
||||
all_elements.append(Paragraph(f"{content}{source_text}", self._createNormalStyle(styles)))
|
||||
all_elements.append(Spacer(1, 6))
|
||||
continue
|
||||
|
||||
# Check element type, not section type (elements can have different types than section)
|
||||
if element_type == "table":
|
||||
all_elements.extend(self._renderJsonTable(element, styles))
|
||||
elif section_type == "bullet_list":
|
||||
elif element_type == "bullet_list":
|
||||
all_elements.extend(self._renderJsonBulletList(element, styles))
|
||||
elif section_type == "heading":
|
||||
elif element_type == "heading":
|
||||
all_elements.extend(self._renderJsonHeading(element, styles))
|
||||
elif section_type == "paragraph":
|
||||
elif element_type == "paragraph":
|
||||
all_elements.extend(self._renderJsonParagraph(element, styles))
|
||||
elif section_type == "code_block":
|
||||
elif element_type == "code_block":
|
||||
all_elements.extend(self._renderJsonCodeBlock(element, styles))
|
||||
elif section_type == "image":
|
||||
elif element_type == "image":
|
||||
all_elements.extend(self._renderJsonImage(element, styles))
|
||||
else:
|
||||
# Fallback to paragraph for unknown types
|
||||
all_elements.extend(self._renderJsonParagraph(element, styles))
|
||||
# Fallback: if element_type not set, use section_type as fallback
|
||||
if section_type == "table":
|
||||
all_elements.extend(self._renderJsonTable(element, styles))
|
||||
elif section_type == "bullet_list":
|
||||
all_elements.extend(self._renderJsonBulletList(element, styles))
|
||||
elif section_type == "heading":
|
||||
all_elements.extend(self._renderJsonHeading(element, styles))
|
||||
elif section_type == "paragraph":
|
||||
all_elements.extend(self._renderJsonParagraph(element, styles))
|
||||
elif section_type == "code_block":
|
||||
all_elements.extend(self._renderJsonCodeBlock(element, styles))
|
||||
elif section_type == "image":
|
||||
all_elements.extend(self._renderJsonImage(element, styles))
|
||||
else:
|
||||
# Final fallback to paragraph for unknown types
|
||||
all_elements.extend(self._renderJsonParagraph(element, styles))
|
||||
|
||||
return all_elements
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error rendering section {self._getSectionId(section)}: {str(e)}")
|
||||
return [Paragraph(f"[Error rendering section: {str(e)}]", self._create_normal_style(styles))]
|
||||
return [Paragraph(f"[Error rendering section: {str(e)}]", self._createNormalStyle(styles))]
|
||||
|
||||
def _renderJsonTable(self, table_data: Dict[str, Any], styles: Dict[str, Any]) -> List[Any]:
|
||||
"""Render a JSON table to PDF elements using AI-generated styles."""
|
||||
try:
|
||||
headers = table_data.get("headers", [])
|
||||
rows = table_data.get("rows", [])
|
||||
# Handle nested content structure: element.content.headers vs element.headers
|
||||
# Extract from nested content structure
|
||||
content = table_data.get("content", {})
|
||||
if not isinstance(content, dict):
|
||||
return []
|
||||
headers = content.get("headers", [])
|
||||
rows = content.get("rows", [])
|
||||
|
||||
if not headers or not rows:
|
||||
return []
|
||||
|
|
@ -527,13 +627,13 @@ class RendererPdf(BaseRenderer):
|
|||
table_cell_style = styles.get("table_cell", {})
|
||||
|
||||
table_style = [
|
||||
('BACKGROUND', (0, 0), (-1, 0), self._hex_to_color(table_header_style.get("background", "#4F4F4F"))),
|
||||
('TEXTCOLOR', (0, 0), (-1, 0), self._hex_to_color(table_header_style.get("text_color", "#FFFFFF"))),
|
||||
('BACKGROUND', (0, 0), (-1, 0), self._hexToColor(table_header_style.get("background", "#4F4F4F"))),
|
||||
('TEXTCOLOR', (0, 0), (-1, 0), self._hexToColor(table_header_style.get("text_color", "#FFFFFF"))),
|
||||
('ALIGN', (0, 0), (-1, -1), self._getTableAlignment(table_cell_style.get("align", "left"))),
|
||||
('FONTNAME', (0, 0), (-1, 0), 'Helvetica-Bold' if table_header_style.get("bold", True) else 'Helvetica'),
|
||||
('FONTSIZE', (0, 0), (-1, 0), table_header_style.get("font_size", 12)),
|
||||
('BOTTOMPADDING', (0, 0), (-1, 0), 12),
|
||||
('BACKGROUND', (0, 1), (-1, -1), self._hex_to_color(table_cell_style.get("background", "#FFFFFF"))),
|
||||
('BACKGROUND', (0, 1), (-1, -1), self._hexToColor(table_cell_style.get("background", "#FFFFFF"))),
|
||||
('FONTSIZE', (0, 1), (-1, -1), table_cell_style.get("font_size", 10)),
|
||||
('GRID', (0, 0), (-1, -1), 1, colors.black)
|
||||
]
|
||||
|
|
@ -549,15 +649,19 @@ class RendererPdf(BaseRenderer):
|
|||
def _renderJsonBulletList(self, list_data: Dict[str, Any], styles: Dict[str, Any]) -> List[Any]:
|
||||
"""Render a JSON bullet list to PDF elements using AI-generated styles."""
|
||||
try:
|
||||
items = list_data.get("items", [])
|
||||
# Extract from nested content structure
|
||||
content = list_data.get("content", {})
|
||||
if not isinstance(content, dict):
|
||||
return []
|
||||
items = content.get("items", [])
|
||||
bullet_style_def = styles.get("bullet_list", {})
|
||||
|
||||
elements = []
|
||||
for item in items:
|
||||
if isinstance(item, str):
|
||||
elements.append(Paragraph(f"• {item}", self._create_normal_style(styles)))
|
||||
elements.append(Paragraph(f"• {item}", self._createNormalStyle(styles)))
|
||||
elif isinstance(item, dict) and "text" in item:
|
||||
elements.append(Paragraph(f"• {item['text']}", self._create_normal_style(styles)))
|
||||
elements.append(Paragraph(f"• {item['text']}", self._createNormalStyle(styles)))
|
||||
|
||||
if elements:
|
||||
elements.append(Spacer(1, bullet_style_def.get("space_after", 3)))
|
||||
|
|
@ -571,8 +675,12 @@ class RendererPdf(BaseRenderer):
|
|||
def _renderJsonHeading(self, heading_data: Dict[str, Any], styles: Dict[str, Any]) -> List[Any]:
|
||||
"""Render a JSON heading to PDF elements using AI-generated styles."""
|
||||
try:
|
||||
level = heading_data.get("level", 1)
|
||||
text = heading_data.get("text", "")
|
||||
# Extract from nested content structure
|
||||
content = heading_data.get("content", {})
|
||||
if not isinstance(content, dict):
|
||||
return []
|
||||
text = content.get("text", "")
|
||||
level = content.get("level", 1)
|
||||
|
||||
if text:
|
||||
level = max(1, min(6, level))
|
||||
|
|
@ -588,7 +696,14 @@ class RendererPdf(BaseRenderer):
|
|||
def _renderJsonParagraph(self, paragraph_data: Dict[str, Any], styles: Dict[str, Any]) -> List[Any]:
|
||||
"""Render a JSON paragraph to PDF elements using AI-generated styles."""
|
||||
try:
|
||||
text = paragraph_data.get("text", "")
|
||||
# Extract from nested content structure
|
||||
content = paragraph_data.get("content", {})
|
||||
if isinstance(content, dict):
|
||||
text = content.get("text", "")
|
||||
elif isinstance(content, str):
|
||||
text = content
|
||||
else:
|
||||
text = ""
|
||||
|
||||
if text:
|
||||
return [Paragraph(text, self._createNormalStyle(styles))]
|
||||
|
|
@ -602,8 +717,12 @@ class RendererPdf(BaseRenderer):
|
|||
def _renderJsonCodeBlock(self, code_data: Dict[str, Any], styles: Dict[str, Any]) -> List[Any]:
|
||||
"""Render a JSON code block to PDF elements using AI-generated styles."""
|
||||
try:
|
||||
code = code_data.get("code", "")
|
||||
language = code_data.get("language", "")
|
||||
# Extract from nested content structure
|
||||
content = code_data.get("content", {})
|
||||
if not isinstance(content, dict):
|
||||
return []
|
||||
code = content.get("code", "")
|
||||
language = content.get("language", "")
|
||||
code_style_def = styles.get("code_block", {})
|
||||
|
||||
if code:
|
||||
|
|
@ -637,17 +756,164 @@ class RendererPdf(BaseRenderer):
|
|||
return []
|
||||
|
||||
def _renderJsonImage(self, image_data: Dict[str, Any], styles: Dict[str, Any]) -> List[Any]:
|
||||
"""Render a JSON image to PDF elements."""
|
||||
"""Render a JSON image to PDF elements using reportlab."""
|
||||
try:
|
||||
base64_data = image_data.get("base64Data", "")
|
||||
alt_text = image_data.get("altText", "Image")
|
||||
# Extract from nested content structure
|
||||
content = image_data.get("content", {})
|
||||
base64_data = ""
|
||||
alt_text = "Image"
|
||||
caption = ""
|
||||
|
||||
if base64_data:
|
||||
# For now, just add a placeholder since reportlab image handling is complex
|
||||
if isinstance(content, dict):
|
||||
# Nested content structure
|
||||
base64_data = content.get("base64Data", "")
|
||||
alt_text = content.get("altText", "Image")
|
||||
caption = content.get("caption", "")
|
||||
elif isinstance(content, str):
|
||||
# Content might be base64 string directly (shouldn't happen, but handle it)
|
||||
self.logger.warning("Image content is a string, not a dict. This should not happen.")
|
||||
return [Paragraph(f"[Image: Invalid format]", self._createNormalStyle(styles))]
|
||||
|
||||
# If base64Data not found in content, try direct element fields (fallback)
|
||||
if not base64_data:
|
||||
base64_data = image_data.get("base64Data", "")
|
||||
if not alt_text or alt_text == "Image":
|
||||
alt_text = image_data.get("altText", "Image")
|
||||
if not caption:
|
||||
caption = image_data.get("caption", "")
|
||||
|
||||
# If base64Data still not found, try extracting from url data URI
|
||||
if not base64_data:
|
||||
url = image_data.get("url", "") or (content.get("url", "") if isinstance(content, dict) else "")
|
||||
if url and isinstance(url, str) and url.startswith("data:image/"):
|
||||
# Extract base64 from data URI: data:image/png;base64,<base64>
|
||||
import re
|
||||
match = re.match(r'data:image/[^;]+;base64,(.+)', url)
|
||||
if match:
|
||||
base64_data = match.group(1)
|
||||
|
||||
if not base64_data:
|
||||
self.logger.warning(f"No base64 data found for image. Alt text: {alt_text}")
|
||||
return [Paragraph(f"[Image: {alt_text}]", self._createNormalStyle(styles))]
|
||||
|
||||
return []
|
||||
# Validate that base64_data is actually base64 (not the entire element rendered as text)
|
||||
if len(base64_data) > 10000: # Very long string might be entire element JSON
|
||||
self.logger.warning(f"Base64 data seems too long ({len(base64_data)} chars), might be incorrectly extracted")
|
||||
|
||||
# Ensure base64_data is a string, not bytes or other type
|
||||
if not isinstance(base64_data, str):
|
||||
self.logger.warning(f"Base64 data is not a string: {type(base64_data)}")
|
||||
return [Paragraph(f"[Image: {alt_text} - Invalid data type]", self._createNormalStyle(styles))]
|
||||
|
||||
try:
|
||||
from reportlab.platypus import Image as ReportLabImage
|
||||
from reportlab.lib.units import inch
|
||||
import base64
|
||||
import io
|
||||
|
||||
# Decode base64 image data
|
||||
imageBytes = base64.b64decode(base64_data)
|
||||
imageStream = io.BytesIO(imageBytes)
|
||||
|
||||
# Create reportlab Image element
|
||||
# Try to get image dimensions from PIL
|
||||
try:
|
||||
from PIL import Image as PILImage
|
||||
from reportlab.lib.pagesizes import A4
|
||||
|
||||
pilImage = PILImage.open(imageStream)
|
||||
originalWidth, originalHeight = pilImage.size
|
||||
|
||||
# Calculate available page dimensions (A4 with margins: 72pt left/right, 72pt top, 18pt bottom)
|
||||
pageWidth = A4[0] # 595.27 points
|
||||
pageHeight = A4[1] # 841.89 points
|
||||
leftMargin = 72
|
||||
rightMargin = 72
|
||||
topMargin = 72
|
||||
bottomMargin = 18
|
||||
|
||||
# Use actual frame dimensions from SimpleDocTemplate
|
||||
# Frame is smaller than page minus margins due to internal spacing
|
||||
# From error message: frame is 439.27559055118115 x 739.8897637795277
|
||||
# Use conservative values with safety margin
|
||||
availableWidth = 430.0 # Slightly smaller than frame width for safety
|
||||
availableHeight = 730.0 # Slightly smaller than frame height for safety
|
||||
|
||||
# Convert original image size from pixels to points
|
||||
# PIL provides size in pixels, need to convert to points
|
||||
# Standard conversion: 1 inch = 72 points, typical screen DPI = 96 pixels/inch
|
||||
# So: pixels * (72/96) = points, or pixels * 0.75 = points
|
||||
# But for images, we should use the image's actual DPI if available
|
||||
dpi = pilImage.info.get('dpi', (96, 96))[0] # Default to 96 DPI if not specified
|
||||
if dpi <= 0:
|
||||
dpi = 96 # Fallback to 96 DPI
|
||||
|
||||
# Convert pixels to points: 1 point = 1/72 inch, so pixels * (72/dpi) = points
|
||||
imgWidthPoints = originalWidth * (72.0 / dpi)
|
||||
imgHeightPoints = originalHeight * (72.0 / dpi)
|
||||
|
||||
# Scale to fit within available page dimensions while maintaining aspect ratio
|
||||
widthScale = availableWidth / imgWidthPoints if imgWidthPoints > 0 else 1.0
|
||||
heightScale = availableHeight / imgHeightPoints if imgHeightPoints > 0 else 1.0
|
||||
|
||||
# Use the smaller scale to ensure image fits both width and height
|
||||
scale = min(widthScale, heightScale, 1.0) # Don't scale up, only down
|
||||
|
||||
imgWidth = imgWidthPoints * scale
|
||||
imgHeight = imgHeightPoints * scale
|
||||
|
||||
# Additional safety check: ensure dimensions don't exceed available space
|
||||
if imgWidth > availableWidth:
|
||||
scale = availableWidth / imgWidth
|
||||
imgWidth = availableWidth
|
||||
imgHeight = imgHeight * scale
|
||||
|
||||
if imgHeight > availableHeight:
|
||||
scale = availableHeight / imgHeight
|
||||
imgHeight = availableHeight
|
||||
imgWidth = imgWidth * scale
|
||||
|
||||
# Reset stream for reportlab
|
||||
imageStream.seek(0)
|
||||
except Exception as e:
|
||||
# Fallback: use default size that fits page
|
||||
self.logger.warning(f"Error calculating image size: {str(e)}, using safe default")
|
||||
# Use 80% of available width as safe default
|
||||
imgWidth = 4 * inch # ~288 points, safe for ~451pt available width
|
||||
imgHeight = 3 * inch # ~216 points, safe for ~751pt available height
|
||||
imageStream.seek(0)
|
||||
|
||||
# Create reportlab Image
|
||||
reportlabImage = ReportLabImage(imageStream, width=imgWidth, height=imgHeight)
|
||||
|
||||
elements = [reportlabImage]
|
||||
|
||||
# Add caption if available
|
||||
if caption:
|
||||
captionStyle = self._createNormalStyle(styles)
|
||||
captionStyle.fontSize = 10
|
||||
captionStyle.textColor = self._hexToColor(styles.get("paragraph", {}).get("color", "#666666"))
|
||||
elements.append(Paragraph(f"<i>{caption}</i>", captionStyle))
|
||||
elif alt_text and alt_text != "Image":
|
||||
# Use alt text as caption if no caption provided
|
||||
captionStyle = self._createNormalStyle(styles)
|
||||
captionStyle.fontSize = 10
|
||||
captionStyle.textColor = self._hexToColor(styles.get("paragraph", {}).get("color", "#666666"))
|
||||
elements.append(Paragraph(f"<i>Figure: {alt_text}</i>", captionStyle))
|
||||
|
||||
return elements
|
||||
|
||||
except Exception as imgError:
|
||||
self.logger.error(f"Error embedding image in PDF: {str(imgError)}")
|
||||
# Return error message instead of placeholder
|
||||
errorStyle = self._createNormalStyle(styles)
|
||||
errorStyle.textColor = self._hexToColor("#FF0000") # Red color for error
|
||||
errorMsg = f"[Error: Could not embed image '{alt_text}'. {str(imgError)}]"
|
||||
return [Paragraph(errorMsg, errorStyle)]
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error rendering image: {str(e)}")
|
||||
return [Paragraph(f"[Image: {image_data.get('altText', 'Image')}]", self._createNormalStyle(styles))]
|
||||
self.logger.error(f"Error rendering image: {str(e)}")
|
||||
errorStyle = self._createNormalStyle(styles)
|
||||
errorStyle.textColor = self._hexToColor("#FF0000") # Red color for error
|
||||
errorMsg = f"[Error: Could not render image '{image_data.get('altText', 'Image')}'. {str(e)}]"
|
||||
return [Paragraph(errorMsg, errorStyle)]
|
||||
File diff suppressed because it is too large
Load diff
|
|
@ -5,7 +5,8 @@ Text renderer for report generation.
|
|||
"""
|
||||
|
||||
from .rendererBaseTemplate import BaseRenderer
|
||||
from typing import Dict, Any, Tuple, List
|
||||
from modules.datamodels.datamodelDocument import RenderedDocument
|
||||
from typing import Dict, Any, List
|
||||
|
||||
class RendererText(BaseRenderer):
|
||||
"""Renders content to plain text format with format-specific extraction."""
|
||||
|
|
@ -47,31 +48,64 @@ class RendererText(BaseRenderer):
|
|||
"""Return priority for text renderer."""
|
||||
return 90
|
||||
|
||||
async def render(self, extractedContent: Dict[str, Any], title: str, userPrompt: str = None, aiService=None) -> Tuple[str, str]:
|
||||
async def render(self, extractedContent: Dict[str, Any], title: str, userPrompt: str = None, aiService=None) -> List[RenderedDocument]:
|
||||
"""Render extracted JSON content to plain text format."""
|
||||
try:
|
||||
# Generate text from JSON structure
|
||||
textContent = self._generateTextFromJson(extractedContent, title)
|
||||
|
||||
return textContent, "text/plain"
|
||||
# Determine filename from document or title
|
||||
documents = extractedContent.get("documents", [])
|
||||
if documents and isinstance(documents[0], dict):
|
||||
filename = documents[0].get("filename")
|
||||
if not filename:
|
||||
filename = self._determineFilename(title, "text/plain")
|
||||
else:
|
||||
filename = self._determineFilename(title, "text/plain")
|
||||
|
||||
# Extract metadata for document type and other info
|
||||
metadata = extractedContent.get("metadata", {}) if extractedContent else {}
|
||||
documentType = metadata.get("documentType") if isinstance(metadata, dict) else None
|
||||
|
||||
return [
|
||||
RenderedDocument(
|
||||
documentData=textContent.encode('utf-8'),
|
||||
mimeType="text/plain",
|
||||
filename=filename,
|
||||
documentType=documentType,
|
||||
metadata=metadata if isinstance(metadata, dict) else None
|
||||
)
|
||||
]
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error rendering text: {str(e)}")
|
||||
# Return minimal text fallback
|
||||
return f"{title}\n\nError rendering report: {str(e)}", "text/plain"
|
||||
fallbackContent = f"{title}\n\nError rendering report: {str(e)}"
|
||||
metadata = extractedContent.get("metadata", {}) if extractedContent else {}
|
||||
documentType = metadata.get("documentType") if isinstance(metadata, dict) else None
|
||||
return [
|
||||
RenderedDocument(
|
||||
documentData=fallbackContent.encode('utf-8'),
|
||||
mimeType="text/plain",
|
||||
filename=self._determineFilename(title, "text/plain"),
|
||||
documentType=documentType,
|
||||
metadata=metadata if isinstance(metadata, dict) else None
|
||||
)
|
||||
]
|
||||
|
||||
def _generateTextFromJson(self, jsonContent: Dict[str, Any], title: str) -> str:
|
||||
"""Generate text content from structured JSON document."""
|
||||
try:
|
||||
# Validate JSON structure
|
||||
if not isinstance(jsonContent, dict):
|
||||
raise ValueError("JSON content must be a dictionary")
|
||||
if not self._validateJsonStructure(jsonContent):
|
||||
raise ValueError("JSON content must follow standardized schema: {metadata: {...}, documents: [{sections: [...]}]}")
|
||||
|
||||
if "sections" not in jsonContent:
|
||||
raise ValueError("JSON content must contain 'sections' field")
|
||||
# Extract sections and metadata from standardized schema
|
||||
sections = self._extractSections(jsonContent)
|
||||
metadata = self._extractMetadata(jsonContent)
|
||||
|
||||
# Use title from JSON metadata if available, otherwise use provided title
|
||||
documentTitle = jsonContent.get("metadata", {}).get("title", title)
|
||||
documentTitle = metadata.get("title", title)
|
||||
|
||||
# Build text content
|
||||
textParts = []
|
||||
|
|
@ -82,7 +116,6 @@ class RendererText(BaseRenderer):
|
|||
textParts.append("")
|
||||
|
||||
# Process each section
|
||||
sections = jsonContent.get("sections", [])
|
||||
for section in sections:
|
||||
sectionText = self._renderJsonSection(section)
|
||||
if sectionText:
|
||||
|
|
@ -100,41 +133,75 @@ class RendererText(BaseRenderer):
|
|||
raise Exception(f"Text generation failed: {str(e)}")
|
||||
|
||||
def _renderJsonSection(self, section: Dict[str, Any]) -> str:
|
||||
"""Render a single JSON section to text."""
|
||||
"""Render a single JSON section to text.
|
||||
Supports three content formats: reference, object (base64), extracted_text.
|
||||
"""
|
||||
try:
|
||||
sectionType = self._getSectionType(section)
|
||||
sectionData = self._getSectionData(section)
|
||||
|
||||
if sectionType == "table":
|
||||
# Process the section data to extract table structure
|
||||
processedData = self._processSectionByType(section)
|
||||
return self._renderJsonTable(processedData)
|
||||
elif sectionType == "bullet_list":
|
||||
# Process the section data to extract bullet list structure
|
||||
processedData = self._processSectionByType(section)
|
||||
return self._renderJsonBulletList(processedData)
|
||||
elif sectionType == "heading":
|
||||
# Render each heading element in the elements array
|
||||
# sectionData is already the elements array from _getSectionData
|
||||
renderedElements = []
|
||||
# Check for three content formats from Phase 5D in elements
|
||||
if isinstance(sectionData, list):
|
||||
textParts = []
|
||||
for element in sectionData:
|
||||
renderedElements.append(self._renderJsonHeading(element))
|
||||
return "\n".join(renderedElements)
|
||||
element_type = element.get("type", "") if isinstance(element, dict) else ""
|
||||
|
||||
# Support three content formats from Phase 5D
|
||||
if element_type == "reference":
|
||||
# Document reference format
|
||||
doc_ref = element.get("documentReference", "")
|
||||
label = element.get("label", "Reference")
|
||||
textParts.append(f"[Reference: {label}]")
|
||||
continue
|
||||
elif element_type == "extracted_text":
|
||||
# Extracted text format
|
||||
content = element.get("content", "")
|
||||
source = element.get("source", "")
|
||||
if content:
|
||||
source_text = f" (Source: {source})" if source else ""
|
||||
textParts.append(f"{content}{source_text}")
|
||||
continue
|
||||
|
||||
# If we processed reference/extracted_text elements, return them
|
||||
if textParts:
|
||||
return '\n\n'.join(textParts)
|
||||
|
||||
if sectionType == "table":
|
||||
# Work directly with elements like other renderers
|
||||
if isinstance(sectionData, list) and sectionData:
|
||||
element = sectionData[0] if isinstance(sectionData[0], dict) else {}
|
||||
return self._renderJsonTable(element)
|
||||
return ""
|
||||
elif sectionType == "bullet_list":
|
||||
# Work directly with elements like other renderers
|
||||
if isinstance(sectionData, list) and sectionData:
|
||||
element = sectionData[0] if isinstance(sectionData[0], dict) else {}
|
||||
return self._renderJsonBulletList(element)
|
||||
return ""
|
||||
elif sectionType == "heading":
|
||||
# Work directly with elements like other renderers
|
||||
if isinstance(sectionData, list) and sectionData:
|
||||
element = sectionData[0] if isinstance(sectionData[0], dict) else {}
|
||||
return self._renderJsonHeading(element)
|
||||
return ""
|
||||
elif sectionType == "paragraph":
|
||||
# Render each paragraph element in the elements array
|
||||
# sectionData is already the elements array from _getSectionData
|
||||
renderedElements = []
|
||||
for element in sectionData:
|
||||
renderedElements.append(self._renderJsonParagraph(element))
|
||||
return "\n".join(renderedElements)
|
||||
elif sectionType == "code_block":
|
||||
# Process the section data to extract code block structure
|
||||
processedData = self._processSectionByType(section)
|
||||
return self._renderJsonCodeBlock(processedData)
|
||||
# Work directly with elements like other renderers
|
||||
if isinstance(sectionData, list) and sectionData:
|
||||
element = sectionData[0] if isinstance(sectionData[0], dict) else {}
|
||||
return self._renderJsonCodeBlock(element)
|
||||
return ""
|
||||
elif sectionType == "image":
|
||||
# Process the section data to extract image structure
|
||||
processedData = self._processSectionByType(section)
|
||||
return self._renderJsonImage(processedData)
|
||||
# Work directly with elements like other renderers
|
||||
if isinstance(sectionData, list) and sectionData:
|
||||
element = sectionData[0] if isinstance(sectionData[0], dict) else {}
|
||||
return self._renderJsonImage(element)
|
||||
return ""
|
||||
else:
|
||||
# Fallback to paragraph for unknown types - render each element
|
||||
# sectionData is already the elements array from _getSectionData
|
||||
|
|
@ -150,8 +217,12 @@ class RendererText(BaseRenderer):
|
|||
def _renderJsonTable(self, tableData: Dict[str, Any]) -> str:
|
||||
"""Render a JSON table to text."""
|
||||
try:
|
||||
headers = tableData.get("headers", [])
|
||||
rows = tableData.get("rows", [])
|
||||
# Extract from nested content structure: element.content.{headers, rows}
|
||||
content = tableData.get("content", {})
|
||||
if not isinstance(content, dict):
|
||||
return ""
|
||||
headers = content.get("headers", [])
|
||||
rows = content.get("rows", [])
|
||||
|
||||
if not headers or not rows:
|
||||
return ""
|
||||
|
|
@ -180,7 +251,11 @@ class RendererText(BaseRenderer):
|
|||
def _renderJsonBulletList(self, listData: Dict[str, Any]) -> str:
|
||||
"""Render a JSON bullet list to text."""
|
||||
try:
|
||||
items = listData.get("items", [])
|
||||
# Extract from nested content structure: element.content.{items}
|
||||
content = listData.get("content", {})
|
||||
if not isinstance(content, dict):
|
||||
return ""
|
||||
items = content.get("items", [])
|
||||
|
||||
if not items:
|
||||
return ""
|
||||
|
|
@ -201,8 +276,12 @@ class RendererText(BaseRenderer):
|
|||
def _renderJsonHeading(self, headingData: Dict[str, Any]) -> str:
|
||||
"""Render a JSON heading to text."""
|
||||
try:
|
||||
level = headingData.get("level", 1)
|
||||
text = headingData.get("text", "")
|
||||
# Extract from nested content structure: element.content.{text, level}
|
||||
content = headingData.get("content", {})
|
||||
if not isinstance(content, dict):
|
||||
return ""
|
||||
text = content.get("text", "")
|
||||
level = content.get("level", 1)
|
||||
|
||||
if text:
|
||||
level = max(1, min(6, level))
|
||||
|
|
@ -222,7 +301,14 @@ class RendererText(BaseRenderer):
|
|||
def _renderJsonParagraph(self, paragraphData: Dict[str, Any]) -> str:
|
||||
"""Render a JSON paragraph to text."""
|
||||
try:
|
||||
text = paragraphData.get("text", "")
|
||||
# Extract from nested content structure
|
||||
content = paragraphData.get("content", {})
|
||||
if isinstance(content, dict):
|
||||
text = content.get("text", "")
|
||||
elif isinstance(content, str):
|
||||
text = content
|
||||
else:
|
||||
text = ""
|
||||
return text if text else ""
|
||||
|
||||
except Exception as e:
|
||||
|
|
@ -232,8 +318,12 @@ class RendererText(BaseRenderer):
|
|||
def _renderJsonCodeBlock(self, codeData: Dict[str, Any]) -> str:
|
||||
"""Render a JSON code block to text."""
|
||||
try:
|
||||
code = codeData.get("code", "")
|
||||
language = codeData.get("language", "")
|
||||
# Extract from nested content structure: element.content.{code, language}
|
||||
content = codeData.get("content", {})
|
||||
if not isinstance(content, dict):
|
||||
return ""
|
||||
code = content.get("code", "")
|
||||
language = content.get("language", "")
|
||||
|
||||
if code:
|
||||
if language:
|
||||
|
|
@ -250,9 +340,14 @@ class RendererText(BaseRenderer):
|
|||
def _renderJsonImage(self, imageData: Dict[str, Any]) -> str:
|
||||
"""Render a JSON image to text."""
|
||||
try:
|
||||
altText = imageData.get("altText", "Image")
|
||||
# Extract from nested content structure: element.content.{base64Data, altText, caption}
|
||||
content = imageData.get("content", {})
|
||||
if isinstance(content, dict):
|
||||
altText = content.get("altText", "Image")
|
||||
else:
|
||||
altText = imageData.get("altText", "Image")
|
||||
return f"[Image: {altText}]"
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error rendering image: {str(e)}")
|
||||
return f"[Image: {imageData.get('altText', 'Image')}]"
|
||||
return f"[Image: Image]"
|
||||
|
|
|
|||
File diff suppressed because it is too large
Load diff
944
modules/services/serviceGeneration/subContentGenerator.py
Normal file
944
modules/services/serviceGeneration/subContentGenerator.py
Normal file
|
|
@ -0,0 +1,944 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""
|
||||
Content Generator for hierarchical document generation.
|
||||
Generates content for each section in the document structure.
|
||||
"""
|
||||
|
||||
import logging
|
||||
import asyncio
|
||||
import json
|
||||
import base64
|
||||
import re
|
||||
import traceback
|
||||
from typing import Dict, Any, Optional, List, Callable
|
||||
from modules.services.serviceGeneration.subContentIntegrator import ContentIntegrator
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class ContentGenerator:
|
||||
"""Generates content for document sections"""
|
||||
|
||||
def __init__(self, services: Any):
|
||||
self.services = services
|
||||
self.integrator = ContentIntegrator(services)
|
||||
|
||||
async def generateContent(
|
||||
self,
|
||||
structure: Dict[str, Any],
|
||||
cachedContent: Optional[Dict[str, Any]] = None,
|
||||
userPrompt: str = "",
|
||||
contentParts: Optional[List[Any]] = None,
|
||||
progressCallback: Optional[Callable] = None,
|
||||
parallelGeneration: bool = True,
|
||||
batchSize: int = 10
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
Generate content for all sections in structure.
|
||||
|
||||
Args:
|
||||
structure: Document structure from Phase 1 (with contentPartIds per section)
|
||||
cachedContent: Extracted content cache
|
||||
userPrompt: Original user prompt
|
||||
contentParts: List of all available ContentParts (for mapping by contentPartIds)
|
||||
progressCallback: Function to call for progress updates
|
||||
parallelGeneration: Enable parallel section generation
|
||||
batchSize: Number of sections to process in parallel
|
||||
|
||||
Returns:
|
||||
Complete document structure with populated elements
|
||||
"""
|
||||
try:
|
||||
documents = structure.get("documents", [])
|
||||
|
||||
if not documents:
|
||||
logger.warning("No documents found in structure")
|
||||
return structure
|
||||
|
||||
allGeneratedSections = []
|
||||
totalSectionsAcrossDocs = 0
|
||||
|
||||
# Count total sections for progress tracking
|
||||
for doc in documents:
|
||||
totalSectionsAcrossDocs += len(doc.get("sections", []))
|
||||
|
||||
if progressCallback:
|
||||
progressCallback(0, totalSectionsAcrossDocs, "Starting content generation...")
|
||||
|
||||
currentSectionIndex = 0
|
||||
|
||||
for docIdx, doc in enumerate(documents):
|
||||
sections = doc.get("sections", [])
|
||||
totalSections = len(sections)
|
||||
|
||||
if totalSections == 0:
|
||||
continue
|
||||
|
||||
# Determine if parallel generation is beneficial
|
||||
# Use sequential if only 1 section or if sections depend on each other
|
||||
useParallel = parallelGeneration and totalSections > 1
|
||||
|
||||
# Count images - if many images, parallel is still beneficial but slower
|
||||
imageCount = sum(1 for s in sections if s.get("content_type") == "image")
|
||||
|
||||
if progressCallback and docIdx > 0:
|
||||
progressCallback(
|
||||
currentSectionIndex,
|
||||
totalSectionsAcrossDocs,
|
||||
f"Processing document {docIdx + 1}/{len(documents)}..."
|
||||
)
|
||||
|
||||
if useParallel:
|
||||
# Generate in batches for parallel processing
|
||||
generatedSections = await self._generateSectionsParallel(
|
||||
sections=sections,
|
||||
cachedContent=cachedContent,
|
||||
userPrompt=userPrompt,
|
||||
contentParts=contentParts, # Pass ContentParts for section generation
|
||||
documentMetadata=structure.get("metadata", {}),
|
||||
progressCallback=lambda idx, total, msg: progressCallback(
|
||||
currentSectionIndex + idx,
|
||||
totalSectionsAcrossDocs,
|
||||
msg
|
||||
) if progressCallback else None,
|
||||
batchSize=batchSize
|
||||
)
|
||||
else:
|
||||
# Generate sequentially (better for context-dependent sections)
|
||||
generatedSections = await self._generateSectionsSequential(
|
||||
sections=sections,
|
||||
cachedContent=cachedContent,
|
||||
userPrompt=userPrompt,
|
||||
contentParts=contentParts, # Pass ContentParts for section generation
|
||||
documentMetadata=structure.get("metadata", {}),
|
||||
progressCallback=lambda idx, total, msg: progressCallback(
|
||||
currentSectionIndex + idx,
|
||||
totalSectionsAcrossDocs,
|
||||
msg
|
||||
) if progressCallback else None
|
||||
)
|
||||
|
||||
allGeneratedSections.extend(generatedSections)
|
||||
currentSectionIndex += totalSections
|
||||
|
||||
if progressCallback:
|
||||
progressCallback(
|
||||
totalSectionsAcrossDocs,
|
||||
totalSectionsAcrossDocs,
|
||||
"Content generation complete"
|
||||
)
|
||||
|
||||
# Integrate generated content into structure
|
||||
completeStructure = self.integrator.integrateContent(
|
||||
structure=structure,
|
||||
generatedSections=allGeneratedSections
|
||||
)
|
||||
|
||||
return completeStructure
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error generating content: {str(e)}")
|
||||
raise
|
||||
|
||||
async def _generateSectionsSequential(
|
||||
self,
|
||||
sections: List[Dict[str, Any]],
|
||||
cachedContent: Optional[Dict[str, Any]],
|
||||
userPrompt: str,
|
||||
contentParts: Optional[List[Any]] = None,
|
||||
documentMetadata: Dict[str, Any] = {},
|
||||
progressCallback: Optional[Callable] = None
|
||||
) -> List[Dict[str, Any]]:
|
||||
"""
|
||||
Generate sections sequentially with enhanced progress tracking.
|
||||
Uses previous sections for context continuity.
|
||||
"""
|
||||
generatedSections = []
|
||||
previousSections = []
|
||||
totalSections = len(sections)
|
||||
|
||||
# Create ContentParts lookup map by ID
|
||||
contentPartsMap = {}
|
||||
if contentParts:
|
||||
for part in contentParts:
|
||||
partId = part.id if hasattr(part, 'id') else part.get('id', '')
|
||||
if partId:
|
||||
contentPartsMap[partId] = part
|
||||
|
||||
for idx, section in enumerate(sections):
|
||||
try:
|
||||
contentType = section.get("content_type", "content")
|
||||
sectionId = section.get("id", f"section_{idx}")
|
||||
|
||||
# Enhanced progress message
|
||||
if contentType == "image":
|
||||
message = f"Generating image: {section.get('generation_hint', 'Image')[:50]}..."
|
||||
elif contentType == "heading":
|
||||
message = f"Generating heading..."
|
||||
elif contentType == "paragraph":
|
||||
message = f"Generating paragraph..."
|
||||
else:
|
||||
message = f"Generating {contentType}..."
|
||||
|
||||
if progressCallback:
|
||||
progressCallback(
|
||||
idx + 1,
|
||||
totalSections,
|
||||
message
|
||||
)
|
||||
|
||||
# Get ContentParts for this section
|
||||
sectionContentPartIds = section.get("contentPartIds", [])
|
||||
sectionContentParts = []
|
||||
if sectionContentPartIds and contentPartsMap:
|
||||
for partId in sectionContentPartIds:
|
||||
if partId in contentPartsMap:
|
||||
sectionContentParts.append(contentPartsMap[partId])
|
||||
|
||||
context = {
|
||||
"userPrompt": userPrompt,
|
||||
"cachedContent": cachedContent,
|
||||
"previousSections": previousSections.copy(),
|
||||
"targetSection": section,
|
||||
"sectionContentParts": sectionContentParts, # ContentParts for this section
|
||||
"documentMetadata": documentMetadata,
|
||||
"operationId": None
|
||||
}
|
||||
|
||||
generated = await self._generateSectionContent(section, context)
|
||||
generatedSections.append(generated)
|
||||
previousSections.append(generated)
|
||||
|
||||
# Log success
|
||||
if contentType == "image":
|
||||
logger.info(f"Successfully generated image for section {sectionId}")
|
||||
elif not generated.get("error"):
|
||||
logger.debug(f"Successfully generated {contentType} for section {sectionId}")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error generating section {section.get('id')}: {str(e)}")
|
||||
errorSection = self.integrator.createErrorSection(section, str(e))
|
||||
generatedSections.append(errorSection)
|
||||
previousSections.append(errorSection)
|
||||
|
||||
return generatedSections
|
||||
|
||||
async def _generateSectionsParallel(
|
||||
self,
|
||||
sections: List[Dict[str, Any]],
|
||||
cachedContent: Optional[Dict[str, Any]],
|
||||
userPrompt: str,
|
||||
documentMetadata: Dict[str, Any],
|
||||
progressCallback: Optional[Callable] = None,
|
||||
batchSize: int = 10
|
||||
) -> List[Dict[str, Any]]:
|
||||
"""
|
||||
Generate sections in parallel batches with enhanced progress tracking.
|
||||
|
||||
Args:
|
||||
sections: List of sections to generate
|
||||
cachedContent: Extracted content cache
|
||||
userPrompt: Original user prompt
|
||||
documentMetadata: Document metadata
|
||||
progressCallback: Progress callback function
|
||||
batchSize: Number of sections to process in parallel per batch
|
||||
|
||||
Returns:
|
||||
List of generated sections
|
||||
"""
|
||||
generatedSections = []
|
||||
totalSections = len(sections)
|
||||
|
||||
if totalSections == 0:
|
||||
return []
|
||||
|
||||
# Adjust batch size based on section types (images take longer)
|
||||
imageCount = sum(1 for s in sections if s.get("content_type") == "image")
|
||||
if imageCount > 0:
|
||||
# Reduce batch size if many images (images are slower)
|
||||
adjustedBatchSize = min(batchSize, max(3, batchSize - imageCount // 2))
|
||||
else:
|
||||
adjustedBatchSize = batchSize
|
||||
|
||||
# Process in batches
|
||||
totalBatches = (totalSections + adjustedBatchSize - 1) // adjustedBatchSize
|
||||
accumulatedPreviousSections = [] # Track sections from previous batches
|
||||
|
||||
for batchNum, batchStart in enumerate(range(0, totalSections, adjustedBatchSize)):
|
||||
batch = sections[batchStart:batchStart + adjustedBatchSize]
|
||||
batchEnd = min(batchStart + adjustedBatchSize, totalSections)
|
||||
|
||||
if progressCallback:
|
||||
progressCallback(
|
||||
batchStart,
|
||||
totalSections,
|
||||
f"Processing batch {batchNum + 1}/{totalBatches} ({len(batch)} sections)..."
|
||||
)
|
||||
|
||||
async def generateWithProgress(section: Dict[str, Any], globalIndex: int, localIndex: int, batchPreviousSections: List[Dict[str, Any]]):
|
||||
try:
|
||||
contentType = section.get("content_type", "content")
|
||||
sectionId = section.get("id", f"section_{globalIndex}")
|
||||
|
||||
# Enhanced progress message based on content type
|
||||
if contentType == "image":
|
||||
message = f"Generating image: {section.get('generation_hint', 'Image')[:50]}..."
|
||||
elif contentType == "heading":
|
||||
message = f"Generating heading..."
|
||||
elif contentType == "paragraph":
|
||||
message = f"Generating paragraph..."
|
||||
else:
|
||||
message = f"Generating {contentType}..."
|
||||
|
||||
if progressCallback:
|
||||
progressCallback(
|
||||
globalIndex + 1,
|
||||
totalSections,
|
||||
message
|
||||
)
|
||||
|
||||
# Get ContentParts for this section
|
||||
sectionContentPartIds = section.get("contentPartIds", [])
|
||||
sectionContentParts = []
|
||||
if sectionContentPartIds and contentPartsMap:
|
||||
for partId in sectionContentPartIds:
|
||||
if partId in contentPartsMap:
|
||||
sectionContentParts.append(contentPartsMap[partId])
|
||||
|
||||
context = {
|
||||
"userPrompt": userPrompt,
|
||||
"cachedContent": cachedContent,
|
||||
"previousSections": batchPreviousSections.copy(), # Include sections from previous batches
|
||||
"targetSection": section,
|
||||
"sectionContentParts": sectionContentParts, # ContentParts for this section
|
||||
"documentMetadata": documentMetadata,
|
||||
"operationId": None # Can be set if needed for nested progress
|
||||
}
|
||||
|
||||
result = await self._generateSectionContent(section, context)
|
||||
|
||||
# Log success
|
||||
if contentType == "image":
|
||||
logger.info(f"Successfully generated image for section {sectionId}")
|
||||
elif not result.get("error"):
|
||||
logger.debug(f"Successfully generated {contentType} for section {sectionId}")
|
||||
|
||||
return result
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error generating section {section.get('id')}: {str(e)}")
|
||||
return self.integrator.createErrorSection(section, str(e))
|
||||
|
||||
# Generate batch in parallel
|
||||
# Pass accumulated previous sections to each task in this batch
|
||||
batchTasks = [
|
||||
generateWithProgress(section, batchStart + idx, idx, accumulatedPreviousSections)
|
||||
for idx, section in enumerate(batch)
|
||||
]
|
||||
|
||||
batchResults = await asyncio.gather(
|
||||
*batchTasks,
|
||||
return_exceptions=True
|
||||
)
|
||||
|
||||
# Handle exceptions and collect results
|
||||
for idx, result in enumerate(batchResults):
|
||||
if isinstance(result, Exception):
|
||||
logger.error(f"Error in parallel generation batch {batchNum + 1}: {str(result)}")
|
||||
errorSection = self.integrator.createErrorSection(batch[idx], str(result))
|
||||
generatedSections.append(errorSection)
|
||||
accumulatedPreviousSections.append(errorSection) # Add to accumulated for next batch
|
||||
else:
|
||||
generatedSections.append(result)
|
||||
accumulatedPreviousSections.append(result) # Add to accumulated for next batch
|
||||
|
||||
# Update progress after batch completion
|
||||
if progressCallback:
|
||||
progressCallback(
|
||||
batchEnd,
|
||||
totalSections,
|
||||
f"Completed batch {batchNum + 1}/{totalBatches}"
|
||||
)
|
||||
|
||||
return generatedSections
|
||||
|
||||
async def _generateSectionContent(
|
||||
self,
|
||||
section: Dict[str, Any],
|
||||
context: Dict[str, Any]
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
Generate content for a single section.
|
||||
|
||||
Args:
|
||||
section: Section to generate content for
|
||||
context: Generation context
|
||||
|
||||
Returns:
|
||||
Section with populated elements array
|
||||
"""
|
||||
try:
|
||||
contentType = section.get("content_type", "")
|
||||
complexity = section.get("complexity", "simple")
|
||||
|
||||
if contentType == "image":
|
||||
return await self._generateImageSection(section, context)
|
||||
elif complexity == "complex":
|
||||
return await self._generateComplexTextSection(section, context)
|
||||
else:
|
||||
return await self._generateSimpleSection(section, context)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error generating section {section.get('id')}: {str(e)}")
|
||||
return self.integrator.createErrorSection(section, str(e))
|
||||
|
||||
async def _generateSimpleSection(
|
||||
self,
|
||||
section: Dict[str, Any],
|
||||
context: Dict[str, Any]
|
||||
) -> Dict[str, Any]:
|
||||
"""Generate content for simple section (heading, paragraph)"""
|
||||
try:
|
||||
contentType = section.get("content_type", "")
|
||||
generationHint = section.get("generation_hint", "")
|
||||
|
||||
# Create section-specific prompt
|
||||
sectionPrompt = self._createSectionPrompt(section, context)
|
||||
|
||||
# Debug: Log section generation prompt (harmonisiert - keine Checks nötig)
|
||||
sectionId = section.get('id', 'unknown')
|
||||
contentType = section.get('content_type', 'unknown')
|
||||
self.services.utils.writeDebugFile(
|
||||
sectionPrompt,
|
||||
f"document_generation_section_{sectionId}_{contentType}_prompt"
|
||||
)
|
||||
|
||||
# Call AI to generate content
|
||||
from modules.datamodels.datamodelAi import AiCallOptions, OperationTypeEnum
|
||||
|
||||
options = AiCallOptions(
|
||||
operationType=OperationTypeEnum.DATA_GENERATE,
|
||||
resultFormat="json"
|
||||
)
|
||||
|
||||
aiResponse = await self.services.ai.callAiContent(
|
||||
prompt=sectionPrompt,
|
||||
options=options,
|
||||
outputFormat="json"
|
||||
)
|
||||
|
||||
# Debug: Log section generation response (harmonisiert - keine Checks nötig)
|
||||
sectionId = section.get('id', 'unknown')
|
||||
contentType = section.get('content_type', 'unknown')
|
||||
|
||||
responseContent = ''
|
||||
if aiResponse:
|
||||
if hasattr(aiResponse, 'content') and aiResponse.content:
|
||||
responseContent = aiResponse.content
|
||||
elif hasattr(aiResponse, 'documents') and aiResponse.documents:
|
||||
responseContent = f"[Response has {len(aiResponse.documents)} documents]"
|
||||
else:
|
||||
responseContent = f"[Response object: {type(aiResponse).__name__}, attributes: {dir(aiResponse)}]"
|
||||
else:
|
||||
responseContent = '[No response object]'
|
||||
|
||||
# Debug: Log section generation response (harmonisiert - keine Checks nötig)
|
||||
self.services.utils.writeDebugFile(
|
||||
responseContent,
|
||||
f"document_generation_section_{sectionId}_{contentType}_response"
|
||||
)
|
||||
logger.debug(f"Logged section response for {sectionId} ({len(responseContent)} chars)")
|
||||
|
||||
if not aiResponse or not aiResponse.content:
|
||||
logger.error(f"AI section generation returned empty response for section {sectionId}")
|
||||
logger.error(f"Response object: {aiResponse}, has content: {hasattr(aiResponse, 'content') if aiResponse else False}")
|
||||
raise ValueError("AI section generation returned empty response")
|
||||
|
||||
# Extract JSON elements
|
||||
rawContent = aiResponse.content if aiResponse and aiResponse.content else ""
|
||||
if not rawContent or not rawContent.strip():
|
||||
logger.error(f"AI section generation returned empty response for section {sectionId}")
|
||||
logger.error(f"Response object: {aiResponse}, content length: {len(rawContent) if rawContent else 0}")
|
||||
raise ValueError("AI section generation returned empty response")
|
||||
|
||||
extractedJson = self.services.utils.jsonExtractString(rawContent)
|
||||
if not extractedJson or not extractedJson.strip():
|
||||
logger.error(f"No JSON found in AI response for section {sectionId}")
|
||||
logger.error(f"Raw response (first 1000 chars): {rawContent[:1000]}")
|
||||
logger.error(f"Extracted JSON (first 500 chars): {extractedJson[:500] if extractedJson else 'None'}")
|
||||
raise ValueError("No JSON found in AI section response")
|
||||
|
||||
# json is already imported at module level
|
||||
try:
|
||||
elementsData = json.loads(extractedJson)
|
||||
logger.debug(f"Parsed JSON for section {section.get('id')}: type={type(elementsData)}, keys={list(elementsData.keys()) if isinstance(elementsData, dict) else 'N/A'}")
|
||||
except json.JSONDecodeError as e:
|
||||
logger.error(f"Failed to parse JSON from AI response for section {section.get('id')}")
|
||||
logger.error(f"JSON decode error: {str(e)}")
|
||||
logger.error(f"Extracted JSON length: {len(extractedJson)} chars")
|
||||
logger.error(f"Extracted JSON (first 1000 chars): {extractedJson[:1000]}")
|
||||
if len(extractedJson) > 1000:
|
||||
logger.error(f"Extracted JSON (last 500 chars): {extractedJson[-500:]}")
|
||||
logger.error(f"Raw AI response length: {len(rawContent)} chars")
|
||||
logger.error(f"Raw AI response (first 1000 chars): {rawContent[:1000] if rawContent else 'None'}")
|
||||
|
||||
# Try to recover from truncated JSON if it looks like it was cut off
|
||||
if "Expecting" in str(e) and ("delimiter" in str(e) or "value" in str(e)):
|
||||
# Check if JSON starts correctly but is truncated
|
||||
if extractedJson.strip().startswith('{"elements"'):
|
||||
logger.warning(f"JSON appears truncated, attempting recovery...")
|
||||
# Use closeJsonStructures which handles unterminated strings properly
|
||||
try:
|
||||
from modules.shared.jsonUtils import closeJsonStructures
|
||||
recoveredJson = closeJsonStructures(extractedJson)
|
||||
|
||||
logger.info(f"Attempting to parse recovered JSON (closed structures)")
|
||||
logger.debug(f"Recovered JSON length: {len(recoveredJson)} chars (original: {len(extractedJson)} chars)")
|
||||
|
||||
elementsData = json.loads(recoveredJson)
|
||||
logger.info(f"Successfully recovered JSON for section {section.get('id')}")
|
||||
except (json.JSONDecodeError, ValueError) as recoveryError:
|
||||
logger.error(f"JSON recovery failed: {str(recoveryError)}")
|
||||
logger.error(f"Recovered JSON (first 500 chars): {recoveredJson[:500] if 'recoveredJson' in locals() else 'N/A'}")
|
||||
logger.error(f"Recovered JSON (last 200 chars): {recoveredJson[-200:] if 'recoveredJson' in locals() else 'N/A'}")
|
||||
|
||||
# Last resort: try to extract partial content and create minimal valid JSON
|
||||
try:
|
||||
# Try to extract text content before the truncation point
|
||||
# re is already imported at module level
|
||||
# Look for text field that might be partially complete
|
||||
textMatch = re.search(r'"text"\s*:\s*"([^"]*)', extractedJson)
|
||||
if textMatch:
|
||||
partialText = textMatch.group(1)
|
||||
# Create minimal valid JSON with truncated text marked
|
||||
elementsData = {
|
||||
"elements": [{
|
||||
"text": partialText + "... [Content truncated due to token limit]"
|
||||
}]
|
||||
}
|
||||
logger.warning(f"Created minimal JSON structure with truncated text for section {section.get('id')}")
|
||||
else:
|
||||
# If no text found, create empty structure
|
||||
elementsData = {"elements": []}
|
||||
logger.warning(f"Created empty JSON structure for section {section.get('id')} due to recovery failure")
|
||||
except Exception as fallbackError:
|
||||
logger.error(f"Fallback recovery also failed: {str(fallbackError)}")
|
||||
# Check if raw response might be truncated
|
||||
if len(rawContent) <= len(extractedJson) + 100: # Raw content is similar length to extracted
|
||||
logger.warning(f"Raw AI response may be truncated (length: {len(rawContent)} chars)")
|
||||
logger.warning(f"Consider increasing max_tokens for AI calls or checking token limits")
|
||||
raise ValueError(f"Invalid JSON in AI response (truncated?): {str(e)}")
|
||||
else:
|
||||
raise ValueError(f"Invalid JSON in AI response: {str(e)}")
|
||||
else:
|
||||
raise ValueError(f"Invalid JSON in AI response: {str(e)}")
|
||||
|
||||
# Extract elements array - handle various response formats
|
||||
elements = None
|
||||
|
||||
if isinstance(elementsData, dict):
|
||||
# Try to find elements in various possible locations
|
||||
if "elements" in elementsData:
|
||||
elements = elementsData["elements"]
|
||||
elif "content" in elementsData and isinstance(elementsData["content"], list):
|
||||
# Some models return {"content": [...]}
|
||||
elements = elementsData["content"]
|
||||
elif "data" in elementsData and isinstance(elementsData["data"], list):
|
||||
# Some models return {"data": [...]}
|
||||
elements = elementsData["data"]
|
||||
elif len(elementsData) == 1:
|
||||
# Single key dict - might be the elements directly
|
||||
firstValue = list(elementsData.values())[0]
|
||||
if isinstance(firstValue, list):
|
||||
elements = firstValue
|
||||
else:
|
||||
# Try to convert entire dict to a single element
|
||||
logger.warning(f"AI returned dict without 'elements' key, attempting to convert: {list(elementsData.keys())}")
|
||||
# For heading/paragraph, create element from dict
|
||||
if contentType == "heading":
|
||||
text = elementsData.get("text") or elementsData.get("heading") or str(elementsData)
|
||||
level = elementsData.get("level", 1)
|
||||
elements = [{"level": level, "text": text}]
|
||||
elif contentType == "paragraph":
|
||||
text = elementsData.get("text") or elementsData.get("content") or str(elementsData)
|
||||
elements = [{"text": text}]
|
||||
else:
|
||||
# Try to create element from dict structure
|
||||
elements = [elementsData]
|
||||
elif isinstance(elementsData, list):
|
||||
elements = elementsData
|
||||
else:
|
||||
# Primitive value - wrap it
|
||||
logger.warning(f"AI returned primitive value, wrapping: {type(elementsData)}")
|
||||
if contentType == "heading":
|
||||
elements = [{"level": 1, "text": str(elementsData)}]
|
||||
elif contentType == "paragraph":
|
||||
elements = [{"text": str(elementsData)}]
|
||||
else:
|
||||
elements = [{"text": str(elementsData)}]
|
||||
|
||||
if elements is None:
|
||||
logger.error(f"Could not extract elements from AI response. Response structure: {type(elementsData)}, keys: {list(elementsData.keys()) if isinstance(elementsData, dict) else 'N/A'}")
|
||||
logger.error(f"Full response (first 500 chars): {str(extractedJson)[:500]}")
|
||||
raise ValueError(f"Invalid elements format in AI response. Expected dict with 'elements' key or list, got: {type(elementsData)}")
|
||||
|
||||
# Validate elements is a list
|
||||
if not isinstance(elements, list):
|
||||
logger.warning(f"Elements is not a list, converting: {type(elements)}")
|
||||
elements = [elements]
|
||||
|
||||
# Update section with elements
|
||||
section["elements"] = elements
|
||||
return section
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error generating simple section: {str(e)}")
|
||||
raise
|
||||
|
||||
async def _generateImageSection(
|
||||
self,
|
||||
section: Dict[str, Any],
|
||||
context: Dict[str, Any]
|
||||
) -> Dict[str, Any]:
|
||||
"""Generate image for image section or include existing image"""
|
||||
try:
|
||||
# Check if this is an existing image to include or render
|
||||
imageSource = section.get("image_source", "generate")
|
||||
|
||||
if imageSource == "existing" or imageSource == "render":
|
||||
# Phase 4: Include existing image or render image from cachedContent
|
||||
imageRefId = section.get("image_reference_id")
|
||||
if not imageRefId:
|
||||
raise ValueError(f"Image section {section.get('id')} has image_source='{imageSource}' but no image_reference_id")
|
||||
|
||||
cachedContent = context.get("cachedContent", {})
|
||||
imageDocuments = cachedContent.get("imageDocuments", [])
|
||||
|
||||
# Find the image document
|
||||
imageDoc = next((img for img in imageDocuments if img.get("id") == imageRefId), None)
|
||||
if not imageDoc:
|
||||
raise ValueError(f"Image document {imageRefId} not found in cachedContent.imageDocuments")
|
||||
|
||||
# Create image element from existing/render image
|
||||
altText = imageDoc.get("altText", section.get("generation_hint", "Image"))
|
||||
mimeType = imageDoc.get("mimeType", "image/png")
|
||||
|
||||
section["elements"] = [{
|
||||
"base64Data": imageDoc.get("base64Data"),
|
||||
"altText": altText,
|
||||
"mimeType": mimeType,
|
||||
"caption": section.get("metadata", {}).get("caption")
|
||||
}]
|
||||
|
||||
logger.info(f"Successfully integrated image {imageRefId} for section {section.get('id')} (source={imageSource})")
|
||||
return section
|
||||
|
||||
# Generate new image (existing logic)
|
||||
imagePrompt = section.get("image_prompt")
|
||||
if not imagePrompt:
|
||||
# Try to create from generation_hint
|
||||
generationHint = section.get("generation_hint", "")
|
||||
if generationHint:
|
||||
imagePrompt = f"Create a professional illustration: {generationHint}"
|
||||
else:
|
||||
raise ValueError(f"Image section {section.get('id')} missing image_prompt and generation_hint")
|
||||
|
||||
# Call AI service for image generation
|
||||
from modules.datamodels.datamodelAi import AiCallOptions, OperationTypeEnum, AiCallPromptImage
|
||||
# json is already imported at module level
|
||||
|
||||
# Create image generation prompt
|
||||
promptModel = AiCallPromptImage(
|
||||
prompt=imagePrompt,
|
||||
size="1024x1024",
|
||||
quality="standard",
|
||||
style="vivid"
|
||||
)
|
||||
promptJson = promptModel.model_dump_json(exclude_none=True, indent=2)
|
||||
|
||||
options = AiCallOptions(
|
||||
operationType=OperationTypeEnum.IMAGE_GENERATE,
|
||||
resultFormat="base64"
|
||||
)
|
||||
|
||||
# Log image generation start
|
||||
logger.info(f"Starting image generation for section {section.get('id')}: {imagePrompt[:100]}...")
|
||||
|
||||
# Call AI for image generation
|
||||
aiResponse = await self.services.ai.callAiContent(
|
||||
prompt=promptJson,
|
||||
options=options,
|
||||
outputFormat="base64"
|
||||
)
|
||||
|
||||
# Extract base64 image data
|
||||
base64Data = None
|
||||
|
||||
if aiResponse and aiResponse.documents and len(aiResponse.documents) > 0:
|
||||
imageDoc = aiResponse.documents[0]
|
||||
base64Data = imageDoc.documentData
|
||||
logger.debug(f"Image data extracted from documents: {len(base64Data) if base64Data else 0} chars")
|
||||
|
||||
# Fallback: check content field (might be base64 string)
|
||||
if not base64Data and aiResponse and aiResponse.content:
|
||||
base64Data = aiResponse.content
|
||||
logger.debug(f"Image data extracted from content: {len(base64Data) if base64Data else 0} chars")
|
||||
|
||||
if not base64Data:
|
||||
raise ValueError("Image generation returned no data")
|
||||
|
||||
# Validate base64 data
|
||||
try:
|
||||
# base64 is already imported at module level
|
||||
base64.b64decode(base64Data[:100], validate=True) # Validate first 100 chars
|
||||
except Exception as e:
|
||||
logger.warning(f"Image data may not be valid base64: {str(e)}")
|
||||
# Continue anyway - renderer will handle it
|
||||
|
||||
# Create image element
|
||||
altText = section.get("generation_hint", "Image")
|
||||
if not altText or altText == "Image":
|
||||
# Use image_prompt as alt text if generation_hint is generic
|
||||
altText = section.get("image_prompt", "Image")[:100] # Limit length
|
||||
|
||||
caption = section.get("metadata", {}).get("caption")
|
||||
|
||||
section["elements"] = [{
|
||||
"url": f"data:image/png;base64,{base64Data}",
|
||||
"base64Data": base64Data,
|
||||
"altText": altText,
|
||||
"caption": caption
|
||||
}]
|
||||
|
||||
logger.info(f"Successfully generated image for section {section.get('id')}")
|
||||
return section
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error generating image section: {str(e)}")
|
||||
raise
|
||||
|
||||
async def _generateComplexTextSection(
|
||||
self,
|
||||
section: Dict[str, Any],
|
||||
context: Dict[str, Any]
|
||||
) -> Dict[str, Any]:
|
||||
"""Generate content for complex text section (long chapter)"""
|
||||
# For now, use same approach as simple section
|
||||
# Can be enhanced later with chunking for very long content
|
||||
return await self._generateSimpleSection(section, context)
|
||||
|
||||
def _createSectionPrompt(
|
||||
self,
|
||||
section: Dict[str, Any],
|
||||
context: Dict[str, Any]
|
||||
) -> str:
|
||||
"""Create sub-prompt for section content generation"""
|
||||
contentType = section.get("content_type", "")
|
||||
generationHint = section.get("generation_hint", "")
|
||||
extractionPrompt = section.get("extractionPrompt") # Optional extraction prompt for ContentParts
|
||||
userPrompt = context.get("userPrompt", "")
|
||||
cachedContent = context.get("cachedContent")
|
||||
previousSections = context.get("previousSections", [])
|
||||
sectionContentParts = context.get("sectionContentParts", []) # ContentParts for this section
|
||||
documentMetadata = context.get("documentMetadata", {})
|
||||
|
||||
# Get user language
|
||||
userLanguage = self._getUserLanguage()
|
||||
|
||||
# Format cached content
|
||||
cachedContentText = ""
|
||||
if cachedContent and cachedContent.get("extractedContent"):
|
||||
cachedContentText = self._formatCachedContent(cachedContent)
|
||||
|
||||
# Format ContentParts for this section
|
||||
contentPartsText = ""
|
||||
imagePartReferences = [] # Track image parts for text reference
|
||||
|
||||
if sectionContentParts:
|
||||
try:
|
||||
partsList = []
|
||||
imageIndex = 1
|
||||
for part in sectionContentParts:
|
||||
partTypeGroup = part.typeGroup if hasattr(part, 'typeGroup') else part.get('typeGroup', '')
|
||||
partMimeType = part.mimeType if hasattr(part, 'mimeType') else part.get('mimeType', '')
|
||||
partId = part.id if hasattr(part, 'id') else part.get('id', '')
|
||||
partData = part.data if hasattr(part, 'data') else part.get('data', '')
|
||||
|
||||
# Check if this is an image part
|
||||
isImage = partTypeGroup == "image" or (partMimeType and partMimeType.startswith("image/"))
|
||||
|
||||
if contentType == "image" and isImage:
|
||||
# For image sections: include image data for integration
|
||||
partsList.append(f"- ContentPart {partId} (image): [Image data available for integration]")
|
||||
elif isImage:
|
||||
# For non-image sections: track for text reference
|
||||
imagePartReferences.append({
|
||||
"id": partId,
|
||||
"index": imageIndex
|
||||
})
|
||||
imageIndex += 1
|
||||
# Don't include image data in prompt for non-image sections
|
||||
else:
|
||||
# For text/table/etc parts: include data preview
|
||||
dataPreview = str(partData)[:200] if partData else "[No data]"
|
||||
partsList.append(f"- ContentPart {partId} ({partTypeGroup}): {dataPreview}{'...' if partData and len(str(partData)) > 200 else ''}")
|
||||
|
||||
if partsList:
|
||||
contentPartsText = "\n".join(partsList)
|
||||
|
||||
# Add image reference instructions for non-image sections
|
||||
if imagePartReferences and contentType != "image":
|
||||
refText = ", ".join([f"Bild {ref['index']}" if userLanguage == "de" else f"Image {ref['index']}" for ref in imagePartReferences])
|
||||
contentPartsText += f"\n\nNOTE: Reference images as text in the document language: {refText}"
|
||||
|
||||
except Exception as e:
|
||||
logger.warning(f"Could not format ContentParts for section prompt: {str(e)}")
|
||||
contentPartsText = ""
|
||||
|
||||
# Format previous sections for context
|
||||
previousSectionsText = ""
|
||||
if previousSections:
|
||||
formattedSections = []
|
||||
for s in previousSections[-10:]: # Last 10 sections for context (increased from 5)
|
||||
prevContentType = s.get('content_type', 'unknown') # Use different variable name to avoid shadowing
|
||||
order = s.get('order', 0)
|
||||
hint = s.get('generation_hint', '')
|
||||
elements = s.get('elements', [])
|
||||
|
||||
# Extract actual content from elements
|
||||
contentPreview = ""
|
||||
if elements:
|
||||
if prevContentType == "heading":
|
||||
# Extract heading text
|
||||
for elem in elements:
|
||||
if isinstance(elem, dict) and "text" in elem:
|
||||
contentPreview = f": \"{elem['text']}\""
|
||||
break
|
||||
elif prevContentType == "paragraph":
|
||||
# Extract paragraph text (first 100 chars)
|
||||
for elem in elements:
|
||||
if isinstance(elem, dict) and "text" in elem:
|
||||
text = elem['text']
|
||||
contentPreview = f": \"{text[:100]}{'...' if len(text) > 100 else ''}\""
|
||||
break
|
||||
elif prevContentType == "bullet_list":
|
||||
# Extract bullet items
|
||||
for elem in elements:
|
||||
if isinstance(elem, dict) and "items" in elem:
|
||||
items = elem['items']
|
||||
if items:
|
||||
contentPreview = f": {items[:3]}{'...' if len(items) > 3 else ''}"
|
||||
break
|
||||
|
||||
formattedSections.append(
|
||||
f"- Section {order} ({prevContentType}){contentPreview}"
|
||||
)
|
||||
previousSectionsText = "\n".join(formattedSections)
|
||||
|
||||
prompt = f"""{'='*80}
|
||||
SECTION TO GENERATE:
|
||||
{'='*80}
|
||||
Type: {contentType}
|
||||
Hint: {generationHint}
|
||||
{'='*80}
|
||||
|
||||
CONTEXT:
|
||||
- User Request: {userPrompt}
|
||||
- Previous Sections: {len(previousSections)} sections already generated
|
||||
- Document Title: {documentMetadata.get('title', 'Unknown')}
|
||||
|
||||
{'='*80}
|
||||
PREVIOUS SECTIONS (for continuity):
|
||||
{'='*80}
|
||||
{previousSectionsText if previousSectionsText else "This is the first section."}
|
||||
{'='*80}
|
||||
|
||||
{'='*80}
|
||||
EXTRACTED CONTENT (if available):
|
||||
{'='*80}
|
||||
{cachedContentText if cachedContentText else "None"}
|
||||
{'='*80}
|
||||
|
||||
{'='*80}
|
||||
CONTENT PARTS FOR THIS SECTION:
|
||||
{'='*80}
|
||||
{contentPartsText if contentPartsText else "No ContentParts assigned to this section."}
|
||||
{'='*80}
|
||||
|
||||
TASK: Generate content for this section ONLY.
|
||||
|
||||
INSTRUCTIONS:
|
||||
1. Generate content appropriate for section type: {contentType}
|
||||
2. Use the generation hint: {generationHint}
|
||||
{f"3. Use extractionPrompt for ContentParts: {extractionPrompt}" if extractionPrompt else "3. Use ContentParts data if provided"}
|
||||
4. Consider previous sections for continuity
|
||||
5. Use extracted content if relevant
|
||||
6. All content must be in the language '{userLanguage}'
|
||||
7. {'For image sections: Integrate image ContentParts as visual elements' if contentType == "image" else 'For non-image sections: Reference image ContentParts as text (e.g., "siehe Bild 1" in German, "see Image 1" in English)'}
|
||||
|
||||
6. CRITICAL: Return ONLY a JSON object with an "elements" array. DO NOT return a full document structure.
|
||||
|
||||
REQUIRED FORMAT - Return ONLY this structure:
|
||||
|
||||
For heading:
|
||||
{{"elements": [{{"level": 1, "text": "Heading Text"}}]}}
|
||||
|
||||
For paragraph:
|
||||
{{"elements": [{{"text": "Paragraph text content"}}]}}
|
||||
|
||||
For table:
|
||||
{{"elements": [{{"headers": ["Col1", "Col2"], "rows": [["Row1", "Row2"]]}}]}}
|
||||
|
||||
For bullet_list:
|
||||
{{"elements": [{{"items": ["Item 1", "Item 2"]}}]}}
|
||||
|
||||
For code_block:
|
||||
{{"elements": [{{"code": "code content here", "language": "python"}}]}}
|
||||
|
||||
CRITICAL RULES:
|
||||
- Return ONLY {{"elements": [...]}} - nothing else
|
||||
- DO NOT include "metadata", "documents", "sections", or any other fields
|
||||
- DO NOT return a full document structure
|
||||
- DO NOT add explanatory text before or after the JSON
|
||||
- The response must start with {{"elements": and end with }}
|
||||
- This is a SINGLE SECTION, not a full document
|
||||
"""
|
||||
return prompt
|
||||
|
||||
def _formatCachedContent(self, cachedContent: Dict[str, Any]) -> str:
|
||||
"""Format cached content for prompt inclusion"""
|
||||
try:
|
||||
extractedContent = cachedContent.get("extractedContent", [])
|
||||
if not extractedContent:
|
||||
return "No content extracted."
|
||||
|
||||
formattedParts = []
|
||||
for extracted in extractedContent:
|
||||
if hasattr(extracted, 'parts'):
|
||||
for part in extracted.parts:
|
||||
if hasattr(part, 'content'):
|
||||
formattedParts.append(part.content)
|
||||
elif isinstance(extracted, dict):
|
||||
formattedParts.append(str(extracted))
|
||||
else:
|
||||
formattedParts.append(str(extracted))
|
||||
|
||||
return "\n\n".join(formattedParts) if formattedParts else "No content extracted."
|
||||
|
||||
except Exception as e:
|
||||
logger.warning(f"Error formatting cached content: {str(e)}")
|
||||
return "Error formatting cached content."
|
||||
|
||||
def _getUserLanguage(self) -> str:
|
||||
"""Get user language for document generation"""
|
||||
try:
|
||||
if self.services:
|
||||
if hasattr(self.services, 'currentUserLanguage') and self.services.currentUserLanguage:
|
||||
return self.services.currentUserLanguage
|
||||
elif hasattr(self.services, 'user') and self.services.user and hasattr(self.services.user, 'language'):
|
||||
return self.services.user.language
|
||||
except Exception:
|
||||
pass
|
||||
return 'en' # Default fallback
|
||||
|
||||
163
modules/services/serviceGeneration/subContentIntegrator.py
Normal file
163
modules/services/serviceGeneration/subContentIntegrator.py
Normal file
|
|
@ -0,0 +1,163 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""
|
||||
Content Integrator for hierarchical document generation.
|
||||
Merges generated content into document structure and validates completeness.
|
||||
"""
|
||||
|
||||
import logging
|
||||
from typing import Dict, Any, List, Tuple
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class ContentIntegrator:
|
||||
"""Integrates generated content into document structure"""
|
||||
|
||||
def __init__(self, services: Any = None):
|
||||
self.services = services
|
||||
|
||||
def integrateContent(
|
||||
self,
|
||||
structure: Dict[str, Any],
|
||||
generatedSections: List[Dict[str, Any]]
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
Merge generated sections into document structure.
|
||||
|
||||
Args:
|
||||
structure: Original document structure
|
||||
generatedSections: List of sections with populated elements
|
||||
|
||||
Returns:
|
||||
Complete document structure ready for rendering
|
||||
"""
|
||||
try:
|
||||
# Create mapping of section IDs to generated sections
|
||||
sectionMap = {section.get("id"): section for section in generatedSections}
|
||||
|
||||
# Process each document
|
||||
for doc in structure.get("documents", []):
|
||||
sections = doc.get("sections", [])
|
||||
|
||||
for idx, section in enumerate(sections):
|
||||
sectionId = section.get("id")
|
||||
|
||||
# Find corresponding generated section
|
||||
if sectionId in sectionMap:
|
||||
generatedSection = sectionMap[sectionId]
|
||||
|
||||
# Merge elements into structure section
|
||||
if "elements" in generatedSection:
|
||||
section["elements"] = generatedSection["elements"]
|
||||
|
||||
# Preserve error information if present
|
||||
if generatedSection.get("error"):
|
||||
section["error"] = True
|
||||
section["errorMessage"] = generatedSection.get("errorMessage")
|
||||
section["originalContentType"] = generatedSection.get("originalContentType")
|
||||
else:
|
||||
# Section not generated - create error section
|
||||
logger.warning(f"Section {sectionId} not found in generated sections")
|
||||
section = self.createErrorSection(
|
||||
section,
|
||||
f"Section {sectionId} was not generated"
|
||||
)
|
||||
sections[idx] = section
|
||||
|
||||
# Debug: Write final merged structure to debug file (harmonisiert - keine Checks nötig)
|
||||
import json
|
||||
structureJson = json.dumps(structure, indent=2, ensure_ascii=False)
|
||||
self.services.utils.writeDebugFile(
|
||||
structureJson,
|
||||
"document_generation_final_merged_json"
|
||||
)
|
||||
logger.debug(f"Logged final merged JSON structure ({len(structureJson)} chars)")
|
||||
|
||||
return structure
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error integrating content: {str(e)}")
|
||||
raise
|
||||
|
||||
def validateCompleteness(
|
||||
self,
|
||||
document: Dict[str, Any]
|
||||
) -> Tuple[bool, List[str]]:
|
||||
"""
|
||||
Validate that all sections have content.
|
||||
|
||||
Args:
|
||||
document: Document structure to validate
|
||||
|
||||
Returns:
|
||||
(is_complete, list_of_missing_sections)
|
||||
"""
|
||||
missingSections = []
|
||||
|
||||
try:
|
||||
for doc in document.get("documents", []):
|
||||
sections = doc.get("sections", [])
|
||||
|
||||
for section in sections:
|
||||
sectionId = section.get("id", "unknown")
|
||||
elements = section.get("elements", [])
|
||||
|
||||
# Check if section has content
|
||||
if not elements or len(elements) == 0:
|
||||
# Skip error sections (they have error text)
|
||||
if not section.get("error"):
|
||||
missingSections.append(sectionId)
|
||||
else:
|
||||
# Validate elements have actual content
|
||||
hasContent = False
|
||||
for element in elements:
|
||||
# Check different content types
|
||||
if element.get("text") or element.get("base64Data") or \
|
||||
element.get("headers") or element.get("items") or \
|
||||
element.get("code"):
|
||||
hasContent = True
|
||||
break
|
||||
|
||||
if not hasContent and not section.get("error"):
|
||||
missingSections.append(sectionId)
|
||||
|
||||
return len(missingSections) == 0, missingSections
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error validating completeness: {str(e)}")
|
||||
return False, [f"Validation error: {str(e)}"]
|
||||
|
||||
def createErrorSection(
|
||||
self,
|
||||
originalSection: Dict[str, Any],
|
||||
errorMessage: str
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
Create error placeholder section.
|
||||
|
||||
Args:
|
||||
originalSection: Original section that failed
|
||||
errorMessage: Error message to display
|
||||
|
||||
Returns:
|
||||
Error section with placeholder content
|
||||
"""
|
||||
contentType = originalSection.get("content_type", "content")
|
||||
sectionId = originalSection.get("id", "unknown")
|
||||
|
||||
return {
|
||||
"id": sectionId,
|
||||
"content_type": "paragraph", # Change to paragraph for error display
|
||||
"elements": [{
|
||||
"text": f"[ERROR: Failed to generate {contentType} for section '{sectionId}'. Error: {errorMessage}]"
|
||||
}],
|
||||
"order": originalSection.get("order", 0),
|
||||
"error": True,
|
||||
"errorMessage": errorMessage,
|
||||
"originalContentType": contentType,
|
||||
"title": originalSection.get("title"),
|
||||
"generation_hint": originalSection.get("generation_hint"),
|
||||
"complexity": originalSection.get("complexity")
|
||||
}
|
||||
|
||||
|
|
@ -180,6 +180,16 @@ def convertDocumentDataToString(document_data: Any, file_extension: str) -> str:
|
|||
try:
|
||||
if document_data is None:
|
||||
return ""
|
||||
if isinstance(document_data, bytes):
|
||||
# WICHTIG: Decode bytes to string for text files (HTML, text, etc.)
|
||||
try:
|
||||
return document_data.decode('utf-8')
|
||||
except UnicodeDecodeError:
|
||||
# Fallback: try latin1 or return with error replacement
|
||||
try:
|
||||
return document_data.decode('latin1')
|
||||
except Exception:
|
||||
return document_data.decode('utf-8', errors='replace')
|
||||
if isinstance(document_data, str):
|
||||
return document_data
|
||||
if isinstance(document_data, dict):
|
||||
|
|
|
|||
|
|
@ -19,7 +19,8 @@ async def buildGenerationPrompt(
|
|||
title: str,
|
||||
extracted_content: str = None,
|
||||
continuationContext: Dict[str, Any] = None,
|
||||
services: Any = None
|
||||
services: Any = None,
|
||||
useContentParts: bool = False # ARCHITECTURE: If True, don't include full content in prompt (ContentParts will be used directly)
|
||||
) -> str:
|
||||
"""
|
||||
Build the unified generation prompt using a single JSON template.
|
||||
|
|
@ -120,7 +121,9 @@ Continue generating the remaining content now.
|
|||
# PROMPT FOR FIRST CALL
|
||||
# Structure: User request + Extracted content FIRST (if available), then JSON template, then instructions
|
||||
|
||||
if extracted_content:
|
||||
# ARCHITECTURE: If useContentParts=True, don't include full content in prompt
|
||||
# ContentParts will be passed directly to callAi for model-aware chunking
|
||||
if extracted_content and not useContentParts:
|
||||
# If we have extracted content, put it FIRST and make it very clear it's the source data
|
||||
generationPrompt = f"""{'='*80}
|
||||
USER REQUEST / USER PROMPT:
|
||||
|
|
|
|||
540
modules/services/serviceGeneration/subStructureGenerator.py
Normal file
540
modules/services/serviceGeneration/subStructureGenerator.py
Normal file
|
|
@ -0,0 +1,540 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
"""
|
||||
Structure Generator for hierarchical document generation.
|
||||
Generates document skeleton with section placeholders.
|
||||
"""
|
||||
|
||||
import logging
|
||||
import json
|
||||
from typing import Dict, Any, Optional, List
|
||||
from modules.datamodels.datamodelJson import jsonTemplateDocument
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class StructureGenerator:
|
||||
"""Generates document structure with section placeholders"""
|
||||
|
||||
def __init__(self, services: Any):
|
||||
self.services = services
|
||||
|
||||
async def generateStructure(
|
||||
self,
|
||||
userPrompt: str,
|
||||
documentList: Optional[Any] = None,
|
||||
cachedContent: Optional[Dict[str, Any]] = None,
|
||||
contentParts: Optional[List[Any]] = None,
|
||||
maxSectionLength: int = 500,
|
||||
existingImages: Optional[List[Dict[str, Any]]] = None
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
Generate document structure with sections.
|
||||
|
||||
Args:
|
||||
userPrompt: User's original prompt
|
||||
documentList: Optional document references
|
||||
cachedContent: Optional extracted content cache
|
||||
contentParts: Optional list of ContentParts to analyze for structure generation
|
||||
maxSectionLength: Maximum words for simple sections
|
||||
existingImages: Optional list of existing images to include
|
||||
|
||||
Returns:
|
||||
Document structure with empty elements arrays and contentPartIds per section
|
||||
"""
|
||||
try:
|
||||
# Create structure generation prompt
|
||||
structurePrompt = self._createStructurePrompt(
|
||||
userPrompt=userPrompt,
|
||||
cachedContent=cachedContent,
|
||||
contentParts=contentParts,
|
||||
maxSectionLength=maxSectionLength,
|
||||
existingImages=existingImages or []
|
||||
)
|
||||
|
||||
# Debug: Log structure generation prompt (harmonisiert - keine Checks nötig)
|
||||
self.services.utils.writeDebugFile(
|
||||
structurePrompt,
|
||||
"document_generation_structure_prompt"
|
||||
)
|
||||
|
||||
# Call AI to generate structure
|
||||
from modules.datamodels.datamodelAi import AiCallOptions, OperationTypeEnum
|
||||
|
||||
options = AiCallOptions(
|
||||
operationType=OperationTypeEnum.DATA_GENERATE,
|
||||
resultFormat="json"
|
||||
)
|
||||
|
||||
aiResponse = await self.services.ai.callAiContent(
|
||||
prompt=structurePrompt,
|
||||
options=options,
|
||||
outputFormat="json"
|
||||
)
|
||||
|
||||
# Debug: Log structure generation response (harmonisiert - keine Checks nötig)
|
||||
self.services.utils.writeDebugFile(
|
||||
aiResponse.content if aiResponse and aiResponse.content else '',
|
||||
"document_generation_structure_response"
|
||||
)
|
||||
|
||||
if not aiResponse or not aiResponse.content:
|
||||
raise ValueError("AI structure generation returned empty response")
|
||||
|
||||
# Extract and parse JSON
|
||||
extractedJson = self.services.utils.jsonExtractString(aiResponse.content)
|
||||
if not extractedJson:
|
||||
raise ValueError("No JSON found in AI structure response")
|
||||
|
||||
structure = json.loads(extractedJson)
|
||||
|
||||
# Validate and enhance structure
|
||||
structure = self._validateAndEnhanceStructure(structure, maxSectionLength)
|
||||
|
||||
return structure
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error generating structure: {str(e)}")
|
||||
raise
|
||||
|
||||
def _createStructurePrompt(
|
||||
self,
|
||||
userPrompt: str,
|
||||
cachedContent: Optional[Dict[str, Any]] = None,
|
||||
contentParts: Optional[List[Any]] = None,
|
||||
maxSectionLength: int = 500,
|
||||
existingImages: Optional[List[Dict[str, Any]]] = None
|
||||
) -> str:
|
||||
"""
|
||||
Create prompt for structure generation.
|
||||
"""
|
||||
# Get user language
|
||||
userLanguage = self._getUserLanguage()
|
||||
|
||||
# Format cached content if available
|
||||
cachedContentText = ""
|
||||
if cachedContent and cachedContent.get("extractedContent"):
|
||||
cachedContentText = self._formatCachedContent(cachedContent)
|
||||
|
||||
# Use provided existingImages or extract from cachedContent
|
||||
if existingImages is None:
|
||||
existingImages = []
|
||||
if cachedContent and cachedContent.get("imageDocuments"):
|
||||
existingImages = cachedContent.get("imageDocuments", [])
|
||||
|
||||
# Format ContentParts as JSON for structure generation
|
||||
contentPartsJson = ""
|
||||
if contentParts:
|
||||
try:
|
||||
import json
|
||||
# Convert ContentParts to dict format for JSON serialization
|
||||
contentPartsList = []
|
||||
for part in contentParts:
|
||||
if hasattr(part, 'dict'):
|
||||
partDict = part.dict()
|
||||
elif isinstance(part, dict):
|
||||
partDict = part
|
||||
else:
|
||||
# Try to convert to dict
|
||||
partDict = {
|
||||
"id": getattr(part, 'id', ''),
|
||||
"typeGroup": getattr(part, 'typeGroup', ''),
|
||||
"mimeType": getattr(part, 'mimeType', ''),
|
||||
"label": getattr(part, 'label', ''),
|
||||
"metadata": getattr(part, 'metadata', {})
|
||||
}
|
||||
# Only include essential fields for structure generation (not full data)
|
||||
contentPartsList.append({
|
||||
"id": partDict.get("id", ""),
|
||||
"typeGroup": partDict.get("typeGroup", ""),
|
||||
"mimeType": partDict.get("mimeType", ""),
|
||||
"label": partDict.get("label", ""),
|
||||
"metadata": partDict.get("metadata", {})
|
||||
})
|
||||
|
||||
contentPartsJson = json.dumps(contentPartsList, indent=2, ensure_ascii=False)
|
||||
except Exception as e:
|
||||
logger.warning(f"Could not format ContentParts as JSON: {str(e)}")
|
||||
contentPartsJson = ""
|
||||
|
||||
# Create structure template
|
||||
structureTemplate = jsonTemplateDocument.replace("{{DOCUMENT_TITLE}}", "Document Title")
|
||||
|
||||
prompt = f"""{'='*80}
|
||||
USER REQUEST:
|
||||
{'='*80}
|
||||
{userPrompt}
|
||||
{'='*80}
|
||||
|
||||
TASK: Generate a document STRUCTURE (skeleton) with sections.
|
||||
Do NOT generate actual content yet - only the structure.
|
||||
|
||||
{'='*80}
|
||||
EXTRACTED CONTENT (if available):
|
||||
{'='*80}
|
||||
{cachedContentText if cachedContentText else "No source documents provided."}
|
||||
{'='*80}
|
||||
|
||||
INSTRUCTIONS:
|
||||
1. Analyze the user request, extracted content, and available ContentParts
|
||||
2. Create a document structure with CONTENT sections only
|
||||
3. For each section, specify:
|
||||
- id: Unique identifier (e.g., "section_title_1", "section_image_1")
|
||||
- content_type: "heading" | "paragraph" | "image" | "table" | "bullet_list" | "code_block"
|
||||
- complexity: "simple" (can generate directly) or "complex" (needs sub-prompt)
|
||||
- generation_hint: Brief description of what content should be generated
|
||||
- contentPartIds: Array of ContentPart IDs that should be used for this section (e.g., ["part_1", "part_2"]) - can be empty []
|
||||
- extractionPrompt: (optional) Specific prompt for extracting/processing ContentParts for this section
|
||||
- image_prompt: (only for image sections) Detailed prompt for image generation
|
||||
- order: Section order number (starting from 1)
|
||||
- elements: [] (empty array - will be populated later)
|
||||
|
||||
4. Identify image sections:
|
||||
- If user requests illustrations/images, create image sections
|
||||
- If existing images are provided in documentList (check EXISTING IMAGES section below), create image sections that reference them
|
||||
- Add image_prompt field with detailed description for image generation (only for new images)
|
||||
- Set complexity to "complex" for new images, "simple" for existing/render images
|
||||
- For existing images: Set image_source to "existing" and image_reference_id to the image document ID
|
||||
- For images to render (from input documents): Set image_source to "render" and image_reference_id to the image document ID
|
||||
- Example for new image: {{"id": "section_image_1", "content_type": "image", "complexity": "complex", "generation_hint": "Illustration for chapter 1", "image_prompt": "A detailed description for image generation", "order": 2, "elements": []}}
|
||||
- Example for existing image: {{"id": "section_image_1", "content_type": "image", "complexity": "simple", "generation_hint": "Include provided image", "image_source": "existing", "image_reference_id": "doc_id_here", "order": 2, "elements": []}}
|
||||
- Example for render image: {{"id": "section_image_1", "content_type": "image", "complexity": "simple", "generation_hint": "Render input image", "image_source": "render", "image_reference_id": "doc_id_here", "order": 2, "elements": []}}
|
||||
|
||||
{'='*80}
|
||||
EXISTING IMAGES (to include in document):
|
||||
{'='*80}
|
||||
{self._formatExistingImages(existingImages) if existingImages else "No existing images provided."}
|
||||
{'='*80}
|
||||
|
||||
6. Identify complex text sections:
|
||||
- Long chapters (>{maxSectionLength} words expected) should be marked as "complex"
|
||||
- Short paragraphs/headings should be "simple"
|
||||
|
||||
7. Return ONLY valid JSON following this structure:
|
||||
{structureTemplate}
|
||||
|
||||
5. CRITICAL RULES FOR CONTENT PARTS:
|
||||
- Analyze available ContentParts and determine which ones are needed for each section
|
||||
- For image sections (content_type == "image"): Include image ContentParts in contentPartIds - images will be integrated as visual elements
|
||||
- For other sections (heading, paragraph, etc.): If image ContentParts are referenced, they will be referenced as text in the document language (not integrated as images)
|
||||
- Each section can reference multiple ContentParts via contentPartIds array
|
||||
- If specific extraction/processing is needed for ContentParts, provide extractionPrompt
|
||||
- Image references in non-image sections should be automatically derived in the document language (e.g., "siehe Bild 1" in German, "see Image 1" in English)
|
||||
|
||||
6. CRITICAL RULES:
|
||||
- Return ONLY valid JSON (no comments, no trailing commas, double quotes only)
|
||||
- Follow the exact JSON schema structure provided
|
||||
- IMPORTANT: All sections MUST have empty elements arrays: "elements": [] (the template shows examples with content, but you must use empty arrays)
|
||||
- ALL sections MUST include "generation_hint" field with a brief description of what content should be generated
|
||||
- ALL sections MUST include "complexity" field: "simple" for short content, "complex" for long chapters/images
|
||||
- ALL sections MUST include "contentPartIds" field (can be empty array [] if no ContentParts needed)
|
||||
- Image sections MUST include "image_prompt" field with detailed description for image generation
|
||||
- Order numbers MUST start from 1 (not 0)
|
||||
- All content must be in the language '{userLanguage}'
|
||||
- Do NOT generate actual content - only structure (skeleton)
|
||||
- Use only supported content_type values: "heading", "paragraph", "image", "table", "bullet_list", "code_block"
|
||||
|
||||
Return ONLY the JSON structure. No explanations.
|
||||
"""
|
||||
return prompt
|
||||
|
||||
def _validateAndEnhanceStructure(
|
||||
self,
|
||||
structure: Dict[str, Any],
|
||||
maxSectionLength: int
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
Validate structure and enhance with complexity identification.
|
||||
"""
|
||||
try:
|
||||
# Ensure structure has required fields
|
||||
if "documents" not in structure:
|
||||
if "sections" in structure:
|
||||
# Convert single-document format to multi-document format
|
||||
structure = {
|
||||
"metadata": structure.get("metadata", {}),
|
||||
"documents": [{
|
||||
"id": "doc_1",
|
||||
"title": structure.get("metadata", {}).get("title", "Document"),
|
||||
"filename": "document.json",
|
||||
"sections": structure.get("sections", [])
|
||||
}]
|
||||
}
|
||||
else:
|
||||
raise ValueError("Structure missing 'documents' or 'sections' field")
|
||||
|
||||
# Process each document
|
||||
for doc in structure.get("documents", []):
|
||||
sections = doc.get("sections", [])
|
||||
|
||||
# Process and validate sections according to standardized schema
|
||||
for idx, section in enumerate(sections):
|
||||
# Ensure required fields
|
||||
if "id" not in section:
|
||||
section["id"] = f"section_{idx + 1}"
|
||||
|
||||
sectionId = section.get("id", "")
|
||||
section["order"] = idx + 1
|
||||
|
||||
if "elements" not in section:
|
||||
section["elements"] = []
|
||||
|
||||
# Ensure contentPartIds field exists (can be empty array)
|
||||
if "contentPartIds" not in section:
|
||||
section["contentPartIds"] = []
|
||||
|
||||
# Ensure extractionPrompt field exists (optional)
|
||||
if "extractionPrompt" not in section:
|
||||
section["extractionPrompt"] = None
|
||||
|
||||
# Identify complexity if not set
|
||||
if "complexity" not in section:
|
||||
section["complexity"] = self._identifySectionComplexity(
|
||||
section,
|
||||
maxSectionLength
|
||||
)
|
||||
|
||||
# Ensure generation_hint exists (required for content generation)
|
||||
if "generation_hint" not in section or not section.get("generation_hint"):
|
||||
# Create meaningful generation hint from section id or content type
|
||||
contentType = section.get("content_type", "")
|
||||
|
||||
# Extract meaningful hint from section ID
|
||||
meaningfulHint = self._extractMeaningfulHint(sectionId, contentType, section.get("elements", []))
|
||||
section["generation_hint"] = meaningfulHint
|
||||
|
||||
# Ensure image sections have proper configuration
|
||||
if section.get("content_type") == "image":
|
||||
imageSource = section.get("image_source", "generate")
|
||||
|
||||
if imageSource == "existing" or imageSource == "render":
|
||||
# Existing or render image - ensure image_reference_id is set
|
||||
if "image_reference_id" not in section:
|
||||
logger.warning(f"Image section {sectionId} has image_source='{imageSource}' but no image_reference_id")
|
||||
# Existing/render images are simple (no generation needed, code integration)
|
||||
section["complexity"] = "simple"
|
||||
else:
|
||||
# New image generation - ensure image_prompt
|
||||
if "image_prompt" not in section or not section.get("image_prompt"):
|
||||
# Try to extract from generation_hint
|
||||
generationHint = section.get("generation_hint", "")
|
||||
if generationHint:
|
||||
# Enhance generation_hint to be a proper image prompt
|
||||
section["image_prompt"] = self._enhanceImagePrompt(generationHint)
|
||||
else:
|
||||
# Create default based on document context
|
||||
docTitle = doc.get("title", "Document")
|
||||
section["image_prompt"] = f"Generate an illustration for: {docTitle}"
|
||||
|
||||
# Ensure complexity is set to complex for new image generation
|
||||
section["complexity"] = "complex"
|
||||
|
||||
return structure
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error validating structure: {str(e)}")
|
||||
raise
|
||||
|
||||
def _identifySectionComplexity(
|
||||
self,
|
||||
section: Dict[str, Any],
|
||||
maxSectionLength: int
|
||||
) -> str:
|
||||
"""
|
||||
Identify if section is simple or complex.
|
||||
|
||||
Rules:
|
||||
- Images: always complex
|
||||
- Long chapters (>maxSectionLength words): complex
|
||||
- Others: simple
|
||||
"""
|
||||
contentType = section.get("content_type", "")
|
||||
|
||||
# Images are always complex
|
||||
if contentType == "image":
|
||||
return "complex"
|
||||
|
||||
# Check generation_hint for length indicators
|
||||
generationHint = section.get("generation_hint", "").lower()
|
||||
|
||||
# Keywords indicating long content
|
||||
longContentKeywords = [
|
||||
"chapter", "long", "detailed", "comprehensive",
|
||||
"extensive", "full", "complete story"
|
||||
]
|
||||
|
||||
if any(keyword in generationHint for keyword in longContentKeywords):
|
||||
return "complex"
|
||||
|
||||
# Default to simple
|
||||
return "simple"
|
||||
|
||||
def _extractMeaningfulHint(
|
||||
self,
|
||||
sectionId: str,
|
||||
contentType: str,
|
||||
elements: List[Any]
|
||||
) -> str:
|
||||
"""
|
||||
Extract meaningful generation hint from section ID, content type, or elements.
|
||||
|
||||
Args:
|
||||
sectionId: Section identifier (e.g., "section_heading_current_state")
|
||||
contentType: Content type (e.g., "heading", "paragraph")
|
||||
elements: Existing elements if any
|
||||
|
||||
Returns:
|
||||
Meaningful generation hint string
|
||||
"""
|
||||
sectionIdLower = sectionId.lower()
|
||||
|
||||
# Try to extract text from existing elements first (most accurate)
|
||||
if elements and isinstance(elements, list) and len(elements) > 0:
|
||||
firstElement = elements[0]
|
||||
if isinstance(firstElement, dict):
|
||||
if "text" in firstElement and firstElement["text"]:
|
||||
if contentType == "heading":
|
||||
return firstElement["text"]
|
||||
elif contentType == "paragraph":
|
||||
return f"Content paragraph: {firstElement['text'][:50]}..."
|
||||
|
||||
# Extract meaningful text from section ID
|
||||
# Remove common prefixes: "section_", "section_heading_", "section_paragraph_", etc.
|
||||
meaningfulPart = sectionId
|
||||
for prefix in ["section_heading_", "section_paragraph_", "section_bullet_list_",
|
||||
"section_code_block_", "section_image_", "section_"]:
|
||||
if meaningfulPart.lower().startswith(prefix):
|
||||
meaningfulPart = meaningfulPart[len(prefix):]
|
||||
break
|
||||
|
||||
# Convert snake_case to Title Case
|
||||
# e.g., "current_state" -> "Current State"
|
||||
words = meaningfulPart.replace("_", " ").split()
|
||||
titleCase = " ".join(word.capitalize() for word in words if word)
|
||||
|
||||
# Handle special cases
|
||||
if "introduction" in sectionIdLower or "intro" in sectionIdLower:
|
||||
return "Introduction paragraph"
|
||||
elif "conclusion" in sectionIdLower:
|
||||
return "Conclusion paragraph"
|
||||
elif "footer" in sectionIdLower or "copyright" in sectionIdLower:
|
||||
return "Footer content"
|
||||
elif "title" in sectionIdLower and "main" in sectionIdLower:
|
||||
# Main title - try to get from document title or use generic
|
||||
return "Main document title"
|
||||
|
||||
# Create hint based on content type and extracted text
|
||||
if contentType == "heading":
|
||||
if titleCase:
|
||||
return titleCase
|
||||
else:
|
||||
return "Section heading"
|
||||
elif contentType == "paragraph":
|
||||
if titleCase:
|
||||
return f"Content paragraph about {titleCase.lower()}"
|
||||
else:
|
||||
return f"Content paragraph"
|
||||
elif contentType == "bullet_list":
|
||||
if titleCase:
|
||||
return f"Bullet list: {titleCase.lower()}"
|
||||
else:
|
||||
return "Bullet list items"
|
||||
elif contentType == "code_block":
|
||||
return "Code content"
|
||||
else:
|
||||
if titleCase:
|
||||
return f"Content for {titleCase.lower()}"
|
||||
else:
|
||||
return f"Content for {contentType} section"
|
||||
|
||||
def _extractImagePrompts(
|
||||
self,
|
||||
structure: Dict[str, Any]
|
||||
) -> Dict[str, str]:
|
||||
"""
|
||||
Extract image generation prompts from structure.
|
||||
Maps section_id -> image_prompt
|
||||
"""
|
||||
imagePrompts = {}
|
||||
|
||||
for doc in structure.get("documents", []):
|
||||
for section in doc.get("sections", []):
|
||||
if section.get("content_type") == "image":
|
||||
sectionId = section.get("id")
|
||||
imagePrompt = section.get("image_prompt")
|
||||
if sectionId and imagePrompt:
|
||||
imagePrompts[sectionId] = imagePrompt
|
||||
|
||||
return imagePrompts
|
||||
|
||||
def _formatCachedContent(
|
||||
self,
|
||||
cachedContent: Dict[str, Any]
|
||||
) -> str:
|
||||
"""
|
||||
Format cached content for prompt inclusion.
|
||||
"""
|
||||
try:
|
||||
extractedContent = cachedContent.get("extractedContent", [])
|
||||
if not extractedContent:
|
||||
return "No content extracted."
|
||||
|
||||
# Format ContentPart objects
|
||||
formattedParts = []
|
||||
for extracted in extractedContent:
|
||||
if hasattr(extracted, 'parts'):
|
||||
for part in extracted.parts:
|
||||
if hasattr(part, 'content'):
|
||||
formattedParts.append(part.content)
|
||||
elif isinstance(extracted, dict):
|
||||
formattedParts.append(str(extracted))
|
||||
else:
|
||||
formattedParts.append(str(extracted))
|
||||
|
||||
return "\n\n".join(formattedParts) if formattedParts else "No content extracted."
|
||||
|
||||
except Exception as e:
|
||||
logger.warning(f"Error formatting cached content: {str(e)}")
|
||||
return "Error formatting cached content."
|
||||
|
||||
def _enhanceImagePrompt(self, generationHint: str) -> str:
|
||||
"""
|
||||
Enhance generation hint to be a proper image generation prompt.
|
||||
Adds visual details and style guidance if missing.
|
||||
"""
|
||||
# If hint already contains visual details, use as-is
|
||||
visualKeywords = ["illustration", "image", "picture", "visual", "depict", "show", "drawing"]
|
||||
if any(keyword.lower() in generationHint.lower() for keyword in visualKeywords):
|
||||
return generationHint
|
||||
|
||||
# Enhance with visual description
|
||||
enhanced = f"Create a professional illustration: {generationHint}"
|
||||
return enhanced
|
||||
|
||||
def _formatExistingImages(self, imageDocuments: List[Dict[str, Any]]) -> str:
|
||||
"""Format existing images list for prompt inclusion"""
|
||||
if not imageDocuments:
|
||||
return "No existing images provided."
|
||||
|
||||
formatted = []
|
||||
for i, imgDoc in enumerate(imageDocuments, 1):
|
||||
formatted.append(f"{i}. Image ID: {imgDoc.get('id')}")
|
||||
formatted.append(f" File Name: {imgDoc.get('fileName', 'Unknown')}")
|
||||
formatted.append(f" MIME Type: {imgDoc.get('mimeType', 'Unknown')}")
|
||||
formatted.append(f" Alt Text: {imgDoc.get('altText', 'Image')}")
|
||||
formatted.append("")
|
||||
|
||||
return "\n".join(formatted)
|
||||
|
||||
def _getUserLanguage(self) -> str:
|
||||
"""Get user language for document generation"""
|
||||
try:
|
||||
if self.services:
|
||||
if hasattr(self.services, 'currentUserLanguage') and self.services.currentUserLanguage:
|
||||
return self.services.currentUserLanguage
|
||||
elif hasattr(self.services, 'user') and self.services.user and hasattr(self.services.user, 'language'):
|
||||
return self.services.user.language
|
||||
except Exception:
|
||||
pass
|
||||
return 'en' # Default fallback
|
||||
|
||||
|
|
@ -2,6 +2,7 @@
|
|||
# All rights reserved.
|
||||
import json
|
||||
import logging
|
||||
import re
|
||||
from typing import Any, Dict, List, Optional, Tuple, Union, Type, TypeVar
|
||||
from pydantic import BaseModel, ValidationError
|
||||
|
||||
|
|
@ -11,10 +12,32 @@ T = TypeVar('T', bound=BaseModel)
|
|||
|
||||
|
||||
def stripCodeFences(text: str) -> str:
|
||||
"""Remove ```json / ``` fences and surrounding whitespace if present."""
|
||||
"""Remove ```json / ``` fences and surrounding whitespace if present.
|
||||
Also removes [SOURCE: ...] and [END SOURCE] tags that may wrap the JSON."""
|
||||
if not text:
|
||||
return text
|
||||
s = text.strip()
|
||||
|
||||
# Remove [SOURCE: ...] tags at the beginning
|
||||
if s.startswith("[SOURCE:"):
|
||||
# Find the end of the SOURCE tag (newline or end of string)
|
||||
end_pos = s.find("\n")
|
||||
if end_pos != -1:
|
||||
s = s[end_pos+1:]
|
||||
else:
|
||||
# No newline, entire string is SOURCE tag
|
||||
return ""
|
||||
|
||||
# Remove [END SOURCE] tags at the end
|
||||
if s.endswith("[END SOURCE]"):
|
||||
# Find the start of END SOURCE tag (newline before it)
|
||||
start_pos = s.rfind("\n[END SOURCE]")
|
||||
if start_pos != -1:
|
||||
s = s[:start_pos]
|
||||
else:
|
||||
# No newline, entire string is END SOURCE tag
|
||||
return ""
|
||||
|
||||
# Handle opening fence (may or may not have closing fence)
|
||||
if s.startswith("```"):
|
||||
# Remove first triple backticks
|
||||
|
|
@ -199,9 +222,9 @@ def closeJsonStructures(text: str) -> str:
|
|||
|
||||
# Handle unterminated strings: find the last unclosed string
|
||||
# Look for patterns like: "value" or "value\n (unterminated)
|
||||
# Simple heuristic: if we end with an unterminated string (odd number of quotes at end)
|
||||
# Try to close it by finding the last opening quote and closing it
|
||||
# Check if we're in the middle of a string value when text ends
|
||||
if result.strip():
|
||||
# re is already imported at module level
|
||||
# Count quotes - if odd number, we have an unterminated string
|
||||
quoteCount = result.count('"')
|
||||
if quoteCount % 2 == 1:
|
||||
|
|
@ -219,6 +242,64 @@ def closeJsonStructures(text: str) -> str:
|
|||
# Find where the string should end (before next comma, bracket, or brace)
|
||||
# For now, just close it at the end
|
||||
result += '"'
|
||||
else:
|
||||
# Even number of quotes, but might still be in middle of string if cut off
|
||||
# More robust detection: check if text ends with alphanumeric/text chars after a quote
|
||||
# This handles cases like: "text": "value cut off mid-word
|
||||
|
||||
# Pattern 1: ends with colon + quote + text (no closing quote)
|
||||
if re.search(r':\s*"[^"]*$', result):
|
||||
# We're in the middle of a string value, close it
|
||||
result += '"'
|
||||
else:
|
||||
# Pattern 2: find last quote and check what comes after
|
||||
lastQuotePos = result.rfind('"')
|
||||
if lastQuotePos >= 0:
|
||||
afterQuote = result[lastQuotePos + 1:]
|
||||
# If after quote we have text (alphanumeric/whitespace) but no closing quote/comma/brace
|
||||
# and the text doesn't end with structural characters, we're likely in a string
|
||||
if afterQuote:
|
||||
# Check if it looks like we're in a string value (has text, no closing quote)
|
||||
# Pattern: ends with letters/numbers/spaces, not ending with quote, comma, }, or ]
|
||||
if re.search(r'[a-zA-Z0-9\s]$', result) and not re.match(r'^\s*[,}\]\]]', afterQuote):
|
||||
# Check if it's escaped
|
||||
escapeCount = 0
|
||||
i = lastQuotePos - 1
|
||||
while i >= 0 and result[i] == '\\':
|
||||
escapeCount += 1
|
||||
i -= 1
|
||||
if escapeCount % 2 == 0:
|
||||
# Verify we're actually in a string context (not in a key name)
|
||||
# Look backwards to see if we have ": " before the quote (value context)
|
||||
beforeQuote = result[:lastQuotePos]
|
||||
# Check if we're in a value context (has ": " before quote) or in an array (has "[ before quote)
|
||||
if re.search(r':\s*"', beforeQuote[-50:]) or re.search(r'\[\s*"', beforeQuote[-50:]):
|
||||
result += '"'
|
||||
# Also check if text ends with alphanumeric (likely cut off mid-word)
|
||||
elif re.search(r'[a-zA-Z]$', result):
|
||||
# If we end with a letter and have a quote before it, likely in a string
|
||||
result += '"'
|
||||
|
||||
# Final fallback: if text ends with alphanumeric and we have quotes, try to close the last string
|
||||
# This handles edge cases where patterns above didn't match
|
||||
if result.strip() and re.search(r'[a-zA-Z0-9]$', result):
|
||||
# Count quotes - if we have quotes and end with text, might be in a string
|
||||
if quoteCount > 0:
|
||||
lastQuotePos = result.rfind('"')
|
||||
if lastQuotePos >= 0:
|
||||
afterQuote = result[lastQuotePos + 1:]
|
||||
# If after quote is text (not empty, not structural), close it
|
||||
if afterQuote and re.search(r'^[a-zA-Z0-9\s]+$', afterQuote[:50]): # Check first 50 chars after quote
|
||||
# Make sure we're not already closed (check if next char would be quote/comma/brace)
|
||||
if not result.endswith('"') and not result.endswith(',') and not result.endswith('}') and not result.endswith(']'):
|
||||
# Check if escaped
|
||||
escapeCount = 0
|
||||
i = lastQuotePos - 1
|
||||
while i >= 0 and result[i] == '\\':
|
||||
escapeCount += 1
|
||||
i -= 1
|
||||
if escapeCount % 2 == 0:
|
||||
result += '"'
|
||||
|
||||
# Count open/close brackets and braces
|
||||
openBraces = result.count('{')
|
||||
|
|
@ -309,7 +390,7 @@ def _removeLastIncompleteItem(items: List[str], original_text: str) -> List[str]
|
|||
Remove the last item if it appears to be incomplete/corrupted.
|
||||
This prevents corrupted data from being included in the final result.
|
||||
"""
|
||||
import re
|
||||
# re is already imported at module level
|
||||
|
||||
if not items:
|
||||
return items
|
||||
|
|
@ -360,7 +441,7 @@ def _extractGenericContent(text: str) -> List[Dict[str, Any]]:
|
|||
|
||||
CRITICAL: Must preserve original content_type and id from the JSON structure!
|
||||
"""
|
||||
import re
|
||||
# re is already imported at module level
|
||||
|
||||
sections = []
|
||||
|
||||
|
|
@ -967,7 +1048,7 @@ def _extractCutOffElements(incomplete_section: Dict[str, Any], raw_json: str) ->
|
|||
if not cut_off_element:
|
||||
# Extract the last incomplete part from raw JSON
|
||||
# Find the last incomplete string/number/array
|
||||
import re
|
||||
# re is already imported at module level
|
||||
# Look for incomplete string at the end
|
||||
incomplete_match = re.search(r'"([^"]*?)(?:"|$)', raw_json[-500:], re.DOTALL)
|
||||
if incomplete_match:
|
||||
|
|
@ -987,7 +1068,7 @@ def _extractCutOffFromElement(element: Dict[str, Any], raw_json: str) -> Optiona
|
|||
|
||||
This helps identify where exactly to continue within nested structures.
|
||||
"""
|
||||
import re
|
||||
# re is already imported at module level
|
||||
|
||||
# Check for code_block with nested JSON
|
||||
if "code" in element:
|
||||
|
|
|
|||
7
modules/workflows/methods/methodAi/__init__.py
Normal file
7
modules/workflows/methods/methodAi/__init__.py
Normal file
|
|
@ -0,0 +1,7 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
|
||||
from .methodAi import MethodAi
|
||||
|
||||
__all__ = ['MethodAi']
|
||||
|
||||
22
modules/workflows/methods/methodAi/actions/__init__.py
Normal file
22
modules/workflows/methods/methodAi/actions/__init__.py
Normal file
|
|
@ -0,0 +1,22 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
|
||||
"""Action modules for AI operations."""
|
||||
|
||||
# Export all actions
|
||||
from .process import process
|
||||
from .webResearch import webResearch
|
||||
from .summarizeDocument import summarizeDocument
|
||||
from .translateDocument import translateDocument
|
||||
from .convertDocument import convertDocument
|
||||
from .generateDocument import generateDocument
|
||||
|
||||
__all__ = [
|
||||
'process',
|
||||
'webResearch',
|
||||
'summarizeDocument',
|
||||
'translateDocument',
|
||||
'convertDocument',
|
||||
'generateDocument',
|
||||
]
|
||||
|
||||
|
|
@ -0,0 +1,52 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
|
||||
"""
|
||||
Convert Document action for AI operations.
|
||||
Converts documents between different formats (PDF→Word, Excel→CSV, etc.).
|
||||
"""
|
||||
|
||||
import logging
|
||||
from typing import Dict, Any
|
||||
from modules.workflows.methods.methodBase import action
|
||||
from modules.datamodels.datamodelChat import ActionResult
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
@action
|
||||
async def convertDocument(self, parameters: Dict[str, Any]) -> ActionResult:
|
||||
"""
|
||||
GENERAL:
|
||||
- Purpose: Convert documents between different formats (PDF→Word, Excel→CSV, etc.).
|
||||
- Input requirements: documentList (required); targetFormat (required).
|
||||
- Output format: Document in target format.
|
||||
|
||||
Parameters:
|
||||
- documentList (list, required): Document reference(s) to convert.
|
||||
- targetFormat (str, required): Target format extension (docx, pdf, xlsx, csv, txt, html, json, md, etc.).
|
||||
- preserveStructure (bool, optional): Whether to preserve document structure (headings, tables, etc.). Default: True.
|
||||
"""
|
||||
documentList = parameters.get("documentList", [])
|
||||
if not documentList:
|
||||
return ActionResult.isFailure(error="documentList is required")
|
||||
|
||||
targetFormat = parameters.get("targetFormat")
|
||||
if not targetFormat:
|
||||
return ActionResult.isFailure(error="targetFormat is required")
|
||||
|
||||
preserveStructure = parameters.get("preserveStructure", True)
|
||||
|
||||
# Normalize format (remove leading dot if present)
|
||||
normalizedFormat = targetFormat.strip().lstrip('.').lower()
|
||||
|
||||
aiPrompt = f"Convert the provided document(s) to {normalizedFormat.upper()} format."
|
||||
if preserveStructure:
|
||||
aiPrompt += " Preserve all document structure including headings, tables, formatting, lists, and layout."
|
||||
aiPrompt += " Ensure the converted document maintains the same content and information as the original."
|
||||
|
||||
return await self.process({
|
||||
"aiPrompt": aiPrompt,
|
||||
"documentList": documentList,
|
||||
"resultType": normalizedFormat
|
||||
})
|
||||
|
||||
154
modules/workflows/methods/methodAi/actions/generateDocument.py
Normal file
154
modules/workflows/methods/methodAi/actions/generateDocument.py
Normal file
|
|
@ -0,0 +1,154 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
|
||||
"""
|
||||
Generate Document action for AI operations.
|
||||
Wrapper around AI service callAiContent method.
|
||||
"""
|
||||
|
||||
import logging
|
||||
import time
|
||||
from typing import Dict, Any, Optional, List
|
||||
from modules.workflows.methods.methodBase import action
|
||||
from modules.datamodels.datamodelChat import ActionResult, ActionDocument
|
||||
from modules.datamodels.datamodelExtraction import ContentPart
|
||||
from modules.datamodels.datamodelAi import AiCallOptions, OperationTypeEnum, PriorityEnum, ProcessingModeEnum
|
||||
from modules.datamodels.datamodelWorkflow import AiResponse, DocumentData
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
@action
|
||||
async def generateDocument(self, parameters: Dict[str, Any]) -> ActionResult:
|
||||
"""
|
||||
GENERAL:
|
||||
- Purpose: Generate documents from scratch or based on templates/inputs using hierarchical approach.
|
||||
- Input requirements: prompt or description (required); optional documentList (for templates/references).
|
||||
- Output format: Document in specified format. Any format supported by dynamically registered renderers is acceptable (default: txt).
|
||||
|
||||
Parameters:
|
||||
- prompt (str, required): Description of the document to generate.
|
||||
- documentList (list, optional): Template documents or reference documents to use as a guide.
|
||||
- documentType (str, optional): Type of document - letter, memo, proposal, contract, etc.
|
||||
- resultType (str, optional): Output format. Any format supported by dynamically registered renderers is acceptable (formats are discovered automatically from renderer registry). Common formats: txt, html, pdf, docx, md, json, csv, xlsx, pptx, png, jpg. Default: txt.
|
||||
- maxSectionLength (int, optional): Maximum words for simple sections. Default: 500.
|
||||
- parallelGeneration (bool, optional): Enable parallel section generation. Default: True.
|
||||
- progressLogging (bool, optional): Send ChatLog progress updates. Default: True.
|
||||
"""
|
||||
prompt = parameters.get("prompt")
|
||||
if not prompt:
|
||||
return ActionResult.isFailure(error="prompt is required")
|
||||
|
||||
documentList = parameters.get("documentList", [])
|
||||
documentType = parameters.get("documentType")
|
||||
resultType = parameters.get("resultType", "txt")
|
||||
|
||||
# Auto-detect format from prompt if not explicitly provided
|
||||
if resultType == "txt" and prompt:
|
||||
promptLower = prompt.lower()
|
||||
if "html" in promptLower or "html5" in promptLower:
|
||||
resultType = "html"
|
||||
logger.info(f"Auto-detected HTML format from prompt")
|
||||
elif "pdf" in promptLower:
|
||||
resultType = "pdf"
|
||||
logger.info(f"Auto-detected PDF format from prompt")
|
||||
elif "markdown" in promptLower or " md " in promptLower or promptLower.endswith(" md"):
|
||||
resultType = "md"
|
||||
logger.info(f"Auto-detected Markdown format from prompt")
|
||||
elif ("text" in promptLower or "txt" in promptLower) and "html" not in promptLower:
|
||||
resultType = "txt"
|
||||
logger.info(f"Auto-detected Text format from prompt")
|
||||
|
||||
# Create operation ID for progress tracking
|
||||
workflowId = self.services.workflow.id if self.services.workflow else f"no-workflow-{int(time.time())}"
|
||||
operationId = f"doc_gen_{workflowId}_{int(time.time())}"
|
||||
parentOperationId = parameters.get('parentOperationId')
|
||||
|
||||
try:
|
||||
# Convert documentList to DocumentReferenceList if needed
|
||||
docRefList = None
|
||||
if documentList:
|
||||
from modules.datamodels.datamodelDocref import DocumentReferenceList
|
||||
|
||||
if isinstance(documentList, DocumentReferenceList):
|
||||
docRefList = documentList
|
||||
elif isinstance(documentList, str):
|
||||
docRefList = DocumentReferenceList.from_string_list([documentList])
|
||||
elif isinstance(documentList, list):
|
||||
docRefList = DocumentReferenceList.from_string_list(documentList)
|
||||
else:
|
||||
docRefList = DocumentReferenceList(references=[])
|
||||
|
||||
# Prepare title
|
||||
title = parameters.get("documentType") or "Generated Document"
|
||||
|
||||
# Call AI service for document generation
|
||||
# callAiContent handles documentList internally via Phases 5A-5E
|
||||
options = AiCallOptions(
|
||||
operationType=OperationTypeEnum.DATA_GENERATE,
|
||||
priority=PriorityEnum.BALANCED,
|
||||
processingMode=ProcessingModeEnum.DETAILED,
|
||||
compressPrompt=False,
|
||||
compressContext=False
|
||||
)
|
||||
|
||||
aiResponse: AiResponse = await self.services.ai.callAiContent(
|
||||
prompt=prompt,
|
||||
options=options,
|
||||
documentList=docRefList, # Übergebe documentList direkt - callAiContent macht Phasen 5A-5E
|
||||
outputFormat=resultType,
|
||||
title=title,
|
||||
parentOperationId=parentOperationId
|
||||
)
|
||||
|
||||
# Convert AiResponse to ActionResult
|
||||
documents = []
|
||||
|
||||
# Convert DocumentData to ActionDocument
|
||||
if aiResponse.documents:
|
||||
for docData in aiResponse.documents:
|
||||
documents.append(ActionDocument(
|
||||
documentName=docData.documentName,
|
||||
documentData=docData.documentData,
|
||||
mimeType=docData.mimeType,
|
||||
sourceJson=docData.sourceJson if hasattr(docData, 'sourceJson') else None
|
||||
))
|
||||
|
||||
# If no documents but content exists, create a document from content
|
||||
if not documents and aiResponse.content:
|
||||
# Determine document name from metadata
|
||||
docName = f"document.{resultType}"
|
||||
if aiResponse.metadata and aiResponse.metadata.filename:
|
||||
docName = aiResponse.metadata.filename
|
||||
elif aiResponse.metadata and aiResponse.metadata.title:
|
||||
import re
|
||||
sanitized = re.sub(r"[^a-zA-Z0-9._-]", "_", aiResponse.metadata.title)
|
||||
sanitized = re.sub(r"_+", "_", sanitized).strip("_")
|
||||
if sanitized:
|
||||
if not sanitized.lower().endswith(f".{resultType}"):
|
||||
docName = f"{sanitized}.{resultType}"
|
||||
else:
|
||||
docName = sanitized
|
||||
|
||||
# Determine mime type
|
||||
mimeType = "text/plain"
|
||||
if resultType == "html":
|
||||
mimeType = "text/html"
|
||||
elif resultType == "json":
|
||||
mimeType = "application/json"
|
||||
elif resultType == "pdf":
|
||||
mimeType = "application/pdf"
|
||||
elif resultType == "md":
|
||||
mimeType = "text/markdown"
|
||||
|
||||
documents.append(ActionDocument(
|
||||
documentName=docName,
|
||||
documentData=aiResponse.content.encode('utf-8') if isinstance(aiResponse.content, str) else aiResponse.content,
|
||||
mimeType=mimeType
|
||||
))
|
||||
|
||||
return ActionResult.isSuccess(documents=documents)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error in document generation: {str(e)}")
|
||||
return ActionResult.isFailure(error=str(e))
|
||||
|
||||
198
modules/workflows/methods/methodAi/actions/process.py
Normal file
198
modules/workflows/methods/methodAi/actions/process.py
Normal file
|
|
@ -0,0 +1,198 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
|
||||
"""
|
||||
Process action for AI operations.
|
||||
Universal AI document processing action.
|
||||
"""
|
||||
|
||||
import logging
|
||||
import time
|
||||
import json
|
||||
from typing import Dict, Any, List, Optional
|
||||
from modules.workflows.methods.methodBase import action
|
||||
from modules.datamodels.datamodelChat import ActionResult, ActionDocument
|
||||
from modules.datamodels.datamodelAi import AiCallOptions
|
||||
from modules.datamodels.datamodelExtraction import ContentPart
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
@action
|
||||
async def process(self, parameters: Dict[str, Any]) -> ActionResult:
|
||||
"""
|
||||
GENERAL:
|
||||
- Purpose: Universal AI document processing action - accepts MULTIPLE input documents in ANY format (docx, pdf, json, txt, xlsx, html, images, etc.) and processes them together with a prompt to produce MULTIPLE output documents in ANY specified format (via resultType). Use for document generation, format conversion, content transformation, analysis, summarization, translation, extraction, comparison, and any AI-powered document manipulation.
|
||||
- Input requirements: aiPrompt (required); optional documentList (can contain multiple documents in any format).
|
||||
- Output format: Multiple documents in the same format per call (via resultType: txt, json, pdf, docx, xlsx, pptx, png, jpg, etc.). The AI can generate multiple files based on the prompt (e.g., "create separate documents for each section"). Default: txt.
|
||||
- Key capabilities: Can process any number of input documents together, extract data from mixed formats, combine information, generate multiple output files, transform between formats, perform analysis/comparison/summarization on document sets.
|
||||
|
||||
Parameters:
|
||||
- aiPrompt (str, required): Instruction for the AI describing what processing to perform.
|
||||
- documentList (list, optional): Document reference(s) in any format to use as input/context.
|
||||
- resultType (str, optional): Output file extension (txt, json, md, csv, xml, html, pdf, docx, xlsx, png, etc.). All output documents will use this format. Default: txt.
|
||||
"""
|
||||
try:
|
||||
# Init progress logger
|
||||
workflowId = self.services.workflow.id if self.services.workflow else f"no-workflow-{int(time.time())}"
|
||||
operationId = f"ai_process_{workflowId}_{int(time.time())}"
|
||||
|
||||
# Start progress tracking
|
||||
parentOperationId = parameters.get('parentOperationId')
|
||||
self.services.chat.progressLogStart(
|
||||
operationId,
|
||||
"Generate",
|
||||
"AI Processing",
|
||||
f"Format: {parameters.get('resultType', 'txt')}",
|
||||
parentOperationId=parentOperationId
|
||||
)
|
||||
|
||||
aiPrompt = parameters.get("aiPrompt")
|
||||
logger.info(f"aiPrompt extracted: '{aiPrompt}' (type: {type(aiPrompt)})")
|
||||
|
||||
# Update progress - preparing parameters
|
||||
self.services.chat.progressLogUpdate(operationId, 0.2, "Preparing parameters")
|
||||
|
||||
from modules.datamodels.datamodelDocref import DocumentReferenceList
|
||||
|
||||
documentListParam = parameters.get("documentList")
|
||||
# Convert to DocumentReferenceList if needed
|
||||
if documentListParam is None:
|
||||
documentList = DocumentReferenceList(references=[])
|
||||
elif isinstance(documentListParam, DocumentReferenceList):
|
||||
documentList = documentListParam
|
||||
elif isinstance(documentListParam, str):
|
||||
documentList = DocumentReferenceList.from_string_list([documentListParam])
|
||||
elif isinstance(documentListParam, list):
|
||||
documentList = DocumentReferenceList.from_string_list(documentListParam)
|
||||
else:
|
||||
logger.error(f"Invalid documentList type: {type(documentListParam)}")
|
||||
documentList = DocumentReferenceList(references=[])
|
||||
|
||||
resultType = parameters.get("resultType", "txt")
|
||||
|
||||
|
||||
if not aiPrompt:
|
||||
logger.error(f"aiPrompt is missing or empty. Parameters: {parameters}")
|
||||
return ActionResult.isFailure(
|
||||
error="AI prompt is required"
|
||||
)
|
||||
|
||||
# Determine output extension and default MIME type without duplicating service logic
|
||||
normalized_result_type = (str(resultType).strip().lstrip('.').lower() or "txt")
|
||||
output_extension = f".{normalized_result_type}"
|
||||
output_mime_type = "application/octet-stream" # Prefer service-provided mimeType when available
|
||||
logger.info(f"Using result type: {resultType} -> {output_extension}")
|
||||
|
||||
# Check if contentParts are already provided (from context.extractContent or other sources)
|
||||
contentParts: Optional[List[ContentPart]] = None
|
||||
if "contentParts" in parameters:
|
||||
contentParts = parameters.get("contentParts")
|
||||
if contentParts and not isinstance(contentParts, list):
|
||||
# Try to extract from ContentExtracted if it's an ActionDocument
|
||||
if hasattr(contentParts, 'parts'):
|
||||
contentParts = contentParts.parts
|
||||
else:
|
||||
logger.warning(f"Invalid contentParts type: {type(contentParts)}, treating as empty")
|
||||
contentParts = None
|
||||
|
||||
# Update progress - preparing AI call
|
||||
self.services.chat.progressLogUpdate(operationId, 0.4, "Preparing AI call")
|
||||
|
||||
# Build options
|
||||
output_format = output_extension.replace('.', '') or 'txt'
|
||||
options = AiCallOptions(
|
||||
resultFormat=output_format
|
||||
)
|
||||
|
||||
# Update progress - calling AI
|
||||
self.services.chat.progressLogUpdate(operationId, 0.6, "Calling AI")
|
||||
|
||||
# Use unified callAiContent method
|
||||
# If contentParts provided (pre-extracted), use them directly
|
||||
# Otherwise, pass documentList and let callAiContent handle Phases 5A-5E internally
|
||||
# Note: ContentExtracted documents (from context.extractContent) are now handled
|
||||
# automatically in _extractAndPrepareContent() (Phase 5B)
|
||||
if contentParts:
|
||||
# Pre-extracted ContentParts - use them directly
|
||||
aiResponse = await self.services.ai.callAiContent(
|
||||
prompt=aiPrompt,
|
||||
options=options,
|
||||
contentParts=contentParts, # Pre-extracted ContentParts
|
||||
outputFormat=output_format,
|
||||
parentOperationId=operationId
|
||||
)
|
||||
else:
|
||||
# Pass documentList - callAiContent handles Phases 5A-5E internally
|
||||
# This includes automatic detection of ContentExtracted documents
|
||||
aiResponse = await self.services.ai.callAiContent(
|
||||
prompt=aiPrompt,
|
||||
options=options,
|
||||
documentList=documentList, # callAiContent macht Phasen 5A-5E
|
||||
outputFormat=output_format,
|
||||
parentOperationId=operationId
|
||||
)
|
||||
|
||||
# Update progress - processing result
|
||||
self.services.chat.progressLogUpdate(operationId, 0.8, "Processing result")
|
||||
|
||||
# Extract documents from AiResponse
|
||||
if aiResponse.documents and len(aiResponse.documents) > 0:
|
||||
action_documents = []
|
||||
for doc in aiResponse.documents:
|
||||
validationMetadata = {
|
||||
"actionType": "ai.process",
|
||||
"resultType": normalized_result_type,
|
||||
"outputFormat": output_format,
|
||||
"hasDocuments": True,
|
||||
"documentCount": len(aiResponse.documents)
|
||||
}
|
||||
action_documents.append(ActionDocument(
|
||||
documentName=doc.documentName,
|
||||
documentData=doc.documentData,
|
||||
mimeType=doc.mimeType or output_mime_type,
|
||||
sourceJson=getattr(doc, 'sourceJson', None), # Preserve source JSON for structure validation
|
||||
validationMetadata=validationMetadata
|
||||
))
|
||||
|
||||
final_documents = action_documents
|
||||
else:
|
||||
# Text response - create document from content
|
||||
extension = output_extension.lstrip('.')
|
||||
meaningful_name = self._generateMeaningfulFileName(
|
||||
base_name="ai",
|
||||
extension=extension,
|
||||
action_name="result"
|
||||
)
|
||||
validationMetadata = {
|
||||
"actionType": "ai.process",
|
||||
"resultType": normalized_result_type,
|
||||
"outputFormat": output_format,
|
||||
"hasDocuments": False,
|
||||
"contentType": "text"
|
||||
}
|
||||
action_document = ActionDocument(
|
||||
documentName=meaningful_name,
|
||||
documentData=aiResponse.content,
|
||||
mimeType=output_mime_type,
|
||||
validationMetadata=validationMetadata
|
||||
)
|
||||
final_documents = [action_document]
|
||||
|
||||
# Complete progress tracking
|
||||
self.services.chat.progressLogFinish(operationId, True)
|
||||
|
||||
return ActionResult.isSuccess(documents=final_documents)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error in AI processing: {str(e)}")
|
||||
|
||||
# Complete progress tracking with failure
|
||||
try:
|
||||
self.services.chat.progressLogFinish(operationId, False)
|
||||
except:
|
||||
pass # Don't fail on progress logging errors
|
||||
|
||||
return ActionResult.isFailure(
|
||||
error=str(e)
|
||||
)
|
||||
|
||||
|
|
@ -0,0 +1,55 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
|
||||
"""
|
||||
Summarize Document action for AI operations.
|
||||
Summarizes one or more documents, extracting key points and main ideas.
|
||||
"""
|
||||
|
||||
import logging
|
||||
from typing import Dict, Any
|
||||
from modules.workflows.methods.methodBase import action
|
||||
from modules.datamodels.datamodelChat import ActionResult
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
@action
|
||||
async def summarizeDocument(self, parameters: Dict[str, Any]) -> ActionResult:
|
||||
"""
|
||||
GENERAL:
|
||||
- Purpose: Summarize one or more documents, extracting key points and main ideas.
|
||||
- Input requirements: documentList (required); optional summaryLength, focus.
|
||||
- Output format: Text document with summary (default: txt, can be overridden with resultType).
|
||||
|
||||
Parameters:
|
||||
- documentList (list, required): Document reference(s) to summarize.
|
||||
- summaryLength (str, optional): Desired summary length - brief, medium, or detailed. Default: medium.
|
||||
- focus (str, optional): Specific aspect to focus on in the summary (e.g., "financial data", "key decisions").
|
||||
- resultType (str, optional): Output file extension (txt, md, docx, etc.). Default: txt.
|
||||
"""
|
||||
documentList = parameters.get("documentList", [])
|
||||
if not documentList:
|
||||
return ActionResult.isFailure(error="documentList is required")
|
||||
|
||||
summaryLength = parameters.get("summaryLength", "medium")
|
||||
focus = parameters.get("focus")
|
||||
resultType = parameters.get("resultType", "txt")
|
||||
|
||||
lengthInstructions = {
|
||||
"brief": "Create a brief summary (2-3 paragraphs)",
|
||||
"medium": "Create a medium-length summary (comprehensive but concise)",
|
||||
"detailed": "Create a detailed summary covering all major points"
|
||||
}
|
||||
lengthInstruction = lengthInstructions.get(summaryLength.lower(), lengthInstructions["medium"])
|
||||
|
||||
aiPrompt = f"Summarize the provided document(s). {lengthInstruction}."
|
||||
if focus:
|
||||
aiPrompt += f" Focus specifically on: {focus}."
|
||||
aiPrompt += " Extract and present the key points, main ideas, and important information in a clear, well-structured format."
|
||||
|
||||
return await self.process({
|
||||
"aiPrompt": aiPrompt,
|
||||
"documentList": documentList,
|
||||
"resultType": resultType
|
||||
})
|
||||
|
||||
|
|
@ -0,0 +1,60 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
|
||||
"""
|
||||
Translate Document action for AI operations.
|
||||
Translates documents to a target language while preserving formatting and structure.
|
||||
"""
|
||||
|
||||
import logging
|
||||
from typing import Dict, Any
|
||||
from modules.workflows.methods.methodBase import action
|
||||
from modules.datamodels.datamodelChat import ActionResult
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
@action
|
||||
async def translateDocument(self, parameters: Dict[str, Any]) -> ActionResult:
|
||||
"""
|
||||
GENERAL:
|
||||
- Purpose: Translate documents to a target language while preserving formatting and structure.
|
||||
- Input requirements: documentList (required); targetLanguage (required).
|
||||
- Output format: Translated document in same format as input (default) or specified resultType.
|
||||
|
||||
Parameters:
|
||||
- documentList (list, required): Document reference(s) to translate.
|
||||
- targetLanguage (str, required): Target language code or name (e.g., "de", "German", "French", "es").
|
||||
- sourceLanguage (str, optional): Source language if known (e.g., "en", "English"). If not provided, AI will detect.
|
||||
- preserveFormatting (bool, optional): Whether to preserve original formatting. Default: True.
|
||||
- resultType (str, optional): Output file extension. If not specified, uses same format as input.
|
||||
"""
|
||||
documentList = parameters.get("documentList", [])
|
||||
if not documentList:
|
||||
return ActionResult.isFailure(error="documentList is required")
|
||||
|
||||
targetLanguage = parameters.get("targetLanguage")
|
||||
if not targetLanguage:
|
||||
return ActionResult.isFailure(error="targetLanguage is required")
|
||||
|
||||
sourceLanguage = parameters.get("sourceLanguage")
|
||||
preserveFormatting = parameters.get("preserveFormatting", True)
|
||||
resultType = parameters.get("resultType")
|
||||
|
||||
aiPrompt = f"Translate the provided document(s) to {targetLanguage}."
|
||||
if sourceLanguage:
|
||||
aiPrompt += f" The source language is {sourceLanguage}."
|
||||
if preserveFormatting:
|
||||
aiPrompt += " Preserve all formatting, structure, tables, and layout exactly as they appear in the original document."
|
||||
else:
|
||||
aiPrompt += " Focus on accurate translation of content."
|
||||
aiPrompt += " Maintain the same document structure, headings, and organization."
|
||||
|
||||
processParams = {
|
||||
"aiPrompt": aiPrompt,
|
||||
"documentList": documentList
|
||||
}
|
||||
if resultType:
|
||||
processParams["resultType"] = resultType
|
||||
|
||||
return await self.process(processParams)
|
||||
|
||||
117
modules/workflows/methods/methodAi/actions/webResearch.py
Normal file
117
modules/workflows/methods/methodAi/actions/webResearch.py
Normal file
|
|
@ -0,0 +1,117 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
|
||||
"""
|
||||
Web Research action for AI operations.
|
||||
Web research with two-step process: search for URLs, then crawl content.
|
||||
"""
|
||||
|
||||
import logging
|
||||
import time
|
||||
import re
|
||||
from typing import Dict, Any
|
||||
from modules.workflows.methods.methodBase import action
|
||||
from modules.datamodels.datamodelChat import ActionResult, ActionDocument
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
@action
|
||||
async def webResearch(self, parameters: Dict[str, Any]) -> ActionResult:
|
||||
"""
|
||||
GENERAL:
|
||||
- Purpose: Web research with two-step process: search for URLs, then crawl content.
|
||||
- Input requirements: prompt (required); optional list(url), country, language, researchDepth.
|
||||
- Output format: JSON with research results including URLs and content.
|
||||
|
||||
Parameters:
|
||||
- prompt (str, required): Natural language research instruction.
|
||||
- urlList (list, optional): Specific URLs to crawl, if needed.
|
||||
- country (str, optional): Two-digit country code (lowercase, e.g., ch, us, de).
|
||||
- language (str, optional): Language code (lowercase, e.g., de, en, fr).
|
||||
- researchDepth (str, optional): Research depth - fast, general, or deep. Default: general.
|
||||
"""
|
||||
try:
|
||||
prompt = parameters.get("prompt")
|
||||
if not prompt:
|
||||
return ActionResult.isFailure(error="Research prompt is required")
|
||||
|
||||
# Init progress logger
|
||||
workflowId = self.services.workflow.id if self.services.workflow else f"no-workflow-{int(time.time())}"
|
||||
operationId = f"web_research_{workflowId}_{int(time.time())}"
|
||||
|
||||
# Start progress tracking
|
||||
parentOperationId = parameters.get('parentOperationId')
|
||||
self.services.chat.progressLogStart(
|
||||
operationId,
|
||||
"Web Research",
|
||||
"Searching and Crawling",
|
||||
"Extracting URLs and Content",
|
||||
parentOperationId=parentOperationId
|
||||
)
|
||||
|
||||
# Call webcrawl service - service handles all AI intention analysis and processing
|
||||
result = await self.services.web.performWebResearch(
|
||||
prompt=prompt,
|
||||
urls=parameters.get("urlList", []),
|
||||
country=parameters.get("country"),
|
||||
language=parameters.get("language"),
|
||||
researchDepth=parameters.get("researchDepth", "general"),
|
||||
operationId=operationId
|
||||
)
|
||||
|
||||
# Complete progress tracking
|
||||
self.services.chat.progressLogFinish(operationId, True)
|
||||
|
||||
# Get meaningful filename from research result (generated by intent analyzer)
|
||||
suggestedFilename = result.get("suggested_filename")
|
||||
if suggestedFilename:
|
||||
# Clean and validate filename
|
||||
cleaned = suggestedFilename.strip().strip('"\'')
|
||||
cleaned = cleaned.replace('\n', ' ').replace('\r', ' ').strip()
|
||||
# Ensure it doesn't already have extension
|
||||
if cleaned.lower().endswith('.json'):
|
||||
cleaned = cleaned[:-5]
|
||||
# Validate: should be reasonable length and contain only safe characters
|
||||
if cleaned and len(cleaned) <= 60 and re.match(r'^[a-zA-Z0-9_\-]+$', cleaned):
|
||||
meaningfulName = f"{cleaned}.json"
|
||||
else:
|
||||
# Fallback to generic meaningful filename
|
||||
meaningfulName = self._generateMeaningfulFileName(
|
||||
base_name="web_research",
|
||||
extension="json",
|
||||
action_name="research"
|
||||
)
|
||||
else:
|
||||
# Fallback to generic meaningful filename
|
||||
meaningfulName = self._generateMeaningfulFileName(
|
||||
base_name="web_research",
|
||||
extension="json",
|
||||
action_name="research"
|
||||
)
|
||||
|
||||
validationMetadata = {
|
||||
"actionType": "ai.webResearch",
|
||||
"prompt": prompt,
|
||||
"urlList": parameters.get("urlList", []),
|
||||
"country": parameters.get("country"),
|
||||
"language": parameters.get("language"),
|
||||
"researchDepth": parameters.get("researchDepth", "general"),
|
||||
"resultFormat": "json"
|
||||
}
|
||||
actionDocument = ActionDocument(
|
||||
documentName=meaningfulName,
|
||||
documentData=result,
|
||||
mimeType="application/json",
|
||||
validationMetadata=validationMetadata
|
||||
)
|
||||
|
||||
return ActionResult.isSuccess(documents=[actionDocument])
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error in web research: {str(e)}")
|
||||
try:
|
||||
self.services.chat.progressLogFinish(operationId, False)
|
||||
except:
|
||||
pass
|
||||
return ActionResult.isFailure(error=str(e))
|
||||
|
||||
5
modules/workflows/methods/methodAi/helpers/__init__.py
Normal file
5
modules/workflows/methods/methodAi/helpers/__init__.py
Normal file
|
|
@ -0,0 +1,5 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
|
||||
"""Helper modules for AI method operations."""
|
||||
|
||||
59
modules/workflows/methods/methodAi/helpers/csvProcessing.py
Normal file
59
modules/workflows/methods/methodAi/helpers/csvProcessing.py
Normal file
|
|
@ -0,0 +1,59 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
|
||||
"""
|
||||
CSV Processing helper for AI operations.
|
||||
Handles CSV content processing with options.
|
||||
"""
|
||||
|
||||
import logging
|
||||
from typing import Dict, Any
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
class CsvProcessingHelper:
|
||||
"""Helper for CSV processing operations"""
|
||||
|
||||
def __init__(self, methodInstance):
|
||||
"""
|
||||
Initialize CSV processing helper.
|
||||
|
||||
Args:
|
||||
methodInstance: Instance of MethodAi (for access to services)
|
||||
"""
|
||||
self.method = methodInstance
|
||||
self.services = methodInstance.services
|
||||
|
||||
def applyCsvOptions(self, csvContent: str, options: Dict[str, Any]) -> str:
|
||||
"""
|
||||
Apply CSV processing options to CSV content.
|
||||
|
||||
Args:
|
||||
csvContent: CSV content as string
|
||||
options: Dictionary with CSV processing options
|
||||
|
||||
Returns:
|
||||
Processed CSV content as string
|
||||
"""
|
||||
if not csvContent:
|
||||
return csvContent
|
||||
|
||||
# Apply options if provided
|
||||
if options:
|
||||
# Handle delimiter option
|
||||
if "delimiter" in options:
|
||||
delimiter = options["delimiter"]
|
||||
# Replace delimiter in content (simple approach)
|
||||
# Note: This is a basic implementation, may need enhancement
|
||||
if delimiter != ",":
|
||||
csvContent = csvContent.replace(",", delimiter)
|
||||
|
||||
# Handle quote character option
|
||||
if "quotechar" in options:
|
||||
quotechar = options["quotechar"]
|
||||
# Replace quote character (simple approach)
|
||||
if quotechar != '"':
|
||||
csvContent = csvContent.replace('"', quotechar)
|
||||
|
||||
return csvContent
|
||||
|
||||
276
modules/workflows/methods/methodAi/methodAi.py
Normal file
276
modules/workflows/methods/methodAi/methodAi.py
Normal file
|
|
@ -0,0 +1,276 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
|
||||
import logging
|
||||
from datetime import datetime, UTC
|
||||
from modules.workflows.methods.methodBase import MethodBase
|
||||
from modules.datamodels.datamodelWorkflowActions import WorkflowActionDefinition, WorkflowActionParameter
|
||||
from modules.shared.frontendTypes import FrontendType
|
||||
|
||||
# Import helpers
|
||||
from .helpers.csvProcessing import CsvProcessingHelper
|
||||
|
||||
# Import actions
|
||||
from .actions.process import process
|
||||
from .actions.webResearch import webResearch
|
||||
from .actions.summarizeDocument import summarizeDocument
|
||||
from .actions.translateDocument import translateDocument
|
||||
from .actions.convertDocument import convertDocument
|
||||
from .actions.generateDocument import generateDocument
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
class MethodAi(MethodBase):
|
||||
"""AI processing methods."""
|
||||
|
||||
def __init__(self, services):
|
||||
super().__init__(services)
|
||||
self.name = "ai"
|
||||
self.description = "AI processing methods"
|
||||
|
||||
# Initialize helper modules
|
||||
self.csvProcessing = CsvProcessingHelper(self)
|
||||
|
||||
# RBAC-Integration: Action-Definitionen mit actionId
|
||||
self._actions = {
|
||||
"process": WorkflowActionDefinition(
|
||||
actionId="ai.process",
|
||||
description="Universal AI document processing action - accepts multiple input documents in any format and processes them together with a prompt",
|
||||
parameters={
|
||||
"aiPrompt": WorkflowActionParameter(
|
||||
name="aiPrompt",
|
||||
type="str",
|
||||
frontendType=FrontendType.TEXTAREA,
|
||||
required=True,
|
||||
description="Instruction for the AI describing what processing to perform"
|
||||
),
|
||||
"documentList": WorkflowActionParameter(
|
||||
name="documentList",
|
||||
type="List[str]",
|
||||
frontendType=FrontendType.DOCUMENT_REFERENCE,
|
||||
required=False,
|
||||
description="Document reference(s) in any format to use as input/context"
|
||||
),
|
||||
"resultType": WorkflowActionParameter(
|
||||
name="resultType",
|
||||
type="str",
|
||||
frontendType=FrontendType.SELECT,
|
||||
frontendOptions=["txt", "json", "md", "csv", "xml", "html", "pdf", "docx", "xlsx", "pptx", "png", "jpg"],
|
||||
required=False,
|
||||
default="txt",
|
||||
description="Output file extension. All output documents will use this format"
|
||||
)
|
||||
},
|
||||
execute=process.__get__(self, self.__class__)
|
||||
),
|
||||
"webResearch": WorkflowActionDefinition(
|
||||
actionId="ai.webResearch",
|
||||
description="Web research with two-step process: search for URLs, then crawl content",
|
||||
parameters={
|
||||
"prompt": WorkflowActionParameter(
|
||||
name="prompt",
|
||||
type="str",
|
||||
frontendType=FrontendType.TEXTAREA,
|
||||
required=True,
|
||||
description="Natural language research instruction"
|
||||
),
|
||||
"urlList": WorkflowActionParameter(
|
||||
name="urlList",
|
||||
type="List[str]",
|
||||
frontendType=FrontendType.MULTISELECT,
|
||||
required=False,
|
||||
description="Specific URLs to crawl, if needed"
|
||||
),
|
||||
"country": WorkflowActionParameter(
|
||||
name="country",
|
||||
type="str",
|
||||
frontendType=FrontendType.TEXT,
|
||||
required=False,
|
||||
description="Two-digit country code (lowercase, e.g., ch, us, de)"
|
||||
),
|
||||
"language": WorkflowActionParameter(
|
||||
name="language",
|
||||
type="str",
|
||||
frontendType=FrontendType.SELECT,
|
||||
frontendOptions=["de", "en", "fr", "it", "es"],
|
||||
required=False,
|
||||
description="Language code (lowercase, e.g., de, en, fr)"
|
||||
),
|
||||
"researchDepth": WorkflowActionParameter(
|
||||
name="researchDepth",
|
||||
type="str",
|
||||
frontendType=FrontendType.SELECT,
|
||||
frontendOptions=["fast", "general", "deep"],
|
||||
required=False,
|
||||
default="general",
|
||||
description="Research depth"
|
||||
)
|
||||
},
|
||||
execute=webResearch.__get__(self, self.__class__)
|
||||
),
|
||||
"summarizeDocument": WorkflowActionDefinition(
|
||||
actionId="ai.summarizeDocument",
|
||||
description="Summarize one or more documents, extracting key points and main ideas",
|
||||
parameters={
|
||||
"documentList": WorkflowActionParameter(
|
||||
name="documentList",
|
||||
type="List[str]",
|
||||
frontendType=FrontendType.DOCUMENT_REFERENCE,
|
||||
required=True,
|
||||
description="Document reference(s) to summarize"
|
||||
),
|
||||
"summaryLength": WorkflowActionParameter(
|
||||
name="summaryLength",
|
||||
type="str",
|
||||
frontendType=FrontendType.SELECT,
|
||||
frontendOptions=["brief", "medium", "detailed"],
|
||||
required=False,
|
||||
default="medium",
|
||||
description="Desired summary length"
|
||||
),
|
||||
"focus": WorkflowActionParameter(
|
||||
name="focus",
|
||||
type="str",
|
||||
frontendType=FrontendType.TEXT,
|
||||
required=False,
|
||||
description="Specific aspect to focus on in the summary (e.g., financial data, key decisions)"
|
||||
),
|
||||
"resultType": WorkflowActionParameter(
|
||||
name="resultType",
|
||||
type="str",
|
||||
frontendType=FrontendType.SELECT,
|
||||
frontendOptions=["txt", "md", "docx"],
|
||||
required=False,
|
||||
default="txt",
|
||||
description="Output file extension"
|
||||
)
|
||||
},
|
||||
execute=summarizeDocument.__get__(self, self.__class__)
|
||||
),
|
||||
"translateDocument": WorkflowActionDefinition(
|
||||
actionId="ai.translateDocument",
|
||||
description="Translate documents to a target language while preserving formatting and structure",
|
||||
parameters={
|
||||
"documentList": WorkflowActionParameter(
|
||||
name="documentList",
|
||||
type="List[str]",
|
||||
frontendType=FrontendType.DOCUMENT_REFERENCE,
|
||||
required=True,
|
||||
description="Document reference(s) to translate"
|
||||
),
|
||||
"targetLanguage": WorkflowActionParameter(
|
||||
name="targetLanguage",
|
||||
type="str",
|
||||
frontendType=FrontendType.TEXT,
|
||||
required=True,
|
||||
description="Target language code or name (e.g., de, German, French, es)"
|
||||
),
|
||||
"sourceLanguage": WorkflowActionParameter(
|
||||
name="sourceLanguage",
|
||||
type="str",
|
||||
frontendType=FrontendType.TEXT,
|
||||
required=False,
|
||||
description="Source language if known (e.g., en, English). If not provided, AI will detect"
|
||||
),
|
||||
"preserveFormatting": WorkflowActionParameter(
|
||||
name="preserveFormatting",
|
||||
type="bool",
|
||||
frontendType=FrontendType.CHECKBOX,
|
||||
required=False,
|
||||
default=True,
|
||||
description="Whether to preserve original formatting"
|
||||
),
|
||||
"resultType": WorkflowActionParameter(
|
||||
name="resultType",
|
||||
type="str",
|
||||
frontendType=FrontendType.TEXT,
|
||||
required=False,
|
||||
description="Output file extension. If not specified, uses same format as input"
|
||||
)
|
||||
},
|
||||
execute=translateDocument.__get__(self, self.__class__)
|
||||
),
|
||||
"convertDocument": WorkflowActionDefinition(
|
||||
actionId="ai.convertDocument",
|
||||
description="Convert documents between different formats (PDF→Word, Excel→CSV, etc.)",
|
||||
parameters={
|
||||
"documentList": WorkflowActionParameter(
|
||||
name="documentList",
|
||||
type="List[str]",
|
||||
frontendType=FrontendType.DOCUMENT_REFERENCE,
|
||||
required=True,
|
||||
description="Document reference(s) to convert"
|
||||
),
|
||||
"targetFormat": WorkflowActionParameter(
|
||||
name="targetFormat",
|
||||
type="str",
|
||||
frontendType=FrontendType.SELECT,
|
||||
frontendOptions=["docx", "pdf", "xlsx", "csv", "txt", "html", "json", "md"],
|
||||
required=True,
|
||||
description="Target format extension"
|
||||
),
|
||||
"preserveStructure": WorkflowActionParameter(
|
||||
name="preserveStructure",
|
||||
type="bool",
|
||||
frontendType=FrontendType.CHECKBOX,
|
||||
required=False,
|
||||
default=True,
|
||||
description="Whether to preserve document structure (headings, tables, etc.)"
|
||||
)
|
||||
},
|
||||
execute=convertDocument.__get__(self, self.__class__)
|
||||
),
|
||||
"generateDocument": WorkflowActionDefinition(
|
||||
actionId="ai.generateDocument",
|
||||
description="Generate documents from scratch or based on templates/inputs",
|
||||
parameters={
|
||||
"prompt": WorkflowActionParameter(
|
||||
name="prompt",
|
||||
type="str",
|
||||
frontendType=FrontendType.TEXTAREA,
|
||||
required=True,
|
||||
description="Description of the document to generate"
|
||||
),
|
||||
"documentList": WorkflowActionParameter(
|
||||
name="documentList",
|
||||
type="List[str]",
|
||||
frontendType=FrontendType.DOCUMENT_REFERENCE,
|
||||
required=False,
|
||||
description="Template documents or reference documents to use as a guide"
|
||||
),
|
||||
"documentType": WorkflowActionParameter(
|
||||
name="documentType",
|
||||
type="str",
|
||||
frontendType=FrontendType.SELECT,
|
||||
frontendOptions=["letter", "memo", "proposal", "contract", "report", "email"],
|
||||
required=False,
|
||||
description="Type of document"
|
||||
),
|
||||
"resultType": WorkflowActionParameter(
|
||||
name="resultType",
|
||||
type="str",
|
||||
frontendType=FrontendType.TEXT,
|
||||
required=False,
|
||||
default="txt",
|
||||
description="Output format (e.g., txt, html, pdf, docx, md, json, csv, xlsx, pptx, png, jpg). Any format supported by renderers is acceptable. Default: txt"
|
||||
)
|
||||
},
|
||||
execute=generateDocument.__get__(self, self.__class__)
|
||||
)
|
||||
}
|
||||
|
||||
# Validate actions after definition
|
||||
self._validateActions()
|
||||
|
||||
# Register actions as methods (optional, für direkten Zugriff)
|
||||
self.process = process.__get__(self, self.__class__)
|
||||
self.webResearch = webResearch.__get__(self, self.__class__)
|
||||
self.summarizeDocument = summarizeDocument.__get__(self, self.__class__)
|
||||
self.translateDocument = translateDocument.__get__(self, self.__class__)
|
||||
self.convertDocument = convertDocument.__get__(self, self.__class__)
|
||||
self.generateDocument = generateDocument.__get__(self, self.__class__)
|
||||
|
||||
def _format_timestamp_for_filename(self) -> str:
|
||||
"""Format current timestamp as YYYYMMDD-hhmmss for filenames."""
|
||||
return datetime.now(UTC).strftime("%Y%m%d-%H%M%S")
|
||||
|
||||
|
|
@ -7,6 +7,9 @@ import logging
|
|||
from functools import wraps
|
||||
import inspect
|
||||
|
||||
from modules.datamodels.datamodelWorkflowActions import WorkflowActionDefinition, WorkflowActionParameter
|
||||
from modules.datamodels.datamodelRbac import AccessRuleContext
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
def action(func):
|
||||
|
|
@ -57,37 +60,217 @@ class MethodBase:
|
|||
self.description: str
|
||||
self.logger = logging.getLogger(f"{__name__}.{self.__class__.__name__}")
|
||||
|
||||
# Actions MÜSSEN als Dictionary definiert sein
|
||||
# Jede Method-Klasse muss _actions Dictionary in __init__ definieren
|
||||
self._actions: Dict[str, WorkflowActionDefinition] = {}
|
||||
|
||||
# Nach Initialisierung: Actions validieren (wird überschrieben, wenn _actions gesetzt wird)
|
||||
# Validierung erfolgt erst nach vollständiger Initialisierung der Subklasse
|
||||
|
||||
def _validateActions(self):
|
||||
"""Validate that _actions dictionary is properly defined"""
|
||||
if not hasattr(self, '_actions') or not isinstance(self._actions, dict):
|
||||
raise ValueError(f"Method {self.name} must define _actions dictionary in __init__")
|
||||
|
||||
for actionName, actionDef in self._actions.items():
|
||||
if not isinstance(actionDef, WorkflowActionDefinition):
|
||||
raise ValueError(f"Action '{actionName}' in {self.name} must be WorkflowActionDefinition instance")
|
||||
|
||||
if not actionDef.actionId:
|
||||
raise ValueError(f"Action '{actionName}' in {self.name} must have actionId")
|
||||
|
||||
if not actionDef.execute:
|
||||
raise ValueError(f"Action '{actionName}' in {self.name} must have execute function")
|
||||
|
||||
@property
|
||||
def actions(self) -> Dict[str, Dict[str, Any]]:
|
||||
"""Dynamically collect all actions decorated with @action in the class."""
|
||||
actions = {}
|
||||
for attr_name in dir(self):
|
||||
# Skip the actions property itself to avoid recursion
|
||||
if attr_name == 'actions':
|
||||
continue
|
||||
try:
|
||||
attr = getattr(self, attr_name)
|
||||
if callable(attr) and getattr(attr, 'is_action', False):
|
||||
sig = inspect.signature(attr)
|
||||
params = {}
|
||||
for param_name, param in sig.parameters.items():
|
||||
if param_name not in ['self', 'parameters']:
|
||||
param_type = param.annotation if param.annotation != param.empty else Any
|
||||
params[param_name] = {
|
||||
'type': param_type,
|
||||
'required': param.default == param.empty,
|
||||
'description': None,
|
||||
'default': param.default if param.default != param.empty else None
|
||||
}
|
||||
actions[attr_name] = {
|
||||
'description': attr.__doc__ or '',
|
||||
'parameters': params,
|
||||
'method': attr
|
||||
}
|
||||
except (AttributeError, RecursionError):
|
||||
# Skip attributes that cause issues
|
||||
continue
|
||||
return actions
|
||||
"""
|
||||
Dynamically collect all actions from _actions dictionary.
|
||||
Returns format for API/UI consumption.
|
||||
|
||||
REQUIREMENT: Alle Actions müssen in _actions Dictionary definiert sein.
|
||||
Actions ohne _actions Definition sind nicht verfügbar.
|
||||
"""
|
||||
result = {}
|
||||
|
||||
# Actions müssen in _actions Dictionary definiert sein
|
||||
if not hasattr(self, '_actions') or not self._actions:
|
||||
self.logger.error(f"Method {self.name} has no _actions dictionary defined. Actions will not be available.")
|
||||
return result
|
||||
|
||||
totalActions = len(self._actions)
|
||||
deniedActions = []
|
||||
|
||||
for actionName, actionDef in self._actions.items():
|
||||
# RBAC-Check: Prüfe ob Action für aktuellen User verfügbar ist
|
||||
if not self._checkActionPermission(actionDef.actionId):
|
||||
deniedActions.append(f"{actionName} ({actionDef.actionId})")
|
||||
continue # Skip if user doesn't have permission
|
||||
|
||||
# Konvertiere WorkflowActionDefinition zu System-Format
|
||||
result[actionName] = {
|
||||
'description': actionDef.description,
|
||||
'parameters': self._convertParametersToSystemFormat(actionDef.parameters),
|
||||
'method': self._createActionWrapper(actionDef)
|
||||
}
|
||||
|
||||
if deniedActions:
|
||||
self.logger.warning(f"Method {self.name}: {len(deniedActions)}/{totalActions} actions denied by RBAC: {deniedActions[:5]}{'...' if len(deniedActions) > 5 else ''}")
|
||||
if not result and totalActions > 0:
|
||||
self.logger.error(f"Method {self.name}: ALL {totalActions} actions denied by RBAC! This will result in empty action list.")
|
||||
|
||||
return result
|
||||
|
||||
def _checkActionPermission(self, actionId: str) -> bool:
|
||||
"""
|
||||
Check if current user has permission to execute this action.
|
||||
Uses RBAC RESOURCE context.
|
||||
|
||||
REQUIREMENT: RBAC-Service muss verfügbar sein.
|
||||
"""
|
||||
if not hasattr(self.services, 'rbac') or not self.services.rbac:
|
||||
self.logger.error(f"RBAC service not available (services.rbac is None). Action {actionId} will be denied.")
|
||||
return False
|
||||
|
||||
# Get current user from services.user (not from chat service)
|
||||
currentUser = getattr(self.services, 'user', None)
|
||||
if not currentUser:
|
||||
self.logger.warning(f"No current user found (services.user is None). Action {actionId} will be denied.")
|
||||
return False
|
||||
|
||||
# RBAC-Check: RESOURCE context, item = actionId
|
||||
try:
|
||||
permissions = self.services.rbac.getUserPermissions(
|
||||
user=currentUser,
|
||||
context=AccessRuleContext.RESOURCE,
|
||||
item=actionId
|
||||
)
|
||||
hasPermission = permissions.view
|
||||
if not hasPermission:
|
||||
# Log detailed RBAC denial info
|
||||
userRoles = getattr(currentUser, 'roleLabels', []) or []
|
||||
self.logger.warning(
|
||||
f"RBAC denied action {actionId} for user {currentUser.id}. "
|
||||
f"User roles: {userRoles}, "
|
||||
f"Permissions: view={permissions.view}, edit={permissions.edit}, delete={permissions.delete}. "
|
||||
f"No matching RBAC rule found for context=RESOURCE, item={actionId}"
|
||||
)
|
||||
return hasPermission
|
||||
except Exception as e:
|
||||
self.logger.error(f"RBAC check failed for action {actionId}: {str(e)}. Action will be denied.")
|
||||
return False
|
||||
|
||||
def _convertParametersToSystemFormat(self, parameters: Dict[str, WorkflowActionParameter]) -> Dict[str, Dict[str, Any]]:
|
||||
"""Convert WorkflowActionParameter dict to system format for API/UI consumption"""
|
||||
result = {}
|
||||
for paramName, param in parameters.items():
|
||||
result[paramName] = {
|
||||
'type': param.type,
|
||||
'required': param.required,
|
||||
'description': param.description,
|
||||
'default': param.default,
|
||||
'frontendType': param.frontendType.value,
|
||||
'frontendOptions': param.frontendOptions,
|
||||
'validation': param.validation
|
||||
}
|
||||
return result
|
||||
|
||||
def _createActionWrapper(self, actionDef: WorkflowActionDefinition):
|
||||
"""Create wrapper function for action execution with parameter validation"""
|
||||
async def wrapper(parameters: Dict[str, Any], *args, **kwargs):
|
||||
# Parameter-Validierung basierend auf WorkflowActionParameter definitions
|
||||
validatedParams = self._validateParameters(parameters, actionDef.parameters)
|
||||
|
||||
# Execute action
|
||||
return await actionDef.execute(validatedParams, *args, **kwargs)
|
||||
|
||||
wrapper.is_action = True
|
||||
return wrapper
|
||||
|
||||
def _validateParameters(self, parameters: Dict[str, Any], paramDefs: Dict[str, WorkflowActionParameter]) -> Dict[str, Any]:
|
||||
"""Validate parameters against definitions"""
|
||||
validated = {}
|
||||
|
||||
for paramName, paramDef in paramDefs.items():
|
||||
value = parameters.get(paramName)
|
||||
|
||||
# Check required
|
||||
if paramDef.required and value is None:
|
||||
raise ValueError(f"Required parameter '{paramName}' is missing")
|
||||
|
||||
# Use default if not provided
|
||||
if value is None and paramDef.default is not None:
|
||||
value = paramDef.default
|
||||
|
||||
# Type validation
|
||||
if value is not None:
|
||||
value = self._validateType(value, paramDef.type)
|
||||
|
||||
# Custom validation rules
|
||||
if paramDef.validation and value is not None:
|
||||
self._applyValidationRules(value, paramDef.validation)
|
||||
|
||||
validated[paramName] = value
|
||||
|
||||
return validated
|
||||
|
||||
def _validateType(self, value: Any, expectedType: str) -> Any:
|
||||
"""Validate and convert value to expected type"""
|
||||
# Type validation logic
|
||||
typeMap = {
|
||||
'str': str,
|
||||
'int': int,
|
||||
'float': float,
|
||||
'bool': bool,
|
||||
'list': list,
|
||||
'dict': dict,
|
||||
}
|
||||
|
||||
# Handle List[str], List[int], etc.
|
||||
if expectedType.startswith('List['):
|
||||
if not isinstance(value, list):
|
||||
raise ValueError(f"Expected list for type '{expectedType}', got {type(value).__name__}")
|
||||
# Extract inner type
|
||||
innerType = expectedType[5:-1].strip() # Remove "List[" and "]"
|
||||
if innerType in typeMap:
|
||||
return [typeMap[innerType](v) for v in value]
|
||||
return value
|
||||
|
||||
# Handle Dict[str, Any], etc.
|
||||
if expectedType.startswith('Dict['):
|
||||
if not isinstance(value, dict):
|
||||
raise ValueError(f"Expected dict for type '{expectedType}', got {type(value).__name__}")
|
||||
return value
|
||||
|
||||
# Handle simple types
|
||||
if expectedType in typeMap:
|
||||
expectedTypeClass = typeMap[expectedType]
|
||||
if not isinstance(value, expectedTypeClass):
|
||||
try:
|
||||
return expectedTypeClass(value)
|
||||
except (ValueError, TypeError) as e:
|
||||
raise ValueError(f"Cannot convert {value} to {expectedType}: {str(e)}")
|
||||
|
||||
return value
|
||||
|
||||
def _applyValidationRules(self, value: Any, rules: Dict[str, Any]):
|
||||
"""Apply custom validation rules"""
|
||||
if 'min' in rules:
|
||||
if isinstance(value, (int, float)) and value < rules['min']:
|
||||
raise ValueError(f"Value must be >= {rules['min']}")
|
||||
elif isinstance(value, str) and len(value) < rules['min']:
|
||||
raise ValueError(f"String length must be >= {rules['min']}")
|
||||
|
||||
if 'max' in rules:
|
||||
if isinstance(value, (int, float)) and value > rules['max']:
|
||||
raise ValueError(f"Value must be <= {rules['max']}")
|
||||
elif isinstance(value, str) and len(value) > rules['max']:
|
||||
raise ValueError(f"String length must be <= {rules['max']}")
|
||||
|
||||
if 'pattern' in rules:
|
||||
import re
|
||||
if not re.match(rules['pattern'], str(value)):
|
||||
raise ValueError(f"Value does not match required pattern: {rules['pattern']}")
|
||||
|
||||
def getActionSignature(self, actionName: str) -> str:
|
||||
"""Get formatted action signature for AI prompt generation (detailed version)"""
|
||||
|
|
|
|||
7
modules/workflows/methods/methodContext/__init__.py
Normal file
7
modules/workflows/methods/methodContext/__init__.py
Normal file
|
|
@ -0,0 +1,7 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
|
||||
from .methodContext import MethodContext
|
||||
|
||||
__all__ = ['MethodContext']
|
||||
|
||||
18
modules/workflows/methods/methodContext/actions/__init__.py
Normal file
18
modules/workflows/methods/methodContext/actions/__init__.py
Normal file
|
|
@ -0,0 +1,18 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
|
||||
"""Action modules for Context operations."""
|
||||
|
||||
# Export all actions
|
||||
from .getDocumentIndex import getDocumentIndex
|
||||
from .extractContent import extractContent
|
||||
from .neutralizeData import neutralizeData
|
||||
from .triggerPreprocessingServer import triggerPreprocessingServer
|
||||
|
||||
__all__ = [
|
||||
'getDocumentIndex',
|
||||
'extractContent',
|
||||
'neutralizeData',
|
||||
'triggerPreprocessingServer',
|
||||
]
|
||||
|
||||
|
|
@ -0,0 +1,251 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
|
||||
"""
|
||||
Extract Content action for Context operations.
|
||||
Extracts content from documents (separate from AI calls).
|
||||
"""
|
||||
|
||||
import logging
|
||||
import time
|
||||
from typing import Dict, Any
|
||||
from modules.workflows.methods.methodBase import action
|
||||
from modules.datamodels.datamodelChat import ActionResult, ActionDocument
|
||||
from modules.datamodels.datamodelDocref import DocumentReferenceList
|
||||
from modules.datamodels.datamodelExtraction import ExtractionOptions, MergeStrategy, ContentExtracted, ContentPart
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
@action
|
||||
async def extractContent(self, parameters: Dict[str, Any]) -> ActionResult:
|
||||
"""
|
||||
Extract raw content parts from documents without AI processing.
|
||||
|
||||
This action performs pure content extraction WITHOUT AI/OCR processing.
|
||||
It returns ContentParts with different typeGroups:
|
||||
- "text": Extracted text from text-based formats (PDF text layers, Word docs, etc.)
|
||||
- "image": Images as base64-encoded data (NOT converted to text, no OCR)
|
||||
- "table": Tables as structured data
|
||||
- "structure": Structured content (JSON, etc.)
|
||||
- "container": Container elements (PDF pages, etc.)
|
||||
|
||||
IMPORTANT:
|
||||
- Images are returned as base64 data, NOT as extracted text
|
||||
- No OCR is performed - images are preserved as visual elements
|
||||
- Text extraction only works for text-based formats (not images)
|
||||
- The extracted ContentParts can then be used by subsequent AI processing actions
|
||||
|
||||
Parameters:
|
||||
- documentList (list, required): Document reference(s) to extract content from.
|
||||
- extractionOptions (dict, optional): Extraction options (if not provided, defaults are used).
|
||||
|
||||
Returns:
|
||||
- ActionResult with ActionDocument containing ContentExtracted objects
|
||||
- ContentExtracted.parts contains List[ContentPart] with various typeGroups
|
||||
- Each ContentPart has a typeGroup indicating its type (text, image, table, etc.)
|
||||
"""
|
||||
try:
|
||||
# Init progress logger
|
||||
workflowId = self.services.workflow.id if self.services.workflow else f"no-workflow-{int(time.time())}"
|
||||
operationId = f"context_extract_{workflowId}_{int(time.time())}"
|
||||
|
||||
# Extract documentList from parameters dict
|
||||
documentListParam = parameters.get("documentList")
|
||||
if not documentListParam:
|
||||
return ActionResult.isFailure(error="documentList is required")
|
||||
|
||||
# Convert to DocumentReferenceList if needed
|
||||
if isinstance(documentListParam, DocumentReferenceList):
|
||||
documentList = documentListParam
|
||||
elif isinstance(documentListParam, str):
|
||||
documentList = DocumentReferenceList.from_string_list([documentListParam])
|
||||
elif isinstance(documentListParam, list):
|
||||
documentList = DocumentReferenceList.from_string_list(documentListParam)
|
||||
else:
|
||||
return ActionResult.isFailure(error=f"Invalid documentList type: {type(documentListParam)}")
|
||||
|
||||
# Start progress tracking
|
||||
parentOperationId = parameters.get('parentOperationId')
|
||||
self.services.chat.progressLogStart(
|
||||
operationId,
|
||||
"Extracting content from documents",
|
||||
"Content Extraction",
|
||||
f"Documents: {len(documentList.references)}",
|
||||
parentOperationId=parentOperationId
|
||||
)
|
||||
|
||||
# Get ChatDocuments from documentList
|
||||
self.services.chat.progressLogUpdate(operationId, 0.2, "Loading documents")
|
||||
chatDocuments = self.services.chat.getChatDocumentsFromDocumentList(documentList)
|
||||
|
||||
if not chatDocuments:
|
||||
self.services.chat.progressLogFinish(operationId, False)
|
||||
return ActionResult.isFailure(error="No documents found in documentList")
|
||||
|
||||
logger.info(f"Extracting content from {len(chatDocuments)} documents")
|
||||
|
||||
# Prepare extraction options
|
||||
self.services.chat.progressLogUpdate(operationId, 0.3, "Preparing extraction options")
|
||||
extractionOptionsParam = parameters.get("extractionOptions")
|
||||
|
||||
# Convert dict to ExtractionOptions object if needed, or create defaults
|
||||
if extractionOptionsParam:
|
||||
if isinstance(extractionOptionsParam, dict):
|
||||
# Ensure required fields are present
|
||||
if "prompt" not in extractionOptionsParam:
|
||||
extractionOptionsParam["prompt"] = "Extract all content from the document"
|
||||
if "mergeStrategy" not in extractionOptionsParam:
|
||||
extractionOptionsParam["mergeStrategy"] = MergeStrategy(
|
||||
mergeType="concatenate",
|
||||
groupBy="typeGroup",
|
||||
orderBy="id"
|
||||
)
|
||||
# Convert dict to ExtractionOptions object
|
||||
try:
|
||||
extractionOptions = ExtractionOptions(**extractionOptionsParam)
|
||||
except Exception as e:
|
||||
logger.warning(f"Failed to create ExtractionOptions from dict: {str(e)}, using defaults")
|
||||
extractionOptions = None
|
||||
elif isinstance(extractionOptionsParam, ExtractionOptions):
|
||||
extractionOptions = extractionOptionsParam
|
||||
else:
|
||||
# Invalid type, use defaults
|
||||
logger.warning(f"Invalid extractionOptions type: {type(extractionOptionsParam)}, using defaults")
|
||||
extractionOptions = None
|
||||
else:
|
||||
extractionOptions = None
|
||||
|
||||
# If extractionOptions not provided, create defaults
|
||||
if not extractionOptions:
|
||||
# Default extraction options for pure content extraction (no AI processing)
|
||||
extractionOptions = ExtractionOptions(
|
||||
prompt="Extract all content from the document",
|
||||
mergeStrategy=MergeStrategy(
|
||||
mergeType="concatenate",
|
||||
groupBy="typeGroup",
|
||||
orderBy="id"
|
||||
),
|
||||
processDocumentsIndividually=True
|
||||
)
|
||||
|
||||
# Call extraction service with hierarchical progress logging
|
||||
self.services.chat.progressLogUpdate(operationId, 0.4, "Initiating")
|
||||
self.services.chat.progressLogUpdate(operationId, 0.5, f"Extracting content from {len(chatDocuments)} documents")
|
||||
# Pass operationId for hierarchical per-document progress logging
|
||||
extractedResults = self.services.extraction.extractContent(chatDocuments, extractionOptions, operationId=operationId)
|
||||
|
||||
# Check if neutralization is enabled and should be applied automatically
|
||||
neutralizationEnabled = False
|
||||
try:
|
||||
config = self.services.neutralization.getConfig()
|
||||
neutralizationEnabled = config and config.enabled
|
||||
except Exception as e:
|
||||
logger.debug(f"Could not check neutralization config: {str(e)}")
|
||||
|
||||
# Neutralize extracted data if enabled (for dynamic mode: after extraction, before AI processing)
|
||||
if neutralizationEnabled:
|
||||
self.services.chat.progressLogUpdate(operationId, 0.7, "Neutralizing extracted data")
|
||||
logger.info("Neutralization enabled - neutralizing extracted content data")
|
||||
|
||||
# Neutralize each ContentExtracted result
|
||||
for extracted in extractedResults:
|
||||
if extracted.parts:
|
||||
neutralizedParts = []
|
||||
for part in extracted.parts:
|
||||
if not isinstance(part, ContentPart):
|
||||
# Try to parse as ContentPart if it's a dict
|
||||
if isinstance(part, dict):
|
||||
try:
|
||||
part = ContentPart(**part)
|
||||
except Exception as e:
|
||||
logger.warning(f"Could not parse ContentPart: {str(e)}")
|
||||
neutralizedParts.append(part)
|
||||
continue
|
||||
else:
|
||||
neutralizedParts.append(part)
|
||||
continue
|
||||
|
||||
# Neutralize the data field if it contains text
|
||||
if part.data:
|
||||
try:
|
||||
# Call neutralization service
|
||||
neutralizationResult = self.services.neutralization.processText(part.data)
|
||||
|
||||
if neutralizationResult and 'neutralized_text' in neutralizationResult:
|
||||
# Replace data with neutralized text
|
||||
neutralizedData = neutralizationResult['neutralized_text']
|
||||
|
||||
# Create new ContentPart with neutralized data
|
||||
neutralizedPart = ContentPart(
|
||||
id=part.id,
|
||||
parentId=part.parentId,
|
||||
label=part.label,
|
||||
typeGroup=part.typeGroup,
|
||||
mimeType=part.mimeType,
|
||||
data=neutralizedData,
|
||||
metadata=part.metadata.copy() if part.metadata else {}
|
||||
)
|
||||
neutralizedParts.append(neutralizedPart)
|
||||
else:
|
||||
# Neutralization failed, use original part
|
||||
logger.warning(f"Neutralization did not return neutralized_text for part {part.id}")
|
||||
neutralizedParts.append(part)
|
||||
except Exception as e:
|
||||
logger.error(f"Error neutralizing part {part.id}: {str(e)}")
|
||||
# On error, use original part
|
||||
neutralizedParts.append(part)
|
||||
else:
|
||||
# No data to neutralize, keep original part
|
||||
neutralizedParts.append(part)
|
||||
|
||||
# Update extracted result with neutralized parts
|
||||
extracted.parts = neutralizedParts
|
||||
logger.info(f"Neutralized {len(neutralizedParts)} content parts")
|
||||
|
||||
# Build ActionDocuments from ContentExtracted results
|
||||
self.services.chat.progressLogUpdate(operationId, 0.8, "Building result documents")
|
||||
actionDocuments = []
|
||||
# Map extracted results back to original documents by index (results are in same order)
|
||||
for i, extracted in enumerate(extractedResults):
|
||||
# Get original document name if available
|
||||
originalDoc = chatDocuments[i] if i < len(chatDocuments) else None
|
||||
if originalDoc and hasattr(originalDoc, 'fileName') and originalDoc.fileName:
|
||||
# Use original filename with "extracted_" prefix
|
||||
baseName = originalDoc.fileName.rsplit('.', 1)[0] if '.' in originalDoc.fileName else originalDoc.fileName
|
||||
documentName = f"{baseName}_extracted_{extracted.id}.json"
|
||||
else:
|
||||
# Fallback to generic name with index
|
||||
documentName = f"document_{i+1:03d}_extracted_{extracted.id}.json"
|
||||
|
||||
# Store ContentExtracted object in ActionDocument.documentData
|
||||
validationMetadata = {
|
||||
"actionType": "context.extractContent",
|
||||
"documentIndex": i,
|
||||
"extractedId": extracted.id,
|
||||
"partCount": len(extracted.parts) if extracted.parts else 0,
|
||||
"neutralized": neutralizationEnabled,
|
||||
"originalFileName": originalDoc.fileName if originalDoc and hasattr(originalDoc, 'fileName') else None
|
||||
}
|
||||
actionDoc = ActionDocument(
|
||||
documentName=documentName,
|
||||
documentData=extracted, # ContentExtracted object
|
||||
mimeType="application/json",
|
||||
validationMetadata=validationMetadata
|
||||
)
|
||||
actionDocuments.append(actionDoc)
|
||||
|
||||
self.services.chat.progressLogFinish(operationId, True)
|
||||
|
||||
return ActionResult.isSuccess(documents=actionDocuments)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error in content extraction: {str(e)}")
|
||||
|
||||
# Complete progress tracking with failure
|
||||
try:
|
||||
self.services.chat.progressLogFinish(operationId, False)
|
||||
except:
|
||||
pass # Don't fail on progress logging errors
|
||||
|
||||
return ActionResult.isFailure(error=str(e))
|
||||
|
||||
|
|
@ -0,0 +1,94 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
|
||||
"""
|
||||
Get Document Index action for Context operations.
|
||||
Generates a comprehensive index of all documents available in the current workflow.
|
||||
"""
|
||||
|
||||
import logging
|
||||
import json
|
||||
from typing import Dict, Any
|
||||
from modules.workflows.methods.methodBase import action
|
||||
from modules.datamodels.datamodelChat import ActionResult, ActionDocument
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
@action
|
||||
async def getDocumentIndex(self, parameters: Dict[str, Any]) -> ActionResult:
|
||||
"""
|
||||
GENERAL:
|
||||
- Purpose: Generate a comprehensive index of all documents available in the current workflow, including documents from all rounds and tasks.
|
||||
- Input requirements: No input documents required. Optional resultType parameter.
|
||||
- Output format: Structured document index in JSON format (default) or text format, listing all documents with their references, metadata, and organization by rounds/tasks.
|
||||
|
||||
Parameters:
|
||||
- resultType (str, optional): Output format (json, txt, md). Default: json.
|
||||
"""
|
||||
try:
|
||||
workflow = self.services.workflow
|
||||
if not workflow:
|
||||
return ActionResult.isFailure(
|
||||
error="No workflow available"
|
||||
)
|
||||
|
||||
resultType = parameters.get("resultType", "json").lower().strip().lstrip('.')
|
||||
|
||||
# Get available documents index from chat service
|
||||
documentsIndex = self.services.chat.getAvailableDocuments(workflow)
|
||||
|
||||
if not documentsIndex or documentsIndex == "No documents available" or documentsIndex == "NO DOCUMENTS AVAILABLE - This workflow has no documents to process.":
|
||||
# Return empty index structure
|
||||
if resultType == "json":
|
||||
indexData = {
|
||||
"workflowId": getattr(workflow, 'id', 'unknown'),
|
||||
"totalDocuments": 0,
|
||||
"rounds": [],
|
||||
"documentReferences": []
|
||||
}
|
||||
indexContent = json.dumps(indexData, indent=2, ensure_ascii=False)
|
||||
else:
|
||||
indexContent = "Document Index\n==============\n\nNo documents available in this workflow.\n"
|
||||
else:
|
||||
# Parse the document index string to extract structured information
|
||||
indexData = self.documentIndex.parseDocumentIndex(documentsIndex, workflow)
|
||||
|
||||
if resultType == "json":
|
||||
indexContent = json.dumps(indexData, indent=2, ensure_ascii=False)
|
||||
elif resultType == "md":
|
||||
indexContent = self.formatting.formatAsMarkdown(indexData)
|
||||
else: # txt
|
||||
indexContent = self.formatting.formatAsText(indexData, documentsIndex)
|
||||
|
||||
# Generate meaningful filename
|
||||
workflowContext = self.services.chat.getWorkflowContext()
|
||||
filename = self._generateMeaningfulFileName(
|
||||
"document_index",
|
||||
resultType if resultType in ["json", "txt", "md"] else "json",
|
||||
workflowContext,
|
||||
"getDocumentIndex"
|
||||
)
|
||||
|
||||
validationMetadata = {
|
||||
"actionType": "context.getDocumentIndex",
|
||||
"resultType": resultType,
|
||||
"workflowId": getattr(workflow, 'id', 'unknown'),
|
||||
"totalDocuments": indexData.get("totalDocuments", 0) if isinstance(indexData, dict) else 0
|
||||
}
|
||||
|
||||
# Create ActionDocument
|
||||
document = ActionDocument(
|
||||
documentName=filename,
|
||||
documentData=indexContent,
|
||||
mimeType="application/json" if resultType == "json" else "text/plain",
|
||||
validationMetadata=validationMetadata
|
||||
)
|
||||
|
||||
return ActionResult.isSuccess(documents=[document])
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error generating document index: {str(e)}")
|
||||
return ActionResult.isFailure(
|
||||
error=f"Failed to generate document index: {str(e)}"
|
||||
)
|
||||
|
||||
|
|
@ -0,0 +1,256 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
|
||||
"""
|
||||
Neutralize Data action for Context operations.
|
||||
Neutralizes extracted content data from ContentExtracted documents.
|
||||
"""
|
||||
|
||||
import logging
|
||||
import time
|
||||
from typing import Dict, Any
|
||||
from modules.workflows.methods.methodBase import action
|
||||
from modules.datamodels.datamodelChat import ActionResult, ActionDocument
|
||||
from modules.datamodels.datamodelDocref import DocumentReferenceList
|
||||
from modules.datamodels.datamodelExtraction import ContentExtracted, ContentPart
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
@action
|
||||
async def neutralizeData(self, parameters: Dict[str, Any]) -> ActionResult:
|
||||
"""
|
||||
Neutralize data from ContentExtracted documents.
|
||||
|
||||
This action takes documents containing ContentExtracted objects (from extractContent)
|
||||
and neutralizes the text data in ContentPart.data fields.
|
||||
|
||||
Parameters:
|
||||
- documentList (list, required): Document reference(s) containing ContentExtracted objects.
|
||||
|
||||
Returns:
|
||||
- ActionResult with ActionDocument containing neutralized ContentExtracted objects
|
||||
"""
|
||||
try:
|
||||
# Init progress logger
|
||||
workflowId = self.services.workflow.id if self.services.workflow else f"no-workflow-{int(time.time())}"
|
||||
operationId = f"context_neutralize_{workflowId}_{int(time.time())}"
|
||||
|
||||
# Check if neutralization is enabled
|
||||
neutralizationEnabled = False
|
||||
try:
|
||||
config = self.services.neutralization.getConfig()
|
||||
neutralizationEnabled = config and config.enabled
|
||||
except Exception as e:
|
||||
logger.debug(f"Could not check neutralization config: {str(e)}")
|
||||
|
||||
if not neutralizationEnabled:
|
||||
logger.info("Neutralization is not enabled, returning documents unchanged")
|
||||
# Return original documents if neutralization is disabled
|
||||
# Get documents from documentList
|
||||
documentListParam = parameters.get("documentList")
|
||||
if not documentListParam:
|
||||
return ActionResult.isFailure(error="documentList is required")
|
||||
|
||||
# Convert to DocumentReferenceList if needed
|
||||
if isinstance(documentListParam, DocumentReferenceList):
|
||||
documentList = documentListParam
|
||||
elif isinstance(documentListParam, str):
|
||||
documentList = DocumentReferenceList.from_string_list([documentListParam])
|
||||
elif isinstance(documentListParam, list):
|
||||
documentList = DocumentReferenceList.from_string_list(documentListParam)
|
||||
else:
|
||||
return ActionResult.isFailure(error=f"Invalid documentList type: {type(documentListParam)}")
|
||||
|
||||
# Get ChatDocuments from documentList
|
||||
chatDocuments = self.services.chat.getChatDocumentsFromDocumentList(documentList)
|
||||
if not chatDocuments:
|
||||
return ActionResult.isFailure(error="No documents found in documentList")
|
||||
|
||||
# Return original documents as ActionDocuments
|
||||
actionDocuments = []
|
||||
for chatDoc in chatDocuments:
|
||||
# Extract ContentExtracted from documentData if available
|
||||
if hasattr(chatDoc, 'documentData') and chatDoc.documentData:
|
||||
actionDoc = ActionDocument(
|
||||
documentName=getattr(chatDoc, 'fileName', 'unknown'),
|
||||
documentData=chatDoc.documentData,
|
||||
mimeType=getattr(chatDoc, 'mimeType', 'application/json'),
|
||||
validationMetadata={
|
||||
"actionType": "context.neutralizeData",
|
||||
"neutralized": False,
|
||||
"reason": "Neutralization disabled"
|
||||
}
|
||||
)
|
||||
actionDocuments.append(actionDoc)
|
||||
|
||||
return ActionResult.isSuccess(documents=actionDocuments)
|
||||
|
||||
# Extract documentList from parameters dict
|
||||
documentListParam = parameters.get("documentList")
|
||||
if not documentListParam:
|
||||
return ActionResult.isFailure(error="documentList is required")
|
||||
|
||||
# Convert to DocumentReferenceList if needed
|
||||
if isinstance(documentListParam, DocumentReferenceList):
|
||||
documentList = documentListParam
|
||||
elif isinstance(documentListParam, str):
|
||||
documentList = DocumentReferenceList.from_string_list([documentListParam])
|
||||
elif isinstance(documentListParam, list):
|
||||
documentList = DocumentReferenceList.from_string_list(documentListParam)
|
||||
else:
|
||||
return ActionResult.isFailure(error=f"Invalid documentList type: {type(documentListParam)}")
|
||||
|
||||
# Start progress tracking
|
||||
parentOperationId = parameters.get('parentOperationId')
|
||||
self.services.chat.progressLogStart(
|
||||
operationId,
|
||||
"Neutralizing data from documents",
|
||||
"Data Neutralization",
|
||||
f"Documents: {len(documentList.references)}",
|
||||
parentOperationId=parentOperationId
|
||||
)
|
||||
|
||||
# Get ChatDocuments from documentList
|
||||
self.services.chat.progressLogUpdate(operationId, 0.2, "Loading documents")
|
||||
chatDocuments = self.services.chat.getChatDocumentsFromDocumentList(documentList)
|
||||
|
||||
if not chatDocuments:
|
||||
self.services.chat.progressLogFinish(operationId, False)
|
||||
return ActionResult.isFailure(error="No documents found in documentList")
|
||||
|
||||
logger.info(f"Neutralizing data from {len(chatDocuments)} documents")
|
||||
|
||||
# Process each document
|
||||
self.services.chat.progressLogUpdate(operationId, 0.3, "Processing documents")
|
||||
actionDocuments = []
|
||||
|
||||
for i, chatDoc in enumerate(chatDocuments):
|
||||
try:
|
||||
# Extract ContentExtracted from documentData
|
||||
if not hasattr(chatDoc, 'documentData') or not chatDoc.documentData:
|
||||
logger.warning(f"Document {i+1} has no documentData, skipping")
|
||||
continue
|
||||
|
||||
documentData = chatDoc.documentData
|
||||
|
||||
# Check if it's a ContentExtracted object
|
||||
if isinstance(documentData, ContentExtracted):
|
||||
contentExtracted = documentData
|
||||
elif isinstance(documentData, dict):
|
||||
# Try to parse as ContentExtracted
|
||||
try:
|
||||
contentExtracted = ContentExtracted(**documentData)
|
||||
except Exception as e:
|
||||
logger.warning(f"Document {i+1} documentData is not ContentExtracted: {str(e)}")
|
||||
continue
|
||||
else:
|
||||
logger.warning(f"Document {i+1} documentData is not ContentExtracted or dict")
|
||||
continue
|
||||
|
||||
# Neutralize each ContentPart's data field
|
||||
neutralizedParts = []
|
||||
for part in contentExtracted.parts:
|
||||
if not isinstance(part, ContentPart):
|
||||
# Try to parse as ContentPart
|
||||
if isinstance(part, dict):
|
||||
try:
|
||||
part = ContentPart(**part)
|
||||
except Exception as e:
|
||||
logger.warning(f"Could not parse ContentPart: {str(e)}")
|
||||
neutralizedParts.append(part)
|
||||
continue
|
||||
else:
|
||||
neutralizedParts.append(part)
|
||||
continue
|
||||
|
||||
# Neutralize the data field if it contains text
|
||||
if part.data:
|
||||
try:
|
||||
self.services.chat.progressLogUpdate(
|
||||
operationId,
|
||||
0.3 + (i / len(chatDocuments)) * 0.6,
|
||||
f"Neutralizing part {len(neutralizedParts) + 1} of document {i+1}"
|
||||
)
|
||||
|
||||
# Call neutralization service
|
||||
neutralizationResult = self.services.neutralization.processText(part.data)
|
||||
|
||||
if neutralizationResult and 'neutralized_text' in neutralizationResult:
|
||||
# Replace data with neutralized text
|
||||
neutralizedData = neutralizationResult['neutralized_text']
|
||||
|
||||
# Create new ContentPart with neutralized data
|
||||
neutralizedPart = ContentPart(
|
||||
id=part.id,
|
||||
parentId=part.parentId,
|
||||
label=part.label,
|
||||
typeGroup=part.typeGroup,
|
||||
mimeType=part.mimeType,
|
||||
data=neutralizedData,
|
||||
metadata=part.metadata.copy() if part.metadata else {}
|
||||
)
|
||||
neutralizedParts.append(neutralizedPart)
|
||||
else:
|
||||
# Neutralization failed, use original part
|
||||
logger.warning(f"Neutralization did not return neutralized_text for part {part.id}")
|
||||
neutralizedParts.append(part)
|
||||
except Exception as e:
|
||||
logger.error(f"Error neutralizing part {part.id}: {str(e)}")
|
||||
# On error, use original part
|
||||
neutralizedParts.append(part)
|
||||
else:
|
||||
# No data to neutralize, keep original part
|
||||
neutralizedParts.append(part)
|
||||
|
||||
# Create neutralized ContentExtracted object
|
||||
neutralizedContentExtracted = ContentExtracted(
|
||||
id=contentExtracted.id,
|
||||
parts=neutralizedParts,
|
||||
summary=contentExtracted.summary
|
||||
)
|
||||
|
||||
# Create ActionDocument
|
||||
originalFileName = getattr(chatDoc, 'fileName', f"document_{i+1}.json")
|
||||
baseName = originalFileName.rsplit('.', 1)[0] if '.' in originalFileName else originalFileName
|
||||
documentName = f"{baseName}_neutralized_{contentExtracted.id}.json"
|
||||
|
||||
validationMetadata = {
|
||||
"actionType": "context.neutralizeData",
|
||||
"documentIndex": i,
|
||||
"extractedId": contentExtracted.id,
|
||||
"partCount": len(neutralizedParts),
|
||||
"neutralized": True,
|
||||
"originalFileName": originalFileName
|
||||
}
|
||||
|
||||
actionDoc = ActionDocument(
|
||||
documentName=documentName,
|
||||
documentData=neutralizedContentExtracted,
|
||||
mimeType="application/json",
|
||||
validationMetadata=validationMetadata
|
||||
)
|
||||
actionDocuments.append(actionDoc)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error processing document {i+1}: {str(e)}")
|
||||
# Continue with other documents
|
||||
continue
|
||||
|
||||
if not actionDocuments:
|
||||
self.services.chat.progressLogFinish(operationId, False)
|
||||
return ActionResult.isFailure(error="No valid ContentExtracted documents found to neutralize")
|
||||
|
||||
self.services.chat.progressLogFinish(operationId, True)
|
||||
|
||||
return ActionResult.isSuccess(documents=actionDocuments)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error in data neutralization: {str(e)}")
|
||||
|
||||
# Complete progress tracking with failure
|
||||
try:
|
||||
self.services.chat.progressLogFinish(operationId, False)
|
||||
except:
|
||||
pass # Don't fail on progress logging errors
|
||||
|
||||
return ActionResult.isFailure(error=str(e))
|
||||
|
|
@ -0,0 +1,121 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
|
||||
"""
|
||||
Trigger Preprocessing Server action for Context operations.
|
||||
Triggers preprocessing server at customer tenant to update database with configuration.
|
||||
"""
|
||||
|
||||
import logging
|
||||
import json
|
||||
import aiohttp
|
||||
from typing import Dict, Any
|
||||
from modules.workflows.methods.methodBase import action
|
||||
from modules.datamodels.datamodelChat import ActionResult, ActionDocument
|
||||
from modules.shared.configuration import APP_CONFIG
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
@action
|
||||
async def triggerPreprocessingServer(self, parameters: Dict[str, Any]) -> ActionResult:
|
||||
"""
|
||||
Trigger preprocessing server at customer tenant to update database with configuration.
|
||||
|
||||
This action makes a POST request to the preprocessing server endpoint with the provided
|
||||
configuration JSON. The authorization secret is retrieved from APP_CONFIG using the provided config key.
|
||||
|
||||
Parameters:
|
||||
- endpoint (str, required): The full URL endpoint for the preprocessing server API.
|
||||
- configJson (dict or str, required): Configuration JSON object to send to the preprocessing server. Can be provided as a dict or as a JSON string that will be parsed.
|
||||
- authSecretConfigKey (str, required): The APP_CONFIG key name to retrieve the authorization secret from.
|
||||
|
||||
Returns:
|
||||
- ActionResult with ActionDocument containing "ok" on success, or error message on failure.
|
||||
"""
|
||||
try:
|
||||
endpoint = parameters.get("endpoint")
|
||||
if not endpoint:
|
||||
return ActionResult.isFailure(error="endpoint parameter is required")
|
||||
|
||||
configJsonParam = parameters.get("configJson")
|
||||
if not configJsonParam:
|
||||
return ActionResult.isFailure(error="configJson parameter is required")
|
||||
|
||||
authSecretConfigKey = parameters.get("authSecretConfigKey")
|
||||
if not authSecretConfigKey:
|
||||
return ActionResult.isFailure(error="authSecretConfigKey parameter is required")
|
||||
|
||||
# Handle configJson as either dict or JSON string
|
||||
if isinstance(configJsonParam, str):
|
||||
try:
|
||||
configJson = json.loads(configJsonParam)
|
||||
except json.JSONDecodeError as e:
|
||||
return ActionResult.isFailure(error=f"configJson is not valid JSON: {str(e)}")
|
||||
elif isinstance(configJsonParam, dict):
|
||||
configJson = configJsonParam
|
||||
else:
|
||||
return ActionResult.isFailure(error=f"configJson must be a dict or JSON string, got {type(configJsonParam)}")
|
||||
|
||||
# Get authorization secret from APP_CONFIG using the provided config key
|
||||
authSecret = APP_CONFIG.get(authSecretConfigKey)
|
||||
if not authSecret:
|
||||
errorMsg = f"{authSecretConfigKey} not found in APP_CONFIG"
|
||||
logger.error(errorMsg)
|
||||
return ActionResult.isFailure(error=errorMsg)
|
||||
|
||||
# Prepare headers with authorization (default headers as in original function)
|
||||
headers = {
|
||||
"X-PP-API-Key": authSecret,
|
||||
"Content-Type": "application/json"
|
||||
}
|
||||
|
||||
# Make POST request
|
||||
timeout = aiohttp.ClientTimeout(total=60)
|
||||
async with aiohttp.ClientSession(timeout=timeout) as session:
|
||||
async with session.post(
|
||||
endpoint,
|
||||
headers=headers,
|
||||
json=configJson
|
||||
) as response:
|
||||
if response.status in [200, 201]:
|
||||
responseText = await response.text()
|
||||
logger.info(f"Preprocessing server trigger successful: {response.status}")
|
||||
logger.debug(f"Response: {responseText}")
|
||||
|
||||
# Generate meaningful filename
|
||||
workflowContext = self.services.chat.getWorkflowContext() if hasattr(self.services, 'chat') else None
|
||||
filename = self._generateMeaningfulFileName(
|
||||
"preprocessing_result",
|
||||
"txt",
|
||||
workflowContext,
|
||||
"triggerPreprocessingServer"
|
||||
)
|
||||
|
||||
# Create validation metadata
|
||||
validationMetadata = self._createValidationMetadata(
|
||||
"triggerPreprocessingServer",
|
||||
endpoint=endpoint,
|
||||
statusCode=response.status,
|
||||
responseText=responseText
|
||||
)
|
||||
|
||||
# Return success with "ok" document
|
||||
document = ActionDocument(
|
||||
documentName=filename,
|
||||
documentData="ok",
|
||||
mimeType="text/plain",
|
||||
validationMetadata=validationMetadata
|
||||
)
|
||||
|
||||
return ActionResult.isSuccess(documents=[document])
|
||||
else:
|
||||
errorText = await response.text()
|
||||
errorMsg = f"Preprocessing server trigger failed: {response.status} - {errorText}"
|
||||
logger.error(errorMsg)
|
||||
return ActionResult.isFailure(error=errorMsg)
|
||||
|
||||
except Exception as e:
|
||||
errorMsg = f"Error triggering preprocessing server: {str(e)}"
|
||||
logger.error(errorMsg)
|
||||
return ActionResult.isFailure(error=errorMsg)
|
||||
|
||||
|
|
@ -0,0 +1,5 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
|
||||
"""Helper modules for Context method operations."""
|
||||
|
||||
|
|
@ -0,0 +1,89 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
|
||||
"""
|
||||
Document Index helper for Context operations.
|
||||
Handles parsing and formatting of document indexes.
|
||||
"""
|
||||
|
||||
import logging
|
||||
from typing import Dict, Any
|
||||
from datetime import datetime, UTC
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
class DocumentIndexHelper:
|
||||
"""Helper for document index operations"""
|
||||
|
||||
def __init__(self, methodInstance):
|
||||
"""
|
||||
Initialize document index helper.
|
||||
|
||||
Args:
|
||||
methodInstance: Instance of MethodContext (for access to services)
|
||||
"""
|
||||
self.method = methodInstance
|
||||
self.services = methodInstance.services
|
||||
|
||||
def parseDocumentIndex(self, documentsIndex: str, workflow: Any) -> Dict[str, Any]:
|
||||
"""Parse the document index string into structured data."""
|
||||
try:
|
||||
indexData = {
|
||||
"workflowId": getattr(workflow, 'id', 'unknown'),
|
||||
"generatedAt": datetime.now(UTC).isoformat(),
|
||||
"totalDocuments": 0,
|
||||
"rounds": [],
|
||||
"documentReferences": []
|
||||
}
|
||||
|
||||
# Extract document references from the index string
|
||||
lines = documentsIndex.split('\n')
|
||||
currentRound = None
|
||||
currentDocList = None
|
||||
|
||||
for line in lines:
|
||||
line = line.strip()
|
||||
if not line:
|
||||
continue
|
||||
|
||||
# Check for round headers
|
||||
if "Current round documents:" in line:
|
||||
currentRound = "current"
|
||||
continue
|
||||
elif "Past rounds documents:" in line:
|
||||
currentRound = "past"
|
||||
continue
|
||||
|
||||
# Check for document list references (docList:...)
|
||||
if line.startswith("- docList:"):
|
||||
docListRef = line.replace("- docList:", "").strip()
|
||||
currentDocList = {
|
||||
"reference": docListRef,
|
||||
"round": currentRound,
|
||||
"documents": []
|
||||
}
|
||||
indexData["rounds"].append(currentDocList)
|
||||
continue
|
||||
|
||||
# Check for individual document references (docItem:...)
|
||||
if line.startswith(" - docItem:") or line.startswith("- docItem:"):
|
||||
docItemRef = line.replace(" - docItem:", "").replace("- docItem:", "").strip()
|
||||
indexData["documentReferences"].append({
|
||||
"reference": docItemRef,
|
||||
"round": currentRound,
|
||||
"docList": currentDocList["reference"] if currentDocList else None
|
||||
})
|
||||
indexData["totalDocuments"] += 1
|
||||
if currentDocList:
|
||||
currentDocList["documents"].append(docItemRef)
|
||||
|
||||
return indexData
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error parsing document index: {str(e)}")
|
||||
return {
|
||||
"workflowId": getattr(workflow, 'id', 'unknown'),
|
||||
"error": f"Failed to parse document index: {str(e)}",
|
||||
"rawIndex": documentsIndex
|
||||
}
|
||||
|
||||
|
|
@ -0,0 +1,75 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
|
||||
"""
|
||||
Formatting helper for Context operations.
|
||||
Handles formatting of document indexes in different formats.
|
||||
"""
|
||||
|
||||
import logging
|
||||
from typing import Dict, Any
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
class FormattingHelper:
|
||||
"""Helper for formatting operations"""
|
||||
|
||||
def __init__(self, methodInstance):
|
||||
"""
|
||||
Initialize formatting helper.
|
||||
|
||||
Args:
|
||||
methodInstance: Instance of MethodContext (for access to services)
|
||||
"""
|
||||
self.method = methodInstance
|
||||
self.services = methodInstance.services
|
||||
|
||||
def formatAsMarkdown(self, indexData: Dict[str, Any]) -> str:
|
||||
"""Format document index as Markdown."""
|
||||
try:
|
||||
md = f"# Document Index\n\n"
|
||||
md += f"**Workflow ID:** {indexData.get('workflowId', 'unknown')}\n\n"
|
||||
md += f"**Generated At:** {indexData.get('generatedAt', 'unknown')}\n\n"
|
||||
md += f"**Total Documents:** {indexData.get('totalDocuments', 0)}\n\n"
|
||||
|
||||
if indexData.get('rounds'):
|
||||
md += "## Documents by Round\n\n"
|
||||
for roundInfo in indexData['rounds']:
|
||||
roundLabel = roundInfo.get('round', 'unknown').title()
|
||||
md += f"### {roundLabel} Round\n\n"
|
||||
md += f"**Document List:** `{roundInfo.get('reference', 'unknown')}`\n\n"
|
||||
if roundInfo.get('documents'):
|
||||
md += "**Documents:**\n\n"
|
||||
for docRef in roundInfo['documents']:
|
||||
md += f"- `{docRef}`\n"
|
||||
md += "\n"
|
||||
|
||||
if indexData.get('documentReferences'):
|
||||
md += "## All Document References\n\n"
|
||||
for docRef in indexData['documentReferences']:
|
||||
md += f"- `{docRef.get('reference', 'unknown')}`\n"
|
||||
|
||||
return md
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error formatting as Markdown: {str(e)}")
|
||||
return f"# Document Index\n\nError formatting index: {str(e)}\n"
|
||||
|
||||
def formatAsText(self, indexData: Dict[str, Any], rawIndex: str) -> str:
|
||||
"""Format document index as plain text."""
|
||||
try:
|
||||
text = "Document Index\n"
|
||||
text += "=" * 50 + "\n\n"
|
||||
text += f"Workflow ID: {indexData.get('workflowId', 'unknown')}\n"
|
||||
text += f"Generated At: {indexData.get('generatedAt', 'unknown')}\n"
|
||||
text += f"Total Documents: {indexData.get('totalDocuments', 0)}\n\n"
|
||||
|
||||
# Include the raw formatted index for readability
|
||||
text += rawIndex
|
||||
|
||||
return text
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error formatting as text: {str(e)}")
|
||||
return f"Document Index\n\nError formatting index: {str(e)}\n\nRaw index:\n{rawIndex}\n"
|
||||
|
||||
124
modules/workflows/methods/methodContext/methodContext.py
Normal file
124
modules/workflows/methods/methodContext/methodContext.py
Normal file
|
|
@ -0,0 +1,124 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
|
||||
import logging
|
||||
from modules.workflows.methods.methodBase import MethodBase
|
||||
from modules.datamodels.datamodelWorkflowActions import WorkflowActionDefinition, WorkflowActionParameter
|
||||
from modules.shared.frontendTypes import FrontendType
|
||||
|
||||
# Import helpers
|
||||
from .helpers.documentIndex import DocumentIndexHelper
|
||||
from .helpers.formatting import FormattingHelper
|
||||
|
||||
# Import actions
|
||||
from .actions.getDocumentIndex import getDocumentIndex
|
||||
from .actions.extractContent import extractContent
|
||||
from .actions.neutralizeData import neutralizeData
|
||||
from .actions.triggerPreprocessingServer import triggerPreprocessingServer
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
class MethodContext(MethodBase):
|
||||
"""Context and workflow information methods."""
|
||||
|
||||
def __init__(self, services):
|
||||
super().__init__(services)
|
||||
self.name = "context"
|
||||
self.description = "Context and workflow information methods"
|
||||
|
||||
# Initialize helper modules
|
||||
self.documentIndex = DocumentIndexHelper(self)
|
||||
self.formatting = FormattingHelper(self)
|
||||
|
||||
# RBAC-Integration: Action-Definitionen mit actionId
|
||||
self._actions = {
|
||||
"getDocumentIndex": WorkflowActionDefinition(
|
||||
actionId="context.getDocumentIndex",
|
||||
description="Generate a comprehensive index of all documents available in the current workflow",
|
||||
parameters={
|
||||
"resultType": WorkflowActionParameter(
|
||||
name="resultType",
|
||||
type="str",
|
||||
frontendType=FrontendType.SELECT,
|
||||
frontendOptions=["json", "txt", "md"],
|
||||
required=False,
|
||||
default="json",
|
||||
description="Output format"
|
||||
)
|
||||
},
|
||||
execute=getDocumentIndex.__get__(self, self.__class__)
|
||||
),
|
||||
"extractContent": WorkflowActionDefinition(
|
||||
actionId="context.extractContent",
|
||||
description="Extract raw content parts from documents without AI processing. Returns ContentParts with different typeGroups (text, image, table, structure, container). Images are returned as base64 data, not as extracted text. Text content is extracted from text-based formats (PDF text layers, Word docs, etc.) but NOT from images (no OCR). Use this action to prepare documents for subsequent AI processing actions.",
|
||||
parameters={
|
||||
"documentList": WorkflowActionParameter(
|
||||
name="documentList",
|
||||
type="List[str]",
|
||||
frontendType=FrontendType.DOCUMENT_REFERENCE,
|
||||
required=True,
|
||||
description="Document reference(s) to extract content from"
|
||||
),
|
||||
"extractionOptions": WorkflowActionParameter(
|
||||
name="extractionOptions",
|
||||
type="dict",
|
||||
frontendType=FrontendType.JSON,
|
||||
required=False,
|
||||
description="Extraction options (if not provided, defaults are used). Note: This action does NOT use AI - it performs pure content extraction. Images are preserved as base64 data, not converted to text."
|
||||
)
|
||||
},
|
||||
execute=extractContent.__get__(self, self.__class__)
|
||||
),
|
||||
"neutralizeData": WorkflowActionDefinition(
|
||||
actionId="context.neutralizeData",
|
||||
description="Neutralize extracted data from ContentExtracted documents (for use after extractContent)",
|
||||
parameters={
|
||||
"documentList": WorkflowActionParameter(
|
||||
name="documentList",
|
||||
type="List[str]",
|
||||
frontendType=FrontendType.DOCUMENT_REFERENCE,
|
||||
required=True,
|
||||
description="Document reference(s) containing ContentExtracted objects to neutralize"
|
||||
)
|
||||
},
|
||||
execute=neutralizeData.__get__(self, self.__class__)
|
||||
),
|
||||
"triggerPreprocessingServer": WorkflowActionDefinition(
|
||||
actionId="context.triggerPreprocessingServer",
|
||||
description="Trigger preprocessing server at customer tenant to update database with configuration",
|
||||
parameters={
|
||||
"endpoint": WorkflowActionParameter(
|
||||
name="endpoint",
|
||||
type="str",
|
||||
frontendType=FrontendType.TEXT,
|
||||
required=True,
|
||||
description="The full URL endpoint for the preprocessing server API"
|
||||
),
|
||||
"configJson": WorkflowActionParameter(
|
||||
name="configJson",
|
||||
type="str",
|
||||
frontendType=FrontendType.JSON,
|
||||
required=True,
|
||||
description="Configuration JSON object to send to the preprocessing server. Can be provided as a dict or as a JSON string"
|
||||
),
|
||||
"authSecretConfigKey": WorkflowActionParameter(
|
||||
name="authSecretConfigKey",
|
||||
type="str",
|
||||
frontendType=FrontendType.TEXT,
|
||||
required=True,
|
||||
description="The APP_CONFIG key name to retrieve the authorization secret from"
|
||||
)
|
||||
},
|
||||
execute=triggerPreprocessingServer.__get__(self, self.__class__)
|
||||
)
|
||||
}
|
||||
|
||||
# Validate actions after definition
|
||||
self._validateActions()
|
||||
|
||||
# Register actions as methods (optional, für direkten Zugriff)
|
||||
self.getDocumentIndex = getDocumentIndex.__get__(self, self.__class__)
|
||||
self.extractContent = extractContent.__get__(self, self.__class__)
|
||||
self.neutralizeData = neutralizeData.__get__(self, self.__class__)
|
||||
self.triggerPreprocessingServer = triggerPreprocessingServer.__get__(self, self.__class__)
|
||||
|
||||
7
modules/workflows/methods/methodJira/__init__.py
Normal file
7
modules/workflows/methods/methodJira/__init__.py
Normal file
|
|
@ -0,0 +1,7 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
|
||||
from .methodJira import MethodJira
|
||||
|
||||
__all__ = ['MethodJira']
|
||||
|
||||
26
modules/workflows/methods/methodJira/actions/__init__.py
Normal file
26
modules/workflows/methods/methodJira/actions/__init__.py
Normal file
|
|
@ -0,0 +1,26 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
|
||||
"""Action modules for JIRA operations."""
|
||||
|
||||
# Export all actions
|
||||
from .connectJira import connectJira
|
||||
from .exportTicketsAsJson import exportTicketsAsJson
|
||||
from .importTicketsFromJson import importTicketsFromJson
|
||||
from .mergeTicketData import mergeTicketData
|
||||
from .parseCsvContent import parseCsvContent
|
||||
from .parseExcelContent import parseExcelContent
|
||||
from .createCsvContent import createCsvContent
|
||||
from .createExcelContent import createExcelContent
|
||||
|
||||
__all__ = [
|
||||
'connectJira',
|
||||
'exportTicketsAsJson',
|
||||
'importTicketsFromJson',
|
||||
'mergeTicketData',
|
||||
'parseCsvContent',
|
||||
'parseExcelContent',
|
||||
'createCsvContent',
|
||||
'createExcelContent',
|
||||
]
|
||||
|
||||
139
modules/workflows/methods/methodJira/actions/connectJira.py
Normal file
139
modules/workflows/methods/methodJira/actions/connectJira.py
Normal file
|
|
@ -0,0 +1,139 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
|
||||
"""
|
||||
Connect JIRA action for JIRA operations.
|
||||
Connects to JIRA instance and creates ticket interface.
|
||||
"""
|
||||
|
||||
import logging
|
||||
import json
|
||||
import uuid
|
||||
from typing import Dict, Any
|
||||
from modules.workflows.methods.methodBase import action
|
||||
from modules.datamodels.datamodelChat import ActionResult, ActionDocument
|
||||
from modules.shared.configuration import APP_CONFIG
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
@action
|
||||
async def connectJira(self, parameters: Dict[str, Any]) -> ActionResult:
|
||||
"""
|
||||
Connect to JIRA instance and create ticket interface.
|
||||
|
||||
Parameters:
|
||||
- apiUsername (str, required): JIRA API username/email
|
||||
- apiTokenConfigKey (str, required): APP_CONFIG key name for JIRA API token
|
||||
- apiUrl (str, required): JIRA instance URL (e.g., https://example.atlassian.net)
|
||||
- projectCode (str, required): JIRA project code (e.g., "DCS")
|
||||
- issueType (str, required): JIRA issue type (e.g., "Task")
|
||||
- taskSyncDefinition (str or dict, required): Field mapping definition as JSON string or dict
|
||||
|
||||
Returns:
|
||||
- ActionResult with ActionDocument containing connection ID
|
||||
"""
|
||||
try:
|
||||
apiUsername = parameters.get("apiUsername")
|
||||
if not apiUsername:
|
||||
return ActionResult.isFailure(error="apiUsername parameter is required")
|
||||
|
||||
apiTokenConfigKey = parameters.get("apiTokenConfigKey")
|
||||
if not apiTokenConfigKey:
|
||||
return ActionResult.isFailure(error="apiTokenConfigKey parameter is required")
|
||||
|
||||
apiUrl = parameters.get("apiUrl")
|
||||
if not apiUrl:
|
||||
return ActionResult.isFailure(error="apiUrl parameter is required")
|
||||
|
||||
projectCode = parameters.get("projectCode")
|
||||
if not projectCode:
|
||||
return ActionResult.isFailure(error="projectCode parameter is required")
|
||||
|
||||
issueType = parameters.get("issueType")
|
||||
if not issueType:
|
||||
return ActionResult.isFailure(error="issueType parameter is required")
|
||||
|
||||
taskSyncDefinitionParam = parameters.get("taskSyncDefinition")
|
||||
if not taskSyncDefinitionParam:
|
||||
return ActionResult.isFailure(error="taskSyncDefinition parameter is required")
|
||||
|
||||
# Parse taskSyncDefinition
|
||||
if isinstance(taskSyncDefinitionParam, str):
|
||||
try:
|
||||
taskSyncDefinition = json.loads(taskSyncDefinitionParam)
|
||||
except json.JSONDecodeError as e:
|
||||
return ActionResult.isFailure(error=f"taskSyncDefinition is not valid JSON: {str(e)}")
|
||||
elif isinstance(taskSyncDefinitionParam, dict):
|
||||
taskSyncDefinition = taskSyncDefinitionParam
|
||||
else:
|
||||
return ActionResult.isFailure(error=f"taskSyncDefinition must be a dict or JSON string, got {type(taskSyncDefinitionParam)}")
|
||||
|
||||
# Get API token from APP_CONFIG
|
||||
apiToken = APP_CONFIG.get(apiTokenConfigKey)
|
||||
if not apiToken:
|
||||
errorMsg = f"{apiTokenConfigKey} not found in APP_CONFIG"
|
||||
logger.error(errorMsg)
|
||||
return ActionResult.isFailure(error=errorMsg)
|
||||
|
||||
# Create ticket interface
|
||||
syncInterface = await self.services.ticket.connectTicket(
|
||||
taskSyncDefinition=taskSyncDefinition,
|
||||
connectorType="Jira",
|
||||
connectorParams={
|
||||
"apiUsername": apiUsername,
|
||||
"apiToken": apiToken,
|
||||
"apiUrl": apiUrl,
|
||||
"projectCode": projectCode,
|
||||
"ticketType": issueType,
|
||||
},
|
||||
)
|
||||
|
||||
# Store connection with unique ID
|
||||
connectionId = str(uuid.uuid4())
|
||||
self._connections[connectionId] = {
|
||||
"interface": syncInterface,
|
||||
"taskSyncDefinition": taskSyncDefinition,
|
||||
"apiUrl": apiUrl,
|
||||
"projectCode": projectCode,
|
||||
}
|
||||
|
||||
logger.info(f"JIRA connection established: {connectionId} (Project: {projectCode})")
|
||||
|
||||
# Generate filename
|
||||
workflowContext = self.services.chat.getWorkflowContext() if hasattr(self.services, 'chat') else None
|
||||
filename = self._generateMeaningfulFileName(
|
||||
"jira_connection",
|
||||
"json",
|
||||
workflowContext,
|
||||
"connectJira"
|
||||
)
|
||||
|
||||
# Create connection info document
|
||||
connectionInfo = {
|
||||
"connectionId": connectionId,
|
||||
"apiUrl": apiUrl,
|
||||
"projectCode": projectCode,
|
||||
"issueType": issueType,
|
||||
}
|
||||
|
||||
validationMetadata = self._createValidationMetadata(
|
||||
"connectJira",
|
||||
connectionId=connectionId,
|
||||
apiUrl=apiUrl,
|
||||
projectCode=projectCode
|
||||
)
|
||||
|
||||
document = ActionDocument(
|
||||
documentName=filename,
|
||||
documentData=json.dumps(connectionInfo, indent=2),
|
||||
mimeType="application/json",
|
||||
validationMetadata=validationMetadata
|
||||
)
|
||||
|
||||
return ActionResult.isSuccess(documents=[document])
|
||||
|
||||
except Exception as e:
|
||||
errorMsg = f"Error connecting to JIRA: {str(e)}"
|
||||
logger.error(errorMsg)
|
||||
return ActionResult.isFailure(error=errorMsg)
|
||||
|
||||
157
modules/workflows/methods/methodJira/actions/createCsvContent.py
Normal file
157
modules/workflows/methods/methodJira/actions/createCsvContent.py
Normal file
|
|
@ -0,0 +1,157 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
|
||||
"""
|
||||
Create CSV Content action for JIRA operations.
|
||||
Creates CSV content with custom headers.
|
||||
"""
|
||||
|
||||
import logging
|
||||
import json
|
||||
import base64
|
||||
import pandas as pd
|
||||
import csv as csv_module
|
||||
from io import StringIO
|
||||
from datetime import datetime, UTC
|
||||
from typing import Dict, Any
|
||||
from modules.workflows.methods.methodBase import action
|
||||
from modules.datamodels.datamodelChat import ActionResult, ActionDocument
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
@action
|
||||
async def createCsvContent(self, parameters: Dict[str, Any]) -> ActionResult:
|
||||
"""
|
||||
Create CSV content with custom headers.
|
||||
|
||||
Parameters:
|
||||
- data (str, required): Document reference containing data as JSON (with "data" field from mergeTicketData)
|
||||
- headers (str, optional): Document reference containing headers JSON (from parseCsvContent/parseExcelContent)
|
||||
- columns (str or list, optional): List of column names (if not provided, extracted from taskSyncDefinition or data)
|
||||
- taskSyncDefinition (str or dict, optional): Field mapping definition (used to extract column names if columns not provided)
|
||||
|
||||
Returns:
|
||||
- ActionResult with ActionDocument containing CSV content as bytes
|
||||
"""
|
||||
try:
|
||||
dataParam = parameters.get("data")
|
||||
if not dataParam:
|
||||
return ActionResult.isFailure(error="data parameter is required")
|
||||
|
||||
headersParam = parameters.get("headers")
|
||||
columnsParam = parameters.get("columns")
|
||||
taskSyncDefinitionParam = parameters.get("taskSyncDefinition")
|
||||
|
||||
# Get data from document
|
||||
dataJson = self.documentParsing.parseJsonFromDocument(dataParam)
|
||||
if dataJson is None:
|
||||
return ActionResult.isFailure(error="Could not parse data from document reference")
|
||||
|
||||
# Extract data array if wrapped in object
|
||||
if isinstance(dataJson, dict) and "data" in dataJson:
|
||||
dataList = dataJson["data"]
|
||||
elif isinstance(dataJson, list):
|
||||
dataList = dataJson
|
||||
else:
|
||||
return ActionResult.isFailure(error="Data must be a JSON array or object with 'data' field")
|
||||
|
||||
# Get headers
|
||||
headers = {"header1": "Header 1", "header2": "Header 2"}
|
||||
if headersParam:
|
||||
headersJson = self.documentParsing.parseJsonFromDocument(headersParam)
|
||||
if headersJson and isinstance(headersJson, dict) and "headers" in headersJson:
|
||||
headers = headersJson["headers"]
|
||||
elif headersJson and isinstance(headersJson, dict):
|
||||
headers = headersJson
|
||||
|
||||
# Get columns
|
||||
if columnsParam:
|
||||
if isinstance(columnsParam, str):
|
||||
try:
|
||||
columns = json.loads(columnsParam) if columnsParam.startswith('[') or columnsParam.startswith('{') else columnsParam.split(',')
|
||||
except:
|
||||
columns = columnsParam.split(',')
|
||||
elif isinstance(columnsParam, list):
|
||||
columns = columnsParam
|
||||
else:
|
||||
columns = None
|
||||
elif taskSyncDefinitionParam:
|
||||
# Extract columns from taskSyncDefinition
|
||||
if isinstance(taskSyncDefinitionParam, str):
|
||||
taskSyncDefinition = json.loads(taskSyncDefinitionParam)
|
||||
else:
|
||||
taskSyncDefinition = taskSyncDefinitionParam
|
||||
columns = list(taskSyncDefinition.keys())
|
||||
elif dataList and len(dataList) > 0:
|
||||
columns = list(dataList[0].keys())
|
||||
else:
|
||||
columns = []
|
||||
|
||||
# Create DataFrame
|
||||
if not dataList:
|
||||
df = pd.DataFrame(columns=columns)
|
||||
else:
|
||||
df = pd.DataFrame(dataList)
|
||||
# Ensure all columns exist
|
||||
for col in columns:
|
||||
if col not in df.columns:
|
||||
df[col] = ""
|
||||
# Reorder columns
|
||||
df = df[columns]
|
||||
|
||||
# Clean data
|
||||
for column in df.columns:
|
||||
df[column] = df[column].astype("object").fillna("")
|
||||
df[column] = df[column].astype(str).str.replace('\n', '\\n', regex=False).str.replace('"', '""', regex=False)
|
||||
|
||||
# Create headers with timestamp
|
||||
timestamp = datetime.fromtimestamp(self.services.utils.timestampGetUtc(), UTC).strftime("%Y-%m-%d %H:%M:%S UTC")
|
||||
header1Row = next(csv_module.reader([headers.get("header1", "Header 1")]), [])
|
||||
header2Row = next(csv_module.reader([headers.get("header2", "Header 2")]), [])
|
||||
if len(header2Row) > 1:
|
||||
header2Row[1] = timestamp
|
||||
|
||||
headerRow1 = pd.DataFrame([header1Row + [""] * (len(df.columns) - len(header1Row))], columns=df.columns)
|
||||
headerRow2 = pd.DataFrame([header2Row + [""] * (len(df.columns) - len(header2Row))], columns=df.columns)
|
||||
tableHeaders = pd.DataFrame([df.columns.tolist()], columns=df.columns)
|
||||
finalDf = pd.concat([headerRow1, headerRow2, tableHeaders, df], ignore_index=True)
|
||||
|
||||
# Convert to CSV bytes
|
||||
out = StringIO()
|
||||
finalDf.to_csv(out, index=False, header=False, quoting=1, escapechar='\\')
|
||||
csvBytes = out.getvalue().encode('utf-8')
|
||||
|
||||
logger.info(f"Created CSV content: {len(dataList)} rows, {len(columns)} columns")
|
||||
|
||||
# Generate filename
|
||||
workflowContext = self.services.chat.getWorkflowContext() if hasattr(self.services, 'chat') else None
|
||||
filename = self._generateMeaningfulFileName(
|
||||
"ticket_sync",
|
||||
"csv",
|
||||
workflowContext,
|
||||
"createCsvContent"
|
||||
)
|
||||
|
||||
validationMetadata = self._createValidationMetadata(
|
||||
"createCsvContent",
|
||||
rowCount=len(dataList),
|
||||
columnCount=len(columns)
|
||||
)
|
||||
|
||||
# Store as base64 for document
|
||||
csvBase64 = base64.b64encode(csvBytes).decode('utf-8')
|
||||
|
||||
document = ActionDocument(
|
||||
documentName=filename,
|
||||
documentData=csvBase64,
|
||||
mimeType="application/octet-stream",
|
||||
validationMetadata=validationMetadata
|
||||
)
|
||||
|
||||
return ActionResult.isSuccess(documents=[document])
|
||||
|
||||
except Exception as e:
|
||||
errorMsg = f"Error creating CSV content: {str(e)}"
|
||||
logger.error(errorMsg)
|
||||
return ActionResult.isFailure(error=errorMsg)
|
||||
|
||||
|
|
@ -0,0 +1,157 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
|
||||
"""
|
||||
Create Excel Content action for JIRA operations.
|
||||
Creates Excel content with custom headers.
|
||||
"""
|
||||
|
||||
import logging
|
||||
import json
|
||||
import base64
|
||||
import pandas as pd
|
||||
import csv as csv_module
|
||||
from io import BytesIO
|
||||
from datetime import datetime, UTC
|
||||
from typing import Dict, Any
|
||||
from modules.workflows.methods.methodBase import action
|
||||
from modules.datamodels.datamodelChat import ActionResult, ActionDocument
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
@action
|
||||
async def createExcelContent(self, parameters: Dict[str, Any]) -> ActionResult:
|
||||
"""
|
||||
Create Excel content with custom headers.
|
||||
|
||||
Parameters:
|
||||
- data (str, required): Document reference containing data as JSON (with "data" field from mergeTicketData)
|
||||
- headers (str, optional): Document reference containing headers JSON (from parseExcelContent)
|
||||
- columns (str or list, optional): List of column names (if not provided, extracted from taskSyncDefinition or data)
|
||||
- taskSyncDefinition (str or dict, optional): Field mapping definition (used to extract column names if columns not provided)
|
||||
|
||||
Returns:
|
||||
- ActionResult with ActionDocument containing Excel content as bytes
|
||||
"""
|
||||
try:
|
||||
dataParam = parameters.get("data")
|
||||
if not dataParam:
|
||||
return ActionResult.isFailure(error="data parameter is required")
|
||||
|
||||
headersParam = parameters.get("headers")
|
||||
columnsParam = parameters.get("columns")
|
||||
taskSyncDefinitionParam = parameters.get("taskSyncDefinition")
|
||||
|
||||
# Get data from document
|
||||
dataJson = self.documentParsing.parseJsonFromDocument(dataParam)
|
||||
if dataJson is None:
|
||||
return ActionResult.isFailure(error="Could not parse data from document reference")
|
||||
|
||||
# Extract data array if wrapped in object
|
||||
if isinstance(dataJson, dict) and "data" in dataJson:
|
||||
dataList = dataJson["data"]
|
||||
elif isinstance(dataJson, list):
|
||||
dataList = dataJson
|
||||
else:
|
||||
return ActionResult.isFailure(error="Data must be a JSON array or object with 'data' field")
|
||||
|
||||
# Get headers
|
||||
headers = {"header1": "Header 1", "header2": "Header 2"}
|
||||
if headersParam:
|
||||
headersJson = self.documentParsing.parseJsonFromDocument(headersParam)
|
||||
if headersJson and isinstance(headersJson, dict) and "headers" in headersJson:
|
||||
headers = headersJson["headers"]
|
||||
elif headersJson and isinstance(headersJson, dict):
|
||||
headers = headersJson
|
||||
|
||||
# Get columns
|
||||
if columnsParam:
|
||||
if isinstance(columnsParam, str):
|
||||
try:
|
||||
columns = json.loads(columnsParam) if columnsParam.startswith('[') or columnsParam.startswith('{') else columnsParam.split(',')
|
||||
except:
|
||||
columns = columnsParam.split(',')
|
||||
elif isinstance(columnsParam, list):
|
||||
columns = columnsParam
|
||||
else:
|
||||
columns = None
|
||||
elif taskSyncDefinitionParam:
|
||||
# Extract columns from taskSyncDefinition
|
||||
if isinstance(taskSyncDefinitionParam, str):
|
||||
taskSyncDefinition = json.loads(taskSyncDefinitionParam)
|
||||
else:
|
||||
taskSyncDefinition = taskSyncDefinitionParam
|
||||
columns = list(taskSyncDefinition.keys())
|
||||
elif dataList and len(dataList) > 0:
|
||||
columns = list(dataList[0].keys())
|
||||
else:
|
||||
columns = []
|
||||
|
||||
# Create DataFrame
|
||||
if not dataList:
|
||||
df = pd.DataFrame(columns=columns)
|
||||
else:
|
||||
df = pd.DataFrame(dataList)
|
||||
# Ensure all columns exist
|
||||
for col in columns:
|
||||
if col not in df.columns:
|
||||
df[col] = ""
|
||||
# Reorder columns
|
||||
df = df[columns]
|
||||
|
||||
# Clean data
|
||||
for column in df.columns:
|
||||
df[column] = df[column].astype("object").fillna("")
|
||||
df[column] = df[column].astype(str).str.replace('\n', '\\n', regex=False).str.replace('"', '""', regex=False)
|
||||
|
||||
# Create headers with timestamp
|
||||
timestamp = datetime.fromtimestamp(self.services.utils.timestampGetUtc(), UTC).strftime("%Y-%m-%d %H:%M:%S UTC")
|
||||
header1Row = next(csv_module.reader([headers.get("header1", "Header 1")]), [])
|
||||
header2Row = next(csv_module.reader([headers.get("header2", "Header 2")]), [])
|
||||
if len(header2Row) > 1:
|
||||
header2Row[1] = timestamp
|
||||
|
||||
headerRow1 = pd.DataFrame([header1Row + [""] * (len(df.columns) - len(header1Row))], columns=df.columns)
|
||||
headerRow2 = pd.DataFrame([header2Row + [""] * (len(df.columns) - len(header2Row))], columns=df.columns)
|
||||
tableHeaders = pd.DataFrame([df.columns.tolist()], columns=df.columns)
|
||||
finalDf = pd.concat([headerRow1, headerRow2, tableHeaders, df], ignore_index=True)
|
||||
|
||||
# Convert to Excel bytes
|
||||
buf = BytesIO()
|
||||
finalDf.to_excel(buf, index=False, header=False, engine='openpyxl')
|
||||
excelBytes = buf.getvalue()
|
||||
|
||||
logger.info(f"Created Excel content: {len(dataList)} rows, {len(columns)} columns")
|
||||
|
||||
# Generate filename
|
||||
workflowContext = self.services.chat.getWorkflowContext() if hasattr(self.services, 'chat') else None
|
||||
filename = self._generateMeaningfulFileName(
|
||||
"ticket_sync",
|
||||
"xlsx",
|
||||
workflowContext,
|
||||
"createExcelContent"
|
||||
)
|
||||
|
||||
validationMetadata = self._createValidationMetadata(
|
||||
"createExcelContent",
|
||||
rowCount=len(dataList),
|
||||
columnCount=len(columns)
|
||||
)
|
||||
|
||||
# Store as base64 for document
|
||||
excelBase64 = base64.b64encode(excelBytes).decode('utf-8')
|
||||
|
||||
document = ActionDocument(
|
||||
documentName=filename,
|
||||
documentData=excelBase64,
|
||||
mimeType="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
|
||||
validationMetadata=validationMetadata
|
||||
)
|
||||
|
||||
return ActionResult.isSuccess(documents=[document])
|
||||
|
||||
except Exception as e:
|
||||
errorMsg = f"Error creating Excel content: {str(e)}"
|
||||
logger.error(errorMsg)
|
||||
return ActionResult.isFailure(error=errorMsg)
|
||||
|
||||
|
|
@ -0,0 +1,84 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
|
||||
"""
|
||||
Export Tickets As JSON action for JIRA operations.
|
||||
Exports tickets from JIRA as JSON list.
|
||||
"""
|
||||
|
||||
import logging
|
||||
import json
|
||||
from typing import Dict, Any
|
||||
from modules.workflows.methods.methodBase import action
|
||||
from modules.datamodels.datamodelChat import ActionResult, ActionDocument
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
@action
|
||||
async def exportTicketsAsJson(self, parameters: Dict[str, Any]) -> ActionResult:
|
||||
"""
|
||||
Export tickets from JIRA as JSON list.
|
||||
|
||||
Parameters:
|
||||
- connectionId (str, required): Connection ID from connectJira action result
|
||||
- taskSyncDefinition (str or dict, optional): Field mapping definition (if not provided, uses stored definition)
|
||||
|
||||
Returns:
|
||||
- ActionResult with ActionDocument containing list of tickets as JSON
|
||||
"""
|
||||
try:
|
||||
connectionIdParam = parameters.get("connectionId")
|
||||
if not connectionIdParam:
|
||||
return ActionResult.isFailure(error="connectionId parameter is required")
|
||||
|
||||
# Get connection ID from document if it's a reference
|
||||
connectionId = None
|
||||
if isinstance(connectionIdParam, str):
|
||||
# Try to parse from document reference
|
||||
connectionInfo = self.documentParsing.parseJsonFromDocument(connectionIdParam)
|
||||
if connectionInfo and "connectionId" in connectionInfo:
|
||||
connectionId = connectionInfo["connectionId"]
|
||||
else:
|
||||
# Assume it's the connection ID directly
|
||||
connectionId = connectionIdParam
|
||||
|
||||
if not connectionId or connectionId not in self._connections:
|
||||
return ActionResult.isFailure(error=f"Connection ID {connectionIdParam} not found. Ensure connectJira was called first.")
|
||||
|
||||
connection = self._connections[connectionId]
|
||||
syncInterface = connection["interface"]
|
||||
|
||||
# Export tickets
|
||||
dataList = await syncInterface.exportTicketsAsList()
|
||||
|
||||
logger.info(f"Exported {len(dataList)} tickets from JIRA")
|
||||
|
||||
# Generate filename
|
||||
workflowContext = self.services.chat.getWorkflowContext() if hasattr(self.services, 'chat') else None
|
||||
filename = self._generateMeaningfulFileName(
|
||||
"jira_tickets_export",
|
||||
"json",
|
||||
workflowContext,
|
||||
"exportTicketsAsJson"
|
||||
)
|
||||
|
||||
validationMetadata = self._createValidationMetadata(
|
||||
"exportTicketsAsJson",
|
||||
connectionId=connectionId,
|
||||
ticketCount=len(dataList)
|
||||
)
|
||||
|
||||
document = ActionDocument(
|
||||
documentName=filename,
|
||||
documentData=json.dumps(dataList, indent=2, ensure_ascii=False),
|
||||
mimeType="application/json",
|
||||
validationMetadata=validationMetadata
|
||||
)
|
||||
|
||||
return ActionResult.isSuccess(documents=[document])
|
||||
|
||||
except Exception as e:
|
||||
errorMsg = f"Error exporting tickets from JIRA: {str(e)}"
|
||||
logger.error(errorMsg)
|
||||
return ActionResult.isFailure(error=errorMsg)
|
||||
|
||||
|
|
@ -0,0 +1,101 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
|
||||
"""
|
||||
Import Tickets From JSON action for JIRA operations.
|
||||
Imports ticket data from JSON back to JIRA.
|
||||
"""
|
||||
|
||||
import logging
|
||||
import json
|
||||
from typing import Dict, Any
|
||||
from modules.workflows.methods.methodBase import action
|
||||
from modules.datamodels.datamodelChat import ActionResult, ActionDocument
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
@action
|
||||
async def importTicketsFromJson(self, parameters: Dict[str, Any]) -> ActionResult:
|
||||
"""
|
||||
Import ticket data from JSON back to JIRA.
|
||||
|
||||
Parameters:
|
||||
- connectionId (str, required): Connection ID from connectJira action result
|
||||
- ticketData (str, required): Document reference containing ticket data as JSON
|
||||
- taskSyncDefinition (str or dict, optional): Field mapping definition (if not provided, uses stored definition)
|
||||
|
||||
Returns:
|
||||
- ActionResult with ActionDocument containing import result with counts
|
||||
"""
|
||||
try:
|
||||
connectionIdParam = parameters.get("connectionId")
|
||||
if not connectionIdParam:
|
||||
return ActionResult.isFailure(error="connectionId parameter is required")
|
||||
|
||||
ticketDataParam = parameters.get("ticketData")
|
||||
if not ticketDataParam:
|
||||
return ActionResult.isFailure(error="ticketData parameter is required")
|
||||
|
||||
# Get connection ID from document if it's a reference
|
||||
connectionId = None
|
||||
if isinstance(connectionIdParam, str):
|
||||
connectionInfo = self.documentParsing.parseJsonFromDocument(connectionIdParam)
|
||||
if connectionInfo and "connectionId" in connectionInfo:
|
||||
connectionId = connectionInfo["connectionId"]
|
||||
else:
|
||||
connectionId = connectionIdParam
|
||||
|
||||
if not connectionId or connectionId not in self._connections:
|
||||
return ActionResult.isFailure(error=f"Connection ID {connectionIdParam} not found. Ensure connectJira was called first.")
|
||||
|
||||
connection = self._connections[connectionId]
|
||||
syncInterface = connection["interface"]
|
||||
|
||||
# Get ticket data from document
|
||||
ticketDataJson = self.documentParsing.parseJsonFromDocument(ticketDataParam)
|
||||
if ticketDataJson is None:
|
||||
return ActionResult.isFailure(error="Could not parse ticket data from document reference")
|
||||
|
||||
# Ensure it's a list
|
||||
if not isinstance(ticketDataJson, list):
|
||||
return ActionResult.isFailure(error="ticketData must be a JSON array")
|
||||
|
||||
# Import tickets
|
||||
await syncInterface.importListToTickets(ticketDataJson)
|
||||
|
||||
logger.info(f"Imported {len(ticketDataJson)} tickets to JIRA")
|
||||
|
||||
# Generate filename
|
||||
workflowContext = self.services.chat.getWorkflowContext() if hasattr(self.services, 'chat') else None
|
||||
filename = self._generateMeaningfulFileName(
|
||||
"jira_import_result",
|
||||
"json",
|
||||
workflowContext,
|
||||
"importTicketsFromJson"
|
||||
)
|
||||
|
||||
importResult = {
|
||||
"imported": len(ticketDataJson),
|
||||
"connectionId": connectionId,
|
||||
}
|
||||
|
||||
validationMetadata = self._createValidationMetadata(
|
||||
"importTicketsFromJson",
|
||||
connectionId=connectionId,
|
||||
importedCount=len(ticketDataJson)
|
||||
)
|
||||
|
||||
document = ActionDocument(
|
||||
documentName=filename,
|
||||
documentData=json.dumps(importResult, indent=2),
|
||||
mimeType="application/json",
|
||||
validationMetadata=validationMetadata
|
||||
)
|
||||
|
||||
return ActionResult.isSuccess(documents=[document])
|
||||
|
||||
except Exception as e:
|
||||
errorMsg = f"Error importing tickets to JIRA: {str(e)}"
|
||||
logger.error(errorMsg)
|
||||
return ActionResult.isFailure(error=errorMsg)
|
||||
|
||||
157
modules/workflows/methods/methodJira/actions/mergeTicketData.py
Normal file
157
modules/workflows/methods/methodJira/actions/mergeTicketData.py
Normal file
|
|
@ -0,0 +1,157 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
|
||||
"""
|
||||
Merge Ticket Data action for JIRA operations.
|
||||
Merges JIRA export data with existing SharePoint data.
|
||||
"""
|
||||
|
||||
import logging
|
||||
import json
|
||||
from typing import Dict, Any, List
|
||||
from modules.workflows.methods.methodBase import action
|
||||
from modules.datamodels.datamodelChat import ActionResult, ActionDocument
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
@action
|
||||
async def mergeTicketData(self, parameters: Dict[str, Any]) -> ActionResult:
|
||||
"""
|
||||
Merge JIRA export data with existing SharePoint data.
|
||||
|
||||
Parameters:
|
||||
- jiraData (str, required): Document reference containing JIRA ticket data as JSON array
|
||||
- existingData (str, required): Document reference containing existing SharePoint data as JSON array
|
||||
- taskSyncDefinition (str or dict, required): Field mapping definition
|
||||
- idField (str, optional): Field name to use as ID for merging (default: "ID")
|
||||
|
||||
Returns:
|
||||
- ActionResult with ActionDocument containing merged data and merge details
|
||||
"""
|
||||
try:
|
||||
jiraDataParam = parameters.get("jiraData")
|
||||
if not jiraDataParam:
|
||||
return ActionResult.isFailure(error="jiraData parameter is required")
|
||||
|
||||
existingDataParam = parameters.get("existingData")
|
||||
if not existingDataParam:
|
||||
return ActionResult.isFailure(error="existingData parameter is required")
|
||||
|
||||
taskSyncDefinitionParam = parameters.get("taskSyncDefinition")
|
||||
if not taskSyncDefinitionParam:
|
||||
return ActionResult.isFailure(error="taskSyncDefinition parameter is required")
|
||||
|
||||
idField = parameters.get("idField", "ID")
|
||||
|
||||
# Parse taskSyncDefinition
|
||||
if isinstance(taskSyncDefinitionParam, str):
|
||||
try:
|
||||
taskSyncDefinition = json.loads(taskSyncDefinitionParam)
|
||||
except json.JSONDecodeError as e:
|
||||
return ActionResult.isFailure(error=f"taskSyncDefinition is not valid JSON: {str(e)}")
|
||||
elif isinstance(taskSyncDefinitionParam, dict):
|
||||
taskSyncDefinition = taskSyncDefinitionParam
|
||||
else:
|
||||
return ActionResult.isFailure(error=f"taskSyncDefinition must be a dict or JSON string, got {type(taskSyncDefinitionParam)}")
|
||||
|
||||
# Get data from documents
|
||||
jiraDataJson = self.documentParsing.parseJsonFromDocument(jiraDataParam)
|
||||
if jiraDataJson is None or not isinstance(jiraDataJson, list):
|
||||
return ActionResult.isFailure(error="Could not parse jiraData as JSON array")
|
||||
|
||||
existingDataJson = self.documentParsing.parseJsonFromDocument(existingDataParam)
|
||||
if existingDataJson is None or not isinstance(existingDataJson, list):
|
||||
# Empty existing data is OK
|
||||
existingDataJson = []
|
||||
|
||||
# Perform merge
|
||||
existingLookup = {row.get(idField): row for row in existingDataJson if row.get(idField)}
|
||||
mergedData: List[dict] = []
|
||||
changes: List[str] = []
|
||||
updatedCount = addedCount = unchangedCount = 0
|
||||
|
||||
for jiraRow in jiraDataJson:
|
||||
jiraId = jiraRow.get(idField)
|
||||
if jiraId and jiraId in existingLookup:
|
||||
existingRow = existingLookup[jiraId].copy()
|
||||
rowChanges: List[str] = []
|
||||
|
||||
for fieldName, fieldConfig in taskSyncDefinition.items():
|
||||
if fieldConfig[0] == 'get':
|
||||
oldValue = "" if existingRow.get(fieldName) is None else str(existingRow.get(fieldName))
|
||||
newValue = "" if jiraRow.get(fieldName) is None else str(jiraRow.get(fieldName))
|
||||
|
||||
# Convert ADF data to readable text for logging
|
||||
if isinstance(newValue, dict) and newValue.get("type") == "doc":
|
||||
newValueReadable = self.adfConverter.convertAdfToText(newValue)
|
||||
if oldValue != newValueReadable:
|
||||
rowChanges.append(f"{fieldName}: '{oldValue[:100]}...' -> '{newValueReadable[:100]}...'")
|
||||
elif oldValue != newValue:
|
||||
# Truncate long values for logging
|
||||
oldTruncated = oldValue[:100] + "..." if len(oldValue) > 100 else oldValue
|
||||
newTruncated = newValue[:100] + "..." if len(newValue) > 100 else newValue
|
||||
rowChanges.append(f"{fieldName}: '{oldTruncated}' -> '{newTruncated}'")
|
||||
|
||||
existingRow[fieldName] = jiraRow.get(fieldName)
|
||||
|
||||
mergedData.append(existingRow)
|
||||
if rowChanges:
|
||||
updatedCount += 1
|
||||
changes.append(f"Row ID {jiraId} updated: {', '.join(rowChanges)}")
|
||||
else:
|
||||
unchangedCount += 1
|
||||
del existingLookup[jiraId]
|
||||
else:
|
||||
mergedData.append(jiraRow)
|
||||
addedCount += 1
|
||||
changes.append(f"Row ID {jiraId} added as new record")
|
||||
|
||||
# Add remaining existing rows
|
||||
for remaining in existingLookup.values():
|
||||
mergedData.append(remaining)
|
||||
unchangedCount += 1
|
||||
|
||||
mergeDetails = {
|
||||
"updated": updatedCount,
|
||||
"added": addedCount,
|
||||
"unchanged": unchangedCount,
|
||||
"changes": changes
|
||||
}
|
||||
|
||||
logger.info(f"Merged ticket data: {updatedCount} updated, {addedCount} added, {unchangedCount} unchanged")
|
||||
|
||||
# Generate filename
|
||||
workflowContext = self.services.chat.getWorkflowContext() if hasattr(self.services, 'chat') else None
|
||||
filename = self._generateMeaningfulFileName(
|
||||
"merged_ticket_data",
|
||||
"json",
|
||||
workflowContext,
|
||||
"mergeTicketData"
|
||||
)
|
||||
|
||||
result = {
|
||||
"data": mergedData,
|
||||
"mergeDetails": mergeDetails
|
||||
}
|
||||
|
||||
validationMetadata = self._createValidationMetadata(
|
||||
"mergeTicketData",
|
||||
updated=updatedCount,
|
||||
added=addedCount,
|
||||
unchanged=unchangedCount
|
||||
)
|
||||
|
||||
document = ActionDocument(
|
||||
documentName=filename,
|
||||
documentData=json.dumps(result, indent=2, ensure_ascii=False),
|
||||
mimeType="application/json",
|
||||
validationMetadata=validationMetadata
|
||||
)
|
||||
|
||||
return ActionResult.isSuccess(documents=[document])
|
||||
|
||||
except Exception as e:
|
||||
errorMsg = f"Error merging ticket data: {str(e)}"
|
||||
logger.error(errorMsg)
|
||||
return ActionResult.isFailure(error=errorMsg)
|
||||
|
||||
112
modules/workflows/methods/methodJira/actions/parseCsvContent.py
Normal file
112
modules/workflows/methods/methodJira/actions/parseCsvContent.py
Normal file
|
|
@ -0,0 +1,112 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
|
||||
"""
|
||||
Parse CSV Content action for JIRA operations.
|
||||
Parses CSV content with custom headers.
|
||||
"""
|
||||
|
||||
import logging
|
||||
import json
|
||||
import io
|
||||
import pandas as pd
|
||||
from typing import Dict, Any
|
||||
from modules.workflows.methods.methodBase import action
|
||||
from modules.datamodels.datamodelChat import ActionResult, ActionDocument
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
@action
|
||||
async def parseCsvContent(self, parameters: Dict[str, Any]) -> ActionResult:
|
||||
"""
|
||||
Parse CSV content with custom headers.
|
||||
|
||||
Parameters:
|
||||
- csvContent (str, required): Document reference containing CSV file content as bytes
|
||||
- skipRows (int, optional): Number of header rows to skip (default: 2)
|
||||
- hasCustomHeaders (bool, optional): Whether CSV has custom header rows (default: true)
|
||||
|
||||
Returns:
|
||||
- ActionResult with ActionDocument containing parsed data and headers as JSON
|
||||
"""
|
||||
try:
|
||||
csvContentParam = parameters.get("csvContent")
|
||||
if not csvContentParam:
|
||||
return ActionResult.isFailure(error="csvContent parameter is required")
|
||||
|
||||
skipRows = parameters.get("skipRows", 2)
|
||||
hasCustomHeaders = parameters.get("hasCustomHeaders", True)
|
||||
|
||||
# Get CSV content from document
|
||||
csvBytes = self.documentParsing.getDocumentData(csvContentParam)
|
||||
if csvBytes is None:
|
||||
return ActionResult.isFailure(error="Could not get CSV content from document reference")
|
||||
|
||||
# Convert to bytes if needed
|
||||
if isinstance(csvBytes, str):
|
||||
csvBytes = csvBytes.encode('utf-8')
|
||||
elif not isinstance(csvBytes, bytes):
|
||||
return ActionResult.isFailure(error="CSV content must be bytes or string")
|
||||
|
||||
# Parse headers if hasCustomHeaders
|
||||
headers = {"header1": "Header 1", "header2": "Header 2"}
|
||||
if hasCustomHeaders:
|
||||
csvLines = csvBytes.decode('utf-8').split('\n')
|
||||
if len(csvLines) >= 2:
|
||||
headers["header1"] = csvLines[0].rstrip('\r\n')
|
||||
headers["header2"] = csvLines[1].rstrip('\r\n')
|
||||
|
||||
# Parse CSV data
|
||||
df = pd.read_csv(
|
||||
io.BytesIO(csvBytes),
|
||||
skiprows=skipRows,
|
||||
quoting=1,
|
||||
escapechar='\\',
|
||||
on_bad_lines='skip',
|
||||
engine='python'
|
||||
)
|
||||
|
||||
# Convert to dict records
|
||||
for column in df.columns:
|
||||
df[column] = df[column].astype('object').fillna('')
|
||||
data = df.to_dict(orient='records')
|
||||
|
||||
logger.info(f"Parsed CSV: {len(data)} rows, {len(df.columns)} columns")
|
||||
|
||||
# Generate filename
|
||||
workflowContext = self.services.chat.getWorkflowContext() if hasattr(self.services, 'chat') else None
|
||||
filename = self._generateMeaningfulFileName(
|
||||
"parsed_csv_data",
|
||||
"json",
|
||||
workflowContext,
|
||||
"parseCsvContent"
|
||||
)
|
||||
|
||||
result = {
|
||||
"data": data,
|
||||
"headers": headers,
|
||||
"rowCount": len(data),
|
||||
"columnCount": len(df.columns)
|
||||
}
|
||||
|
||||
validationMetadata = self._createValidationMetadata(
|
||||
"parseCsvContent",
|
||||
rowCount=len(data),
|
||||
columnCount=len(df.columns),
|
||||
skipRows=skipRows
|
||||
)
|
||||
|
||||
document = ActionDocument(
|
||||
documentName=filename,
|
||||
documentData=json.dumps(result, indent=2, ensure_ascii=False),
|
||||
mimeType="application/json",
|
||||
validationMetadata=validationMetadata
|
||||
)
|
||||
|
||||
return ActionResult.isSuccess(documents=[document])
|
||||
|
||||
except Exception as e:
|
||||
errorMsg = f"Error parsing CSV content: {str(e)}"
|
||||
logger.error(errorMsg)
|
||||
return ActionResult.isFailure(error=errorMsg)
|
||||
|
||||
|
|
@ -0,0 +1,121 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
|
||||
"""
|
||||
Parse Excel Content action for JIRA operations.
|
||||
Parses Excel content with custom headers.
|
||||
"""
|
||||
|
||||
import logging
|
||||
import json
|
||||
import pandas as pd
|
||||
from io import BytesIO
|
||||
from typing import Dict, Any
|
||||
from modules.workflows.methods.methodBase import action
|
||||
from modules.datamodels.datamodelChat import ActionResult, ActionDocument
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
@action
|
||||
async def parseExcelContent(self, parameters: Dict[str, Any]) -> ActionResult:
|
||||
"""
|
||||
Parse Excel content with custom headers.
|
||||
|
||||
Parameters:
|
||||
- excelContent (str, required): Document reference containing Excel file content as bytes
|
||||
- skipRows (int, optional): Number of header rows to skip (default: 3)
|
||||
- hasCustomHeaders (bool, optional): Whether Excel has custom header rows (default: true)
|
||||
|
||||
Returns:
|
||||
- ActionResult with ActionDocument containing parsed data and headers as JSON
|
||||
"""
|
||||
try:
|
||||
excelContentParam = parameters.get("excelContent")
|
||||
if not excelContentParam:
|
||||
return ActionResult.isFailure(error="excelContent parameter is required")
|
||||
|
||||
skipRows = parameters.get("skipRows", 3)
|
||||
hasCustomHeaders = parameters.get("hasCustomHeaders", True)
|
||||
|
||||
# Get Excel content from document
|
||||
excelBytes = self.documentParsing.getDocumentData(excelContentParam)
|
||||
if excelBytes is None:
|
||||
return ActionResult.isFailure(error="Could not get Excel content from document reference")
|
||||
|
||||
# Convert to bytes if needed
|
||||
if isinstance(excelBytes, str):
|
||||
excelBytes = excelBytes.encode('latin-1') # Excel might have binary data
|
||||
elif not isinstance(excelBytes, bytes):
|
||||
return ActionResult.isFailure(error="Excel content must be bytes or string")
|
||||
|
||||
# Parse Excel
|
||||
df = pd.read_excel(BytesIO(excelBytes), engine='openpyxl', header=None)
|
||||
|
||||
# Extract headers if hasCustomHeaders
|
||||
headers = {"header1": "Header 1", "header2": "Header 2"}
|
||||
if hasCustomHeaders and len(df) >= 3:
|
||||
headerRow1 = df.iloc[0:1].copy()
|
||||
headerRow2 = df.iloc[1:2].copy()
|
||||
tableHeaders = df.iloc[2:3].copy()
|
||||
dfData = df.iloc[skipRows:].copy()
|
||||
dfData.columns = tableHeaders.iloc[0]
|
||||
|
||||
headers = {
|
||||
"header1": ",".join([str(x) if pd.notna(x) else "" for x in headerRow1.iloc[0].tolist()]),
|
||||
"header2": ",".join([str(x) if pd.notna(x) else "" for x in headerRow2.iloc[0].tolist()]),
|
||||
}
|
||||
else:
|
||||
# No custom headers, use standard parsing
|
||||
if skipRows > 0:
|
||||
dfData = df.iloc[skipRows:].copy()
|
||||
if len(df) > skipRows:
|
||||
dfData.columns = df.iloc[skipRows-1]
|
||||
else:
|
||||
dfData = df.copy()
|
||||
|
||||
# Reset index and clean data
|
||||
dfData = dfData.reset_index(drop=True)
|
||||
for column in dfData.columns:
|
||||
dfData[column] = dfData[column].astype('object').fillna('')
|
||||
|
||||
data = dfData.to_dict(orient='records')
|
||||
|
||||
logger.info(f"Parsed Excel: {len(data)} rows, {len(dfData.columns)} columns")
|
||||
|
||||
# Generate filename
|
||||
workflowContext = self.services.chat.getWorkflowContext() if hasattr(self.services, 'chat') else None
|
||||
filename = self._generateMeaningfulFileName(
|
||||
"parsed_excel_data",
|
||||
"json",
|
||||
workflowContext,
|
||||
"parseExcelContent"
|
||||
)
|
||||
|
||||
result = {
|
||||
"data": data,
|
||||
"headers": headers,
|
||||
"rowCount": len(data),
|
||||
"columnCount": len(dfData.columns)
|
||||
}
|
||||
|
||||
validationMetadata = self._createValidationMetadata(
|
||||
"parseExcelContent",
|
||||
rowCount=len(data),
|
||||
columnCount=len(dfData.columns),
|
||||
skipRows=skipRows
|
||||
)
|
||||
|
||||
document = ActionDocument(
|
||||
documentName=filename,
|
||||
documentData=json.dumps(result, indent=2, ensure_ascii=False),
|
||||
mimeType="application/json",
|
||||
validationMetadata=validationMetadata
|
||||
)
|
||||
|
||||
return ActionResult.isSuccess(documents=[document])
|
||||
|
||||
except Exception as e:
|
||||
errorMsg = f"Error parsing Excel content: {str(e)}"
|
||||
logger.error(errorMsg)
|
||||
return ActionResult.isFailure(error=errorMsg)
|
||||
|
||||
5
modules/workflows/methods/methodJira/helpers/__init__.py
Normal file
5
modules/workflows/methods/methodJira/helpers/__init__.py
Normal file
|
|
@ -0,0 +1,5 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
|
||||
"""Helper modules for JIRA method operations."""
|
||||
|
||||
180
modules/workflows/methods/methodJira/helpers/adfConverter.py
Normal file
180
modules/workflows/methods/methodJira/helpers/adfConverter.py
Normal file
|
|
@ -0,0 +1,180 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
|
||||
"""
|
||||
ADF Converter helper for JIRA operations.
|
||||
Handles conversion of Atlassian Document Format (ADF) to plain text.
|
||||
"""
|
||||
|
||||
import logging
|
||||
from typing import Any
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
class AdfConverterHelper:
|
||||
"""Helper for ADF conversion operations"""
|
||||
|
||||
def __init__(self, methodInstance):
|
||||
"""
|
||||
Initialize ADF converter helper.
|
||||
|
||||
Args:
|
||||
methodInstance: Instance of MethodJira (for access to services)
|
||||
"""
|
||||
self.method = methodInstance
|
||||
self.services = methodInstance.services
|
||||
|
||||
def convertAdfToText(self, adfData):
|
||||
"""Convert Atlassian Document Format (ADF) to plain text.
|
||||
|
||||
Based on Atlassian Document Format specification for JIRA fields.
|
||||
Handles paragraphs, lists, text formatting, and other ADF node types.
|
||||
|
||||
Args:
|
||||
adfData: ADF object or None
|
||||
|
||||
Returns:
|
||||
str: Plain text content, or empty string if None/invalid
|
||||
"""
|
||||
if not adfData or not isinstance(adfData, dict):
|
||||
return ""
|
||||
|
||||
if adfData.get("type") != "doc":
|
||||
return str(adfData) if adfData else ""
|
||||
|
||||
content = adfData.get("content", [])
|
||||
if not isinstance(content, list):
|
||||
return ""
|
||||
|
||||
def extractTextFromContent(contentList, listLevel=0):
|
||||
"""Recursively extract text from ADF content with proper formatting."""
|
||||
textParts = []
|
||||
listCounter = 1
|
||||
|
||||
for item in contentList:
|
||||
if not isinstance(item, dict):
|
||||
continue
|
||||
|
||||
itemType = item.get("type", "")
|
||||
|
||||
if itemType == "text":
|
||||
# Extract text content, preserving formatting
|
||||
text = item.get("text", "")
|
||||
marks = item.get("marks", [])
|
||||
|
||||
# Handle text formatting (bold, italic, etc.)
|
||||
if marks:
|
||||
for mark in marks:
|
||||
if mark.get("type") == "strong":
|
||||
text = f"**{text}**"
|
||||
elif mark.get("type") == "em":
|
||||
text = f"*{text}*"
|
||||
elif mark.get("type") == "code":
|
||||
text = f"`{text}`"
|
||||
elif mark.get("type") == "link":
|
||||
attrs = mark.get("attrs", {})
|
||||
href = attrs.get("href", "")
|
||||
if href:
|
||||
text = f"[{text}]({href})"
|
||||
|
||||
textParts.append(text)
|
||||
|
||||
elif itemType == "hardBreak":
|
||||
textParts.append("\n")
|
||||
|
||||
elif itemType == "paragraph":
|
||||
paragraphContent = item.get("content", [])
|
||||
if paragraphContent:
|
||||
paragraphText = extractTextFromContent(paragraphContent, listLevel)
|
||||
if paragraphText.strip():
|
||||
textParts.append(paragraphText)
|
||||
|
||||
elif itemType == "bulletList":
|
||||
listContent = item.get("content", [])
|
||||
if listContent:
|
||||
listText = extractTextFromContent(listContent, listLevel + 1)
|
||||
if listText.strip():
|
||||
textParts.append(listText)
|
||||
|
||||
elif itemType == "orderedList":
|
||||
listContent = item.get("content", [])
|
||||
if listContent:
|
||||
listText = extractTextFromContent(listContent, listLevel + 1)
|
||||
if listText.strip():
|
||||
textParts.append(listText)
|
||||
|
||||
elif itemType == "listItem":
|
||||
itemContent = item.get("content", [])
|
||||
if itemContent:
|
||||
indent = " " * listLevel
|
||||
itemText = extractTextFromContent(itemContent, listLevel)
|
||||
if itemText.strip():
|
||||
prefix = f"{indent}- " if listLevel > 0 else "- "
|
||||
textParts.append(f"{prefix}{itemText}")
|
||||
|
||||
elif itemType == "heading":
|
||||
level = item.get("attrs", {}).get("level", 1)
|
||||
headingContent = item.get("content", [])
|
||||
if headingContent:
|
||||
headingText = extractTextFromContent(headingContent, listLevel)
|
||||
if headingText.strip():
|
||||
prefix = "#" * level + " "
|
||||
textParts.append(f"{prefix}{headingText}")
|
||||
|
||||
elif itemType == "codeBlock":
|
||||
codeContent = item.get("content", [])
|
||||
if codeContent:
|
||||
codeText = extractTextFromContent(codeContent, listLevel)
|
||||
if codeText.strip():
|
||||
textParts.append(f"```\n{codeText}\n```")
|
||||
|
||||
elif itemType == "blockquote":
|
||||
quoteContent = item.get("content", [])
|
||||
if quoteContent:
|
||||
quoteText = extractTextFromContent(quoteContent, listLevel)
|
||||
if quoteText.strip():
|
||||
textParts.append(f"> {quoteText}")
|
||||
|
||||
elif itemType == "table":
|
||||
tableContent = item.get("content", [])
|
||||
if tableContent:
|
||||
tableText = extractTextFromContent(tableContent, listLevel)
|
||||
if tableText.strip():
|
||||
textParts.append(tableText)
|
||||
|
||||
elif itemType == "tableRow":
|
||||
rowContent = item.get("content", [])
|
||||
if rowContent:
|
||||
rowText = extractTextFromContent(rowContent, listLevel)
|
||||
if rowText.strip():
|
||||
textParts.append(rowText)
|
||||
|
||||
elif itemType == "tableCell":
|
||||
cellContent = item.get("content", [])
|
||||
if cellContent:
|
||||
cellText = extractTextFromContent(cellContent, listLevel)
|
||||
if cellText.strip():
|
||||
textParts.append(cellText)
|
||||
|
||||
elif itemType == "mediaGroup":
|
||||
# Skip media groups for now
|
||||
pass
|
||||
|
||||
elif itemType == "media":
|
||||
# Skip media for now
|
||||
pass
|
||||
|
||||
else:
|
||||
# Unknown type - try to extract content if available
|
||||
if "content" in item:
|
||||
unknownContent = item.get("content", [])
|
||||
if unknownContent:
|
||||
unknownText = extractTextFromContent(unknownContent, listLevel)
|
||||
if unknownText.strip():
|
||||
textParts.append(unknownText)
|
||||
|
||||
return "".join(textParts)
|
||||
|
||||
result = extractTextFromContent(content)
|
||||
return result.strip() if result else ""
|
||||
|
||||
|
|
@ -0,0 +1,81 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
|
||||
"""
|
||||
Document Parsing helper for JIRA operations.
|
||||
Handles parsing of document references and JSON content.
|
||||
"""
|
||||
|
||||
import logging
|
||||
import json
|
||||
from typing import Any, Optional, Dict
|
||||
from modules.datamodels.datamodelDocref import DocumentReferenceList
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
class DocumentParsingHelper:
|
||||
"""Helper for document parsing operations"""
|
||||
|
||||
def __init__(self, methodInstance):
|
||||
"""
|
||||
Initialize document parsing helper.
|
||||
|
||||
Args:
|
||||
methodInstance: Instance of MethodJira (for access to services)
|
||||
"""
|
||||
self.method = methodInstance
|
||||
self.services = methodInstance.services
|
||||
|
||||
def getDocumentData(self, documentReference: str) -> Any:
|
||||
"""
|
||||
Get document data from a document reference.
|
||||
|
||||
Args:
|
||||
documentReference: Document reference string
|
||||
|
||||
Returns:
|
||||
Document data (bytes, str, or None)
|
||||
"""
|
||||
try:
|
||||
docList = DocumentReferenceList.from_string_list([documentReference])
|
||||
chatDocuments = self.services.chat.getChatDocumentsFromDocumentList(docList)
|
||||
if not chatDocuments:
|
||||
return None
|
||||
|
||||
doc = chatDocuments[0]
|
||||
fileId = getattr(doc, 'fileId', None)
|
||||
if not fileId:
|
||||
return None
|
||||
|
||||
return self.services.chat.getFileData(fileId)
|
||||
except Exception as e:
|
||||
logger.error(f"Error getting document data: {str(e)}")
|
||||
return None
|
||||
|
||||
def parseJsonFromDocument(self, documentReference: str) -> Optional[Dict[str, Any]]:
|
||||
"""
|
||||
Parse JSON content from a document reference.
|
||||
|
||||
Args:
|
||||
documentReference: Document reference string
|
||||
|
||||
Returns:
|
||||
Parsed JSON dictionary or None
|
||||
"""
|
||||
try:
|
||||
fileData = self.getDocumentData(documentReference)
|
||||
if not fileData:
|
||||
return None
|
||||
|
||||
# Handle bytes
|
||||
if isinstance(fileData, bytes):
|
||||
jsonStr = fileData.decode('utf-8')
|
||||
else:
|
||||
jsonStr = str(fileData)
|
||||
|
||||
# Parse JSON
|
||||
return json.loads(jsonStr)
|
||||
except Exception as e:
|
||||
logger.error(f"Error parsing JSON from document: {str(e)}")
|
||||
return None
|
||||
|
||||
322
modules/workflows/methods/methodJira/methodJira.py
Normal file
322
modules/workflows/methods/methodJira/methodJira.py
Normal file
|
|
@ -0,0 +1,322 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
|
||||
import logging
|
||||
from typing import Dict, Any
|
||||
from modules.workflows.methods.methodBase import MethodBase
|
||||
from modules.datamodels.datamodelWorkflowActions import WorkflowActionDefinition, WorkflowActionParameter
|
||||
from modules.shared.frontendTypes import FrontendType
|
||||
|
||||
# Import helpers
|
||||
from .helpers.adfConverter import AdfConverterHelper
|
||||
from .helpers.documentParsing import DocumentParsingHelper
|
||||
|
||||
# Import actions
|
||||
from .actions.connectJira import connectJira
|
||||
from .actions.exportTicketsAsJson import exportTicketsAsJson
|
||||
from .actions.importTicketsFromJson import importTicketsFromJson
|
||||
from .actions.mergeTicketData import mergeTicketData
|
||||
from .actions.parseCsvContent import parseCsvContent
|
||||
from .actions.parseExcelContent import parseExcelContent
|
||||
from .actions.createCsvContent import createCsvContent
|
||||
from .actions.createExcelContent import createExcelContent
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
class MethodJira(MethodBase):
|
||||
"""JIRA operations methods."""
|
||||
|
||||
def __init__(self, services):
|
||||
super().__init__(services)
|
||||
self.name = "jira"
|
||||
self.description = "JIRA operations methods"
|
||||
# Store connections in memory (keyed by connectionId)
|
||||
self._connections: Dict[str, Any] = {}
|
||||
|
||||
# Initialize helper modules
|
||||
self.adfConverter = AdfConverterHelper(self)
|
||||
self.documentParsing = DocumentParsingHelper(self)
|
||||
|
||||
# RBAC-Integration: Action-Definitionen mit actionId
|
||||
self._actions = {
|
||||
"connectJira": WorkflowActionDefinition(
|
||||
actionId="jira.connectJira",
|
||||
description="Connect to JIRA instance and create ticket interface",
|
||||
parameters={
|
||||
"apiUsername": WorkflowActionParameter(
|
||||
name="apiUsername",
|
||||
type="str",
|
||||
frontendType=FrontendType.EMAIL,
|
||||
required=True,
|
||||
description="JIRA API username/email"
|
||||
),
|
||||
"apiTokenConfigKey": WorkflowActionParameter(
|
||||
name="apiTokenConfigKey",
|
||||
type="str",
|
||||
frontendType=FrontendType.TEXT,
|
||||
required=True,
|
||||
description="APP_CONFIG key name for JIRA API token"
|
||||
),
|
||||
"apiUrl": WorkflowActionParameter(
|
||||
name="apiUrl",
|
||||
type="str",
|
||||
frontendType=FrontendType.TEXT,
|
||||
required=True,
|
||||
description="JIRA instance URL (e.g., https://example.atlassian.net)"
|
||||
),
|
||||
"projectCode": WorkflowActionParameter(
|
||||
name="projectCode",
|
||||
type="str",
|
||||
frontendType=FrontendType.TEXT,
|
||||
required=True,
|
||||
description="JIRA project code (e.g., DCS)"
|
||||
),
|
||||
"issueType": WorkflowActionParameter(
|
||||
name="issueType",
|
||||
type="str",
|
||||
frontendType=FrontendType.TEXT,
|
||||
required=True,
|
||||
description="JIRA issue type (e.g., Task)"
|
||||
),
|
||||
"taskSyncDefinition": WorkflowActionParameter(
|
||||
name="taskSyncDefinition",
|
||||
type="str",
|
||||
frontendType=FrontendType.TEXTAREA,
|
||||
required=True,
|
||||
description="Field mapping definition as JSON string or dict"
|
||||
)
|
||||
},
|
||||
execute=connectJira.__get__(self, self.__class__)
|
||||
),
|
||||
"exportTicketsAsJson": WorkflowActionDefinition(
|
||||
actionId="jira.exportTicketsAsJson",
|
||||
description="Export tickets from JIRA as JSON list",
|
||||
parameters={
|
||||
"connectionId": WorkflowActionParameter(
|
||||
name="connectionId",
|
||||
type="str",
|
||||
frontendType=FrontendType.TEXT,
|
||||
required=True,
|
||||
description="Connection ID from connectJira action result"
|
||||
),
|
||||
"taskSyncDefinition": WorkflowActionParameter(
|
||||
name="taskSyncDefinition",
|
||||
type="str",
|
||||
frontendType=FrontendType.TEXTAREA,
|
||||
required=False,
|
||||
description="Field mapping definition (if not provided, uses stored definition)"
|
||||
)
|
||||
},
|
||||
execute=exportTicketsAsJson.__get__(self, self.__class__)
|
||||
),
|
||||
"importTicketsFromJson": WorkflowActionDefinition(
|
||||
actionId="jira.importTicketsFromJson",
|
||||
description="Import ticket data from JSON back to JIRA",
|
||||
parameters={
|
||||
"connectionId": WorkflowActionParameter(
|
||||
name="connectionId",
|
||||
type="str",
|
||||
frontendType=FrontendType.TEXT,
|
||||
required=True,
|
||||
description="Connection ID from connectJira action result"
|
||||
),
|
||||
"ticketData": WorkflowActionParameter(
|
||||
name="ticketData",
|
||||
type="str",
|
||||
frontendType=FrontendType.DOCUMENT_REFERENCE,
|
||||
required=True,
|
||||
description="Document reference containing ticket data as JSON"
|
||||
),
|
||||
"taskSyncDefinition": WorkflowActionParameter(
|
||||
name="taskSyncDefinition",
|
||||
type="str",
|
||||
frontendType=FrontendType.TEXTAREA,
|
||||
required=False,
|
||||
description="Field mapping definition (if not provided, uses stored definition)"
|
||||
)
|
||||
},
|
||||
execute=importTicketsFromJson.__get__(self, self.__class__)
|
||||
),
|
||||
"mergeTicketData": WorkflowActionDefinition(
|
||||
actionId="jira.mergeTicketData",
|
||||
description="Merge JIRA export data with existing SharePoint data",
|
||||
parameters={
|
||||
"jiraData": WorkflowActionParameter(
|
||||
name="jiraData",
|
||||
type="str",
|
||||
frontendType=FrontendType.DOCUMENT_REFERENCE,
|
||||
required=True,
|
||||
description="Document reference containing JIRA ticket data as JSON array"
|
||||
),
|
||||
"existingData": WorkflowActionParameter(
|
||||
name="existingData",
|
||||
type="str",
|
||||
frontendType=FrontendType.DOCUMENT_REFERENCE,
|
||||
required=True,
|
||||
description="Document reference containing existing SharePoint data as JSON array"
|
||||
),
|
||||
"taskSyncDefinition": WorkflowActionParameter(
|
||||
name="taskSyncDefinition",
|
||||
type="str",
|
||||
frontendType=FrontendType.TEXTAREA,
|
||||
required=True,
|
||||
description="Field mapping definition"
|
||||
),
|
||||
"idField": WorkflowActionParameter(
|
||||
name="idField",
|
||||
type="str",
|
||||
frontendType=FrontendType.TEXT,
|
||||
required=False,
|
||||
default="ID",
|
||||
description="Field name to use as ID for merging"
|
||||
)
|
||||
},
|
||||
execute=mergeTicketData.__get__(self, self.__class__)
|
||||
),
|
||||
"parseCsvContent": WorkflowActionDefinition(
|
||||
actionId="jira.parseCsvContent",
|
||||
description="Parse CSV content with custom headers",
|
||||
parameters={
|
||||
"csvContent": WorkflowActionParameter(
|
||||
name="csvContent",
|
||||
type="str",
|
||||
frontendType=FrontendType.DOCUMENT_REFERENCE,
|
||||
required=True,
|
||||
description="Document reference containing CSV file content as bytes"
|
||||
),
|
||||
"skipRows": WorkflowActionParameter(
|
||||
name="skipRows",
|
||||
type="int",
|
||||
frontendType=FrontendType.NUMBER,
|
||||
required=False,
|
||||
default=2,
|
||||
description="Number of header rows to skip",
|
||||
validation={"min": 0, "max": 100}
|
||||
),
|
||||
"hasCustomHeaders": WorkflowActionParameter(
|
||||
name="hasCustomHeaders",
|
||||
type="bool",
|
||||
frontendType=FrontendType.CHECKBOX,
|
||||
required=False,
|
||||
default=True,
|
||||
description="Whether CSV has custom header rows"
|
||||
)
|
||||
},
|
||||
execute=parseCsvContent.__get__(self, self.__class__)
|
||||
),
|
||||
"parseExcelContent": WorkflowActionDefinition(
|
||||
actionId="jira.parseExcelContent",
|
||||
description="Parse Excel content with custom headers",
|
||||
parameters={
|
||||
"excelContent": WorkflowActionParameter(
|
||||
name="excelContent",
|
||||
type="str",
|
||||
frontendType=FrontendType.DOCUMENT_REFERENCE,
|
||||
required=True,
|
||||
description="Document reference containing Excel file content as bytes"
|
||||
),
|
||||
"skipRows": WorkflowActionParameter(
|
||||
name="skipRows",
|
||||
type="int",
|
||||
frontendType=FrontendType.NUMBER,
|
||||
required=False,
|
||||
default=3,
|
||||
description="Number of header rows to skip",
|
||||
validation={"min": 0, "max": 100}
|
||||
),
|
||||
"hasCustomHeaders": WorkflowActionParameter(
|
||||
name="hasCustomHeaders",
|
||||
type="bool",
|
||||
frontendType=FrontendType.CHECKBOX,
|
||||
required=False,
|
||||
default=True,
|
||||
description="Whether Excel has custom header rows"
|
||||
)
|
||||
},
|
||||
execute=parseExcelContent.__get__(self, self.__class__)
|
||||
),
|
||||
"createCsvContent": WorkflowActionDefinition(
|
||||
actionId="jira.createCsvContent",
|
||||
description="Create CSV content with custom headers",
|
||||
parameters={
|
||||
"data": WorkflowActionParameter(
|
||||
name="data",
|
||||
type="str",
|
||||
frontendType=FrontendType.DOCUMENT_REFERENCE,
|
||||
required=True,
|
||||
description="Document reference containing data as JSON (with data field from mergeTicketData)"
|
||||
),
|
||||
"headers": WorkflowActionParameter(
|
||||
name="headers",
|
||||
type="str",
|
||||
frontendType=FrontendType.DOCUMENT_REFERENCE,
|
||||
required=False,
|
||||
description="Document reference containing headers JSON (from parseCsvContent/parseExcelContent)"
|
||||
),
|
||||
"columns": WorkflowActionParameter(
|
||||
name="columns",
|
||||
type="List[str]",
|
||||
frontendType=FrontendType.MULTISELECT,
|
||||
required=False,
|
||||
description="List of column names (if not provided, extracted from taskSyncDefinition or data)"
|
||||
),
|
||||
"taskSyncDefinition": WorkflowActionParameter(
|
||||
name="taskSyncDefinition",
|
||||
type="str",
|
||||
frontendType=FrontendType.TEXTAREA,
|
||||
required=False,
|
||||
description="Field mapping definition (used to extract column names if columns not provided)"
|
||||
)
|
||||
},
|
||||
execute=createCsvContent.__get__(self, self.__class__)
|
||||
),
|
||||
"createExcelContent": WorkflowActionDefinition(
|
||||
actionId="jira.createExcelContent",
|
||||
description="Create Excel content with custom headers",
|
||||
parameters={
|
||||
"data": WorkflowActionParameter(
|
||||
name="data",
|
||||
type="str",
|
||||
frontendType=FrontendType.DOCUMENT_REFERENCE,
|
||||
required=True,
|
||||
description="Document reference containing data as JSON (with data field from mergeTicketData)"
|
||||
),
|
||||
"headers": WorkflowActionParameter(
|
||||
name="headers",
|
||||
type="str",
|
||||
frontendType=FrontendType.DOCUMENT_REFERENCE,
|
||||
required=False,
|
||||
description="Document reference containing headers JSON (from parseExcelContent)"
|
||||
),
|
||||
"columns": WorkflowActionParameter(
|
||||
name="columns",
|
||||
type="List[str]",
|
||||
frontendType=FrontendType.MULTISELECT,
|
||||
required=False,
|
||||
description="List of column names (if not provided, extracted from taskSyncDefinition or data)"
|
||||
),
|
||||
"taskSyncDefinition": WorkflowActionParameter(
|
||||
name="taskSyncDefinition",
|
||||
type="str",
|
||||
frontendType=FrontendType.TEXTAREA,
|
||||
required=False,
|
||||
description="Field mapping definition (used to extract column names if columns not provided)"
|
||||
)
|
||||
},
|
||||
execute=createExcelContent.__get__(self, self.__class__)
|
||||
)
|
||||
}
|
||||
|
||||
# Validate actions after definition
|
||||
self._validateActions()
|
||||
|
||||
# Register actions as methods (optional, für direkten Zugriff)
|
||||
self.connectJira = connectJira.__get__(self, self.__class__)
|
||||
self.exportTicketsAsJson = exportTicketsAsJson.__get__(self, self.__class__)
|
||||
self.importTicketsFromJson = importTicketsFromJson.__get__(self, self.__class__)
|
||||
self.mergeTicketData = mergeTicketData.__get__(self, self.__class__)
|
||||
self.parseCsvContent = parseCsvContent.__get__(self, self.__class__)
|
||||
self.parseExcelContent = parseExcelContent.__get__(self, self.__class__)
|
||||
self.createCsvContent = createCsvContent.__get__(self, self.__class__)
|
||||
self.createExcelContent = createExcelContent.__get__(self, self.__class__)
|
||||
|
||||
7
modules/workflows/methods/methodOutlook/__init__.py
Normal file
7
modules/workflows/methods/methodOutlook/__init__.py
Normal file
|
|
@ -0,0 +1,7 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
|
||||
from .methodOutlook import MethodOutlook
|
||||
|
||||
__all__ = ['MethodOutlook']
|
||||
|
||||
18
modules/workflows/methods/methodOutlook/actions/__init__.py
Normal file
18
modules/workflows/methods/methodOutlook/actions/__init__.py
Normal file
|
|
@ -0,0 +1,18 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
|
||||
"""Action modules for Outlook operations."""
|
||||
|
||||
# Export all actions
|
||||
from .readEmails import readEmails
|
||||
from .searchEmails import searchEmails
|
||||
from .composeAndDraftEmailWithContext import composeAndDraftEmailWithContext
|
||||
from .sendDraftEmail import sendDraftEmail
|
||||
|
||||
__all__ = [
|
||||
'readEmails',
|
||||
'searchEmails',
|
||||
'composeAndDraftEmailWithContext',
|
||||
'sendDraftEmail',
|
||||
]
|
||||
|
||||
|
|
@ -0,0 +1,362 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
|
||||
"""
|
||||
Compose And Draft Email With Context action for Outlook operations.
|
||||
Composes email content using AI from context and optional documents, then creates a draft.
|
||||
"""
|
||||
|
||||
import logging
|
||||
import json
|
||||
import base64
|
||||
import requests
|
||||
from typing import Dict, Any
|
||||
from modules.workflows.methods.methodBase import action
|
||||
from modules.datamodels.datamodelChat import ActionResult, ActionDocument
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
@action
|
||||
async def composeAndDraftEmailWithContext(self, parameters: Dict[str, Any]) -> ActionResult:
|
||||
"""
|
||||
GENERAL:
|
||||
- Purpose: Compose email content using AI from context and optional documents, then create a draft.
|
||||
- Input requirements: connectionReference (required); to (required); context (required); optional documentList, cc, bcc, emailStyle, maxLength.
|
||||
- Output format: JSON confirmation with AI-generated draft metadata.
|
||||
|
||||
Parameters:
|
||||
- connectionReference (str, required): Microsoft connection label.
|
||||
- to (list, required): Recipient email addresses.
|
||||
- context (str, required): Detailled context for composing the email.
|
||||
- documentList (list, optional): Document references for context/attachments.
|
||||
- cc (list, optional): CC recipients.
|
||||
- bcc (list, optional): BCC recipients.
|
||||
- emailStyle (str, optional): formal | casual | business. Default: business.
|
||||
- maxLength (int, optional): Maximum length for generated content. Default: 1000.
|
||||
"""
|
||||
try:
|
||||
connectionReference = parameters.get("connectionReference")
|
||||
to = parameters.get("to")
|
||||
context = parameters.get("context")
|
||||
documentList = parameters.get("documentList", [])
|
||||
cc = parameters.get("cc", [])
|
||||
bcc = parameters.get("bcc", [])
|
||||
emailStyle = parameters.get("emailStyle", "business")
|
||||
maxLength = parameters.get("maxLength", 1000)
|
||||
|
||||
if not connectionReference or not to or not context:
|
||||
return ActionResult.isFailure(error="connectionReference, to, and context are required")
|
||||
|
||||
# Convert single values to lists for all recipient parameters
|
||||
if isinstance(to, str):
|
||||
to = [to]
|
||||
if isinstance(cc, str):
|
||||
cc = [cc]
|
||||
if isinstance(bcc, str):
|
||||
bcc = [bcc]
|
||||
if isinstance(documentList, str):
|
||||
documentList = [documentList]
|
||||
|
||||
# Get Microsoft connection
|
||||
connection = self.connection.getMicrosoftConnection(connectionReference)
|
||||
if not connection:
|
||||
return ActionResult.isFailure(error="No valid Microsoft connection found")
|
||||
|
||||
# Check permissions
|
||||
permissions_ok = await self.connection.checkPermissions(connection)
|
||||
if not permissions_ok:
|
||||
return ActionResult.isFailure(error="Connection lacks necessary permissions for Outlook operations")
|
||||
|
||||
# Prepare documents for AI processing
|
||||
from modules.datamodels.datamodelDocref import DocumentReferenceList
|
||||
chatDocuments = []
|
||||
if documentList:
|
||||
# Convert to DocumentReferenceList if needed
|
||||
if isinstance(documentList, DocumentReferenceList):
|
||||
docRefList = documentList
|
||||
elif isinstance(documentList, list):
|
||||
docRefList = DocumentReferenceList.from_string_list(documentList)
|
||||
elif isinstance(documentList, str):
|
||||
docRefList = DocumentReferenceList.from_string_list([documentList])
|
||||
else:
|
||||
docRefList = DocumentReferenceList(references=[])
|
||||
chatDocuments = self.services.chat.getChatDocumentsFromDocumentList(docRefList)
|
||||
|
||||
# Create AI prompt for email composition
|
||||
# Build document reference list for AI with expanded list contents when possible
|
||||
doc_references = documentList
|
||||
doc_list_text = ""
|
||||
if doc_references:
|
||||
lines = ["Available_Document_References:"]
|
||||
for ref in doc_references:
|
||||
# Each item is a label: resolve to its document list and render contained items
|
||||
from modules.datamodels.datamodelDocref import DocumentReferenceList
|
||||
list_docs = self.services.chat.getChatDocumentsFromDocumentList(DocumentReferenceList.from_string_list([ref])) or []
|
||||
if list_docs:
|
||||
for d in list_docs:
|
||||
doc_ref_label = self.services.chat.getDocumentReferenceFromChatDocument(d)
|
||||
lines.append(f"- {doc_ref_label}")
|
||||
else:
|
||||
lines.append(" - (no documents)")
|
||||
doc_list_text = "\n" + "\n".join(lines)
|
||||
else:
|
||||
doc_list_text = "Available_Document_References: (No documents available for attachment)"
|
||||
|
||||
# Escape only the user-controlled context to prevent prompt injection
|
||||
escaped_context = context.replace('"', '\\"').replace('\n', '\\n').replace('\r', '\\r')
|
||||
|
||||
ai_prompt = f"""Compose an email based on this context:
|
||||
-------
|
||||
{escaped_context}
|
||||
-------
|
||||
|
||||
Recipients: {to}
|
||||
Style: {emailStyle}
|
||||
Max length: {maxLength} characters
|
||||
{doc_list_text}
|
||||
|
||||
Based on the context, decide which documents to attach.
|
||||
|
||||
CRITICAL: Use EXACT document references from Available_Document_References above. For individual documents: ALWAYS use docItem:<documentId>:<filename> format (include filename)
|
||||
|
||||
Return JSON:
|
||||
{{
|
||||
"subject": "subject line",
|
||||
"body": "email body (HTML allowed)",
|
||||
"attachments": ["docItem:<documentId>:<filename>"]
|
||||
}}
|
||||
"""
|
||||
|
||||
# Call AI service to generate email content
|
||||
try:
|
||||
ai_response = await self.services.ai.callAiPlanning(
|
||||
prompt=ai_prompt,
|
||||
placeholders=None,
|
||||
debugType="email_composition"
|
||||
)
|
||||
|
||||
# Parse AI response
|
||||
try:
|
||||
ai_content = ai_response
|
||||
# Extract JSON from AI response
|
||||
if "```json" in ai_content:
|
||||
json_start = ai_content.find("```json") + 7
|
||||
json_end = ai_content.find("```", json_start)
|
||||
json_content = ai_content[json_start:json_end].strip()
|
||||
elif "{" in ai_content and "}" in ai_content:
|
||||
json_start = ai_content.find("{")
|
||||
json_end = ai_content.rfind("}") + 1
|
||||
json_content = ai_content[json_start:json_end]
|
||||
else:
|
||||
json_content = ai_content
|
||||
|
||||
email_data = json.loads(json_content)
|
||||
subject = email_data.get("subject", "")
|
||||
body = email_data.get("body", "")
|
||||
ai_attachments = email_data.get("attachments", [])
|
||||
|
||||
if not subject or not body:
|
||||
return ActionResult.isFailure(error="AI did not generate valid subject and body")
|
||||
|
||||
# Use AI-selected attachments if provided, otherwise use all documents
|
||||
normalized_ai_attachments = []
|
||||
if documentList:
|
||||
try:
|
||||
available_refs = [documentList] if isinstance(documentList, str) else documentList
|
||||
from modules.datamodels.datamodelDocref import DocumentReferenceList
|
||||
available_docs = self.services.chat.getChatDocumentsFromDocumentList(DocumentReferenceList.from_string_list(available_refs)) or []
|
||||
except Exception:
|
||||
available_docs = []
|
||||
|
||||
# Normalize AI attachments to a list of strings
|
||||
if isinstance(ai_attachments, str):
|
||||
ai_attachments = [ai_attachments]
|
||||
elif isinstance(ai_attachments, list):
|
||||
ai_attachments = [a for a in ai_attachments if isinstance(a, str)]
|
||||
|
||||
if ai_attachments:
|
||||
try:
|
||||
ai_refs = [ai_attachments] if isinstance(ai_attachments, str) else ai_attachments
|
||||
from modules.datamodels.datamodelDocref import DocumentReferenceList
|
||||
ai_docs = self.services.chat.getChatDocumentsFromDocumentList(DocumentReferenceList.from_string_list(ai_refs)) or []
|
||||
except Exception:
|
||||
ai_docs = []
|
||||
|
||||
# Intersect by document id
|
||||
available_ids = {getattr(d, 'id', None) for d in available_docs}
|
||||
selected_docs = [d for d in ai_docs if getattr(d, 'id', None) in available_ids]
|
||||
|
||||
if selected_docs:
|
||||
# Map selected ChatDocuments back to docItem references (with full filename)
|
||||
documentList = [self.services.chat.getDocumentReferenceFromChatDocument(d) for d in selected_docs]
|
||||
# Normalize ai_attachments to full format for storage
|
||||
normalized_ai_attachments = documentList.copy()
|
||||
logger.info(f"AI selected {len(documentList)} documents for attachment (resolved via ChatDocuments)")
|
||||
else:
|
||||
# No intersection; use all available documents
|
||||
documentList = [self.services.chat.getDocumentReferenceFromChatDocument(d) for d in available_docs]
|
||||
normalized_ai_attachments = documentList.copy()
|
||||
logger.warning("AI selected attachments not found in available documents, using all documents")
|
||||
else:
|
||||
# No AI selection; use all available documents
|
||||
documentList = [self.services.chat.getDocumentReferenceFromChatDocument(d) for d in available_docs]
|
||||
normalized_ai_attachments = documentList.copy()
|
||||
logger.warning("AI did not specify attachments, using all available documents")
|
||||
else:
|
||||
logger.info("No documents provided in documentList; skipping attachment processing")
|
||||
|
||||
except json.JSONDecodeError as e:
|
||||
logger.error(f"Failed to parse AI response as JSON: {str(e)}")
|
||||
logger.error(f"AI response content: {ai_response}")
|
||||
return ActionResult.isFailure(error="AI response was not valid JSON format")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error calling AI service: {str(e)}")
|
||||
return ActionResult.isFailure(error=f"Failed to generate email content: {str(e)}")
|
||||
|
||||
# Now create the email with AI-generated content
|
||||
try:
|
||||
graph_url = "https://graph.microsoft.com/v1.0"
|
||||
headers = {
|
||||
"Authorization": f"Bearer {connection['accessToken']}",
|
||||
"Content-Type": "application/json"
|
||||
}
|
||||
|
||||
# Clean and format body content
|
||||
cleaned_body = body.strip()
|
||||
|
||||
# Check if body is already HTML
|
||||
if cleaned_body.startswith('<html>') or cleaned_body.startswith('<body>') or '<br>' in cleaned_body:
|
||||
html_body = cleaned_body
|
||||
else:
|
||||
# Convert plain text to proper HTML formatting
|
||||
html_body = cleaned_body.replace('\n', '<br>')
|
||||
html_body = f"<html><body>{html_body}</body></html>"
|
||||
|
||||
# Build the email message
|
||||
message = {
|
||||
"subject": subject,
|
||||
"body": {
|
||||
"contentType": "HTML",
|
||||
"content": html_body
|
||||
},
|
||||
"toRecipients": [{"emailAddress": {"address": email}} for email in to],
|
||||
"ccRecipients": [{"emailAddress": {"address": email}} for email in cc] if cc else [],
|
||||
"bccRecipients": [{"emailAddress": {"address": email}} for email in bcc] if bcc else []
|
||||
}
|
||||
|
||||
# Add documents as attachments if provided
|
||||
if documentList:
|
||||
message["attachments"] = []
|
||||
for attachment_ref in documentList:
|
||||
# Get attachment document from service center
|
||||
from modules.datamodels.datamodelDocref import DocumentReferenceList
|
||||
attachment_docs = self.services.chat.getChatDocumentsFromDocumentList(DocumentReferenceList.from_string_list([attachment_ref]))
|
||||
if attachment_docs:
|
||||
for doc in attachment_docs:
|
||||
file_id = getattr(doc, 'fileId', None)
|
||||
if file_id:
|
||||
try:
|
||||
file_content = self.services.chat.getFileData(file_id)
|
||||
if file_content:
|
||||
if isinstance(file_content, bytes):
|
||||
content_bytes = file_content
|
||||
else:
|
||||
content_bytes = str(file_content).encode('utf-8')
|
||||
|
||||
base64_content = base64.b64encode(content_bytes).decode('utf-8')
|
||||
|
||||
attachment = {
|
||||
"@odata.type": "#microsoft.graph.fileAttachment",
|
||||
"name": doc.fileName,
|
||||
"contentType": doc.mimeType or "application/octet-stream",
|
||||
"contentBytes": base64_content
|
||||
}
|
||||
message["attachments"].append(attachment)
|
||||
except Exception as e:
|
||||
logger.error(f"Error reading attachment file {doc.fileName}: {str(e)}")
|
||||
|
||||
# Create the draft message
|
||||
drafts_folder_id = self.folderManagement.getFolderId("Drafts", connection)
|
||||
|
||||
if drafts_folder_id:
|
||||
api_url = f"{graph_url}/me/mailFolders/{drafts_folder_id}/messages"
|
||||
else:
|
||||
api_url = f"{graph_url}/me/messages"
|
||||
logger.warning("Could not find Drafts folder, creating draft in default location")
|
||||
|
||||
response = requests.post(api_url, headers=headers, json=message)
|
||||
|
||||
if response.status_code in [200, 201]:
|
||||
draft_data = response.json()
|
||||
draft_id = draft_data.get("id", "Unknown")
|
||||
|
||||
# Create draft result data with full draft information
|
||||
draftResultData = {
|
||||
"status": "draft",
|
||||
"message": "Email draft created successfully with AI-generated content",
|
||||
"draftId": draft_id,
|
||||
"folder": "Drafts (Entwürfe)",
|
||||
"mailbox": connection.get('userEmail', 'Unknown'),
|
||||
"subject": subject,
|
||||
"body": body,
|
||||
"recipients": to,
|
||||
"cc": cc,
|
||||
"bcc": bcc,
|
||||
"attachments": len(documentList) if documentList else 0,
|
||||
"aiSelectedAttachments": normalized_ai_attachments if normalized_ai_attachments else "all documents",
|
||||
"aiGenerated": True,
|
||||
"context": context,
|
||||
"emailStyle": emailStyle,
|
||||
"timestamp": self.services.utils.timestampGetUtc(),
|
||||
"draftData": draft_data
|
||||
}
|
||||
|
||||
# Extract attachment filenames for validation metadata
|
||||
attachmentFilenames = []
|
||||
attachmentReferences = []
|
||||
if documentList:
|
||||
try:
|
||||
from modules.datamodels.datamodelDocref import DocumentReferenceList
|
||||
attached_docs = self.services.chat.getChatDocumentsFromDocumentList(DocumentReferenceList.from_string_list(documentList)) or []
|
||||
attachmentFilenames = [getattr(doc, 'fileName', '') for doc in attached_docs if getattr(doc, 'fileName', None)]
|
||||
# Store normalized document references (with filenames) - use normalized_ai_attachments if available
|
||||
attachmentReferences = normalized_ai_attachments if normalized_ai_attachments else [self.services.chat.getDocumentReferenceFromChatDocument(d) for d in attached_docs]
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# Create validation metadata for content validator
|
||||
validationMetadata = {
|
||||
"actionType": "outlook.composeAndDraftEmailWithContext",
|
||||
"emailRecipients": to,
|
||||
"emailCc": cc,
|
||||
"emailBcc": bcc,
|
||||
"emailSubject": subject,
|
||||
"emailAttachments": attachmentFilenames,
|
||||
"emailAttachmentReferences": attachmentReferences,
|
||||
"emailAttachmentCount": len(attachmentFilenames),
|
||||
"emailStyle": emailStyle,
|
||||
"hasAttachments": len(attachmentFilenames) > 0
|
||||
}
|
||||
|
||||
return ActionResult(
|
||||
success=True,
|
||||
documents=[ActionDocument(
|
||||
documentName=f"ai_generated_email_draft_{self._format_timestamp_for_filename()}.json",
|
||||
documentData=json.dumps(draftResultData, indent=2),
|
||||
mimeType="application/json",
|
||||
validationMetadata=validationMetadata
|
||||
)]
|
||||
)
|
||||
else:
|
||||
logger.error(f"Failed to create draft. Status: {response.status_code}, Response: {response.text}")
|
||||
return ActionResult.isFailure(error=f"Failed to create email draft: {response.status_code} - {response.text}")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error creating email via Microsoft Graph API: {str(e)}")
|
||||
return ActionResult.isFailure(error=f"Failed to create email: {str(e)}")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error in composeAndDraftEmailWithContext: {str(e)}")
|
||||
return ActionResult.isFailure(error=str(e))
|
||||
|
||||
245
modules/workflows/methods/methodOutlook/actions/readEmails.py
Normal file
245
modules/workflows/methods/methodOutlook/actions/readEmails.py
Normal file
|
|
@ -0,0 +1,245 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
|
||||
"""
|
||||
Read Emails action for Outlook operations.
|
||||
Reads emails and metadata from a mailbox folder.
|
||||
"""
|
||||
|
||||
import logging
|
||||
import time
|
||||
import json
|
||||
import requests
|
||||
from typing import Dict, Any
|
||||
from modules.workflows.methods.methodBase import action
|
||||
from modules.datamodels.datamodelChat import ActionResult, ActionDocument
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
@action
|
||||
async def readEmails(self, parameters: Dict[str, Any]) -> ActionResult:
|
||||
"""
|
||||
GENERAL:
|
||||
- Purpose: Read emails and metadata from a mailbox folder.
|
||||
- Input requirements: connectionReference (required); optional folder, limit, filter, outputMimeType.
|
||||
- Output format: JSON with emails and metadata.
|
||||
|
||||
Parameters:
|
||||
- connectionReference (str, required): Microsoft connection label.
|
||||
- folder (str, optional): Folder to read from. Default: Inbox.
|
||||
- limit (int, optional): Maximum items to return. Must be > 0. Default: 1000.
|
||||
- filter (str, optional): Sender, query operators, or subject text.
|
||||
- outputMimeType (str, optional): MIME type for output file. Options: "application/json" (default), "text/plain", "text/csv". Default: "application/json".
|
||||
"""
|
||||
operationId = None
|
||||
try:
|
||||
# Init progress logger
|
||||
workflowId = self.services.workflow.id if self.services.workflow else f"no-workflow-{int(time.time())}"
|
||||
operationId = f"outlook_read_{workflowId}_{int(time.time())}"
|
||||
|
||||
# Start progress tracking
|
||||
parentOperationId = parameters.get('parentOperationId')
|
||||
self.services.chat.progressLogStart(
|
||||
operationId,
|
||||
"Read Emails",
|
||||
"Outlook Email Reading",
|
||||
f"Folder: {parameters.get('folder', 'Inbox')}",
|
||||
parentOperationId=parentOperationId
|
||||
)
|
||||
|
||||
connectionReference = parameters.get("connectionReference")
|
||||
folder = parameters.get("folder", "Inbox")
|
||||
limit = parameters.get("limit", 10)
|
||||
filter = parameters.get("filter")
|
||||
outputMimeType = parameters.get("outputMimeType", "application/json")
|
||||
|
||||
if not connectionReference:
|
||||
if operationId:
|
||||
self.services.chat.progressLogFinish(operationId, False)
|
||||
return ActionResult.isFailure(error="Connection reference is required")
|
||||
|
||||
self.services.chat.progressLogUpdate(operationId, 0.2, "Validating parameters")
|
||||
|
||||
# Validate limit parameter
|
||||
if limit <= 0:
|
||||
limit = 1000
|
||||
logger.warning(f"Invalid limit value ({limit}), using default value 1000")
|
||||
|
||||
# Validate filter parameter if provided
|
||||
if filter:
|
||||
# Remove any potentially dangerous characters that could break the filter
|
||||
filter = filter.strip()
|
||||
if len(filter) > 100:
|
||||
logger.warning(f"Filter too long ({len(filter)} chars), truncating to 100 characters")
|
||||
filter = filter[:100]
|
||||
|
||||
|
||||
# Get Microsoft connection
|
||||
self.services.chat.progressLogUpdate(operationId, 0.3, "Getting Microsoft connection")
|
||||
connection = self.connection.getMicrosoftConnection(connectionReference)
|
||||
if not connection:
|
||||
self.services.chat.progressLogFinish(operationId, False)
|
||||
return ActionResult.isFailure(error="No valid Microsoft connection found for the provided connection reference")
|
||||
|
||||
# Read emails using Microsoft Graph API
|
||||
self.services.chat.progressLogUpdate(operationId, 0.4, "Reading emails from Microsoft Graph API")
|
||||
try:
|
||||
# Microsoft Graph API endpoint for messages
|
||||
graph_url = "https://graph.microsoft.com/v1.0"
|
||||
headers = {
|
||||
"Authorization": f"Bearer {connection['accessToken']}",
|
||||
"Content-Type": "application/json"
|
||||
}
|
||||
|
||||
# Get the folder ID for the specified folder
|
||||
folder_id = self.folderManagement.getFolderId(folder, connection)
|
||||
|
||||
if folder_id:
|
||||
# Build the API request with folder ID
|
||||
api_url = f"{graph_url}/me/mailFolders/{folder_id}/messages"
|
||||
else:
|
||||
# Fallback: use folder name directly (for well-known folders like "Inbox")
|
||||
api_url = f"{graph_url}/me/mailFolders/{folder}/messages"
|
||||
logger.warning(f"Could not find folder ID for '{folder}', using folder name directly")
|
||||
params = {
|
||||
"$top": limit,
|
||||
"$orderby": "receivedDateTime desc"
|
||||
}
|
||||
|
||||
if filter:
|
||||
# Build proper Graph API filter parameters
|
||||
filter_params = self.emailProcessing.buildGraphFilter(filter)
|
||||
params.update(filter_params)
|
||||
|
||||
# If using $search, remove $orderby as they can't be combined
|
||||
if "$search" in params:
|
||||
params.pop("$orderby", None)
|
||||
|
||||
# If using $filter with contains(), remove $orderby as they can't be combined
|
||||
# Microsoft Graph API doesn't support contains() with orderby
|
||||
if "$filter" in params and "contains(" in params["$filter"].lower():
|
||||
params.pop("$orderby", None)
|
||||
|
||||
# Filter applied
|
||||
|
||||
# Make the API call
|
||||
|
||||
|
||||
response = requests.get(api_url, headers=headers, params=params)
|
||||
|
||||
if response.status_code != 200:
|
||||
logger.error(f"Graph API error: {response.status_code} - {response.text}")
|
||||
logger.error(f"Request URL: {response.url}")
|
||||
logger.error(f"Request headers: {headers}")
|
||||
logger.error(f"Request params: {params}")
|
||||
|
||||
response.raise_for_status()
|
||||
|
||||
self.services.chat.progressLogUpdate(operationId, 0.7, "Processing email data")
|
||||
emails_data = response.json()
|
||||
email_data = {
|
||||
"emails": emails_data.get("value", []),
|
||||
"count": len(emails_data.get("value", [])),
|
||||
"folder": folder,
|
||||
"filter": filter,
|
||||
"apiMetadata": {
|
||||
"@odata.context": emails_data.get("@odata.context"),
|
||||
"@odata.count": emails_data.get("@odata.count"),
|
||||
"@odata.nextLink": emails_data.get("@odata.nextLink")
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
|
||||
except ImportError:
|
||||
logger.error("requests module not available")
|
||||
if operationId:
|
||||
self.services.chat.progressLogFinish(operationId, False)
|
||||
return ActionResult.isFailure(error="requests module not available")
|
||||
except requests.exceptions.HTTPError as e:
|
||||
if e.response.status_code == 400:
|
||||
logger.error(f"Bad Request (400) - Invalid filter or parameter: {e.response.text}")
|
||||
if operationId:
|
||||
self.services.chat.progressLogFinish(operationId, False)
|
||||
return ActionResult.isFailure(error=f"Invalid filter syntax. Please check your filter parameter. Error: {e.response.text}")
|
||||
elif e.response.status_code == 401:
|
||||
logger.error("Unauthorized (401) - Access token may be expired or invalid")
|
||||
if operationId:
|
||||
self.services.chat.progressLogFinish(operationId, False)
|
||||
return ActionResult.isFailure(error="Authentication failed. Please check your connection and try again.")
|
||||
elif e.response.status_code == 403:
|
||||
logger.error("Forbidden (403) - Insufficient permissions to access emails")
|
||||
if operationId:
|
||||
self.services.chat.progressLogFinish(operationId, False)
|
||||
return ActionResult.isFailure(error="Insufficient permissions to read emails from this folder.")
|
||||
else:
|
||||
logger.error(f"HTTP Error {e.response.status_code}: {e.response.text}")
|
||||
if operationId:
|
||||
self.services.chat.progressLogFinish(operationId, False)
|
||||
return ActionResult.isFailure(error=f"HTTP Error {e.response.status_code}: {e.response.text}")
|
||||
except Exception as e:
|
||||
logger.error(f"Error reading emails from Microsoft Graph API: {str(e)}")
|
||||
if operationId:
|
||||
self.services.chat.progressLogFinish(operationId, False)
|
||||
return ActionResult.isFailure(error=f"Failed to read emails: {str(e)}")
|
||||
|
||||
# Determine output format based on MIME type
|
||||
mime_type_mapping = {
|
||||
"application/json": ".json",
|
||||
"text/plain": ".txt",
|
||||
"text/csv": ".csv"
|
||||
}
|
||||
output_extension = mime_type_mapping.get(outputMimeType, ".json")
|
||||
output_mime_type = outputMimeType
|
||||
logger.info(f"Using output format: {output_extension} ({output_mime_type})")
|
||||
|
||||
|
||||
|
||||
# Create result data as JSON string
|
||||
result_data = {
|
||||
"connectionReference": connectionReference,
|
||||
"folder": folder,
|
||||
"limit": limit,
|
||||
"filter": filter,
|
||||
"emails": email_data,
|
||||
"connection": {
|
||||
"id": connection["id"],
|
||||
"authority": "microsoft",
|
||||
"reference": connectionReference
|
||||
},
|
||||
"timestamp": self.services.utils.timestampGetUtc()
|
||||
}
|
||||
|
||||
validationMetadata = {
|
||||
"actionType": "outlook.readEmails",
|
||||
"connectionReference": connectionReference,
|
||||
"folder": folder,
|
||||
"limit": limit,
|
||||
"filter": filter,
|
||||
"emailCount": email_data.get("count", 0),
|
||||
"outputMimeType": outputMimeType
|
||||
}
|
||||
|
||||
self.services.chat.progressLogUpdate(operationId, 0.9, f"Found {email_data.get('count', 0)} emails")
|
||||
self.services.chat.progressLogFinish(operationId, True)
|
||||
|
||||
return ActionResult.isSuccess(
|
||||
documents=[ActionDocument(
|
||||
documentName=f"outlook_emails_{self._format_timestamp_for_filename()}.json",
|
||||
documentData=json.dumps(result_data, indent=2),
|
||||
mimeType="application/json",
|
||||
validationMetadata=validationMetadata
|
||||
)]
|
||||
)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error reading emails: {str(e)}")
|
||||
if operationId:
|
||||
try:
|
||||
self.services.chat.progressLogFinish(operationId, False)
|
||||
except:
|
||||
pass # Don't fail on progress logging errors
|
||||
return ActionResult.isFailure(
|
||||
error=str(e)
|
||||
)
|
||||
|
||||
257
modules/workflows/methods/methodOutlook/actions/searchEmails.py
Normal file
257
modules/workflows/methods/methodOutlook/actions/searchEmails.py
Normal file
|
|
@ -0,0 +1,257 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
|
||||
"""
|
||||
Search Emails action for Outlook operations.
|
||||
Searches emails by query and returns matching items with metadata.
|
||||
"""
|
||||
|
||||
import logging
|
||||
import json
|
||||
import requests
|
||||
from typing import Dict, Any
|
||||
from modules.workflows.methods.methodBase import action
|
||||
from modules.datamodels.datamodelChat import ActionResult, ActionDocument
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
@action
|
||||
async def searchEmails(self, parameters: Dict[str, Any]) -> ActionResult:
|
||||
"""
|
||||
GENERAL:
|
||||
- Purpose: Search emails by query and return matching items with metadata.
|
||||
- Input requirements: connectionReference (required); query (required); optional folder, limit, outputMimeType.
|
||||
- Output format: JSON with search results and metadata.
|
||||
|
||||
Parameters:
|
||||
- connectionReference (str, required): Microsoft connection label.
|
||||
- query (str, required): Search expression.
|
||||
- folder (str, optional): Folder scope or All. Default: All.
|
||||
- limit (int, optional): Maximum items to return. Must be > 0. Default: 1000.
|
||||
- outputMimeType (str, optional): MIME type for output file. Options: "application/json" (default), "text/plain", "text/csv". Default: "application/json".
|
||||
"""
|
||||
try:
|
||||
connectionReference = parameters.get("connectionReference")
|
||||
query = parameters.get("query")
|
||||
folder = parameters.get("folder", "All")
|
||||
limit = parameters.get("limit", 1000)
|
||||
outputMimeType = parameters.get("outputMimeType", "application/json")
|
||||
|
||||
# Validate parameters
|
||||
if not connectionReference:
|
||||
return ActionResult.isFailure(error="Connection reference is required")
|
||||
|
||||
# Validate limit parameter
|
||||
if limit <= 0:
|
||||
limit = 1000
|
||||
logger.warning(f"Invalid limit value ({limit}), using default value 1000")
|
||||
|
||||
if not query or not query.strip():
|
||||
return ActionResult.isFailure(error="Search query is required and cannot be empty")
|
||||
|
||||
# Check if this is a folder specification query
|
||||
if query.strip().lower().startswith('folder:'):
|
||||
folder_name = query.strip()[7:].strip() # Remove "folder:" prefix
|
||||
if not folder_name:
|
||||
return ActionResult.isFailure(error="Invalid folder specification. Use format 'folder:FolderName'")
|
||||
logger.info(f"Search query is a folder specification: {folder_name}")
|
||||
|
||||
# Validate limit
|
||||
try:
|
||||
limit = int(limit)
|
||||
if limit <= 0:
|
||||
limit = 1000
|
||||
logger.warning(f"Invalid limit value (<=0), using default value 1000")
|
||||
elif limit > 1000: # Microsoft Graph API has limits
|
||||
limit = 1000
|
||||
logger.warning(f"Limit {limit} exceeds maximum (1000), using 1000")
|
||||
except (ValueError, TypeError):
|
||||
limit = 1000
|
||||
logger.warning(f"Invalid limit value, using default value 1000")
|
||||
|
||||
# Get Microsoft connection
|
||||
connection = self.connection.getMicrosoftConnection(connectionReference)
|
||||
if not connection:
|
||||
return ActionResult.isFailure(error="No valid Microsoft connection found for the provided connection reference")
|
||||
|
||||
# Search emails using Microsoft Graph API
|
||||
try:
|
||||
# Microsoft Graph API endpoint for searching messages
|
||||
graph_url = "https://graph.microsoft.com/v1.0"
|
||||
headers = {
|
||||
"Authorization": f"Bearer {connection['accessToken']}",
|
||||
"Content-Type": "application/json"
|
||||
}
|
||||
|
||||
# Get the folder ID for the specified folder if needed
|
||||
folder_id = None
|
||||
if folder and folder.lower() != "all":
|
||||
folder_id = self.folderManagement.getFolderId(folder, connection)
|
||||
if folder_id:
|
||||
logger.debug(f"Found folder ID for '{folder}': {folder_id}")
|
||||
else:
|
||||
logger.warning(f"Could not find folder ID for '{folder}', using folder name directly")
|
||||
|
||||
# Build the search API request
|
||||
api_url = f"{graph_url}/me/messages"
|
||||
params = self.emailProcessing.buildSearchParameters(query, folder_id or folder, limit)
|
||||
|
||||
# Log search parameters for debugging
|
||||
logger.debug(f"Search query: '{query}'")
|
||||
logger.debug(f"Search folder: '{folder}'")
|
||||
logger.debug(f"Search parameters: {params}")
|
||||
logger.debug(f"API URL: {api_url}")
|
||||
|
||||
# Make the API call
|
||||
response = requests.get(api_url, headers=headers, params=params)
|
||||
|
||||
# Log response details for debugging
|
||||
|
||||
|
||||
if response.status_code != 200:
|
||||
# Log detailed error information
|
||||
try:
|
||||
error_data = response.json()
|
||||
logger.error(f"Microsoft Graph API error: {response.status_code} - {error_data}")
|
||||
except:
|
||||
logger.error(f"Microsoft Graph API error: {response.status_code} - {response.text}")
|
||||
|
||||
# Check for specific error types and provide helpful messages
|
||||
if response.status_code == 400:
|
||||
logger.error("Bad Request (400) - Check search query format and parameters")
|
||||
logger.error(f"Search query: '{query}'")
|
||||
logger.error(f"Search parameters: {params}")
|
||||
logger.error(f"API URL: {api_url}")
|
||||
elif response.status_code == 401:
|
||||
logger.error("Unauthorized (401) - Check access token and permissions")
|
||||
elif response.status_code == 403:
|
||||
logger.error("Forbidden (403) - Check API permissions and scopes")
|
||||
elif response.status_code == 429:
|
||||
logger.error("Too Many Requests (429) - Rate limit exceeded")
|
||||
|
||||
raise Exception(f"Microsoft Graph API returned {response.status_code}: {response.text}")
|
||||
|
||||
response.raise_for_status()
|
||||
|
||||
search_data = response.json()
|
||||
emails = search_data.get("value", [])
|
||||
|
||||
|
||||
|
||||
# Apply folder filtering if needed and we used $search
|
||||
if folder and folder.lower() != "all" and "$search" in params:
|
||||
# Get the actual folder ID for proper filtering
|
||||
folder_id = self.folderManagement.getFolderId(folder, connection)
|
||||
|
||||
if folder_id:
|
||||
# Filter results by folder ID
|
||||
filtered_emails = []
|
||||
for email in emails:
|
||||
if email.get("parentFolderId") == folder_id:
|
||||
filtered_emails.append(email)
|
||||
emails = filtered_emails
|
||||
logger.debug(f"Applied folder filtering: {len(filtered_emails)} emails found in folder {folder}")
|
||||
else:
|
||||
# Fallback: try to filter by folder name (less reliable)
|
||||
filtered_emails = []
|
||||
for email in emails:
|
||||
# Check if email has folder information
|
||||
if hasattr(email, 'parentFolderId') and email.get('parentFolderId'):
|
||||
if email.get('parentFolderId') == folder:
|
||||
filtered_emails.append(email)
|
||||
else:
|
||||
# If no folder info, include the email (less strict filtering)
|
||||
filtered_emails.append(email)
|
||||
|
||||
emails = filtered_emails
|
||||
logger.debug(f"Applied fallback folder filtering: {len(filtered_emails)} emails found in folder {folder}")
|
||||
|
||||
# Special handling for folder specification queries
|
||||
if query.strip().lower().startswith('folder:'):
|
||||
folder_name = query.strip()[7:].strip()
|
||||
folder_id = self.folderManagement.getFolderId(folder_name, connection)
|
||||
if folder_id:
|
||||
# Filter results to only include emails from the specified folder
|
||||
filtered_emails = []
|
||||
for email in emails:
|
||||
if email.get("parentFolderId") == folder_id:
|
||||
filtered_emails.append(email)
|
||||
emails = filtered_emails
|
||||
logger.debug(f"Applied folder specification filtering: {len(filtered_emails)} emails found in folder {folder_name}")
|
||||
else:
|
||||
logger.warning(f"Could not find folder ID for folder specification: {folder_name}")
|
||||
|
||||
|
||||
search_result = {
|
||||
"query": query,
|
||||
"results": emails,
|
||||
"count": len(emails),
|
||||
"folder": folder,
|
||||
"limit": limit,
|
||||
"apiMetadata": {
|
||||
"@odata.context": search_data.get("@odata.context"),
|
||||
"@odata.count": search_data.get("@odata.count"),
|
||||
"@odata.nextLink": search_data.get("@odata.nextLink")
|
||||
},
|
||||
"searchParams": params
|
||||
}
|
||||
|
||||
|
||||
|
||||
except ImportError:
|
||||
logger.error("requests module not available")
|
||||
return ActionResult.isFailure(error="requests module not available")
|
||||
except Exception as e:
|
||||
logger.error(f"Error searching emails via Microsoft Graph API: {str(e)}")
|
||||
return ActionResult.isFailure(error=f"Failed to search emails: {str(e)}")
|
||||
|
||||
# Determine output format based on MIME type
|
||||
mime_type_mapping = {
|
||||
"application/json": ".json",
|
||||
"text/plain": ".txt",
|
||||
"text/csv": ".csv"
|
||||
}
|
||||
output_extension = mime_type_mapping.get(outputMimeType, ".json")
|
||||
output_mime_type = outputMimeType
|
||||
logger.info(f"Using output format: {output_extension} ({output_mime_type})")
|
||||
|
||||
|
||||
|
||||
result_data = {
|
||||
"connectionReference": connectionReference,
|
||||
"query": query,
|
||||
"folder": folder,
|
||||
"limit": limit,
|
||||
"searchResults": search_result,
|
||||
"connection": {
|
||||
"id": connection["id"],
|
||||
"authority": "microsoft",
|
||||
"reference": connectionReference
|
||||
},
|
||||
"timestamp": self.services.utils.timestampGetUtc()
|
||||
}
|
||||
|
||||
validationMetadata = {
|
||||
"actionType": "outlook.searchEmails",
|
||||
"connectionReference": connectionReference,
|
||||
"query": query,
|
||||
"folder": folder,
|
||||
"limit": limit,
|
||||
"resultCount": search_result.get("count", 0),
|
||||
"outputMimeType": outputMimeType
|
||||
}
|
||||
|
||||
return ActionResult(
|
||||
success=True,
|
||||
documents=[ActionDocument(
|
||||
documentName=f"outlook_email_search_{self._format_timestamp_for_filename()}.json",
|
||||
documentData=json.dumps(result_data, indent=2),
|
||||
mimeType="application/json",
|
||||
validationMetadata=validationMetadata
|
||||
)]
|
||||
)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error searching emails: {str(e)}")
|
||||
return ActionResult.isFailure(error=str(e))
|
||||
|
||||
|
|
@ -0,0 +1,312 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
|
||||
"""
|
||||
Send Draft Email action for Outlook operations.
|
||||
Sends draft email(s) using draft email JSON document(s) from action outlook.composeAndDraftEmailWithContext.
|
||||
"""
|
||||
|
||||
import logging
|
||||
import time
|
||||
import json
|
||||
import requests
|
||||
from typing import Dict, Any
|
||||
from modules.workflows.methods.methodBase import action
|
||||
from modules.datamodels.datamodelChat import ActionResult, ActionDocument
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
@action
|
||||
async def sendDraftEmail(self, parameters: Dict[str, Any]) -> ActionResult:
|
||||
"""
|
||||
GENERAL:
|
||||
- Purpose: Send draft email(s) using draft email JSON document(s) from action outlook.composeAndDraftEmailWithContext.
|
||||
- Input requirements: connectionReference (required); documentList with draft email JSON documents (required).
|
||||
- Output format: JSON confirmation with sent mail metadata for all emails.
|
||||
|
||||
Parameters:
|
||||
- connectionReference (str, required): Microsoft connection label.
|
||||
- documentList (list, required): Document reference(s) to draft emails in JSON format (outputs from outlook.composeAndDraftEmailWithContext function).
|
||||
"""
|
||||
operationId = None
|
||||
try:
|
||||
# Init progress logger
|
||||
workflowId = self.services.workflow.id if self.services.workflow else f"no-workflow-{int(time.time())}"
|
||||
operationId = f"outlook_send_{workflowId}_{int(time.time())}"
|
||||
|
||||
# Start progress tracking
|
||||
parentOperationId = parameters.get('parentOperationId')
|
||||
self.services.chat.progressLogStart(
|
||||
operationId,
|
||||
"Send Draft Email",
|
||||
"Outlook Email Sending",
|
||||
f"Processing {len(parameters.get('documentList', []))} draft(s)",
|
||||
parentOperationId=parentOperationId
|
||||
)
|
||||
|
||||
connectionReference = parameters.get("connectionReference")
|
||||
documentList = parameters.get("documentList", [])
|
||||
|
||||
if not connectionReference:
|
||||
if operationId:
|
||||
self.services.chat.progressLogFinish(operationId, False)
|
||||
return ActionResult.isFailure(error="Connection reference is required")
|
||||
|
||||
if not documentList:
|
||||
if operationId:
|
||||
self.services.chat.progressLogFinish(operationId, False)
|
||||
return ActionResult.isFailure(error="documentList is required and cannot be empty")
|
||||
|
||||
# Convert single value to list if needed
|
||||
if isinstance(documentList, str):
|
||||
documentList = [documentList]
|
||||
|
||||
# Get Microsoft connection
|
||||
self.services.chat.progressLogUpdate(operationId, 0.2, "Getting Microsoft connection")
|
||||
connection = self.connection.getMicrosoftConnection(connectionReference)
|
||||
if not connection:
|
||||
self.services.chat.progressLogFinish(operationId, False)
|
||||
return ActionResult.isFailure(error="No valid Microsoft connection found for the provided connection reference")
|
||||
|
||||
# Check permissions
|
||||
self.services.chat.progressLogUpdate(operationId, 0.3, "Checking permissions")
|
||||
permissions_ok = await self.connection.checkPermissions(connection)
|
||||
if not permissions_ok:
|
||||
self.services.chat.progressLogFinish(operationId, False)
|
||||
return ActionResult.isFailure(error="Connection lacks necessary permissions for Outlook operations")
|
||||
|
||||
# Read draft email JSON documents from documentList
|
||||
self.services.chat.progressLogUpdate(operationId, 0.4, "Reading draft email documents")
|
||||
draftEmails = []
|
||||
for docRef in documentList:
|
||||
try:
|
||||
# Get documents from document reference
|
||||
from modules.datamodels.datamodelDocref import DocumentReferenceList
|
||||
chatDocuments = self.services.chat.getChatDocumentsFromDocumentList(DocumentReferenceList.from_string_list([docRef]))
|
||||
if not chatDocuments:
|
||||
logger.warning(f"No documents found for reference: {docRef}")
|
||||
continue
|
||||
|
||||
# Process each document in the reference
|
||||
for doc in chatDocuments:
|
||||
try:
|
||||
# Read file data
|
||||
fileId = getattr(doc, 'fileId', None)
|
||||
if not fileId:
|
||||
logger.warning(f"Document {doc.fileName} has no fileId")
|
||||
continue
|
||||
|
||||
fileData = self.services.chat.getFileData(fileId)
|
||||
if not fileData:
|
||||
logger.warning(f"No file data found for document: {doc.fileName}")
|
||||
continue
|
||||
|
||||
# Parse JSON content
|
||||
if isinstance(fileData, bytes):
|
||||
jsonContent = fileData.decode('utf-8')
|
||||
else:
|
||||
jsonContent = str(fileData)
|
||||
|
||||
# Parse JSON - handle both direct JSON and JSON wrapped in documentData
|
||||
try:
|
||||
draftEmailData = json.loads(jsonContent)
|
||||
|
||||
# If the JSON contains a 'documentData' field, extract it
|
||||
if isinstance(draftEmailData, dict) and 'documentData' in draftEmailData:
|
||||
documentDataStr = draftEmailData['documentData']
|
||||
if isinstance(documentDataStr, str):
|
||||
draftEmailData = json.loads(documentDataStr)
|
||||
|
||||
# Validate draft email structure
|
||||
if not isinstance(draftEmailData, dict):
|
||||
logger.warning(f"Document {doc.fileName} does not contain a valid draft email JSON object")
|
||||
continue
|
||||
|
||||
draftId = draftEmailData.get("draftId")
|
||||
if not draftId:
|
||||
logger.warning(f"Document {doc.fileName} does not contain 'draftId' field")
|
||||
continue
|
||||
|
||||
draftEmails.append({
|
||||
"draftEmailJson": draftEmailData,
|
||||
"draftId": draftId,
|
||||
"sourceDocument": doc.fileName,
|
||||
"sourceReference": docRef
|
||||
})
|
||||
|
||||
except json.JSONDecodeError as e:
|
||||
logger.error(f"Failed to parse JSON from document {doc.fileName}: {str(e)}")
|
||||
continue
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error processing document {doc.fileName}: {str(e)}")
|
||||
continue
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error reading documents from reference {docRef}: {str(e)}")
|
||||
continue
|
||||
|
||||
if not draftEmails:
|
||||
self.services.chat.progressLogFinish(operationId, False)
|
||||
return ActionResult.isFailure(error="No valid draft email JSON documents found in documentList")
|
||||
|
||||
self.services.chat.progressLogUpdate(operationId, 0.6, f"Found {len(draftEmails)} draft email(s) to send")
|
||||
|
||||
# Send all draft emails
|
||||
graph_url = "https://graph.microsoft.com/v1.0"
|
||||
headers = {
|
||||
"Authorization": f"Bearer {connection['accessToken']}",
|
||||
"Content-Type": "application/json"
|
||||
}
|
||||
|
||||
sentResults = []
|
||||
failedResults = []
|
||||
|
||||
self.services.chat.progressLogUpdate(operationId, 0.7, "Sending emails")
|
||||
for idx, draftEmail in enumerate(draftEmails):
|
||||
draftEmailJson = draftEmail["draftEmailJson"]
|
||||
draftId = draftEmail["draftId"]
|
||||
sourceDocument = draftEmail["sourceDocument"]
|
||||
|
||||
try:
|
||||
send_url = f"{graph_url}/me/messages/{draftId}/send"
|
||||
sendResponse = requests.post(send_url, headers=headers)
|
||||
|
||||
# Extract email details from draft JSON for confirmation
|
||||
subject = draftEmailJson.get("subject", "Unknown")
|
||||
recipients = draftEmailJson.get("recipients", [])
|
||||
cc = draftEmailJson.get("cc", [])
|
||||
bcc = draftEmailJson.get("bcc", [])
|
||||
attachmentsCount = draftEmailJson.get("attachments", 0)
|
||||
|
||||
if sendResponse.status_code in [200, 202, 204]:
|
||||
sentResults.append({
|
||||
"status": "sent",
|
||||
"message": "Email sent successfully",
|
||||
"draftId": draftId,
|
||||
"subject": subject,
|
||||
"recipients": recipients,
|
||||
"cc": cc,
|
||||
"bcc": bcc,
|
||||
"attachments": attachmentsCount,
|
||||
"sentTimestamp": self.services.utils.timestampGetUtc(),
|
||||
"sourceDocument": sourceDocument
|
||||
})
|
||||
logger.info(f"Email sent successfully. Draft ID: {draftId}, Subject: {subject}")
|
||||
self.services.chat.progressLogUpdate(operationId, 0.7 + (idx + 1) * 0.2 / len(draftEmails), f"Sent {idx + 1}/{len(draftEmails)}: {subject}")
|
||||
else:
|
||||
errorResult = {
|
||||
"status": "error",
|
||||
"message": "Failed to send draft email",
|
||||
"draftId": draftId,
|
||||
"subject": subject,
|
||||
"recipients": recipients,
|
||||
"sendError": {
|
||||
"statusCode": sendResponse.status_code,
|
||||
"response": sendResponse.text
|
||||
},
|
||||
"sentTimestamp": self.services.utils.timestampGetUtc(),
|
||||
"sourceDocument": sourceDocument
|
||||
}
|
||||
failedResults.append(errorResult)
|
||||
logger.error(f"Failed to send email. Draft ID: {draftId}, Status: {sendResponse.status_code}, Response: {sendResponse.text}")
|
||||
|
||||
except Exception as e:
|
||||
errorResult = {
|
||||
"status": "error",
|
||||
"message": f"Exception while sending draft email: {str(e)}",
|
||||
"draftId": draftId,
|
||||
"subject": draftEmailJson.get("subject", "Unknown"),
|
||||
"recipients": draftEmailJson.get("recipients", []),
|
||||
"exception": str(e),
|
||||
"sentTimestamp": self.services.utils.timestampGetUtc(),
|
||||
"sourceDocument": sourceDocument
|
||||
}
|
||||
failedResults.append(errorResult)
|
||||
logger.error(f"Error sending draft email {draftId}: {str(e)}")
|
||||
|
||||
# Build result summary
|
||||
totalEmails = len(draftEmails)
|
||||
successfulEmails = len(sentResults)
|
||||
failedEmails = len(failedResults)
|
||||
|
||||
resultData = {
|
||||
"totalEmails": totalEmails,
|
||||
"successfulEmails": successfulEmails,
|
||||
"failedEmails": failedEmails,
|
||||
"sentResults": sentResults,
|
||||
"failedResults": failedResults,
|
||||
"timestamp": self.services.utils.timestampGetUtc()
|
||||
}
|
||||
|
||||
# Determine overall success status
|
||||
self.services.chat.progressLogUpdate(operationId, 0.9, f"Sent {successfulEmails}/{totalEmails} email(s)")
|
||||
if successfulEmails == 0:
|
||||
self.services.chat.progressLogFinish(operationId, False)
|
||||
validationMetadata = {
|
||||
"actionType": "outlook.sendDraftEmail",
|
||||
"connectionReference": connectionReference,
|
||||
"totalEmails": totalEmails,
|
||||
"successfulEmails": successfulEmails,
|
||||
"failedEmails": failedEmails,
|
||||
"status": "all_failed"
|
||||
}
|
||||
return ActionResult.isFailure(
|
||||
error=f"Failed to send all {totalEmails} email(s)",
|
||||
documents=[ActionDocument(
|
||||
documentName=f"sent_mail_confirmation_{self._format_timestamp_for_filename()}.json",
|
||||
documentData=json.dumps(resultData, indent=2),
|
||||
mimeType="application/json",
|
||||
validationMetadata=validationMetadata
|
||||
)]
|
||||
)
|
||||
elif failedEmails > 0:
|
||||
# Partial success
|
||||
logger.warning(f"Sent {successfulEmails} out of {totalEmails} emails. {failedEmails} failed.")
|
||||
validationMetadata = {
|
||||
"actionType": "outlook.sendDraftEmail",
|
||||
"connectionReference": connectionReference,
|
||||
"totalEmails": totalEmails,
|
||||
"successfulEmails": successfulEmails,
|
||||
"failedEmails": failedEmails,
|
||||
"status": "partial_success"
|
||||
}
|
||||
self.services.chat.progressLogFinish(operationId, True)
|
||||
return ActionResult(
|
||||
success=True,
|
||||
documents=[ActionDocument(
|
||||
documentName=f"sent_mail_confirmation_{self._format_timestamp_for_filename()}.json",
|
||||
documentData=json.dumps(resultData, indent=2),
|
||||
mimeType="application/json",
|
||||
validationMetadata=validationMetadata
|
||||
)]
|
||||
)
|
||||
else:
|
||||
# All successful
|
||||
logger.info(f"Successfully sent all {totalEmails} email(s)")
|
||||
validationMetadata = {
|
||||
"actionType": "outlook.sendDraftEmail",
|
||||
"connectionReference": connectionReference,
|
||||
"totalEmails": totalEmails,
|
||||
"successfulEmails": successfulEmails,
|
||||
"failedEmails": failedEmails,
|
||||
"status": "all_successful"
|
||||
}
|
||||
self.services.chat.progressLogFinish(operationId, True)
|
||||
return ActionResult(
|
||||
success=True,
|
||||
documents=[ActionDocument(
|
||||
documentName=f"sent_mail_confirmation_{self._format_timestamp_for_filename()}.json",
|
||||
documentData=json.dumps(resultData, indent=2),
|
||||
mimeType="application/json",
|
||||
validationMetadata=validationMetadata
|
||||
)]
|
||||
)
|
||||
|
||||
except ImportError:
|
||||
logger.error("requests module not available")
|
||||
return ActionResult.isFailure(error="requests module not available")
|
||||
except Exception as e:
|
||||
logger.error(f"Error in sendDraftEmail: {str(e)}")
|
||||
return ActionResult.isFailure(error=str(e))
|
||||
|
||||
|
|
@ -0,0 +1,5 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
|
||||
"""Helper modules for Outlook method operations."""
|
||||
|
||||
|
|
@ -0,0 +1,95 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
|
||||
"""
|
||||
Connection helper for Outlook operations.
|
||||
Handles Microsoft connection management and permission checking.
|
||||
"""
|
||||
|
||||
import logging
|
||||
import requests
|
||||
from typing import Dict, Any, Optional
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
class ConnectionHelper:
|
||||
"""Helper for Microsoft connection management in Outlook operations"""
|
||||
|
||||
def __init__(self, methodInstance):
|
||||
"""
|
||||
Initialize connection helper.
|
||||
|
||||
Args:
|
||||
methodInstance: Instance of MethodOutlook (for access to services)
|
||||
"""
|
||||
self.method = methodInstance
|
||||
self.services = methodInstance.services
|
||||
|
||||
def getMicrosoftConnection(self, connectionReference: str) -> Optional[Dict[str, Any]]:
|
||||
"""
|
||||
Helper function to get Microsoft connection details.
|
||||
"""
|
||||
try:
|
||||
logger.debug(f"Getting Microsoft connection for reference: {connectionReference}")
|
||||
|
||||
# Get the connection from the service
|
||||
userConnection = self.services.chat.getUserConnectionFromConnectionReference(connectionReference)
|
||||
if not userConnection:
|
||||
logger.error(f"Connection not found: {connectionReference}")
|
||||
return None
|
||||
|
||||
logger.debug(f"Found connection: {userConnection.id}, status: {userConnection.status.value}, authority: {userConnection.authority.value}")
|
||||
|
||||
# Get a fresh token for this connection
|
||||
token = self.services.chat.getFreshConnectionToken(userConnection.id)
|
||||
if not token:
|
||||
logger.error(f"Fresh token not found for connection: {userConnection.id}")
|
||||
logger.debug(f"Connection details: {userConnection}")
|
||||
return None
|
||||
|
||||
logger.debug(f"Fresh token retrieved for connection {userConnection.id}")
|
||||
|
||||
# Check if connection is active
|
||||
if userConnection.status.value != "active":
|
||||
logger.error(f"Connection is not active: {userConnection.id}, status: {userConnection.status.value}")
|
||||
return None
|
||||
|
||||
return {
|
||||
"id": userConnection.id,
|
||||
"accessToken": token.tokenAccess,
|
||||
"refreshToken": token.tokenRefresh,
|
||||
"scopes": ["Mail.ReadWrite", "Mail.Send", "Mail.ReadWrite.Shared", "User.Read"] # Valid Microsoft Graph API scopes
|
||||
}
|
||||
except Exception as e:
|
||||
logger.error(f"Error getting Microsoft connection: {str(e)}")
|
||||
return None
|
||||
|
||||
async def checkPermissions(self, connection: Dict[str, Any]) -> bool:
|
||||
"""
|
||||
Check if the current connection has the necessary permissions for Outlook operations.
|
||||
"""
|
||||
try:
|
||||
graph_url = "https://graph.microsoft.com/v1.0"
|
||||
headers = {
|
||||
"Authorization": f"Bearer {connection['accessToken']}",
|
||||
"Content-Type": "application/json"
|
||||
}
|
||||
|
||||
# Test permissions by trying to access the user's mail folder
|
||||
test_url = f"{graph_url}/me/mailFolders"
|
||||
response = requests.get(test_url, headers=headers)
|
||||
|
||||
if response.status_code == 200:
|
||||
return True
|
||||
elif response.status_code == 403:
|
||||
logger.error("Permission denied - connection lacks necessary mail permissions")
|
||||
logger.error("Required scopes: Mail.ReadWrite, Mail.Send, Mail.ReadWrite.Shared")
|
||||
return False
|
||||
else:
|
||||
logger.warning(f"Permission check returned status {response.status_code}")
|
||||
return False
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error checking permissions: {str(e)}")
|
||||
return False
|
||||
|
||||
|
|
@ -0,0 +1,184 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
|
||||
"""
|
||||
Email Processing helper for Outlook operations.
|
||||
Handles email search query sanitization, search parameter building, and filter construction.
|
||||
"""
|
||||
|
||||
import logging
|
||||
import re
|
||||
from typing import Dict, Any
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
class EmailProcessingHelper:
|
||||
"""Helper for email search and processing operations"""
|
||||
|
||||
def __init__(self, methodInstance):
|
||||
"""
|
||||
Initialize email processing helper.
|
||||
|
||||
Args:
|
||||
methodInstance: Instance of MethodOutlook (for access to services)
|
||||
"""
|
||||
self.method = methodInstance
|
||||
self.services = methodInstance.services
|
||||
|
||||
def sanitizeSearchQuery(self, query: str) -> str:
|
||||
"""
|
||||
Sanitize and validate search query for Microsoft Graph API
|
||||
|
||||
Microsoft Graph API has specific requirements for search queries:
|
||||
- Escape special characters properly
|
||||
- Handle search operators correctly
|
||||
- Ensure query format is valid
|
||||
"""
|
||||
if not query:
|
||||
return ""
|
||||
|
||||
# Clean the query
|
||||
clean_query = query.strip()
|
||||
|
||||
# Handle folder specifications first
|
||||
if clean_query.lower().startswith('folder:'):
|
||||
folder_name = clean_query[7:].strip()
|
||||
if folder_name:
|
||||
# Return the folder specification as-is
|
||||
return clean_query
|
||||
|
||||
# Remove any double quotes that might cause issues
|
||||
clean_query = clean_query.replace('"', '')
|
||||
|
||||
# Handle common search operators
|
||||
# Recognize Graph operators including both singular and plural forms for hasAttachments
|
||||
lowered = clean_query.lower()
|
||||
if any(op in lowered for op in ['from:', 'to:', 'subject:', 'received:', 'hasattachment:', 'hasattachments:']):
|
||||
# This is an advanced search query, return as-is
|
||||
return clean_query
|
||||
|
||||
# For basic text search, ensure it's safe for contains() filter
|
||||
# Remove any characters that might break the OData filter syntax
|
||||
# Remove or escape characters that could break OData filter syntax
|
||||
safe_query = re.sub(r'[\\\'"]', '', clean_query)
|
||||
|
||||
return safe_query
|
||||
|
||||
def buildSearchParameters(self, query: str, folder: str, limit: int) -> Dict[str, Any]:
|
||||
"""
|
||||
Build search parameters for Microsoft Graph API
|
||||
|
||||
This method handles the complexity of building search parameters
|
||||
while avoiding conflicts between $search and $filter parameters.
|
||||
"""
|
||||
params = {
|
||||
"$top": limit
|
||||
}
|
||||
|
||||
if not query or not query.strip():
|
||||
# No query specified, just get emails from folder
|
||||
if folder and folder.lower() != "all":
|
||||
# Use folder name directly for well-known folders, or get folder ID
|
||||
if folder.lower() in ["inbox", "drafts", "sentitems", "deleteditems"]:
|
||||
params["$filter"] = f"parentFolderId eq '{folder}'"
|
||||
else:
|
||||
# For custom folders, we need to get the folder ID first
|
||||
# This will be handled by the calling method
|
||||
params["$filter"] = f"parentFolderId eq '{folder}'"
|
||||
# Add orderby for basic queries
|
||||
params["$orderby"] = "receivedDateTime desc"
|
||||
return params
|
||||
|
||||
clean_query = self.sanitizeSearchQuery(query)
|
||||
|
||||
# Check if this is a folder specification (e.g., "folder:Drafts", "folder:Inbox")
|
||||
if clean_query.lower().startswith('folder:'):
|
||||
folder_name = clean_query[7:].strip() # Remove "folder:" prefix
|
||||
if folder_name:
|
||||
# This is a folder specification, not a text search
|
||||
# Just filter by folder and return
|
||||
params["$filter"] = f"parentFolderId eq '{folder_name}'"
|
||||
params["$orderby"] = "receivedDateTime desc"
|
||||
return params
|
||||
|
||||
# Check if this is a complex search query with multiple operators
|
||||
# Recognize Graph operators including both singular and plural forms for hasAttachments
|
||||
lowered = clean_query.lower()
|
||||
if any(op in lowered for op in ['from:', 'to:', 'subject:', 'received:', 'hasattachment:', 'hasattachments:']):
|
||||
# This is an advanced search query, use $search
|
||||
# Microsoft Graph API supports complex search syntax
|
||||
params["$search"] = f'"{clean_query}"'
|
||||
|
||||
# Note: When using $search, we cannot combine it with $orderby or $filter for folder
|
||||
# We'll need to filter results after the API call
|
||||
# Folder filtering will be done after the API call
|
||||
else:
|
||||
# Use $filter for basic text search, but keep it simple to avoid "InefficientFilter" error
|
||||
# Microsoft Graph API has limitations on complex filters
|
||||
if len(clean_query) > 50:
|
||||
# If query is too long, truncate it to avoid complex filter issues
|
||||
clean_query = clean_query[:50]
|
||||
|
||||
|
||||
# Use only subject search to keep filter simple
|
||||
# Handle wildcard queries specially
|
||||
if clean_query == "*" or clean_query == "":
|
||||
# For wildcard or empty query, don't use contains filter
|
||||
# Just use folder filter if specified
|
||||
if folder and folder.lower() != "all":
|
||||
params["$filter"] = f"parentFolderId eq '{folder}'"
|
||||
else:
|
||||
# No filter needed for wildcard search across all folders
|
||||
pass
|
||||
else:
|
||||
params["$filter"] = f"contains(subject,'{clean_query}')"
|
||||
|
||||
# Add folder filter if specified
|
||||
if folder and folder.lower() != "all":
|
||||
params["$filter"] = f"{params['$filter']} and parentFolderId eq '{folder}'"
|
||||
|
||||
# Add orderby for basic queries
|
||||
params["$orderby"] = "receivedDateTime desc"
|
||||
|
||||
|
||||
return params
|
||||
|
||||
def buildGraphFilter(self, filter_text: str) -> Dict[str, str]:
|
||||
"""
|
||||
Build proper Microsoft Graph API filter parameters based on filter text
|
||||
|
||||
Args:
|
||||
filter_text (str): The filter text to process
|
||||
|
||||
Returns:
|
||||
Dict[str, str]: Dictionary with either $filter or $search parameter
|
||||
"""
|
||||
if not filter_text:
|
||||
return {}
|
||||
|
||||
filter_text = filter_text.strip()
|
||||
|
||||
# Handle folder specifications (e.g., "folder:Drafts", "folder:Inbox")
|
||||
if filter_text.lower().startswith('folder:'):
|
||||
folder_name = filter_text[7:].strip() # Remove "folder:" prefix
|
||||
if folder_name:
|
||||
# This is a folder specification, return empty to let the main method handle it
|
||||
return {}
|
||||
|
||||
# Handle search queries (from:, to:, subject:, etc.) - check this FIRST
|
||||
# Support both singular and plural forms for hasAttachments
|
||||
lt = filter_text.lower()
|
||||
if any(lt.startswith(prefix) for prefix in ['from:', 'to:', 'subject:', 'received:', 'hasattachment:', 'hasattachments:']):
|
||||
return {"$search": f'"{filter_text}"'}
|
||||
|
||||
# Handle email address filters (only if it's NOT a search query)
|
||||
if '@' in filter_text and '.' in filter_text and ' ' not in filter_text and not filter_text.startswith('from:'):
|
||||
return {"$filter": f"from/fromAddress/address eq '{filter_text}'"}
|
||||
|
||||
# Handle OData filter conditions (contains 'eq', 'ne', 'gt', 'lt', etc.)
|
||||
if any(op in filter_text.lower() for op in [' eq ', ' ne ', ' gt ', ' lt ', ' ge ', ' le ', ' and ', ' or ']):
|
||||
return {"$filter": filter_text}
|
||||
|
||||
# Handle text content - search in subject
|
||||
return {"$filter": f"contains(subject,'{filter_text}')"}
|
||||
|
||||
|
|
@ -0,0 +1,110 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
|
||||
"""
|
||||
Folder Management helper for Outlook operations.
|
||||
Handles folder ID resolution and folder name lookups.
|
||||
"""
|
||||
|
||||
import logging
|
||||
import requests
|
||||
from typing import Dict, Any, Optional
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
class FolderManagementHelper:
|
||||
"""Helper for folder management operations"""
|
||||
|
||||
def __init__(self, methodInstance):
|
||||
"""
|
||||
Initialize folder management helper.
|
||||
|
||||
Args:
|
||||
methodInstance: Instance of MethodOutlook (for access to services)
|
||||
"""
|
||||
self.method = methodInstance
|
||||
self.services = methodInstance.services
|
||||
|
||||
def getFolderId(self, folder_name: str, connection: Dict[str, Any]) -> Optional[str]:
|
||||
"""
|
||||
Get the folder ID for a given folder name
|
||||
|
||||
This is needed for proper filtering when using advanced search queries
|
||||
"""
|
||||
try:
|
||||
graph_url = "https://graph.microsoft.com/v1.0"
|
||||
headers = {
|
||||
"Authorization": f"Bearer {connection['accessToken']}",
|
||||
"Content-Type": "application/json"
|
||||
}
|
||||
|
||||
# Get mail folders
|
||||
api_url = f"{graph_url}/me/mailFolders"
|
||||
response = requests.get(api_url, headers=headers)
|
||||
|
||||
if response.status_code == 200:
|
||||
folders_data = response.json()
|
||||
all_folders = folders_data.get("value", [])
|
||||
|
||||
|
||||
|
||||
# Try exact match first
|
||||
for folder in all_folders:
|
||||
if folder.get("displayName", "").lower() == folder_name.lower():
|
||||
|
||||
return folder.get("id")
|
||||
|
||||
# Try common variations for Drafts folder
|
||||
if folder_name.lower() == "drafts":
|
||||
draft_variations = ["drafts", "draft", "entwürfe", "entwurf", "brouillons", "brouillon"]
|
||||
for folder in all_folders:
|
||||
folder_display_name = folder.get("displayName", "").lower()
|
||||
if any(variation in folder_display_name for variation in draft_variations):
|
||||
|
||||
return folder.get("id")
|
||||
|
||||
# Try common variations for other folders
|
||||
if folder_name.lower() == "sent items":
|
||||
sent_variations = ["sent items", "sent", "gesendete elemente", "éléments envoyés"]
|
||||
for folder in all_folders:
|
||||
folder_display_name = folder.get("displayName", "").lower()
|
||||
if any(variation in folder_display_name for variation in sent_variations):
|
||||
|
||||
return folder.get("id")
|
||||
|
||||
logger.warning(f"Folder '{folder_name}' not found. Available folders: {[f.get('displayName', 'Unknown') for f in all_folders]}")
|
||||
return None
|
||||
else:
|
||||
logger.warning(f"Could not retrieve folders: {response.status_code}")
|
||||
return None
|
||||
|
||||
except Exception as e:
|
||||
logger.warning(f"Error getting folder ID for '{folder_name}': {str(e)}")
|
||||
return None
|
||||
|
||||
def getFolderNameById(self, folder_id: str, connection: Dict[str, Any]) -> str:
|
||||
"""
|
||||
Get the folder display name for a given folder ID
|
||||
"""
|
||||
try:
|
||||
graph_url = "https://graph.microsoft.com/v1.0"
|
||||
headers = {
|
||||
"Authorization": f"Bearer {connection['accessToken']}",
|
||||
"Content-Type": "application/json"
|
||||
}
|
||||
|
||||
# Get folder by ID
|
||||
api_url = f"{graph_url}/me/mailFolders/{folder_id}"
|
||||
response = requests.get(api_url, headers=headers)
|
||||
|
||||
if response.status_code == 200:
|
||||
folder_data = response.json()
|
||||
return folder_data.get("displayName", folder_id)
|
||||
else:
|
||||
logger.warning(f"Could not retrieve folder name for ID {folder_id}: {response.status_code}")
|
||||
return folder_id
|
||||
|
||||
except Exception as e:
|
||||
logger.warning(f"Error getting folder name for ID '{folder_id}': {str(e)}")
|
||||
return folder_id
|
||||
|
||||
237
modules/workflows/methods/methodOutlook/methodOutlook.py
Normal file
237
modules/workflows/methods/methodOutlook/methodOutlook.py
Normal file
|
|
@ -0,0 +1,237 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
|
||||
import logging
|
||||
from datetime import datetime, UTC
|
||||
from modules.workflows.methods.methodBase import MethodBase
|
||||
from modules.datamodels.datamodelWorkflowActions import WorkflowActionDefinition, WorkflowActionParameter
|
||||
from modules.shared.frontendTypes import FrontendType
|
||||
|
||||
# Import helpers
|
||||
from .helpers.connection import ConnectionHelper
|
||||
from .helpers.emailProcessing import EmailProcessingHelper
|
||||
from .helpers.folderManagement import FolderManagementHelper
|
||||
|
||||
# Import actions
|
||||
from .actions.readEmails import readEmails
|
||||
from .actions.searchEmails import searchEmails
|
||||
from .actions.composeAndDraftEmailWithContext import composeAndDraftEmailWithContext
|
||||
from .actions.sendDraftEmail import sendDraftEmail
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
class MethodOutlook(MethodBase):
|
||||
"""Outlook method implementation for email operations"""
|
||||
|
||||
def __init__(self, services):
|
||||
"""Initialize the Outlook method"""
|
||||
super().__init__(services)
|
||||
self.name = "outlook"
|
||||
self.description = "Handle Microsoft Outlook email operations"
|
||||
|
||||
# Initialize helper modules
|
||||
self.connection = ConnectionHelper(self)
|
||||
self.emailProcessing = EmailProcessingHelper(self)
|
||||
self.folderManagement = FolderManagementHelper(self)
|
||||
|
||||
# RBAC-Integration: Action-Definitionen mit actionId
|
||||
self._actions = {
|
||||
"readEmails": WorkflowActionDefinition(
|
||||
actionId="outlook.readEmails",
|
||||
description="Read emails and metadata from a mailbox folder",
|
||||
parameters={
|
||||
"connectionReference": WorkflowActionParameter(
|
||||
name="connectionReference",
|
||||
type="str",
|
||||
frontendType=FrontendType.USER_CONNECTION,
|
||||
required=True,
|
||||
description="Microsoft connection label"
|
||||
),
|
||||
"folder": WorkflowActionParameter(
|
||||
name="folder",
|
||||
type="str",
|
||||
frontendType=FrontendType.SELECT,
|
||||
frontendOptions="outlook.folder",
|
||||
required=False,
|
||||
default="Inbox",
|
||||
description="Folder to read from"
|
||||
),
|
||||
"limit": WorkflowActionParameter(
|
||||
name="limit",
|
||||
type="int",
|
||||
frontendType=FrontendType.NUMBER,
|
||||
required=False,
|
||||
default=1000,
|
||||
description="Maximum items to return",
|
||||
validation={"min": 1, "max": 10000}
|
||||
),
|
||||
"filter": WorkflowActionParameter(
|
||||
name="filter",
|
||||
type="str",
|
||||
frontendType=FrontendType.TEXT,
|
||||
required=False,
|
||||
description="Sender, query operators, or subject text"
|
||||
),
|
||||
"outputMimeType": WorkflowActionParameter(
|
||||
name="outputMimeType",
|
||||
type="str",
|
||||
frontendType=FrontendType.SELECT,
|
||||
frontendOptions=["application/json", "text/plain", "text/csv"],
|
||||
required=False,
|
||||
default="application/json",
|
||||
description="MIME type for output file"
|
||||
)
|
||||
},
|
||||
execute=readEmails.__get__(self, self.__class__)
|
||||
),
|
||||
"searchEmails": WorkflowActionDefinition(
|
||||
actionId="outlook.searchEmails",
|
||||
description="Search emails by query and return matching items with metadata",
|
||||
parameters={
|
||||
"connectionReference": WorkflowActionParameter(
|
||||
name="connectionReference",
|
||||
type="str",
|
||||
frontendType=FrontendType.USER_CONNECTION,
|
||||
required=True,
|
||||
description="Microsoft connection label"
|
||||
),
|
||||
"query": WorkflowActionParameter(
|
||||
name="query",
|
||||
type="str",
|
||||
frontendType=FrontendType.TEXT,
|
||||
required=True,
|
||||
description="Search expression"
|
||||
),
|
||||
"folder": WorkflowActionParameter(
|
||||
name="folder",
|
||||
type="str",
|
||||
frontendType=FrontendType.SELECT,
|
||||
frontendOptions="outlook.folder",
|
||||
required=False,
|
||||
default="All",
|
||||
description="Folder scope or All"
|
||||
),
|
||||
"limit": WorkflowActionParameter(
|
||||
name="limit",
|
||||
type="int",
|
||||
frontendType=FrontendType.NUMBER,
|
||||
required=False,
|
||||
default=1000,
|
||||
description="Maximum items to return",
|
||||
validation={"min": 1, "max": 10000}
|
||||
),
|
||||
"outputMimeType": WorkflowActionParameter(
|
||||
name="outputMimeType",
|
||||
type="str",
|
||||
frontendType=FrontendType.SELECT,
|
||||
frontendOptions=["application/json", "text/plain", "text/csv"],
|
||||
required=False,
|
||||
default="application/json",
|
||||
description="MIME type for output file"
|
||||
)
|
||||
},
|
||||
execute=searchEmails.__get__(self, self.__class__)
|
||||
),
|
||||
"composeAndDraftEmailWithContext": WorkflowActionDefinition(
|
||||
actionId="outlook.composeAndDraftEmailWithContext",
|
||||
description="Compose email content using AI from context and optional documents, then create a draft",
|
||||
parameters={
|
||||
"connectionReference": WorkflowActionParameter(
|
||||
name="connectionReference",
|
||||
type="str",
|
||||
frontendType=FrontendType.USER_CONNECTION,
|
||||
required=True,
|
||||
description="Microsoft connection label"
|
||||
),
|
||||
"to": WorkflowActionParameter(
|
||||
name="to",
|
||||
type="List[str]",
|
||||
frontendType=FrontendType.MULTISELECT,
|
||||
required=True,
|
||||
description="Recipient email addresses"
|
||||
),
|
||||
"context": WorkflowActionParameter(
|
||||
name="context",
|
||||
type="str",
|
||||
frontendType=FrontendType.TEXTAREA,
|
||||
required=True,
|
||||
description="Detailed context for composing the email"
|
||||
),
|
||||
"documentList": WorkflowActionParameter(
|
||||
name="documentList",
|
||||
type="List[str]",
|
||||
frontendType=FrontendType.DOCUMENT_REFERENCE,
|
||||
required=False,
|
||||
description="Document references for context/attachments"
|
||||
),
|
||||
"cc": WorkflowActionParameter(
|
||||
name="cc",
|
||||
type="List[str]",
|
||||
frontendType=FrontendType.MULTISELECT,
|
||||
required=False,
|
||||
description="CC recipients"
|
||||
),
|
||||
"bcc": WorkflowActionParameter(
|
||||
name="bcc",
|
||||
type="List[str]",
|
||||
frontendType=FrontendType.MULTISELECT,
|
||||
required=False,
|
||||
description="BCC recipients"
|
||||
),
|
||||
"emailStyle": WorkflowActionParameter(
|
||||
name="emailStyle",
|
||||
type="str",
|
||||
frontendType=FrontendType.SELECT,
|
||||
frontendOptions=["formal", "casual", "business"],
|
||||
required=False,
|
||||
default="business",
|
||||
description="Email style: formal, casual, or business"
|
||||
),
|
||||
"maxLength": WorkflowActionParameter(
|
||||
name="maxLength",
|
||||
type="int",
|
||||
frontendType=FrontendType.NUMBER,
|
||||
required=False,
|
||||
default=1000,
|
||||
description="Maximum length for generated content",
|
||||
validation={"min": 100, "max": 10000}
|
||||
)
|
||||
},
|
||||
execute=composeAndDraftEmailWithContext.__get__(self, self.__class__)
|
||||
),
|
||||
"sendDraftEmail": WorkflowActionDefinition(
|
||||
actionId="outlook.sendDraftEmail",
|
||||
description="Send draft email(s) using draft email JSON document(s) from action outlook.composeAndDraftEmailWithContext",
|
||||
parameters={
|
||||
"connectionReference": WorkflowActionParameter(
|
||||
name="connectionReference",
|
||||
type="str",
|
||||
frontendType=FrontendType.USER_CONNECTION,
|
||||
required=True,
|
||||
description="Microsoft connection label"
|
||||
),
|
||||
"documentList": WorkflowActionParameter(
|
||||
name="documentList",
|
||||
type="List[str]",
|
||||
frontendType=FrontendType.DOCUMENT_REFERENCE,
|
||||
required=True,
|
||||
description="Document reference(s) to draft emails in JSON format (outputs from outlook.composeAndDraftEmailWithContext function)"
|
||||
)
|
||||
},
|
||||
execute=sendDraftEmail.__get__(self, self.__class__)
|
||||
)
|
||||
}
|
||||
|
||||
# Validate actions after definition
|
||||
self._validateActions()
|
||||
|
||||
# Register actions as methods (optional, für direkten Zugriff)
|
||||
self.readEmails = readEmails.__get__(self, self.__class__)
|
||||
self.searchEmails = searchEmails.__get__(self, self.__class__)
|
||||
self.composeAndDraftEmailWithContext = composeAndDraftEmailWithContext.__get__(self, self.__class__)
|
||||
self.sendDraftEmail = sendDraftEmail.__get__(self, self.__class__)
|
||||
|
||||
def _format_timestamp_for_filename(self) -> str:
|
||||
"""Format current timestamp as YYYYMMDD-hhmmss for filenames."""
|
||||
return datetime.now(UTC).strftime("%Y%m%d-%H%M%S")
|
||||
|
||||
7
modules/workflows/methods/methodSharepoint/__init__.py
Normal file
7
modules/workflows/methods/methodSharepoint/__init__.py
Normal file
|
|
@ -0,0 +1,7 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
|
||||
from .methodSharepoint import MethodSharepoint
|
||||
|
||||
__all__ = ['MethodSharepoint']
|
||||
|
||||
|
|
@ -0,0 +1,28 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
|
||||
"""Action modules for SharePoint operations."""
|
||||
|
||||
# Export all actions
|
||||
from .findDocumentPath import findDocumentPath
|
||||
from .readDocuments import readDocuments
|
||||
from .uploadDocument import uploadDocument
|
||||
from .listDocuments import listDocuments
|
||||
from .analyzeFolderUsage import analyzeFolderUsage
|
||||
from .findSiteByUrl import findSiteByUrl
|
||||
from .downloadFileByPath import downloadFileByPath
|
||||
from .copyFile import copyFile
|
||||
from .uploadFile import uploadFile
|
||||
|
||||
__all__ = [
|
||||
'findDocumentPath',
|
||||
'readDocuments',
|
||||
'uploadDocument',
|
||||
'listDocuments',
|
||||
'analyzeFolderUsage',
|
||||
'findSiteByUrl',
|
||||
'downloadFileByPath',
|
||||
'copyFile',
|
||||
'uploadFile',
|
||||
]
|
||||
|
||||
|
|
@ -0,0 +1,337 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
|
||||
"""
|
||||
Analyze Folder Usage action for SharePoint operations.
|
||||
Analyzes usage intensity of folders and files in SharePoint.
|
||||
"""
|
||||
|
||||
import logging
|
||||
import time
|
||||
import json
|
||||
from datetime import datetime, timezone, timedelta
|
||||
from typing import Dict, Any
|
||||
from modules.workflows.methods.methodBase import action
|
||||
from modules.datamodels.datamodelChat import ActionResult, ActionDocument
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
@action
|
||||
async def analyzeFolderUsage(self, parameters: Dict[str, Any]) -> ActionResult:
|
||||
"""
|
||||
GENERAL:
|
||||
- Purpose: Analyze usage intensity of folders and files in SharePoint.
|
||||
- Input requirements: connectionReference (required); documentList (required); optional startDateTime, endDateTime, interval.
|
||||
- Output format: JSON with usage analytics grouped by time intervals.
|
||||
|
||||
Parameters:
|
||||
- connectionReference (str, required): Microsoft connection label.
|
||||
- documentList (list, required): Document list reference(s) containing findDocumentPath result.
|
||||
- startDateTime (str, optional): Start date/time in ISO format (e.g., "2025-11-01T00:00:00Z"). Default: 30 days ago.
|
||||
- endDateTime (str, optional): End date/time in ISO format (e.g., "2025-11-30T23:59:59Z"). Default: current time.
|
||||
- interval (str, optional): Time interval for grouping activities. Options: "day", "week", "month". Default: "day".
|
||||
"""
|
||||
operationId = None
|
||||
try:
|
||||
# Init progress logger
|
||||
workflowId = self.services.workflow.id if self.services.workflow else f"no-workflow-{int(time.time())}"
|
||||
operationId = f"sharepoint_usage_{workflowId}_{int(time.time())}"
|
||||
|
||||
# Start progress tracking
|
||||
parentOperationId = parameters.get('parentOperationId')
|
||||
self.services.chat.progressLogStart(
|
||||
operationId,
|
||||
"Analyze Folder Usage",
|
||||
"SharePoint Analytics",
|
||||
"Processing document list",
|
||||
parentOperationId=parentOperationId
|
||||
)
|
||||
|
||||
connectionReference = parameters.get("connectionReference")
|
||||
documentList = parameters.get("documentList")
|
||||
pathQuery = parameters.get("pathQuery")
|
||||
if isinstance(documentList, str):
|
||||
documentList = [documentList]
|
||||
startDateTime = parameters.get("startDateTime")
|
||||
endDateTime = parameters.get("endDateTime")
|
||||
interval = parameters.get("interval", "day")
|
||||
|
||||
if not connectionReference:
|
||||
if operationId:
|
||||
self.services.chat.progressLogFinish(operationId, False)
|
||||
return ActionResult.isFailure(error="Connection reference is required")
|
||||
|
||||
# Require either documentList or pathQuery
|
||||
if not documentList and (not pathQuery or pathQuery.strip() == "" or pathQuery.strip() == "*"):
|
||||
if operationId:
|
||||
self.services.chat.progressLogFinish(operationId, False)
|
||||
return ActionResult.isFailure(error="Either documentList or pathQuery is required")
|
||||
|
||||
# Resolve folder/item information from documentList or pathQuery
|
||||
siteId = None
|
||||
driveId = None
|
||||
itemId = None
|
||||
folderPath = None
|
||||
folderName = None
|
||||
foundDocuments = None
|
||||
|
||||
if documentList:
|
||||
foundDocuments, sites, errorMsg = await self.documentParsing.parseDocumentListForFoundDocuments(documentList)
|
||||
if errorMsg:
|
||||
if operationId:
|
||||
self.services.chat.progressLogFinish(operationId, False)
|
||||
return ActionResult.isFailure(error=errorMsg)
|
||||
|
||||
if not foundDocuments:
|
||||
if operationId:
|
||||
self.services.chat.progressLogFinish(operationId, False)
|
||||
return ActionResult.isFailure(error="No documents found in documentList")
|
||||
|
||||
# Get siteId from first document (all should be from same site)
|
||||
firstItem = foundDocuments[0]
|
||||
siteId = firstItem.get("siteId")
|
||||
if not siteId:
|
||||
if operationId:
|
||||
self.services.chat.progressLogFinish(operationId, False)
|
||||
return ActionResult.isFailure(error="Site ID missing from documentList")
|
||||
|
||||
# Get drive ID (needed for analytics)
|
||||
driveId = await self.services.sharepoint.getDriveId(siteId)
|
||||
if not driveId:
|
||||
if operationId:
|
||||
self.services.chat.progressLogFinish(operationId, False)
|
||||
return ActionResult.isFailure(error="Could not determine drive ID for the site")
|
||||
|
||||
# If no items from documentList, try pathQuery fallback
|
||||
if not foundDocuments and pathQuery and pathQuery.strip() != "" and pathQuery.strip() != "*":
|
||||
sites, errorMsg = await self.siteDiscovery.resolveSitesFromPathQuery(pathQuery)
|
||||
if errorMsg:
|
||||
if operationId:
|
||||
self.services.chat.progressLogFinish(operationId, False)
|
||||
return ActionResult.isFailure(error=errorMsg)
|
||||
|
||||
if sites:
|
||||
siteId = sites[0].get("id")
|
||||
# Parse pathQuery to find the folder/item
|
||||
pathQueryParsed, fileQuery, searchType, searchOptions = self.pathProcessing.parseSearchQuery(pathQuery)
|
||||
|
||||
# Extract folder path from pathQuery
|
||||
folderPath = '/'
|
||||
if pathQueryParsed and pathQueryParsed.startswith('/sites/'):
|
||||
parsedPath = self.siteDiscovery.extractSiteFromStandardPath(pathQueryParsed)
|
||||
if parsedPath:
|
||||
innerPath = parsedPath.get("innerPath", "")
|
||||
folderPath = '/' + innerPath if innerPath else '/'
|
||||
elif pathQueryParsed:
|
||||
folderPath = pathQueryParsed
|
||||
|
||||
# Get drive ID
|
||||
driveId = await self.services.sharepoint.getDriveId(siteId)
|
||||
if not driveId:
|
||||
if operationId:
|
||||
self.services.chat.progressLogFinish(operationId, False)
|
||||
return ActionResult.isFailure(error="Could not determine drive ID for the site")
|
||||
|
||||
# Get folder/item by path
|
||||
folderInfo = await self.services.sharepoint.getFolderByPath(siteId, folderPath.lstrip('/'))
|
||||
if not folderInfo:
|
||||
if operationId:
|
||||
self.services.chat.progressLogFinish(operationId, False)
|
||||
return ActionResult.isFailure(error=f"Folder or file not found at path: {folderPath}")
|
||||
|
||||
# Add pathQuery item to foundDocuments for processing
|
||||
foundDocuments = [{
|
||||
"id": folderInfo.get("id"),
|
||||
"name": folderInfo.get("name", ""),
|
||||
"type": "folder" if folderInfo.get("folder") else "file",
|
||||
"siteId": siteId,
|
||||
"fullPath": folderPath,
|
||||
"webUrl": folderInfo.get("webUrl", "")
|
||||
}]
|
||||
|
||||
if not siteId or not driveId:
|
||||
if operationId:
|
||||
self.services.chat.progressLogFinish(operationId, False)
|
||||
return ActionResult.isFailure(error="Either documentList must contain findDocumentPath result with folder information, or pathQuery must be provided. Use findDocumentPath first to get folder path, or provide pathQuery directly.")
|
||||
|
||||
self.services.chat.progressLogUpdate(operationId, 0.2, "Getting Microsoft connection")
|
||||
# Get Microsoft connection
|
||||
connection = self.connection.getMicrosoftConnection(connectionReference)
|
||||
if not connection:
|
||||
if operationId:
|
||||
self.services.chat.progressLogFinish(operationId, False)
|
||||
return ActionResult.isFailure(error="No valid Microsoft connection found for the provided connection reference")
|
||||
|
||||
# Set access token
|
||||
if not self.services.sharepoint.setAccessTokenFromConnection(connection):
|
||||
if operationId:
|
||||
self.services.chat.progressLogFinish(operationId, False)
|
||||
return ActionResult.isFailure(error="Failed to set SharePoint access token")
|
||||
|
||||
# Process all items from documentList or pathQuery
|
||||
# IMPORTANT: Only analyze FOLDERS, not files (action is "analyzeFolderUsage")
|
||||
itemsToAnalyze = []
|
||||
if foundDocuments:
|
||||
for item in foundDocuments:
|
||||
itemId = item.get("id")
|
||||
itemType = item.get("type", "").lower()
|
||||
|
||||
# Only process folders, skip files and site-level items
|
||||
if itemId and itemType == "folder":
|
||||
itemsToAnalyze.append({
|
||||
"id": itemId,
|
||||
"name": item.get("name", ""),
|
||||
"type": itemType,
|
||||
"path": item.get("fullPath", ""),
|
||||
"webUrl": item.get("webUrl", "")
|
||||
})
|
||||
|
||||
if not itemsToAnalyze:
|
||||
if operationId:
|
||||
self.services.chat.progressLogFinish(operationId, False)
|
||||
return ActionResult.isFailure(error="No valid folders found in documentList to analyze. Note: This action only analyzes folders, not files.")
|
||||
|
||||
self.services.chat.progressLogUpdate(operationId, 0.4, f"Analyzing {len(itemsToAnalyze)} folder(s)")
|
||||
|
||||
# Analyze each item
|
||||
allAnalytics = []
|
||||
totalActivities = 0
|
||||
uniqueUsers = set()
|
||||
activityTypes = {}
|
||||
|
||||
# Compute actual date range values (getFolderUsageAnalytics will set defaults if None)
|
||||
# We need to compute them here to store in output, since getFolderUsageAnalytics modifies them
|
||||
actualStartDateTime = startDateTime
|
||||
actualEndDateTime = endDateTime
|
||||
if not actualEndDateTime:
|
||||
actualEndDateTime = datetime.now(timezone.utc).isoformat().replace('+00:00', 'Z')
|
||||
if not actualStartDateTime:
|
||||
startDate = datetime.now(timezone.utc) - timedelta(days=30)
|
||||
actualStartDateTime = startDate.isoformat().replace('+00:00', 'Z')
|
||||
|
||||
for idx, item in enumerate(itemsToAnalyze):
|
||||
progress = 0.4 + (idx / len(itemsToAnalyze)) * 0.5
|
||||
self.services.chat.progressLogUpdate(operationId, progress, f"Analyzing folder {item['name']} ({idx+1}/{len(itemsToAnalyze)})")
|
||||
|
||||
# Get usage analytics for this folder
|
||||
analyticsResult = await self.services.sharepoint.getFolderUsageAnalytics(
|
||||
siteId=siteId,
|
||||
driveId=driveId,
|
||||
itemId=item["id"],
|
||||
startDateTime=startDateTime,
|
||||
endDateTime=endDateTime,
|
||||
interval=interval
|
||||
)
|
||||
|
||||
if "error" in analyticsResult:
|
||||
logger.warning(f"Failed to get analytics for item {item['name']} ({item['id']}): {analyticsResult['error']}")
|
||||
# Continue with other items even if one fails
|
||||
itemAnalytics = {
|
||||
"itemId": item["id"],
|
||||
"itemName": item["name"],
|
||||
"itemType": item["type"],
|
||||
"itemPath": item["path"],
|
||||
"error": analyticsResult.get("error", "Unknown error")
|
||||
}
|
||||
else:
|
||||
# Process analytics for this item
|
||||
itemActivities = 0
|
||||
itemUsers = set()
|
||||
itemActivityTypes = {}
|
||||
|
||||
if "value" in analyticsResult:
|
||||
for intervalData in analyticsResult["value"]:
|
||||
activities = intervalData.get("activities", [])
|
||||
for activity in activities:
|
||||
itemActivities += 1
|
||||
totalActivities += 1
|
||||
|
||||
action = activity.get("action", {})
|
||||
actionType = action.get("verb", "unknown")
|
||||
itemActivityTypes[actionType] = itemActivityTypes.get(actionType, 0) + 1
|
||||
activityTypes[actionType] = activityTypes.get(actionType, 0) + 1
|
||||
|
||||
actor = activity.get("actor", {})
|
||||
userPrincipalName = actor.get("userPrincipalName", "")
|
||||
if userPrincipalName:
|
||||
itemUsers.add(userPrincipalName)
|
||||
uniqueUsers.add(userPrincipalName)
|
||||
|
||||
itemAnalytics = {
|
||||
"itemId": item["id"],
|
||||
"itemName": item["name"],
|
||||
"itemType": item["type"],
|
||||
"itemPath": item["path"],
|
||||
"webUrl": item["webUrl"],
|
||||
"analytics": analyticsResult,
|
||||
"summary": {
|
||||
"totalActivities": itemActivities,
|
||||
"uniqueUsers": len(itemUsers),
|
||||
"activityTypes": itemActivityTypes
|
||||
}
|
||||
}
|
||||
|
||||
# Include note if analytics are not available
|
||||
if "note" in analyticsResult:
|
||||
itemAnalytics["note"] = analyticsResult["note"]
|
||||
|
||||
allAnalytics.append(itemAnalytics)
|
||||
|
||||
self.services.chat.progressLogUpdate(operationId, 0.9, "Processing analytics data")
|
||||
|
||||
# Process and format analytics data
|
||||
resultData = {
|
||||
"siteId": siteId,
|
||||
"driveId": driveId,
|
||||
"startDateTime": actualStartDateTime, # Store computed date range (not None)
|
||||
"endDateTime": actualEndDateTime, # Store computed date range (not None)
|
||||
"interval": interval,
|
||||
"itemsAnalyzed": len(itemsToAnalyze),
|
||||
"foldersAnalyzed": len([item for item in allAnalytics if item.get("itemType") == "folder"]),
|
||||
"items": allAnalytics,
|
||||
"summary": {
|
||||
"totalActivities": totalActivities,
|
||||
"uniqueUsers": len(uniqueUsers),
|
||||
"activityTypes": activityTypes
|
||||
},
|
||||
"note": f"Analyzed {len(itemsToAnalyze)} folder(s) from {actualStartDateTime} to {actualEndDateTime}. " +
|
||||
f"Found {totalActivities} total activities across {len(uniqueUsers)} unique user(s)." +
|
||||
(f" Note: {len([item for item in allAnalytics if 'error' in item])} folder(s) had errors or no analytics data available." if any('error' in item for item in allAnalytics) else ""),
|
||||
"timestamp": self.services.utils.timestampGetUtc()
|
||||
}
|
||||
|
||||
self.services.chat.progressLogUpdate(operationId, 0.95, f"Found {totalActivities} total activities across {len(itemsToAnalyze)} folder(s)")
|
||||
|
||||
validationMetadata = {
|
||||
"actionType": "sharepoint.analyzeFolderUsage",
|
||||
"itemsAnalyzed": len(itemsToAnalyze),
|
||||
"interval": interval,
|
||||
"totalActivities": totalActivities,
|
||||
"uniqueUsers": len(uniqueUsers)
|
||||
}
|
||||
|
||||
self.services.chat.progressLogFinish(operationId, True)
|
||||
return ActionResult(
|
||||
success=True,
|
||||
documents=[
|
||||
ActionDocument(
|
||||
documentName=self._generateMeaningfulFileName("sharepoint_usage_analysis", "json", None, "analyzeFolderUsage"),
|
||||
documentData=json.dumps(resultData, indent=2),
|
||||
mimeType="application/json",
|
||||
validationMetadata=validationMetadata
|
||||
)
|
||||
]
|
||||
)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error analyzing folder usage: {str(e)}")
|
||||
if operationId:
|
||||
try:
|
||||
self.services.chat.progressLogFinish(operationId, False)
|
||||
except:
|
||||
pass
|
||||
return ActionResult(
|
||||
success=False,
|
||||
error=str(e)
|
||||
)
|
||||
|
||||
163
modules/workflows/methods/methodSharepoint/actions/copyFile.py
Normal file
163
modules/workflows/methods/methodSharepoint/actions/copyFile.py
Normal file
|
|
@ -0,0 +1,163 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
|
||||
"""
|
||||
Copy File action for SharePoint operations.
|
||||
Copies file within SharePoint.
|
||||
"""
|
||||
|
||||
import logging
|
||||
import json
|
||||
from typing import Dict, Any
|
||||
from modules.workflows.methods.methodBase import action
|
||||
from modules.datamodels.datamodelChat import ActionResult, ActionDocument
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
@action
|
||||
async def copyFile(self, parameters: Dict[str, Any]) -> ActionResult:
|
||||
"""
|
||||
Copy file within SharePoint.
|
||||
|
||||
Parameters:
|
||||
- connectionReference (str, required): Microsoft connection label.
|
||||
- siteId (str, required): SharePoint site ID (from findSiteByUrl result) or document reference containing site info
|
||||
- sourceFolder (str, required): Source folder path relative to site root
|
||||
- sourceFile (str, required): Source file name
|
||||
- destFolder (str, required): Destination folder path relative to site root
|
||||
- destFile (str, required): Destination file name
|
||||
|
||||
Returns:
|
||||
- ActionResult with ActionDocument containing copy result
|
||||
"""
|
||||
try:
|
||||
connectionReference = parameters.get("connectionReference")
|
||||
if not connectionReference:
|
||||
return ActionResult.isFailure(error="connectionReference parameter is required")
|
||||
|
||||
siteIdParam = parameters.get("siteId")
|
||||
if not siteIdParam:
|
||||
return ActionResult.isFailure(error="siteId parameter is required")
|
||||
|
||||
sourceFolder = parameters.get("sourceFolder")
|
||||
if not sourceFolder:
|
||||
return ActionResult.isFailure(error="sourceFolder parameter is required")
|
||||
|
||||
sourceFile = parameters.get("sourceFile")
|
||||
if not sourceFile:
|
||||
return ActionResult.isFailure(error="sourceFile parameter is required")
|
||||
|
||||
destFolder = parameters.get("destFolder")
|
||||
if not destFolder:
|
||||
return ActionResult.isFailure(error="destFolder parameter is required")
|
||||
|
||||
destFile = parameters.get("destFile")
|
||||
if not destFile:
|
||||
return ActionResult.isFailure(error="destFile parameter is required")
|
||||
|
||||
# Extract siteId from document if it's a reference
|
||||
siteId = None
|
||||
if isinstance(siteIdParam, str):
|
||||
from modules.datamodels.datamodelDocref import DocumentReferenceList
|
||||
try:
|
||||
docList = DocumentReferenceList.from_string_list([siteIdParam])
|
||||
chatDocuments = self.services.chat.getChatDocumentsFromDocumentList(docList)
|
||||
if chatDocuments and len(chatDocuments) > 0:
|
||||
siteInfoJson = json.loads(chatDocuments[0].documentData)
|
||||
siteId = siteInfoJson.get("id")
|
||||
except:
|
||||
pass
|
||||
|
||||
if not siteId:
|
||||
siteId = siteIdParam
|
||||
else:
|
||||
siteId = siteIdParam
|
||||
|
||||
if not siteId:
|
||||
return ActionResult.isFailure(error="Could not extract siteId from parameter")
|
||||
|
||||
# Get Microsoft connection
|
||||
connection = self.connection.getMicrosoftConnection(connectionReference)
|
||||
if not connection:
|
||||
return ActionResult.isFailure(error="No valid Microsoft connection found for the provided connection reference")
|
||||
|
||||
# Copy file
|
||||
await self.services.sharepoint.copyFileAsync(
|
||||
siteId=siteId,
|
||||
sourceFolder=sourceFolder,
|
||||
sourceFile=sourceFile,
|
||||
destFolder=destFolder,
|
||||
destFile=destFile
|
||||
)
|
||||
|
||||
logger.info(f"Copied file in SharePoint: {sourceFolder}/{sourceFile} -> {destFolder}/{destFile}")
|
||||
|
||||
# Generate filename
|
||||
workflowContext = self.services.chat.getWorkflowContext() if hasattr(self.services, 'chat') else None
|
||||
filename = self._generateMeaningfulFileName(
|
||||
"file_copy_result",
|
||||
"json",
|
||||
workflowContext,
|
||||
"copyFile"
|
||||
)
|
||||
|
||||
result = {
|
||||
"success": True,
|
||||
"siteId": siteId,
|
||||
"sourcePath": f"{sourceFolder}/{sourceFile}",
|
||||
"destPath": f"{destFolder}/{destFile}"
|
||||
}
|
||||
|
||||
validationMetadata = self._createValidationMetadata(
|
||||
"copyFile",
|
||||
siteId=siteId,
|
||||
sourcePath=f"{sourceFolder}/{sourceFile}",
|
||||
destPath=f"{destFolder}/{destFile}"
|
||||
)
|
||||
|
||||
document = ActionDocument(
|
||||
documentName=filename,
|
||||
documentData=json.dumps(result, indent=2),
|
||||
mimeType="application/json",
|
||||
validationMetadata=validationMetadata
|
||||
)
|
||||
|
||||
return ActionResult.isSuccess(documents=[document])
|
||||
|
||||
except Exception as e:
|
||||
# Handle file not found gracefully
|
||||
if "itemNotFound" in str(e) or "404" in str(e):
|
||||
logger.warning(f"File not found for copy: {parameters.get('sourceFolder')}/{parameters.get('sourceFile')}")
|
||||
# Return success with skipped status
|
||||
workflowContext = self.services.chat.getWorkflowContext() if hasattr(self.services, 'chat') else None
|
||||
filename = self._generateMeaningfulFileName(
|
||||
"file_copy_result",
|
||||
"json",
|
||||
workflowContext,
|
||||
"copyFile"
|
||||
)
|
||||
|
||||
result = {
|
||||
"success": True,
|
||||
"skipped": True,
|
||||
"reason": "File not found (may not exist yet)"
|
||||
}
|
||||
|
||||
validationMetadata = self._createValidationMetadata(
|
||||
"copyFile",
|
||||
skipped=True
|
||||
)
|
||||
|
||||
document = ActionDocument(
|
||||
documentName=filename,
|
||||
documentData=json.dumps(result, indent=2),
|
||||
mimeType="application/json",
|
||||
validationMetadata=validationMetadata
|
||||
)
|
||||
|
||||
return ActionResult.isSuccess(documents=[document])
|
||||
|
||||
errorMsg = f"Error copying file in SharePoint: {str(e)}"
|
||||
logger.error(errorMsg)
|
||||
return ActionResult.isFailure(error=errorMsg)
|
||||
|
||||
|
|
@ -0,0 +1,117 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
|
||||
"""
|
||||
Download File By Path action for SharePoint operations.
|
||||
Downloads file from SharePoint by exact file path.
|
||||
"""
|
||||
|
||||
import logging
|
||||
import json
|
||||
import base64
|
||||
import os
|
||||
from typing import Dict, Any
|
||||
from modules.workflows.methods.methodBase import action
|
||||
from modules.datamodels.datamodelChat import ActionResult, ActionDocument
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
@action
|
||||
async def downloadFileByPath(self, parameters: Dict[str, Any]) -> ActionResult:
|
||||
"""
|
||||
Download file from SharePoint by exact file path.
|
||||
|
||||
Parameters:
|
||||
- connectionReference (str, required): Microsoft connection label.
|
||||
- siteId (str, required): SharePoint site ID (from findSiteByUrl result) or document reference containing site info
|
||||
- filePath (str, required): Full file path relative to site root (e.g., "/General/50 Docs hosted by SELISE/file.xlsx")
|
||||
|
||||
Returns:
|
||||
- ActionResult with ActionDocument containing file content as base64-encoded bytes
|
||||
"""
|
||||
try:
|
||||
connectionReference = parameters.get("connectionReference")
|
||||
if not connectionReference:
|
||||
return ActionResult.isFailure(error="connectionReference parameter is required")
|
||||
|
||||
siteIdParam = parameters.get("siteId")
|
||||
if not siteIdParam:
|
||||
return ActionResult.isFailure(error="siteId parameter is required")
|
||||
|
||||
filePath = parameters.get("filePath")
|
||||
if not filePath:
|
||||
return ActionResult.isFailure(error="filePath parameter is required")
|
||||
|
||||
# Extract siteId from document if it's a reference
|
||||
siteId = None
|
||||
if isinstance(siteIdParam, str):
|
||||
# Try to parse from document reference
|
||||
from modules.datamodels.datamodelDocref import DocumentReferenceList
|
||||
try:
|
||||
docList = DocumentReferenceList.from_string_list([siteIdParam])
|
||||
chatDocuments = self.services.chat.getChatDocumentsFromDocumentList(docList)
|
||||
if chatDocuments and len(chatDocuments) > 0:
|
||||
siteInfoJson = json.loads(chatDocuments[0].documentData)
|
||||
siteId = siteInfoJson.get("id")
|
||||
except:
|
||||
pass
|
||||
|
||||
if not siteId:
|
||||
# Assume it's the site ID directly
|
||||
siteId = siteIdParam
|
||||
else:
|
||||
siteId = siteIdParam
|
||||
|
||||
if not siteId:
|
||||
return ActionResult.isFailure(error="Could not extract siteId from parameter")
|
||||
|
||||
# Get Microsoft connection
|
||||
connection = self.connection.getMicrosoftConnection(connectionReference)
|
||||
if not connection:
|
||||
return ActionResult.isFailure(error="No valid Microsoft connection found for the provided connection reference")
|
||||
|
||||
# Download file
|
||||
fileContent = await self.services.sharepoint.downloadFileByPath(
|
||||
siteId=siteId,
|
||||
filePath=filePath
|
||||
)
|
||||
|
||||
if fileContent is None:
|
||||
return ActionResult.isFailure(error=f"File not found or could not be downloaded: {filePath}")
|
||||
|
||||
logger.info(f"Downloaded file from SharePoint: {filePath} ({len(fileContent)} bytes)")
|
||||
|
||||
# Generate filename from filePath
|
||||
fileName = os.path.basename(filePath) or "downloaded_file"
|
||||
workflowContext = self.services.chat.getWorkflowContext() if hasattr(self.services, 'chat') else None
|
||||
filename = self._generateMeaningfulFileName(
|
||||
fileName.split('.')[0] if '.' in fileName else fileName,
|
||||
fileName.split('.')[-1] if '.' in fileName else "bin",
|
||||
workflowContext,
|
||||
"downloadFileByPath"
|
||||
)
|
||||
|
||||
# Encode as base64
|
||||
fileBase64 = base64.b64encode(fileContent).decode('utf-8')
|
||||
|
||||
validationMetadata = self._createValidationMetadata(
|
||||
"downloadFileByPath",
|
||||
siteId=siteId,
|
||||
filePath=filePath,
|
||||
fileSize=len(fileContent)
|
||||
)
|
||||
|
||||
document = ActionDocument(
|
||||
documentName=filename,
|
||||
documentData=fileBase64,
|
||||
mimeType="application/octet-stream",
|
||||
validationMetadata=validationMetadata
|
||||
)
|
||||
|
||||
return ActionResult.isSuccess(documents=[document])
|
||||
|
||||
except Exception as e:
|
||||
errorMsg = f"Error downloading file from SharePoint: {str(e)}"
|
||||
logger.error(errorMsg)
|
||||
return ActionResult.isFailure(error=errorMsg)
|
||||
|
||||
|
|
@ -0,0 +1,497 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
|
||||
"""
|
||||
Find Document Path action for SharePoint operations.
|
||||
Finds documents and folders by name/path across SharePoint sites.
|
||||
"""
|
||||
|
||||
import logging
|
||||
import time
|
||||
import json
|
||||
import urllib.parse
|
||||
from typing import Dict, Any
|
||||
from modules.workflows.methods.methodBase import action
|
||||
from modules.datamodels.datamodelChat import ActionResult, ActionDocument
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
@action
|
||||
async def findDocumentPath(self, parameters: Dict[str, Any]) -> ActionResult:
|
||||
"""
|
||||
GENERAL:
|
||||
- Purpose: Find documents and folders by name/path across sites.
|
||||
- Input requirements: connectionReference (required); searchQuery (required); optional site, maxResults.
|
||||
- Output format: JSON with found items and paths.
|
||||
|
||||
Parameters:
|
||||
- connectionReference (str, required): Microsoft connection label.
|
||||
- site (str, optional): Site hint.
|
||||
- searchQuery (str, required): Search terms or path.
|
||||
- maxResults (int, optional): Maximum items to return. Default: 1000.
|
||||
"""
|
||||
operationId = None
|
||||
try:
|
||||
# Init progress logger
|
||||
workflowId = self.services.workflow.id if self.services.workflow else f"no-workflow-{int(time.time())}"
|
||||
operationId = f"sharepoint_find_{workflowId}_{int(time.time())}"
|
||||
|
||||
# Start progress tracking
|
||||
parentOperationId = parameters.get('parentOperationId')
|
||||
self.services.chat.progressLogStart(
|
||||
operationId,
|
||||
"Find Document Path",
|
||||
"SharePoint Search",
|
||||
f"Query: {parameters.get('searchQuery', '*')}",
|
||||
parentOperationId=parentOperationId
|
||||
)
|
||||
|
||||
connectionReference = parameters.get("connectionReference")
|
||||
site = parameters.get("site")
|
||||
searchQuery = parameters.get("searchQuery", "*")
|
||||
maxResults = parameters.get("maxResults", 1000)
|
||||
|
||||
if not connectionReference:
|
||||
if operationId:
|
||||
self.services.chat.progressLogFinish(operationId, False)
|
||||
return ActionResult.isFailure(error="Connection reference is required")
|
||||
|
||||
# Parse searchQuery to extract path, search terms, search type, and options
|
||||
pathQuery, fileQuery, searchType, searchOptions = self.pathProcessing.parseSearchQuery(searchQuery)
|
||||
logger.debug(f"Parsed searchQuery '{searchQuery}' -> pathQuery='{pathQuery}', fileQuery='{fileQuery}', searchType='{searchType}'")
|
||||
|
||||
self.services.chat.progressLogUpdate(operationId, 0.2, "Getting Microsoft connection")
|
||||
connection = self.connection.getMicrosoftConnection(connectionReference)
|
||||
if not connection:
|
||||
if operationId:
|
||||
self.services.chat.progressLogFinish(operationId, False)
|
||||
return ActionResult.isFailure(error="No valid Microsoft connection found for the provided connection reference")
|
||||
|
||||
# Extract site name from pathQuery if it contains Microsoft-standard path (/sites/SiteName/...)
|
||||
siteFromPath = None
|
||||
directSite = None
|
||||
if pathQuery and pathQuery.startswith('/sites/'):
|
||||
parsedPath = self.siteDiscovery.extractSiteFromStandardPath(pathQuery)
|
||||
if parsedPath:
|
||||
siteFromPath = parsedPath.get("siteName")
|
||||
logger.info(f"Extracted site from Microsoft-standard pathQuery '{pathQuery}': '{siteFromPath}'")
|
||||
|
||||
# Try to get site directly by path (optimization - no need to load all 60 sites)
|
||||
directSite = await self.siteDiscovery.getSiteByStandardPath(siteFromPath)
|
||||
if directSite:
|
||||
logger.info(f"Got site directly by standard path - no need to discover all sites")
|
||||
sites = [directSite]
|
||||
else:
|
||||
logger.warning(f"Could not get site directly, falling back to site discovery")
|
||||
directSite = None
|
||||
else:
|
||||
logger.warning(f"Failed to parse site from standard pathQuery '{pathQuery}'")
|
||||
|
||||
# If we didn't get the site directly, use discovery and filtering
|
||||
if not directSite:
|
||||
# Determine which site hint to use (priority: site parameter > site from pathQuery > site_hint from searchOptions)
|
||||
siteHintToUse = site or siteFromPath or searchOptions.get("site_hint")
|
||||
|
||||
# Discover SharePoint sites - use targeted approach when site hint is available
|
||||
self.services.chat.progressLogUpdate(operationId, 0.3, "Discovering SharePoint sites")
|
||||
if siteHintToUse:
|
||||
# When site hint is available, discover all sites first, then filter
|
||||
allSites = await self.siteDiscovery.discoverSharePointSites()
|
||||
if not allSites:
|
||||
if operationId:
|
||||
self.services.chat.progressLogFinish(operationId, False)
|
||||
return ActionResult.isFailure(error="No SharePoint sites found or accessible")
|
||||
|
||||
sites = self.siteDiscovery.filterSitesByHint(allSites, siteHintToUse)
|
||||
logger.info(f"Filtered sites by site hint '{siteHintToUse}' -> {len(sites)} sites")
|
||||
if not sites:
|
||||
if operationId:
|
||||
self.services.chat.progressLogFinish(operationId, False)
|
||||
return ActionResult.isFailure(error=f"No SharePoint sites found matching '{siteHintToUse}'")
|
||||
else:
|
||||
# No site hint - discover all sites
|
||||
sites = await self.siteDiscovery.discoverSharePointSites()
|
||||
if not sites:
|
||||
if operationId:
|
||||
self.services.chat.progressLogFinish(operationId, False)
|
||||
return ActionResult.isFailure(error="No SharePoint sites found or accessible")
|
||||
|
||||
# Resolve path query into search paths
|
||||
searchPaths = self.pathProcessing.resolvePathQuery(pathQuery)
|
||||
|
||||
self.services.chat.progressLogUpdate(operationId, 0.5, f"Searching across {len(sites)} site(s)")
|
||||
|
||||
try:
|
||||
# Search across all discovered sites
|
||||
foundDocuments = []
|
||||
allSitesSearched = []
|
||||
|
||||
# Handle different search approaches based on search type
|
||||
if searchType == "folders" and fileQuery and fileQuery.strip() != "" and fileQuery.strip() != "*":
|
||||
# Use unified search for folders - this is global and searches all sites
|
||||
try:
|
||||
|
||||
# Use Microsoft Graph Search API syntax (simple term search only)
|
||||
terms = [t for t in fileQuery.split() if t.strip()]
|
||||
|
||||
if len(terms) > 1:
|
||||
# Multiple terms: search for ALL terms (AND) - more specific results
|
||||
queryString = " AND ".join(terms)
|
||||
else:
|
||||
# Single term: search for the term
|
||||
queryString = terms[0] if terms else fileQuery
|
||||
logger.info(f"Using unified search for folders: {queryString}")
|
||||
|
||||
payload = {
|
||||
"requests": [
|
||||
{
|
||||
"entityTypes": ["driveItem"],
|
||||
"query": {"queryString": queryString},
|
||||
"from": 0,
|
||||
"size": 50
|
||||
}
|
||||
]
|
||||
}
|
||||
logger.info(f"Using unified search API for folders with queryString: {queryString}")
|
||||
|
||||
# Use global search endpoint (site-specific search not available)
|
||||
unifiedResult = await self.apiClient.makeGraphApiCall(
|
||||
"search/query",
|
||||
method="POST",
|
||||
data=json.dumps(payload).encode("utf-8")
|
||||
)
|
||||
|
||||
if "error" in unifiedResult:
|
||||
logger.warning(f"Unified search failed: {unifiedResult['error']}")
|
||||
items = []
|
||||
else:
|
||||
# Flatten hits -> driveItem resources
|
||||
items = []
|
||||
for container in (unifiedResult.get("value", []) or []):
|
||||
for hitsContainer in (container.get("hitsContainers", []) or []):
|
||||
for hit in (hitsContainer.get("hits", []) or []):
|
||||
resource = hit.get("resource")
|
||||
if resource:
|
||||
items.append(resource)
|
||||
|
||||
logger.info(f"Unified search returned {len(items)} items (pre-filter)")
|
||||
|
||||
# Apply our improved folder detection logic
|
||||
folderItems = []
|
||||
for item in items:
|
||||
resource = item
|
||||
|
||||
# Use the same detection logic as our test
|
||||
isFolder = self.services.sharepoint.detectFolderType(resource)
|
||||
|
||||
if isFolder:
|
||||
folderItems.append(item)
|
||||
|
||||
items = folderItems
|
||||
logger.info(f"Filtered to {len(items)} folders using improved detection logic")
|
||||
|
||||
# Process unified search results - extract site information from webUrl
|
||||
for item in items:
|
||||
itemName = item.get("name", "")
|
||||
webUrl = item.get("webUrl", "")
|
||||
|
||||
# Extract site information from webUrl
|
||||
siteName = "Unknown Site"
|
||||
siteId = "unknown"
|
||||
|
||||
if webUrl and '/sites/' in webUrl:
|
||||
try:
|
||||
# Extract site name from URL: https://pcuster.sharepoint.com/sites/SiteName/...
|
||||
urlParts = webUrl.split('/sites/')
|
||||
if len(urlParts) > 1:
|
||||
sitePath = urlParts[1].split('/')[0]
|
||||
# Find matching site from discovered sites
|
||||
# First try to match by site name (URL path)
|
||||
for site in sites:
|
||||
if site.get("name") == sitePath:
|
||||
siteName = site.get("displayName", sitePath)
|
||||
siteId = site.get("id", "unknown")
|
||||
break
|
||||
else:
|
||||
# If no match by name, try to match by displayName
|
||||
for site in sites:
|
||||
if site.get("displayName") == sitePath:
|
||||
siteName = site.get("displayName", sitePath)
|
||||
siteId = site.get("id", "unknown")
|
||||
break
|
||||
else:
|
||||
# If no exact match, use the site path as site name
|
||||
siteName = sitePath
|
||||
# Try to find a site with similar name
|
||||
for site in sites:
|
||||
if sitePath.lower() in site.get("name", "").lower() or sitePath.lower() in site.get("displayName", "").lower():
|
||||
siteName = site.get("displayName", sitePath)
|
||||
siteId = site.get("id", "unknown")
|
||||
break
|
||||
except Exception as e:
|
||||
logger.warning(f"Error extracting site info from URL {webUrl}: {e}")
|
||||
|
||||
# Use improved folder detection logic
|
||||
isFolder = self.services.sharepoint.detectFolderType(item)
|
||||
itemType = "folder" if isFolder else "file"
|
||||
itemPath = item.get("parentReference", {}).get("path", "")
|
||||
logger.debug(f"Processing {itemType}: '{itemName}' at path: '{itemPath}'")
|
||||
|
||||
# Simple filtering like test file - just check search type
|
||||
if searchType == "files" and isFolder:
|
||||
continue # Skip folders when searching for files
|
||||
elif searchType == "folders" and not isFolder:
|
||||
continue # Skip files when searching for folders
|
||||
|
||||
# Simple approach like test file - no complex filtering
|
||||
logger.debug(f"Item '{itemName}' found - adding to results")
|
||||
|
||||
# Create result with full path information for proper action chaining
|
||||
parentPath = item.get("parentReference", {}).get("path", "")
|
||||
|
||||
# Extract the full SharePoint path from webUrl or parentReference
|
||||
fullPath = ""
|
||||
if webUrl:
|
||||
# Extract path from webUrl: https://pcuster.sharepoint.com/sites/SSSRESYNachfolge/Freigegebene%20Dokumente/General/Eskalation%20LogObject/Druckersteuerung
|
||||
if '/sites/' in webUrl:
|
||||
pathPart = webUrl.split('/sites/')[1]
|
||||
# Decode URL encoding and convert to backslash format
|
||||
decodedPath = urllib.parse.unquote(pathPart)
|
||||
fullPath = "\\" + decodedPath.replace('/', '\\')
|
||||
elif parentPath:
|
||||
# Use parentReference path if available
|
||||
fullPath = parentPath.replace('/', '\\')
|
||||
|
||||
docInfo = {
|
||||
"id": item.get("id"),
|
||||
"name": item.get("name"),
|
||||
"type": "folder" if isFolder else "file",
|
||||
"siteName": siteName,
|
||||
"siteId": siteId,
|
||||
"webUrl": webUrl,
|
||||
"fullPath": fullPath,
|
||||
"parentPath": parentPath
|
||||
}
|
||||
|
||||
foundDocuments.append(docInfo)
|
||||
|
||||
logger.info(f"Found {len(foundDocuments)} documents from unified search")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error performing unified folder search: {str(e)}")
|
||||
# Fallback to site-by-site search
|
||||
pass
|
||||
|
||||
# If no unified search was performed or it failed, fall back to site-by-site search
|
||||
if not foundDocuments:
|
||||
# Use simple approach like test file - no complex filtering
|
||||
siteScopedSites = sites
|
||||
|
||||
for site in siteScopedSites:
|
||||
siteId = site["id"]
|
||||
siteName = site["displayName"]
|
||||
siteUrl = site["webUrl"]
|
||||
|
||||
logger.info(f"Searching in site: {siteName} ({siteUrl})")
|
||||
|
||||
# Check if pathQuery contains a specific folder path (not just /sites/SiteName)
|
||||
folderPath = None
|
||||
if pathQuery and pathQuery.startswith('/sites/'):
|
||||
parsedPath = self.siteDiscovery.extractSiteFromStandardPath(pathQuery)
|
||||
if parsedPath:
|
||||
innerPath = parsedPath.get("innerPath", "")
|
||||
if innerPath and innerPath.strip():
|
||||
# Remove leading slash if present
|
||||
folderPath = innerPath.lstrip('/')
|
||||
|
||||
# Generic approach: Try to find the folder, if it fails, remove first segment
|
||||
# This works for all languages because we test the actual API response
|
||||
# In SharePoint Graph API, /drive/root already points to the default document library,
|
||||
# so library names in paths should be removed
|
||||
pathSegments = [s for s in folderPath.split('/') if s.strip()]
|
||||
if len(pathSegments) > 1:
|
||||
# Try with first segment removed (first segment is likely the document library)
|
||||
testPath = '/'.join(pathSegments[1:])
|
||||
# Quick test: try to get folder info (this is fast and doesn't require full search)
|
||||
testEndpoint = f"sites/{siteId}/drive/root:/{urllib.parse.quote(testPath, safe='')}:"
|
||||
testResult = await self.apiClient.makeGraphApiCall(testEndpoint)
|
||||
if testResult and "error" not in testResult:
|
||||
# Path without first segment works - first segment was likely the document library
|
||||
folderPath = testPath
|
||||
logger.info(f"Removed document library name '{pathSegments[0]}' from folder path (tested via API)")
|
||||
else:
|
||||
# Keep original path - first segment is not a document library
|
||||
logger.info(f"Keeping original folder path '{folderPath}' (first segment is not a document library)")
|
||||
elif len(pathSegments) == 1:
|
||||
# Only one segment - likely the document library itself, use root
|
||||
folderPath = None
|
||||
logger.info(f"Only one segment '{pathSegments[0]}' found, likely document library - using root")
|
||||
|
||||
if folderPath:
|
||||
logger.info(f"Extracted folder path from pathQuery: '{folderPath}'")
|
||||
else:
|
||||
logger.info(f"Folder path resolved to root (only document library in path)")
|
||||
|
||||
# Use Microsoft Graph API for this specific site
|
||||
# Handle empty or wildcard queries
|
||||
if not fileQuery or fileQuery.strip() == "" or fileQuery.strip() == "*":
|
||||
# For wildcard/empty queries, list all items
|
||||
if folderPath:
|
||||
# List items in specific folder
|
||||
encodedPath = urllib.parse.quote(folderPath, safe='')
|
||||
endpoint = f"sites/{siteId}/drive/root:/{encodedPath}:/children"
|
||||
logger.info(f"Listing items in folder: '{folderPath}'")
|
||||
else:
|
||||
# List all items in the drive root
|
||||
endpoint = f"sites/{siteId}/drive/root/children"
|
||||
|
||||
# Make the API call to list items
|
||||
listResult = await self.apiClient.makeGraphApiCall(endpoint)
|
||||
if "error" in listResult:
|
||||
logger.warning(f"List failed for site {siteName}: {listResult['error']}")
|
||||
continue
|
||||
# Process list results for this site
|
||||
items = listResult.get("value", [])
|
||||
logger.info(f"Retrieved {len(items)} items from site {siteName}")
|
||||
else:
|
||||
# For files, use regular search API
|
||||
# Clean the query: remove path-like syntax and invalid KQL syntax
|
||||
searchQueryCleaned = self.pathProcessing.cleanSearchQuery(fileQuery)
|
||||
# URL-encode the query parameter
|
||||
encodedQuery = urllib.parse.quote(searchQueryCleaned, safe='')
|
||||
|
||||
if folderPath:
|
||||
# Search in specific folder
|
||||
encodedPath = urllib.parse.quote(folderPath, safe='')
|
||||
endpoint = f"sites/{siteId}/drive/root:/{encodedPath}:/search(q='{encodedQuery}')"
|
||||
logger.info(f"Searching in folder '{folderPath}' with query: '{searchQueryCleaned}' (encoded: '{encodedQuery}')")
|
||||
else:
|
||||
# Search in drive root
|
||||
endpoint = f"sites/{siteId}/drive/root/search(q='{encodedQuery}')"
|
||||
logger.info(f"Using search API for files with query: '{searchQueryCleaned}' (encoded: '{encodedQuery}')")
|
||||
|
||||
# Make the search API call (files)
|
||||
searchResult = await self.apiClient.makeGraphApiCall(endpoint)
|
||||
if "error" in searchResult:
|
||||
logger.warning(f"Search failed for site {siteName}: {searchResult['error']}")
|
||||
continue
|
||||
# Process search results for this site (files)
|
||||
items = searchResult.get("value", [])
|
||||
logger.info(f"Retrieved {len(items)} items from site {siteName}")
|
||||
|
||||
siteDocuments = []
|
||||
|
||||
for item in items:
|
||||
itemName = item.get("name", "")
|
||||
|
||||
# Use improved folder detection logic
|
||||
isFolder = self.services.sharepoint.detectFolderType(item)
|
||||
|
||||
itemType = "folder" if isFolder else "file"
|
||||
itemPath = item.get("parentReference", {}).get("path", "")
|
||||
logger.debug(f"Processing {itemType}: '{itemName}' at path: '{itemPath}'")
|
||||
|
||||
# Simple filtering like test file - just check search type
|
||||
if searchType == "files" and isFolder:
|
||||
continue # Skip folders when searching for files
|
||||
elif searchType == "folders" and not isFolder:
|
||||
continue # Skip files when searching for folders
|
||||
|
||||
# Simple approach like test file - no complex filtering
|
||||
logger.debug(f"Item '{itemName}' found - adding to results")
|
||||
|
||||
# Create result with full path information for proper action chaining
|
||||
webUrl = item.get("webUrl", "")
|
||||
parentPath = item.get("parentReference", {}).get("path", "")
|
||||
|
||||
# Extract the full SharePoint path from webUrl or parentReference
|
||||
fullPath = ""
|
||||
if webUrl:
|
||||
# Extract path from webUrl: https://pcuster.sharepoint.com/sites/SSSRESYNachfolge/Freigegebene%20Dokumente/General/Eskalation%20LogObject/Druckersteuerung
|
||||
if '/sites/' in webUrl:
|
||||
pathPart = webUrl.split('/sites/')[1]
|
||||
# Decode URL encoding and convert to backslash format
|
||||
decodedPath = urllib.parse.unquote(pathPart)
|
||||
fullPath = "\\" + decodedPath.replace('/', '\\')
|
||||
elif parentPath:
|
||||
# Use parentReference path if available
|
||||
fullPath = parentPath.replace('/', '\\')
|
||||
|
||||
docInfo = {
|
||||
"id": item.get("id"),
|
||||
"name": item.get("name"),
|
||||
"type": "folder" if isFolder else "file",
|
||||
"siteName": siteName,
|
||||
"siteId": siteId,
|
||||
"webUrl": webUrl,
|
||||
"fullPath": fullPath,
|
||||
"parentPath": parentPath
|
||||
}
|
||||
|
||||
siteDocuments.append(docInfo)
|
||||
|
||||
foundDocuments.extend(siteDocuments)
|
||||
allSitesSearched.append({
|
||||
"siteName": siteName,
|
||||
"siteUrl": siteUrl,
|
||||
"siteId": siteId,
|
||||
"documentsFound": len(siteDocuments)
|
||||
})
|
||||
|
||||
logger.info(f"Found {len(siteDocuments)} documents in site {siteName}")
|
||||
|
||||
# Limit total results to maxResults
|
||||
if len(foundDocuments) > maxResults:
|
||||
foundDocuments = foundDocuments[:maxResults]
|
||||
logger.info(f"Limited results to {maxResults} items")
|
||||
|
||||
self.services.chat.progressLogUpdate(operationId, 0.9, f"Found {len(foundDocuments)} document(s)")
|
||||
|
||||
resultData = {
|
||||
"searchQuery": searchQuery,
|
||||
"totalResults": len(foundDocuments),
|
||||
"maxResults": maxResults,
|
||||
"foundDocuments": foundDocuments,
|
||||
"timestamp": self.services.utils.timestampGetUtc()
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error searching SharePoint: {str(e)}")
|
||||
if operationId:
|
||||
self.services.chat.progressLogFinish(operationId, False)
|
||||
return ActionResult.isFailure(error=str(e))
|
||||
|
||||
# Use default JSON format for output
|
||||
outputExtension = ".json" # Default
|
||||
outputMimeType = "application/json" # Default
|
||||
|
||||
validationMetadata = {
|
||||
"actionType": "sharepoint.findDocumentPath",
|
||||
"searchQuery": searchQuery,
|
||||
"maxResults": maxResults,
|
||||
"totalResults": len(foundDocuments),
|
||||
"hasResults": len(foundDocuments) > 0
|
||||
}
|
||||
|
||||
self.services.chat.progressLogFinish(operationId, True)
|
||||
return ActionResult(
|
||||
success=True,
|
||||
documents=[
|
||||
ActionDocument(
|
||||
documentName=self._generateMeaningfulFileName("sharepoint_find_path", "json", None, "findDocumentPath"),
|
||||
documentData=json.dumps(resultData, indent=2),
|
||||
mimeType=outputMimeType,
|
||||
validationMetadata=validationMetadata
|
||||
)
|
||||
]
|
||||
)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error finding document path: {str(e)}")
|
||||
if operationId:
|
||||
try:
|
||||
self.services.chat.progressLogFinish(operationId, False)
|
||||
except:
|
||||
pass
|
||||
return ActionResult.isFailure(error=str(e))
|
||||
|
||||
|
|
@ -0,0 +1,88 @@
|
|||
# Copyright (c) 2025 Patrick Motsch
|
||||
# All rights reserved.
|
||||
|
||||
"""
|
||||
Find Site By URL action for SharePoint operations.
|
||||
Finds SharePoint site by hostname and site path.
|
||||
"""
|
||||
|
||||
import logging
|
||||
import json
|
||||
from typing import Dict, Any
|
||||
from modules.workflows.methods.methodBase import action
|
||||
from modules.datamodels.datamodelChat import ActionResult, ActionDocument
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
@action
|
||||
async def findSiteByUrl(self, parameters: Dict[str, Any]) -> ActionResult:
|
||||
"""
|
||||
Find SharePoint site by hostname and site path.
|
||||
|
||||
Parameters:
|
||||
- connectionReference (str, required): Microsoft connection label.
|
||||
- hostname (str, required): SharePoint hostname (e.g., "example.sharepoint.com")
|
||||
- sitePath (str, required): Site path (e.g., "SteeringBPM" or "/sites/SteeringBPM")
|
||||
|
||||
Returns:
|
||||
- ActionResult with ActionDocument containing site information (id, displayName, name, webUrl)
|
||||
"""
|
||||
try:
|
||||
connectionReference = parameters.get("connectionReference")
|
||||
if not connectionReference:
|
||||
return ActionResult.isFailure(error="connectionReference parameter is required")
|
||||
|
||||
hostname = parameters.get("hostname")
|
||||
if not hostname:
|
||||
return ActionResult.isFailure(error="hostname parameter is required")
|
||||
|
||||
sitePath = parameters.get("sitePath")
|
||||
if not sitePath:
|
||||
return ActionResult.isFailure(error="sitePath parameter is required")
|
||||
|
||||
# Get Microsoft connection
|
||||
connection = self.connection.getMicrosoftConnection(connectionReference)
|
||||
if not connection:
|
||||
return ActionResult.isFailure(error="No valid Microsoft connection found for the provided connection reference")
|
||||
|
||||
# Find site by URL
|
||||
siteInfo = await self.services.sharepoint.findSiteByUrl(
|
||||
hostname=hostname,
|
||||
sitePath=sitePath
|
||||
)
|
||||
|
||||
if not siteInfo:
|
||||
return ActionResult.isFailure(error=f"Site not found: {hostname}:/sites/{sitePath}")
|
||||
|
||||
logger.info(f"Found SharePoint site: {siteInfo.get('displayName')} (ID: {siteInfo.get('id')})")
|
||||
|
||||
# Generate filename
|
||||
workflowContext = self.services.chat.getWorkflowContext() if hasattr(self.services, 'chat') else None
|
||||
filename = self._generateMeaningfulFileName(
|
||||
"sharepoint_site",
|
||||
"json",
|
||||
workflowContext,
|
||||
"findSiteByUrl"
|
||||
)
|
||||
|
||||
validationMetadata = self._createValidationMetadata(
|
||||
"findSiteByUrl",
|
||||
hostname=hostname,
|
||||
sitePath=sitePath,
|
||||
siteId=siteInfo.get("id")
|
||||
)
|
||||
|
||||
document = ActionDocument(
|
||||
documentName=filename,
|
||||
documentData=json.dumps(siteInfo, indent=2),
|
||||
mimeType="application/json",
|
||||
validationMetadata=validationMetadata
|
||||
)
|
||||
|
||||
return ActionResult.isSuccess(documents=[document])
|
||||
|
||||
except Exception as e:
|
||||
errorMsg = f"Error finding SharePoint site: {str(e)}"
|
||||
logger.error(errorMsg)
|
||||
return ActionResult.isFailure(error=errorMsg)
|
||||
|
||||
Some files were not shown because too many files have changed in this diff Show more
Loading…
Reference in a new issue