system test
This commit is contained in:
parent
aa854f27b7
commit
86fe43e987
21 changed files with 2247 additions and 590 deletions
114
README_document_test.md
Normal file
114
README_document_test.md
Normal file
|
|
@ -0,0 +1,114 @@
|
||||||
|
# Document Extraction Test
|
||||||
|
|
||||||
|
This test procedure validates the DocumentManager's ability to extract content from files using AI-powered analysis.
|
||||||
|
|
||||||
|
## Files Created
|
||||||
|
|
||||||
|
- `test_document_extraction.py` - Main test script
|
||||||
|
- `test_sample_document.txt` - Sample document for testing
|
||||||
|
- `run_document_test.ps1` - PowerShell wrapper script
|
||||||
|
- `test_document_extraction.log` - Generated log file (cleared on each run)
|
||||||
|
|
||||||
|
## Usage
|
||||||
|
|
||||||
|
### Method 1: Using PowerShell Script (Recommended)
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
# Test with default sample file
|
||||||
|
.\run_document_test.ps1
|
||||||
|
|
||||||
|
# Test with custom file
|
||||||
|
.\run_document_test.ps1 "path\to\your\document.pdf"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Method 2: Direct Python Execution
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Test with default sample file
|
||||||
|
python test_document_extraction.py test_sample_document.txt
|
||||||
|
|
||||||
|
# Test with custom file
|
||||||
|
python test_document_extraction.py "path/to/your/document.docx"
|
||||||
|
```
|
||||||
|
|
||||||
|
## Test Features
|
||||||
|
|
||||||
|
1. **File Validation**: Checks if the specified file exists
|
||||||
|
2. **MIME Type Detection**: Automatically detects file type based on extension
|
||||||
|
3. **Content Extraction**: Uses the DocumentManager to extract content
|
||||||
|
4. **AI Processing**: Applies the prompt "summarize the content and give list of the major topics"
|
||||||
|
5. **Comprehensive Logging**: Logs all steps and results to `test_document_extraction.log`
|
||||||
|
6. **Log Cleanup**: Clears the log file on each test run
|
||||||
|
|
||||||
|
## Supported File Types
|
||||||
|
|
||||||
|
- Text files (.txt, .md)
|
||||||
|
- CSV files (.csv)
|
||||||
|
- JSON files (.json)
|
||||||
|
- XML files (.xml)
|
||||||
|
- HTML files (.html, .htm)
|
||||||
|
- Images (.jpg, .jpeg, .png, .gif, .svg)
|
||||||
|
- PDF files (.pdf)
|
||||||
|
- Office documents (.docx, .xlsx, .pptx)
|
||||||
|
- And more (fallback to binary processing)
|
||||||
|
|
||||||
|
## Test Output
|
||||||
|
|
||||||
|
The test generates detailed logs including:
|
||||||
|
|
||||||
|
- File information (path, size, MIME type)
|
||||||
|
- Extraction process details
|
||||||
|
- Extracted content summary
|
||||||
|
- AI-processed results
|
||||||
|
- Error details if any issues occur
|
||||||
|
|
||||||
|
## Example Output
|
||||||
|
|
||||||
|
```
|
||||||
|
=== STARTING DOCUMENT EXTRACTION TEST ===
|
||||||
|
File information: {
|
||||||
|
"file_path": "test_sample_document.txt",
|
||||||
|
"filename": "test_sample_document.txt",
|
||||||
|
"mime_type": "text/plain",
|
||||||
|
"file_size_bytes": 2048,
|
||||||
|
"file_size_mb": 0.0
|
||||||
|
}
|
||||||
|
Document extraction completed successfully: {
|
||||||
|
"extracted_content_id": "test-doc-1234567890",
|
||||||
|
"content_items_count": 1,
|
||||||
|
"object_type": "ExtractedContent"
|
||||||
|
}
|
||||||
|
COMPLETE EXTRACTED CONTENT: {
|
||||||
|
"total_length": 1500,
|
||||||
|
"content": "PowerOn System Architecture Overview... [AI processed summary]"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Error Handling
|
||||||
|
|
||||||
|
The test includes comprehensive error handling for:
|
||||||
|
|
||||||
|
- File not found errors
|
||||||
|
- File reading errors
|
||||||
|
- Document processing errors
|
||||||
|
- AI processing errors
|
||||||
|
- Import errors
|
||||||
|
|
||||||
|
All errors are logged with detailed information for debugging.
|
||||||
|
|
||||||
|
## Configuration
|
||||||
|
|
||||||
|
The test uses the same configuration as other tests:
|
||||||
|
|
||||||
|
- Environment variable: `POWERON_CONFIG_FILE = 'test_config.ini'`
|
||||||
|
- Log file: `test_document_extraction.log`
|
||||||
|
- Log level: DEBUG
|
||||||
|
|
||||||
|
## Dependencies
|
||||||
|
|
||||||
|
The test requires the same dependencies as the main PowerOn system:
|
||||||
|
|
||||||
|
- Python 3.8+
|
||||||
|
- Required Python packages (see requirements.txt)
|
||||||
|
- Access to AI services (if AI processing is enabled)
|
||||||
|
- Proper configuration in test_config.ini
|
||||||
|
|
@ -98,7 +98,27 @@ class AiOpenai:
|
||||||
The response from the OpenAI Vision API as text
|
The response from the OpenAI Vision API as text
|
||||||
"""
|
"""
|
||||||
try:
|
try:
|
||||||
logger.debug(f"Starting image analysis for {mimeType} with query '{prompt}' for {mimeType} size {len(imageData)}B...")
|
logger.debug(f"Starting image analysis with query '{prompt}' for size {len(imageData)}B...")
|
||||||
|
|
||||||
|
# Ensure imageData is a string (base64 encoded)
|
||||||
|
if not isinstance(imageData, str):
|
||||||
|
raise ValueError("imageData must be a string (base64 encoded)")
|
||||||
|
|
||||||
|
# Fix base64 padding if needed
|
||||||
|
padding_needed = len(imageData) % 4
|
||||||
|
if padding_needed:
|
||||||
|
imageData += '=' * (4 - padding_needed)
|
||||||
|
|
||||||
|
# Use default MIME type if not provided
|
||||||
|
if not mimeType:
|
||||||
|
mimeType = "image/jpeg"
|
||||||
|
|
||||||
|
logger.debug(f"Using MIME type: {mimeType}")
|
||||||
|
logger.debug(f"Base64 data length: {len(imageData)} characters")
|
||||||
|
|
||||||
|
# Create the data URL format as required by OpenAI Vision API
|
||||||
|
data_url = f"data:{mimeType};base64,{imageData}"
|
||||||
|
|
||||||
messages = [
|
messages = [
|
||||||
{
|
{
|
||||||
"role": "user",
|
"role": "user",
|
||||||
|
|
@ -107,15 +127,40 @@ class AiOpenai:
|
||||||
{
|
{
|
||||||
"type": "image_url",
|
"type": "image_url",
|
||||||
"image_url": {
|
"image_url": {
|
||||||
"url": f"data:{mimeType};base64,{imageData}"
|
"url": data_url
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
]
|
]
|
||||||
}
|
}
|
||||||
]
|
]
|
||||||
|
|
||||||
# Use the existing callApi function with the Vision model
|
# Use a vision-capable model for image analysis
|
||||||
response = await self.callApi(messages)
|
# Override the model for vision tasks
|
||||||
|
visionModel = "gpt-4o" # or "gpt-4-vision-preview" depending on availability
|
||||||
|
|
||||||
|
# Use parameters from configuration
|
||||||
|
temperature = self.config.get("temperature", 0.2)
|
||||||
|
maxTokens = self.config.get("maxTokens", 2000)
|
||||||
|
|
||||||
|
payload = {
|
||||||
|
"model": visionModel,
|
||||||
|
"messages": messages,
|
||||||
|
"temperature": temperature,
|
||||||
|
"max_tokens": maxTokens
|
||||||
|
}
|
||||||
|
|
||||||
|
response = await self.httpClient.post(
|
||||||
|
self.apiUrl,
|
||||||
|
json=payload
|
||||||
|
)
|
||||||
|
|
||||||
|
if response.status_code != 200:
|
||||||
|
logger.error(f"OpenAI API error: {response.status_code} - {response.text}")
|
||||||
|
raise HTTPException(status_code=500, detail="Error communicating with OpenAI API")
|
||||||
|
|
||||||
|
responseJson = response.json()
|
||||||
|
content = responseJson["choices"][0]["message"]["content"]
|
||||||
|
return content
|
||||||
|
|
||||||
# Return content
|
# Return content
|
||||||
return response
|
return response
|
||||||
|
|
|
||||||
|
|
@ -173,13 +173,31 @@ class DatabaseConnector:
|
||||||
record["_modifiedAt"] = currentTime.isoformat()
|
record["_modifiedAt"] = currentTime.isoformat()
|
||||||
record["_modifiedBy"] = self.userId
|
record["_modifiedBy"] = self.userId
|
||||||
|
|
||||||
# Save the record file
|
# Save the record file using atomic write
|
||||||
recordPath = self._getRecordPath(table, recordId)
|
recordPath = self._getRecordPath(table, recordId)
|
||||||
|
tempPath = recordPath + '.tmp'
|
||||||
|
|
||||||
|
# Ensure directory exists
|
||||||
os.makedirs(os.path.dirname(recordPath), exist_ok=True)
|
os.makedirs(os.path.dirname(recordPath), exist_ok=True)
|
||||||
|
|
||||||
with open(recordPath, 'w', encoding='utf-8') as f:
|
# Write to temporary file first
|
||||||
|
with open(tempPath, 'w', encoding='utf-8') as f:
|
||||||
json.dump(record, f, indent=2, ensure_ascii=False)
|
json.dump(record, f, indent=2, ensure_ascii=False)
|
||||||
|
|
||||||
|
# Verify the temporary file can be read back (validation)
|
||||||
|
try:
|
||||||
|
with open(tempPath, 'r', encoding='utf-8') as f:
|
||||||
|
json.load(f) # This will fail if file is corrupted
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Validation failed for record {recordId}: {e}")
|
||||||
|
# Clean up temp file
|
||||||
|
if os.path.exists(tempPath):
|
||||||
|
os.remove(tempPath)
|
||||||
|
raise ValueError(f"Record validation failed: {e}")
|
||||||
|
|
||||||
|
# Atomic move from temp to final location
|
||||||
|
os.replace(tempPath, recordPath)
|
||||||
|
|
||||||
# Update metadata
|
# Update metadata
|
||||||
metadata = self._loadTableMetadata(table)
|
metadata = self._loadTableMetadata(table)
|
||||||
if recordId not in metadata["recordIds"]:
|
if recordId not in metadata["recordIds"]:
|
||||||
|
|
@ -203,6 +221,13 @@ class DatabaseConnector:
|
||||||
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.error(f"Error saving record {recordId} to table {table}: {e}")
|
logger.error(f"Error saving record {recordId} to table {table}: {e}")
|
||||||
|
# Clean up temp file if it exists
|
||||||
|
tempPath = self._getRecordPath(table, recordId) + '.tmp'
|
||||||
|
if os.path.exists(tempPath):
|
||||||
|
try:
|
||||||
|
os.remove(tempPath)
|
||||||
|
except:
|
||||||
|
pass
|
||||||
return False
|
return False
|
||||||
|
|
||||||
def _loadTable(self, table: str) -> List[Dict[str, Any]]:
|
def _loadTable(self, table: str) -> List[Dict[str, Any]]:
|
||||||
|
|
|
||||||
|
|
@ -116,7 +116,7 @@ class AiCalls:
|
||||||
The AI response as text
|
The AI response as text
|
||||||
"""
|
"""
|
||||||
try:
|
try:
|
||||||
return await self.openaiService.callAiImage(imageData, mimeType, prompt)
|
return await self.openaiService.callAiImage(prompt, imageData, mimeType)
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.error(f"Error in OpenAI image call: {str(e)}")
|
logger.error(f"Error in OpenAI image call: {str(e)}")
|
||||||
return f"Error: {str(e)}"
|
return f"Error: {str(e)}"
|
||||||
|
|
|
||||||
|
|
@ -237,7 +237,6 @@ class AppObjects:
|
||||||
# Find user by username
|
# Find user by username
|
||||||
for user_dict in users:
|
for user_dict in users:
|
||||||
if user_dict.get("username") == username:
|
if user_dict.get("username") == username:
|
||||||
logger.info(f"Found user with username {username}")
|
|
||||||
return User.from_dict(user_dict)
|
return User.from_dict(user_dict)
|
||||||
|
|
||||||
logger.info(f"No user found with username {username}")
|
logger.info(f"No user found with username {username}")
|
||||||
|
|
|
||||||
|
|
@ -760,7 +760,7 @@ class ChatObjects:
|
||||||
else:
|
else:
|
||||||
# Create new workflow
|
# Create new workflow
|
||||||
workflowData = {
|
workflowData = {
|
||||||
"name": userInput.name or "New Workflow",
|
"name": "New Workflow", # Default name since UserInputRequest doesn't have a name field
|
||||||
"status": "running",
|
"status": "running",
|
||||||
"startedAt": currentTime,
|
"startedAt": currentTime,
|
||||||
"lastActivity": currentTime,
|
"lastActivity": currentTime,
|
||||||
|
|
|
||||||
|
|
@ -690,34 +690,39 @@ class ComponentObjects:
|
||||||
return None
|
return None
|
||||||
|
|
||||||
# Process content based on file type
|
# Process content based on file type
|
||||||
contentType = "binary"
|
isText = False
|
||||||
content = ""
|
content = ""
|
||||||
|
encoding = None
|
||||||
|
|
||||||
if file.get("mimeType", "").startswith("text/"):
|
# Use proper attribute access for FileItem object
|
||||||
|
if file.mimeType.startswith("text/"):
|
||||||
# For text files, return full content
|
# For text files, return full content
|
||||||
try:
|
try:
|
||||||
content = fileContent.decode('utf-8')
|
content = fileContent.decode('utf-8')
|
||||||
contentType = "text"
|
isText = True
|
||||||
|
encoding = 'utf-8'
|
||||||
except UnicodeDecodeError:
|
except UnicodeDecodeError:
|
||||||
content = fileContent.decode('latin-1')
|
content = fileContent.decode('latin-1')
|
||||||
contentType = "text"
|
isText = True
|
||||||
elif file.get("mimeType", "").startswith("image/"):
|
encoding = 'latin-1'
|
||||||
|
elif file.mimeType.startswith("image/"):
|
||||||
# For images, return base64
|
# For images, return base64
|
||||||
contentType = "base64"
|
import base64
|
||||||
content = f"data:{file['mimeType']};base64,{fileContent.hex()}"
|
content = base64.b64encode(fileContent).decode('utf-8')
|
||||||
|
isText = False
|
||||||
else:
|
else:
|
||||||
# For other files, return as base64
|
# For other files, return as base64
|
||||||
contentType = "base64"
|
import base64
|
||||||
content = f"data:{file['mimeType']};base64,{fileContent.hex()}"
|
content = base64.b64encode(fileContent).decode('utf-8')
|
||||||
|
isText = False
|
||||||
|
|
||||||
return FilePreview(
|
return FilePreview(
|
||||||
id=fileId,
|
|
||||||
name=file.get("name", "Unknown"),
|
|
||||||
mimeType=file.get("mimeType", "application/octet-stream"),
|
|
||||||
size=file.get("size", 0),
|
|
||||||
content=content,
|
content=content,
|
||||||
contentType=contentType,
|
mimeType=file.mimeType,
|
||||||
metadata=file.get("metadata", {})
|
filename=file.filename,
|
||||||
|
isText=isText,
|
||||||
|
encoding=encoding,
|
||||||
|
size=file.fileSize
|
||||||
)
|
)
|
||||||
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
|
|
|
||||||
|
|
@ -1,4 +1,4 @@
|
||||||
from typing import Dict, Any, Optional
|
from typing import Dict, Any, Optional, List
|
||||||
import logging
|
import logging
|
||||||
import uuid
|
import uuid
|
||||||
from datetime import datetime, UTC
|
from datetime import datetime, UTC
|
||||||
|
|
@ -11,9 +11,10 @@ class MethodCoder(MethodBase):
|
||||||
"""Coder method implementation for code operations"""
|
"""Coder method implementation for code operations"""
|
||||||
|
|
||||||
def __init__(self, serviceContainer: Any):
|
def __init__(self, serviceContainer: Any):
|
||||||
|
"""Initialize the coder method"""
|
||||||
super().__init__(serviceContainer)
|
super().__init__(serviceContainer)
|
||||||
self.name = "coder"
|
self.name = "coder"
|
||||||
self.description = "Handle code operations like analysis and generation"
|
self.description = "Handle code operations like analysis, generation, and refactoring"
|
||||||
|
|
||||||
@action
|
@action
|
||||||
async def analyze(self, parameters: Dict[str, Any]) -> ActionResult:
|
async def analyze(self, parameters: Dict[str, Any]) -> ActionResult:
|
||||||
|
|
@ -55,7 +56,7 @@ class MethodCoder(MethodBase):
|
||||||
error="No documents found for the provided reference"
|
error="No documents found for the provided reference"
|
||||||
)
|
)
|
||||||
|
|
||||||
# Extract content from all documents
|
# Process each document individually
|
||||||
all_code_content = []
|
all_code_content = []
|
||||||
|
|
||||||
for chatDocument in chatDocuments:
|
for chatDocument in chatDocuments:
|
||||||
|
|
@ -85,15 +86,18 @@ class MethodCoder(MethodBase):
|
||||||
error="No code content could be extracted from any documents"
|
error="No code content could be extracted from any documents"
|
||||||
)
|
)
|
||||||
|
|
||||||
# Combine all code content for analysis
|
# Extract text content from ExtractedContent objects
|
||||||
combined_code = "\n\n--- CODE SEPARATOR ---\n\n".join(all_code_content)
|
text_contents = self.service.extractTextFromContentObjects(all_code_content)
|
||||||
|
|
||||||
|
# Combine all extracted text content for analysis
|
||||||
|
combined_content = "\n\n--- CODE SEPARATOR ---\n\n".join(text_contents)
|
||||||
|
|
||||||
# Create analysis prompt
|
# Create analysis prompt
|
||||||
analysis_prompt = f"""
|
analysis_prompt = f"""
|
||||||
Analyze this {language} code for quality, structure, and potential issues.
|
Analyze this {language} code for quality, structure, and potential issues.
|
||||||
|
|
||||||
Code to analyze:
|
Code to analyze:
|
||||||
{combined_code}
|
{combined_content}
|
||||||
|
|
||||||
Please check for:
|
Please check for:
|
||||||
{', '.join(checks)}
|
{', '.join(checks)}
|
||||||
|
|
|
||||||
|
|
@ -26,18 +26,16 @@ class MethodDocument(MethodBase):
|
||||||
@action
|
@action
|
||||||
async def extract(self, parameters: Dict[str, Any]) -> ActionResult:
|
async def extract(self, parameters: Dict[str, Any]) -> ActionResult:
|
||||||
"""
|
"""
|
||||||
Extract content from document
|
Extract specific content from document with ai prompt and return it as a json file
|
||||||
|
|
||||||
Parameters:
|
Parameters:
|
||||||
documentList (str): Reference to the document list to extract content from
|
documentList (str): Reference to the document list to extract content from
|
||||||
aiPrompt (str): AI prompt for content extraction
|
aiPrompt (str): AI prompt for content extraction
|
||||||
format (str, optional): Output format (default: "text")
|
|
||||||
includeMetadata (bool, optional): Whether to include metadata (default: True)
|
includeMetadata (bool, optional): Whether to include metadata (default: True)
|
||||||
"""
|
"""
|
||||||
try:
|
try:
|
||||||
documentList = parameters.get("documentList")
|
documentList = parameters.get("documentList")
|
||||||
aiPrompt = parameters.get("aiPrompt")
|
aiPrompt = parameters.get("aiPrompt")
|
||||||
format = parameters.get("format", "text")
|
|
||||||
includeMetadata = parameters.get("includeMetadata", True)
|
includeMetadata = parameters.get("includeMetadata", True)
|
||||||
|
|
||||||
if not documentList:
|
if not documentList:
|
||||||
|
|
@ -95,12 +93,14 @@ class MethodDocument(MethodBase):
|
||||||
error="No content could be extracted from any documents"
|
error="No content could be extracted from any documents"
|
||||||
)
|
)
|
||||||
|
|
||||||
# Combine all extracted content
|
# Extract text content from ExtractedContent objects
|
||||||
combined_content = "\n\n--- DOCUMENT SEPARATOR ---\n\n".join(all_extracted_content)
|
text_contents = self.service.extractTextFromContentObjects(all_extracted_content)
|
||||||
|
|
||||||
|
# Combine all extracted text content
|
||||||
|
combined_content = "\n\n--- DOCUMENT SEPARATOR ---\n\n".join(text_contents)
|
||||||
|
|
||||||
result_data = {
|
result_data = {
|
||||||
"documentCount": len(chatDocuments),
|
"documentCount": len(chatDocuments),
|
||||||
"format": format,
|
|
||||||
"content": combined_content,
|
"content": combined_content,
|
||||||
"fileInfos": file_infos if includeMetadata else None,
|
"fileInfos": file_infos if includeMetadata else None,
|
||||||
"timestamp": datetime.now(UTC).isoformat()
|
"timestamp": datetime.now(UTC).isoformat()
|
||||||
|
|
@ -124,236 +124,3 @@ class MethodDocument(MethodBase):
|
||||||
data={},
|
data={},
|
||||||
error=str(e)
|
error=str(e)
|
||||||
)
|
)
|
||||||
|
|
||||||
@action
|
|
||||||
async def analyze(self, parameters: Dict[str, Any]) -> ActionResult:
|
|
||||||
"""
|
|
||||||
Analyze document content
|
|
||||||
|
|
||||||
Parameters:
|
|
||||||
documentList (str): Reference to the document list to analyze
|
|
||||||
aiPrompt (str): AI prompt for content analysis
|
|
||||||
analysis (List[str], optional): Types of analysis to perform (default: ["entities", "topics", "sentiment"])
|
|
||||||
"""
|
|
||||||
try:
|
|
||||||
documentList = parameters.get("documentList")
|
|
||||||
aiPrompt = parameters.get("aiPrompt")
|
|
||||||
analysis = parameters.get("analysis", ["entities", "topics", "sentiment"])
|
|
||||||
|
|
||||||
if not documentList:
|
|
||||||
return self._createResult(
|
|
||||||
success=False,
|
|
||||||
data={},
|
|
||||||
error="Document list reference is required"
|
|
||||||
)
|
|
||||||
|
|
||||||
if not aiPrompt:
|
|
||||||
return self._createResult(
|
|
||||||
success=False,
|
|
||||||
data={},
|
|
||||||
error="AI prompt is required"
|
|
||||||
)
|
|
||||||
|
|
||||||
chatDocuments = self.service.getChatDocumentsFromDocumentList(documentList)
|
|
||||||
if not chatDocuments:
|
|
||||||
return self._createResult(
|
|
||||||
success=False,
|
|
||||||
data={},
|
|
||||||
error="No documents found for the provided reference"
|
|
||||||
)
|
|
||||||
|
|
||||||
# Extract content from all documents
|
|
||||||
all_extracted_content = []
|
|
||||||
|
|
||||||
for chatDocument in chatDocuments:
|
|
||||||
fileId = chatDocument.fileId
|
|
||||||
file_data = self.service.getFileData(fileId)
|
|
||||||
file_info = self.service.getFileInfo(fileId)
|
|
||||||
|
|
||||||
if not file_data:
|
|
||||||
logger.warning(f"File not found or empty for fileId: {fileId}")
|
|
||||||
continue
|
|
||||||
|
|
||||||
extracted_content = await self.service.extractContentFromFileData(
|
|
||||||
prompt=aiPrompt,
|
|
||||||
fileData=file_data,
|
|
||||||
filename=file_info.get('name', 'document'),
|
|
||||||
mimeType=file_info.get('mimeType', 'application/octet-stream'),
|
|
||||||
base64Encoded=False,
|
|
||||||
documentId=chatDocument.id
|
|
||||||
)
|
|
||||||
|
|
||||||
all_extracted_content.append(extracted_content)
|
|
||||||
|
|
||||||
if not all_extracted_content:
|
|
||||||
return self._createResult(
|
|
||||||
success=False,
|
|
||||||
data={},
|
|
||||||
error="No content could be extracted from any documents"
|
|
||||||
)
|
|
||||||
|
|
||||||
# Combine all extracted content for analysis
|
|
||||||
combined_content = "\n\n--- DOCUMENT SEPARATOR ---\n\n".join(all_extracted_content)
|
|
||||||
|
|
||||||
analysis_prompt = f"""
|
|
||||||
Analyze this document content for the following aspects:
|
|
||||||
{', '.join(analysis)}
|
|
||||||
|
|
||||||
Document content:
|
|
||||||
{combined_content[:8000]} # Limit content length
|
|
||||||
|
|
||||||
Please provide a detailed analysis including:
|
|
||||||
1. Key entities (people, organizations, locations, dates)
|
|
||||||
2. Main topics and themes
|
|
||||||
3. Sentiment analysis (positive, negative, neutral)
|
|
||||||
4. Key insights and patterns
|
|
||||||
5. Important relationships between entities
|
|
||||||
6. Document structure and organization
|
|
||||||
"""
|
|
||||||
|
|
||||||
analysis_result = await self.service.interfaceAiCalls.callAiTextAdvanced(analysis_prompt)
|
|
||||||
|
|
||||||
result_data = {
|
|
||||||
"documentCount": len(chatDocuments),
|
|
||||||
"analysis": analysis,
|
|
||||||
"results": analysis_result,
|
|
||||||
"content": combined_content,
|
|
||||||
"timestamp": datetime.now(UTC).isoformat()
|
|
||||||
}
|
|
||||||
|
|
||||||
return self._createResult(
|
|
||||||
success=True,
|
|
||||||
data={
|
|
||||||
"documents": [
|
|
||||||
{
|
|
||||||
"documentName": f"document_analysis_{datetime.now(UTC).strftime('%Y%m%d_%H%M%S')}.json",
|
|
||||||
"documentData": result_data
|
|
||||||
}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
)
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Error analyzing content: {str(e)}")
|
|
||||||
return self._createResult(
|
|
||||||
success=False,
|
|
||||||
data={},
|
|
||||||
error=str(e)
|
|
||||||
)
|
|
||||||
|
|
||||||
@action
|
|
||||||
async def summarize(self, parameters: Dict[str, Any]) -> ActionResult:
|
|
||||||
"""
|
|
||||||
Summarize document content
|
|
||||||
|
|
||||||
Parameters:
|
|
||||||
documentList (str): Reference to the document list to summarize
|
|
||||||
aiPrompt (str): AI prompt for content extraction
|
|
||||||
maxLength (int, optional): Maximum length of summary in words (default: 200)
|
|
||||||
format (str, optional): Output format (default: "text")
|
|
||||||
"""
|
|
||||||
try:
|
|
||||||
documentList = parameters.get("documentList")
|
|
||||||
aiPrompt = parameters.get("aiPrompt")
|
|
||||||
maxLength = parameters.get("maxLength", 200)
|
|
||||||
format = parameters.get("format", "text")
|
|
||||||
|
|
||||||
if not documentList:
|
|
||||||
return self._createResult(
|
|
||||||
success=False,
|
|
||||||
data={},
|
|
||||||
error="Document list reference is required"
|
|
||||||
)
|
|
||||||
|
|
||||||
if not aiPrompt:
|
|
||||||
return self._createResult(
|
|
||||||
success=False,
|
|
||||||
data={},
|
|
||||||
error="AI prompt is required"
|
|
||||||
)
|
|
||||||
|
|
||||||
chatDocuments = self.service.getChatDocumentsFromDocumentList(documentList)
|
|
||||||
if not chatDocuments:
|
|
||||||
return self._createResult(
|
|
||||||
success=False,
|
|
||||||
data={},
|
|
||||||
error="No documents found for the provided reference"
|
|
||||||
)
|
|
||||||
|
|
||||||
# Extract content from all documents
|
|
||||||
all_extracted_content = []
|
|
||||||
|
|
||||||
for chatDocument in chatDocuments:
|
|
||||||
fileId = chatDocument.fileId
|
|
||||||
file_data = self.service.getFileData(fileId)
|
|
||||||
file_info = self.service.getFileInfo(fileId)
|
|
||||||
|
|
||||||
if not file_data:
|
|
||||||
logger.warning(f"File not found or empty for fileId: {fileId}")
|
|
||||||
continue
|
|
||||||
|
|
||||||
extracted_content = await self.service.extractContentFromFileData(
|
|
||||||
prompt=aiPrompt,
|
|
||||||
fileData=file_data,
|
|
||||||
filename=file_info.get('name', 'document'),
|
|
||||||
mimeType=file_info.get('mimeType', 'application/octet-stream'),
|
|
||||||
base64Encoded=False,
|
|
||||||
documentId=chatDocument.id
|
|
||||||
)
|
|
||||||
|
|
||||||
all_extracted_content.append(extracted_content)
|
|
||||||
|
|
||||||
if not all_extracted_content:
|
|
||||||
return self._createResult(
|
|
||||||
success=False,
|
|
||||||
data={},
|
|
||||||
error="No content could be extracted from any documents"
|
|
||||||
)
|
|
||||||
|
|
||||||
# Combine all extracted content for summarization
|
|
||||||
combined_content = "\n\n--- DOCUMENT SEPARATOR ---\n\n".join(all_extracted_content)
|
|
||||||
|
|
||||||
summary_prompt = f"""
|
|
||||||
Create a comprehensive summary of this document content.
|
|
||||||
|
|
||||||
Document content:
|
|
||||||
{combined_content[:8000]} # Limit content length
|
|
||||||
|
|
||||||
Requirements:
|
|
||||||
- Maximum length: {maxLength} words
|
|
||||||
- Format: {format}
|
|
||||||
- Include key points and main ideas
|
|
||||||
- Maintain accuracy and completeness
|
|
||||||
- Use clear, professional language
|
|
||||||
- Highlight important insights and conclusions
|
|
||||||
"""
|
|
||||||
|
|
||||||
summary = await self.service.interfaceAiCalls.callAiTextAdvanced(summary_prompt)
|
|
||||||
|
|
||||||
result_data = {
|
|
||||||
"documentCount": len(chatDocuments),
|
|
||||||
"maxLength": maxLength,
|
|
||||||
"format": format,
|
|
||||||
"summary": summary,
|
|
||||||
"wordCount": len(summary.split()),
|
|
||||||
"originalContent": combined_content,
|
|
||||||
"timestamp": datetime.now(UTC).isoformat()
|
|
||||||
}
|
|
||||||
|
|
||||||
return self._createResult(
|
|
||||||
success=True,
|
|
||||||
data={
|
|
||||||
"documents": [
|
|
||||||
{
|
|
||||||
"documentName": f"document_summary_{datetime.now(UTC).strftime('%Y%m%d_%H%M%S')}.txt",
|
|
||||||
"documentData": result_data
|
|
||||||
}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
)
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Error summarizing content: {str(e)}")
|
|
||||||
return self._createResult(
|
|
||||||
success=False,
|
|
||||||
data={},
|
|
||||||
error=str(e)
|
|
||||||
)
|
|
||||||
|
|
|
||||||
|
|
@ -133,7 +133,7 @@ async def get_file(
|
||||||
detail=f"File with ID {fileId} not found"
|
detail=f"File with ID {fileId} not found"
|
||||||
)
|
)
|
||||||
|
|
||||||
return FileItem(**fileData)
|
return fileData
|
||||||
|
|
||||||
except interfaceComponentObjects.FileNotFoundError as e:
|
except interfaceComponentObjects.FileNotFoundError as e:
|
||||||
logger.warning(f"File not found: {str(e)}")
|
logger.warning(f"File not found: {str(e)}")
|
||||||
|
|
@ -180,8 +180,8 @@ async def update_file(
|
||||||
detail=f"File with ID {fileId} not found"
|
detail=f"File with ID {fileId} not found"
|
||||||
)
|
)
|
||||||
|
|
||||||
# Check if user has access to the file
|
# Check if user has access to the file using the interface's permission system
|
||||||
if file.get("userId", 0) != currentUser.get("id", 0):
|
if not managementInterface._canModify("files", fileId):
|
||||||
raise HTTPException(
|
raise HTTPException(
|
||||||
status_code=status.HTTP_403_FORBIDDEN,
|
status_code=status.HTTP_403_FORBIDDEN,
|
||||||
detail="Not authorized to update this file"
|
detail="Not authorized to update this file"
|
||||||
|
|
@ -195,9 +195,9 @@ async def update_file(
|
||||||
detail="Failed to update file"
|
detail="Failed to update file"
|
||||||
)
|
)
|
||||||
|
|
||||||
# Get updated file and convert to FileItem
|
# Get updated file
|
||||||
updatedFile = managementInterface.getFile(fileId)
|
updatedFile = managementInterface.getFile(fileId)
|
||||||
return FileItem(**updatedFile)
|
return updatedFile
|
||||||
|
|
||||||
except HTTPException as he:
|
except HTTPException as he:
|
||||||
raise he
|
raise he
|
||||||
|
|
@ -328,15 +328,15 @@ async def preview_file(
|
||||||
try:
|
try:
|
||||||
managementInterface = interfaceComponentObjects.getInterface(currentUser)
|
managementInterface = interfaceComponentObjects.getInterface(currentUser)
|
||||||
|
|
||||||
# Get file preview
|
# Get file preview using the correct method
|
||||||
preview = managementInterface.getFilePreview(fileId)
|
preview = managementInterface.getFileContent(fileId)
|
||||||
if not preview:
|
if not preview:
|
||||||
raise HTTPException(
|
raise HTTPException(
|
||||||
status_code=status.HTTP_404_NOT_FOUND,
|
status_code=status.HTTP_404_NOT_FOUND,
|
||||||
detail=f"File with ID {fileId} not found or no content available"
|
detail=f"File with ID {fileId} not found or no content available"
|
||||||
)
|
)
|
||||||
|
|
||||||
return FilePreview(**preview)
|
return preview
|
||||||
except HTTPException:
|
except HTTPException:
|
||||||
raise
|
raise
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
|
|
|
||||||
|
|
@ -54,7 +54,7 @@ async def create_prompt(
|
||||||
# Create prompt
|
# Create prompt
|
||||||
newPrompt = managementInterface.createPrompt(prompt_data)
|
newPrompt = managementInterface.createPrompt(prompt_data)
|
||||||
|
|
||||||
return Prompt.from_dict(newPrompt)
|
return Prompt(**newPrompt)
|
||||||
|
|
||||||
@router.get("/{promptId}", response_model=Prompt)
|
@router.get("/{promptId}", response_model=Prompt)
|
||||||
@limiter.limit("30/minute")
|
@limiter.limit("30/minute")
|
||||||
|
|
@ -74,7 +74,7 @@ async def get_prompt(
|
||||||
detail=f"Prompt with ID {promptId} not found"
|
detail=f"Prompt with ID {promptId} not found"
|
||||||
)
|
)
|
||||||
|
|
||||||
return Prompt.from_dict(prompt)
|
return prompt
|
||||||
|
|
||||||
@router.put("/{promptId}", response_model=Prompt)
|
@router.put("/{promptId}", response_model=Prompt)
|
||||||
@limiter.limit("10/minute")
|
@limiter.limit("10/minute")
|
||||||
|
|
@ -107,7 +107,7 @@ async def update_prompt(
|
||||||
detail="Error updating the prompt"
|
detail="Error updating the prompt"
|
||||||
)
|
)
|
||||||
|
|
||||||
return Prompt.from_dict(updatedPrompt)
|
return Prompt(**updatedPrompt)
|
||||||
|
|
||||||
@router.delete("/{promptId}", response_model=Dict[str, Any])
|
@router.delete("/{promptId}", response_model=Dict[str, Any])
|
||||||
@limiter.limit("10/minute")
|
@limiter.limit("10/minute")
|
||||||
|
|
|
||||||
|
|
@ -48,7 +48,7 @@ def getServiceChat(currentUser: User):
|
||||||
|
|
||||||
# Consolidated endpoint for getting all workflows
|
# Consolidated endpoint for getting all workflows
|
||||||
@router.get("/", response_model=List[ChatWorkflow])
|
@router.get("/", response_model=List[ChatWorkflow])
|
||||||
@limiter.limit("30/minute")
|
@limiter.limit("120/minute")
|
||||||
async def get_workflows(
|
async def get_workflows(
|
||||||
request: Request,
|
request: Request,
|
||||||
currentUser: User = Depends(getCurrentUser)
|
currentUser: User = Depends(getCurrentUser)
|
||||||
|
|
@ -56,7 +56,31 @@ async def get_workflows(
|
||||||
"""Get all workflows for the current user."""
|
"""Get all workflows for the current user."""
|
||||||
try:
|
try:
|
||||||
appInterface = getInterface(currentUser)
|
appInterface = getInterface(currentUser)
|
||||||
return appInterface.getAllWorkflows()
|
workflows_data = appInterface.getAllWorkflows()
|
||||||
|
|
||||||
|
# Convert raw dictionaries to ChatWorkflow objects
|
||||||
|
workflows = []
|
||||||
|
for workflow_data in workflows_data:
|
||||||
|
try:
|
||||||
|
workflow = ChatWorkflow(
|
||||||
|
id=workflow_data["id"],
|
||||||
|
status=workflow_data.get("status", "running"),
|
||||||
|
name=workflow_data.get("name"),
|
||||||
|
currentRound=workflow_data.get("currentRound", 1),
|
||||||
|
lastActivity=workflow_data.get("lastActivity", appInterface._getCurrentTimestamp()),
|
||||||
|
startedAt=workflow_data.get("startedAt", appInterface._getCurrentTimestamp()),
|
||||||
|
logs=[ChatLog(**log) for log in workflow_data.get("logs", [])],
|
||||||
|
messages=[ChatMessage(**msg) for msg in workflow_data.get("messages", [])],
|
||||||
|
stats=ChatStat(**workflow_data.get("dataStats", {})) if workflow_data.get("dataStats") else None,
|
||||||
|
mandateId=workflow_data.get("mandateId", currentUser.mandateId or "")
|
||||||
|
)
|
||||||
|
workflows.append(workflow)
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning(f"Error converting workflow data to ChatWorkflow object: {str(e)}")
|
||||||
|
# Skip invalid workflows instead of failing the entire request
|
||||||
|
continue
|
||||||
|
|
||||||
|
return workflows
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.error(f"Error getting workflows: {str(e)}")
|
logger.error(f"Error getting workflows: {str(e)}")
|
||||||
raise HTTPException(
|
raise HTTPException(
|
||||||
|
|
@ -65,7 +89,7 @@ async def get_workflows(
|
||||||
)
|
)
|
||||||
|
|
||||||
@router.get("/{workflowId}", response_model=ChatWorkflow)
|
@router.get("/{workflowId}", response_model=ChatWorkflow)
|
||||||
@limiter.limit("30/minute")
|
@limiter.limit("120/minute")
|
||||||
async def get_workflow(
|
async def get_workflow(
|
||||||
request: Request,
|
request: Request,
|
||||||
workflowId: str = Path(..., description="ID of the workflow"),
|
workflowId: str = Path(..., description="ID of the workflow"),
|
||||||
|
|
@ -93,9 +117,58 @@ async def get_workflow(
|
||||||
detail=f"Failed to get workflow: {str(e)}"
|
detail=f"Failed to get workflow: {str(e)}"
|
||||||
)
|
)
|
||||||
|
|
||||||
|
@router.put("/{workflowId}", response_model=ChatWorkflow)
|
||||||
|
@limiter.limit("120/minute")
|
||||||
|
async def update_workflow(
|
||||||
|
request: Request,
|
||||||
|
workflowId: str = Path(..., description="ID of the workflow to update"),
|
||||||
|
workflowData: Dict[str, Any] = Body(...),
|
||||||
|
currentUser: User = Depends(getCurrentUser)
|
||||||
|
) -> ChatWorkflow:
|
||||||
|
"""Update workflow by ID"""
|
||||||
|
try:
|
||||||
|
# Get workflow interface with current user context
|
||||||
|
workflowInterface = getInterface(currentUser)
|
||||||
|
|
||||||
|
# Get raw workflow data from database to check permissions
|
||||||
|
workflows = workflowInterface.db.getRecordset("workflows", recordFilter={"id": workflowId})
|
||||||
|
if not workflows:
|
||||||
|
raise HTTPException(
|
||||||
|
status_code=status.HTTP_404_NOT_FOUND,
|
||||||
|
detail="Workflow not found"
|
||||||
|
)
|
||||||
|
|
||||||
|
workflow_data = workflows[0]
|
||||||
|
|
||||||
|
# Check if user has permission to update using the interface's permission system
|
||||||
|
if not workflowInterface._canModify("workflows", workflowId):
|
||||||
|
raise HTTPException(
|
||||||
|
status_code=status.HTTP_403_FORBIDDEN,
|
||||||
|
detail="You don't have permission to update this workflow"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Update workflow
|
||||||
|
updatedWorkflow = workflowInterface.updateWorkflow(workflowId, workflowData)
|
||||||
|
if not updatedWorkflow:
|
||||||
|
raise HTTPException(
|
||||||
|
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
|
||||||
|
detail="Failed to update workflow"
|
||||||
|
)
|
||||||
|
|
||||||
|
return updatedWorkflow
|
||||||
|
|
||||||
|
except HTTPException:
|
||||||
|
raise
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Error updating workflow: {str(e)}")
|
||||||
|
raise HTTPException(
|
||||||
|
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
|
||||||
|
detail=f"Failed to update workflow: {str(e)}"
|
||||||
|
)
|
||||||
|
|
||||||
# API Endpoint for workflow status
|
# API Endpoint for workflow status
|
||||||
@router.get("/{workflowId}/status", response_model=ChatWorkflow)
|
@router.get("/{workflowId}/status", response_model=ChatWorkflow)
|
||||||
@limiter.limit("30/minute")
|
@limiter.limit("120/minute")
|
||||||
async def get_workflow_status(
|
async def get_workflow_status(
|
||||||
request: Request,
|
request: Request,
|
||||||
workflowId: str = Path(..., description="ID of the workflow"),
|
workflowId: str = Path(..., description="ID of the workflow"),
|
||||||
|
|
@ -114,7 +187,7 @@ async def get_workflow_status(
|
||||||
detail=f"Workflow with ID {workflowId} not found"
|
detail=f"Workflow with ID {workflowId} not found"
|
||||||
)
|
)
|
||||||
|
|
||||||
return ChatWorkflow(**workflow)
|
return workflow
|
||||||
except HTTPException:
|
except HTTPException:
|
||||||
raise
|
raise
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
|
|
@ -126,7 +199,7 @@ async def get_workflow_status(
|
||||||
|
|
||||||
# API Endpoint for workflow logs with selective data transfer
|
# API Endpoint for workflow logs with selective data transfer
|
||||||
@router.get("/{workflowId}/logs", response_model=List[ChatLog])
|
@router.get("/{workflowId}/logs", response_model=List[ChatLog])
|
||||||
@limiter.limit("30/minute")
|
@limiter.limit("120/minute")
|
||||||
async def get_workflow_logs(
|
async def get_workflow_logs(
|
||||||
request: Request,
|
request: Request,
|
||||||
workflowId: str = Path(..., description="ID of the workflow"),
|
workflowId: str = Path(..., description="ID of the workflow"),
|
||||||
|
|
@ -152,12 +225,12 @@ async def get_workflow_logs(
|
||||||
# Apply selective data transfer if logId is provided
|
# Apply selective data transfer if logId is provided
|
||||||
if logId:
|
if logId:
|
||||||
# Find the index of the log with the given ID
|
# Find the index of the log with the given ID
|
||||||
logIndex = next((i for i, log in enumerate(allLogs) if log.get("id") == logId), -1)
|
logIndex = next((i for i, log in enumerate(allLogs) if log.id == logId), -1)
|
||||||
if logIndex >= 0:
|
if logIndex >= 0:
|
||||||
# Return only logs after the specified log
|
# Return only logs after the specified log
|
||||||
return [ChatLog(**log) for log in allLogs[logIndex + 1:]]
|
return allLogs[logIndex + 1:]
|
||||||
|
|
||||||
return [ChatLog(**log) for log in allLogs]
|
return allLogs
|
||||||
except HTTPException:
|
except HTTPException:
|
||||||
raise
|
raise
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
|
|
@ -169,7 +242,7 @@ async def get_workflow_logs(
|
||||||
|
|
||||||
# API Endpoint for workflow messages with selective data transfer
|
# API Endpoint for workflow messages with selective data transfer
|
||||||
@router.get("/{workflowId}/messages", response_model=List[ChatMessage])
|
@router.get("/{workflowId}/messages", response_model=List[ChatMessage])
|
||||||
@limiter.limit("30/minute")
|
@limiter.limit("120/minute")
|
||||||
async def get_workflow_messages(
|
async def get_workflow_messages(
|
||||||
request: Request,
|
request: Request,
|
||||||
workflowId: str = Path(..., description="ID of the workflow"),
|
workflowId: str = Path(..., description="ID of the workflow"),
|
||||||
|
|
@ -195,12 +268,12 @@ async def get_workflow_messages(
|
||||||
# Apply selective data transfer if messageId is provided
|
# Apply selective data transfer if messageId is provided
|
||||||
if messageId:
|
if messageId:
|
||||||
# Find the index of the message with the given ID
|
# Find the index of the message with the given ID
|
||||||
messageIndex = next((i for i, msg in enumerate(allMessages) if msg.get("id") == messageId), -1)
|
messageIndex = next((i for i, msg in enumerate(allMessages) if msg.id == messageId), -1)
|
||||||
if messageIndex >= 0:
|
if messageIndex >= 0:
|
||||||
# Return only messages after the specified message
|
# Return only messages after the specified message
|
||||||
return [ChatMessage(**msg) for msg in allMessages[messageIndex + 1:]]
|
return allMessages[messageIndex + 1:]
|
||||||
|
|
||||||
return [ChatMessage(**msg) for msg in allMessages]
|
return allMessages
|
||||||
except HTTPException:
|
except HTTPException:
|
||||||
raise
|
raise
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
|
|
@ -212,7 +285,7 @@ async def get_workflow_messages(
|
||||||
|
|
||||||
# State 1: Workflow Initialization endpoint
|
# State 1: Workflow Initialization endpoint
|
||||||
@router.post("/start", response_model=ChatWorkflow)
|
@router.post("/start", response_model=ChatWorkflow)
|
||||||
@limiter.limit("10/minute")
|
@limiter.limit("120/minute")
|
||||||
async def start_workflow(
|
async def start_workflow(
|
||||||
request: Request,
|
request: Request,
|
||||||
workflowId: Optional[str] = Query(None, description="Optional ID of the workflow to continue"),
|
workflowId: Optional[str] = Query(None, description="Optional ID of the workflow to continue"),
|
||||||
|
|
@ -230,7 +303,7 @@ async def start_workflow(
|
||||||
# Start or continue workflow using ChatObjects
|
# Start or continue workflow using ChatObjects
|
||||||
workflow = await interfaceChat.workflowStart(currentUser, userInput, workflowId)
|
workflow = await interfaceChat.workflowStart(currentUser, userInput, workflowId)
|
||||||
|
|
||||||
return ChatWorkflow(**workflow)
|
return workflow
|
||||||
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.error(f"Error in start_workflow: {str(e)}")
|
logger.error(f"Error in start_workflow: {str(e)}")
|
||||||
|
|
@ -241,7 +314,7 @@ async def start_workflow(
|
||||||
|
|
||||||
# State 8: Workflow Stopped endpoint
|
# State 8: Workflow Stopped endpoint
|
||||||
@router.post("/{workflowId}/stop", response_model=ChatWorkflow)
|
@router.post("/{workflowId}/stop", response_model=ChatWorkflow)
|
||||||
@limiter.limit("10/minute")
|
@limiter.limit("120/minute")
|
||||||
async def stop_workflow(
|
async def stop_workflow(
|
||||||
request: Request,
|
request: Request,
|
||||||
workflowId: str = Path(..., description="ID of the workflow to stop"),
|
workflowId: str = Path(..., description="ID of the workflow to stop"),
|
||||||
|
|
@ -255,7 +328,7 @@ async def stop_workflow(
|
||||||
# Stop workflow using ChatObjects
|
# Stop workflow using ChatObjects
|
||||||
workflow = await interfaceChat.workflowStop(workflowId)
|
workflow = await interfaceChat.workflowStop(workflowId)
|
||||||
|
|
||||||
return ChatWorkflow(**workflow)
|
return workflow
|
||||||
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.error(f"Error in stop_workflow: {str(e)}")
|
logger.error(f"Error in stop_workflow: {str(e)}")
|
||||||
|
|
@ -266,7 +339,7 @@ async def stop_workflow(
|
||||||
|
|
||||||
# State 11: Workflow Reset/Deletion endpoint
|
# State 11: Workflow Reset/Deletion endpoint
|
||||||
@router.delete("/{workflowId}", response_model=Dict[str, Any])
|
@router.delete("/{workflowId}", response_model=Dict[str, Any])
|
||||||
@limiter.limit("10/minute")
|
@limiter.limit("120/minute")
|
||||||
async def delete_workflow(
|
async def delete_workflow(
|
||||||
request: Request,
|
request: Request,
|
||||||
workflowId: str = Path(..., description="ID of the workflow to delete"),
|
workflowId: str = Path(..., description="ID of the workflow to delete"),
|
||||||
|
|
@ -277,16 +350,18 @@ async def delete_workflow(
|
||||||
# Get service container
|
# Get service container
|
||||||
interfaceChat = getServiceChat(currentUser)
|
interfaceChat = getServiceChat(currentUser)
|
||||||
|
|
||||||
# Verify workflow exists
|
# Get raw workflow data from database to check permissions
|
||||||
workflow = interfaceChat.getWorkflow(workflowId)
|
workflows = interfaceChat.db.getRecordset("workflows", recordFilter={"id": workflowId})
|
||||||
if not workflow:
|
if not workflows:
|
||||||
raise HTTPException(
|
raise HTTPException(
|
||||||
status_code=status.HTTP_404_NOT_FOUND,
|
status_code=status.HTTP_404_NOT_FOUND,
|
||||||
detail=f"Workflow with ID {workflowId} not found"
|
detail=f"Workflow with ID {workflowId} not found"
|
||||||
)
|
)
|
||||||
|
|
||||||
# Check if user has permission to delete
|
workflow_data = workflows[0]
|
||||||
if workflow.get("_userId") != currentUser["id"]:
|
|
||||||
|
# Check if user has permission to delete using the interface's permission system
|
||||||
|
if not interfaceChat._canModify("workflows", workflowId):
|
||||||
raise HTTPException(
|
raise HTTPException(
|
||||||
status_code=status.HTTP_403_FORBIDDEN,
|
status_code=status.HTTP_403_FORBIDDEN,
|
||||||
detail="You don't have permission to delete this workflow"
|
detail="You don't have permission to delete this workflow"
|
||||||
|
|
@ -318,7 +393,7 @@ async def delete_workflow(
|
||||||
# Document Management Endpoints
|
# Document Management Endpoints
|
||||||
|
|
||||||
@router.delete("/{workflowId}/messages/{messageId}", response_model=Dict[str, Any])
|
@router.delete("/{workflowId}/messages/{messageId}", response_model=Dict[str, Any])
|
||||||
@limiter.limit("10/minute")
|
@limiter.limit("120/minute")
|
||||||
async def delete_workflow_message(
|
async def delete_workflow_message(
|
||||||
request: Request,
|
request: Request,
|
||||||
workflowId: str = Path(..., description="ID of the workflow"),
|
workflowId: str = Path(..., description="ID of the workflow"),
|
||||||
|
|
@ -368,7 +443,7 @@ async def delete_workflow_message(
|
||||||
)
|
)
|
||||||
|
|
||||||
@router.delete("/{workflowId}/messages/{messageId}/files/{fileId}", response_model=Dict[str, Any])
|
@router.delete("/{workflowId}/messages/{messageId}/files/{fileId}", response_model=Dict[str, Any])
|
||||||
@limiter.limit("10/minute")
|
@limiter.limit("120/minute")
|
||||||
async def delete_file_from_message(
|
async def delete_file_from_message(
|
||||||
request: Request,
|
request: Request,
|
||||||
workflowId: str = Path(..., description="ID of the workflow"),
|
workflowId: str = Path(..., description="ID of the workflow"),
|
||||||
|
|
|
||||||
File diff suppressed because it is too large
Load diff
|
|
@ -17,8 +17,8 @@ class DocumentManager:
|
||||||
|
|
||||||
def __init__(self, serviceContainer):
|
def __init__(self, serviceContainer):
|
||||||
self.service = serviceContainer
|
self.service = serviceContainer
|
||||||
# Create processor without any dependencies
|
# Create processor with service container for AI calls
|
||||||
self._processor = DocumentProcessor()
|
self._processor = DocumentProcessor(serviceContainer)
|
||||||
|
|
||||||
async def extractContentFromDocument(self, prompt: str, document: ChatDocument) -> ExtractedContent:
|
async def extractContentFromDocument(self, prompt: str, document: ChatDocument) -> ExtractedContent:
|
||||||
"""Extract content from ChatDocument using prompt"""
|
"""Extract content from ChatDocument using prompt"""
|
||||||
|
|
|
||||||
|
|
@ -52,8 +52,56 @@ class WorkflowManager:
|
||||||
|
|
||||||
except WorkflowStoppedException:
|
except WorkflowStoppedException:
|
||||||
logger.info("Workflow stopped by user")
|
logger.info("Workflow stopped by user")
|
||||||
|
# Update workflow status to stopped
|
||||||
|
workflow.status = "stopped"
|
||||||
|
workflow.lastActivity = datetime.now(UTC).isoformat()
|
||||||
|
self.chatInterface.updateWorkflow(workflow.id, {
|
||||||
|
"status": "stopped",
|
||||||
|
"lastActivity": workflow.lastActivity
|
||||||
|
})
|
||||||
|
|
||||||
|
# Add log entry
|
||||||
|
self.chatInterface.createWorkflowLog({
|
||||||
|
"workflowId": workflow.id,
|
||||||
|
"message": "Workflow stopped by user",
|
||||||
|
"type": "warning",
|
||||||
|
"status": "stopped",
|
||||||
|
"progress": 100
|
||||||
|
})
|
||||||
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.error(f"Workflow processing error: {str(e)}")
|
logger.error(f"Workflow processing error: {str(e)}")
|
||||||
|
|
||||||
|
# Update workflow status to failed
|
||||||
|
workflow.status = "failed"
|
||||||
|
workflow.lastActivity = datetime.now(UTC).isoformat()
|
||||||
|
self.chatInterface.updateWorkflow(workflow.id, {
|
||||||
|
"status": "failed",
|
||||||
|
"lastActivity": workflow.lastActivity
|
||||||
|
})
|
||||||
|
|
||||||
|
# Create error message
|
||||||
|
error_message = {
|
||||||
|
"workflowId": workflow.id,
|
||||||
|
"role": "assistant",
|
||||||
|
"message": f"Workflow processing failed: {str(e)}",
|
||||||
|
"status": "last",
|
||||||
|
"sequenceNr": len(workflow.messages) + 1,
|
||||||
|
"publishedAt": datetime.now(UTC).isoformat()
|
||||||
|
}
|
||||||
|
message = self.chatInterface.createWorkflowMessage(error_message)
|
||||||
|
if message:
|
||||||
|
workflow.messages.append(message)
|
||||||
|
|
||||||
|
# Add error log entry
|
||||||
|
self.chatInterface.createWorkflowLog({
|
||||||
|
"workflowId": workflow.id,
|
||||||
|
"message": f"Workflow failed: {str(e)}",
|
||||||
|
"type": "error",
|
||||||
|
"status": "failed",
|
||||||
|
"progress": 100
|
||||||
|
})
|
||||||
|
|
||||||
raise
|
raise
|
||||||
|
|
||||||
async def _sendFirstMessage(self, userInput: UserInputRequest, workflow: ChatWorkflow) -> ChatMessage:
|
async def _sendFirstMessage(self, userInput: UserInputRequest, workflow: ChatWorkflow) -> ChatMessage:
|
||||||
|
|
@ -108,6 +156,25 @@ class WorkflowManager:
|
||||||
if message:
|
if message:
|
||||||
workflow.messages.append(message)
|
workflow.messages.append(message)
|
||||||
|
|
||||||
|
# Update workflow status to completed
|
||||||
|
workflow.status = "completed"
|
||||||
|
workflow.lastActivity = datetime.now(UTC).isoformat()
|
||||||
|
|
||||||
|
# Update workflow in database
|
||||||
|
self.chatInterface.updateWorkflow(workflow.id, {
|
||||||
|
"status": "completed",
|
||||||
|
"lastActivity": workflow.lastActivity
|
||||||
|
})
|
||||||
|
|
||||||
|
# Add completion log entry
|
||||||
|
self.chatInterface.createWorkflowLog({
|
||||||
|
"workflowId": workflow.id,
|
||||||
|
"message": "Workflow completed successfully",
|
||||||
|
"type": "success",
|
||||||
|
"status": "completed",
|
||||||
|
"progress": 100
|
||||||
|
})
|
||||||
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.error(f"Error sending last message: {str(e)}")
|
logger.error(f"Error sending last message: {str(e)}")
|
||||||
raise
|
raise
|
||||||
|
|
@ -128,6 +195,14 @@ class WorkflowManager:
|
||||||
message = self.chatInterface.createWorkflowMessage(error_message)
|
message = self.chatInterface.createWorkflowMessage(error_message)
|
||||||
if message:
|
if message:
|
||||||
workflow.messages.append(message)
|
workflow.messages.append(message)
|
||||||
|
|
||||||
|
# Update workflow status to failed
|
||||||
|
workflow.status = "failed"
|
||||||
|
workflow.lastActivity = datetime.now(UTC).isoformat()
|
||||||
|
self.chatInterface.updateWorkflow(workflow.id, {
|
||||||
|
"status": "failed",
|
||||||
|
"lastActivity": workflow.lastActivity
|
||||||
|
})
|
||||||
return
|
return
|
||||||
|
|
||||||
# Process successful workflow results
|
# Process successful workflow results
|
||||||
|
|
@ -174,6 +249,14 @@ class WorkflowManager:
|
||||||
if message:
|
if message:
|
||||||
workflow.messages.append(message)
|
workflow.messages.append(message)
|
||||||
|
|
||||||
|
# Update workflow status to completed for successful workflows
|
||||||
|
workflow.status = "completed"
|
||||||
|
workflow.lastActivity = datetime.now(UTC).isoformat()
|
||||||
|
self.chatInterface.updateWorkflow(workflow.id, {
|
||||||
|
"status": "completed",
|
||||||
|
"lastActivity": workflow.lastActivity
|
||||||
|
})
|
||||||
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.error(f"Error processing workflow results: {str(e)}")
|
logger.error(f"Error processing workflow results: {str(e)}")
|
||||||
# Create error message
|
# Create error message
|
||||||
|
|
@ -189,3 +272,11 @@ class WorkflowManager:
|
||||||
if message:
|
if message:
|
||||||
workflow.messages.append(message)
|
workflow.messages.append(message)
|
||||||
|
|
||||||
|
# Update workflow status to failed
|
||||||
|
workflow.status = "failed"
|
||||||
|
workflow.lastActivity = datetime.now(UTC).isoformat()
|
||||||
|
self.chatInterface.updateWorkflow(workflow.id, {
|
||||||
|
"status": "failed",
|
||||||
|
"lastActivity": workflow.lastActivity
|
||||||
|
})
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -32,9 +32,10 @@ class FileProcessingError(Exception):
|
||||||
class DocumentProcessor:
|
class DocumentProcessor:
|
||||||
"""Processor for handling document operations and content extraction."""
|
"""Processor for handling document operations and content extraction."""
|
||||||
|
|
||||||
def __init__(self):
|
def __init__(self, serviceContainer=None):
|
||||||
"""Initialize the document processor."""
|
"""Initialize the document processor."""
|
||||||
self._neutralizer = DataAnonymizer() if APP_CONFIG.get("ENABLE_CONTENT_NEUTRALIZATION", False) else None
|
self._neutralizer = DataAnonymizer() if APP_CONFIG.get("ENABLE_CONTENT_NEUTRALIZATION", False) else None
|
||||||
|
self._serviceContainer = serviceContainer
|
||||||
|
|
||||||
self.supportedTypes: Dict[str, Callable[[bytes, str, str], Awaitable[List[ContentItem]]]] = {
|
self.supportedTypes: Dict[str, Callable[[bytes, str, str], Awaitable[List[ContentItem]]]] = {
|
||||||
'text/plain': self._processText,
|
'text/plain': self._processText,
|
||||||
|
|
@ -109,6 +110,8 @@ class DocumentProcessor:
|
||||||
except ImportError as e:
|
except ImportError as e:
|
||||||
logger.warning(f"Image processing libraries could not be loaded: {e}")
|
logger.warning(f"Image processing libraries could not be loaded: {e}")
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
async def processFileData(self, fileData: bytes, filename: str, mimeType: str, base64Encoded: bool = False, prompt: str = None, documentId: str = None) -> ExtractedContent:
|
async def processFileData(self, fileData: bytes, filename: str, mimeType: str, base64Encoded: bool = False, prompt: str = None, documentId: str = None) -> ExtractedContent:
|
||||||
"""
|
"""
|
||||||
Process file data directly and extract its contents with AI processing.
|
Process file data directly and extract its contents with AI processing.
|
||||||
|
|
@ -133,7 +136,7 @@ class DocumentProcessor:
|
||||||
|
|
||||||
# Detect content type if needed
|
# Detect content type if needed
|
||||||
if mimeType == "application/octet-stream":
|
if mimeType == "application/octet-stream":
|
||||||
mimeType = self._detectContentTypeFromData(fileData, filename)
|
mimeType = self._serviceContainer.detectContentTypeFromData(fileData, filename)
|
||||||
|
|
||||||
# Process document based on type
|
# Process document based on type
|
||||||
if mimeType not in self.supportedTypes:
|
if mimeType not in self.supportedTypes:
|
||||||
|
|
@ -162,60 +165,7 @@ class DocumentProcessor:
|
||||||
logger.error(f"Error processing file data: {str(e)}")
|
logger.error(f"Error processing file data: {str(e)}")
|
||||||
raise FileProcessingError(f"Failed to process file data: {str(e)}")
|
raise FileProcessingError(f"Failed to process file data: {str(e)}")
|
||||||
|
|
||||||
def _detectContentTypeFromData(self, fileData: bytes, filename: str) -> str:
|
|
||||||
"""Detect content type from file data and filename"""
|
|
||||||
try:
|
|
||||||
# Check file extension first
|
|
||||||
ext = os.path.splitext(filename)[1].lower()
|
|
||||||
if ext:
|
|
||||||
# Map common extensions to MIME types
|
|
||||||
extToMime = {
|
|
||||||
'.txt': 'text/plain',
|
|
||||||
'.md': 'text/markdown',
|
|
||||||
'.csv': 'text/csv',
|
|
||||||
'.json': 'application/json',
|
|
||||||
'.xml': 'application/xml',
|
|
||||||
'.js': 'application/javascript',
|
|
||||||
'.py': 'application/x-python',
|
|
||||||
'.svg': 'image/svg+xml',
|
|
||||||
'.jpg': 'image/jpeg',
|
|
||||||
'.png': 'image/png',
|
|
||||||
'.gif': 'image/gif',
|
|
||||||
'.pdf': 'application/pdf',
|
|
||||||
'.docx': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
|
|
||||||
'.doc': 'application/msword',
|
|
||||||
'.xlsx': 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
|
|
||||||
'.xls': 'application/vnd.ms-excel',
|
|
||||||
'.pptx': 'application/vnd.openxmlformats-officedocument.presentationml.presentation',
|
|
||||||
'.ppt': 'application/vnd.ms-powerpoint'
|
|
||||||
}
|
|
||||||
if ext in extToMime:
|
|
||||||
return extToMime[ext]
|
|
||||||
|
|
||||||
# Try to detect from content
|
|
||||||
if fileData.startswith(b'%PDF'):
|
|
||||||
return 'application/pdf'
|
|
||||||
elif fileData.startswith(b'PK\x03\x04'):
|
|
||||||
# ZIP-based formats (docx, xlsx, pptx)
|
|
||||||
return 'application/zip'
|
|
||||||
elif fileData.startswith(b'<'):
|
|
||||||
# XML-based formats
|
|
||||||
try:
|
|
||||||
text = fileData.decode('utf-8', errors='ignore')
|
|
||||||
if '<svg' in text.lower():
|
|
||||||
return 'image/svg+xml'
|
|
||||||
elif '<html' in text.lower():
|
|
||||||
return 'text/html'
|
|
||||||
else:
|
|
||||||
return 'application/xml'
|
|
||||||
except:
|
|
||||||
pass
|
|
||||||
|
|
||||||
return 'application/octet-stream'
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Error detecting content type from data: {str(e)}")
|
|
||||||
return 'application/octet-stream'
|
|
||||||
|
|
||||||
async def _processText(self, fileData: bytes, filename: str, mimeType: str) -> List[ContentItem]:
|
async def _processText(self, fileData: bytes, filename: str, mimeType: str) -> List[ContentItem]:
|
||||||
"""Process text document"""
|
"""Process text document"""
|
||||||
|
|
@ -546,14 +496,22 @@ class DocumentProcessor:
|
||||||
try:
|
try:
|
||||||
# Get content type from metadata
|
# Get content type from metadata
|
||||||
mimeType = item.metadata.mimeType if hasattr(item.metadata, 'mimeType') else "text/plain"
|
mimeType = item.metadata.mimeType if hasattr(item.metadata, 'mimeType') else "text/plain"
|
||||||
|
logger.debug(f"Processing content item with MIME type: {mimeType}, label: {item.label}")
|
||||||
|
|
||||||
# Chunk content based on type
|
# Chunk content based on type
|
||||||
if mimeType.startswith('text/'):
|
if mimeType.startswith('text/'):
|
||||||
chunks = self._chunkText(item.data, mimeType)
|
chunks = self._chunkText(item.data, mimeType)
|
||||||
elif mimeType.startswith('image/'):
|
elif mimeType.startswith('image/'):
|
||||||
chunks = self._chunkImage(item.data)
|
# Images should not be chunked - process as single unit
|
||||||
elif mimeType.startswith('video/'):
|
chunks = [item.data]
|
||||||
chunks = self._chunkVideo(item.data)
|
elif mimeType == "application/pdf":
|
||||||
|
chunks = self._chunkPdf(item.data)
|
||||||
|
elif mimeType == "application/vnd.openxmlformats-officedocument.wordprocessingml.document":
|
||||||
|
chunks = self._chunkDocx(item.data)
|
||||||
|
elif mimeType == "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet":
|
||||||
|
chunks = self._chunkXlsx(item.data)
|
||||||
|
elif mimeType.startswith('application/vnd.openxmlformats-officedocument.presentationml.presentation'):
|
||||||
|
chunks = self._chunkPptx(item.data)
|
||||||
else:
|
else:
|
||||||
# Binary data - no chunking
|
# Binary data - no chunking
|
||||||
chunks = [item.data]
|
chunks = [item.data]
|
||||||
|
|
@ -561,26 +519,42 @@ class DocumentProcessor:
|
||||||
# Process each chunk
|
# Process each chunk
|
||||||
chunkResults = []
|
chunkResults = []
|
||||||
for chunk in chunks:
|
for chunk in chunks:
|
||||||
# Neutralize content if neutralizer is enabled
|
# Process with AI based on content type
|
||||||
contentToProcess = chunk
|
try:
|
||||||
if self._neutralizer and contentToProcess:
|
logger.debug(f"AI processing chunk with MIME type: {mimeType}")
|
||||||
contentToProcess = self._neutralizer.neutralize(contentToProcess)
|
if mimeType.startswith('image/'):
|
||||||
|
# For images, use image AI service with base64 data
|
||||||
|
# chunk is already base64 encoded string from _processImage
|
||||||
|
# Use the original prompt directly for images (no content embedding)
|
||||||
|
logger.debug(f"Calling image AI service for MIME type: {mimeType}")
|
||||||
|
processedContent = await self._serviceContainer.callAiImageBasic(prompt, chunk, mimeType)
|
||||||
|
else:
|
||||||
|
# For text content, use text AI service
|
||||||
|
# Neutralize content if neutralizer is enabled (only for text)
|
||||||
|
contentToProcess = chunk
|
||||||
|
if self._neutralizer and contentToProcess:
|
||||||
|
contentToProcess = self._neutralizer.neutralize(contentToProcess)
|
||||||
|
|
||||||
# Create AI prompt for this chunk
|
# Create AI prompt for text content
|
||||||
aiPrompt = f"""
|
aiPrompt = f"""
|
||||||
Extract relevant information from this content based on the following prompt:
|
Extract relevant information from this content based on the following prompt:
|
||||||
|
|
||||||
PROMPT: {prompt}
|
PROMPT: {prompt}
|
||||||
|
|
||||||
CONTENT:
|
CONTENT:
|
||||||
{contentToProcess}
|
{contentToProcess}
|
||||||
|
|
||||||
Return ONLY the extracted information in a clear, concise format.
|
Return ONLY the extracted information in a clear, concise format.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
# Note: This would need to be implemented with actual AI service
|
logger.debug(f"Calling text AI service for MIME type: {mimeType}")
|
||||||
# For now, just return the original content
|
processedContent = await self._serviceContainer.callAiTextBasic(aiPrompt, contentToProcess)
|
||||||
chunkResults.append(contentToProcess)
|
|
||||||
|
chunkResults.append(processedContent)
|
||||||
|
except Exception as aiError:
|
||||||
|
logger.error(f"AI processing failed for chunk: {str(aiError)}")
|
||||||
|
# Fallback to original content
|
||||||
|
chunkResults.append(chunk)
|
||||||
|
|
||||||
# Combine chunk results
|
# Combine chunk results
|
||||||
combinedResult = "\n".join(chunkResults)
|
combinedResult = "\n".join(chunkResults)
|
||||||
|
|
@ -604,6 +578,8 @@ class DocumentProcessor:
|
||||||
|
|
||||||
return processedItems
|
return processedItems
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
def _chunkText(self, content: str, mimeType: str) -> List[str]:
|
def _chunkText(self, content: str, mimeType: str) -> List[str]:
|
||||||
"""Chunk text content based on mime type"""
|
"""Chunk text content based on mime type"""
|
||||||
if mimeType == "text/plain":
|
if mimeType == "text/plain":
|
||||||
|
|
@ -765,36 +741,6 @@ class DocumentProcessor:
|
||||||
except Exception:
|
except Exception:
|
||||||
return [content]
|
return [content]
|
||||||
|
|
||||||
def _chunkImage(self, content: str) -> List[str]:
|
|
||||||
"""Chunk image content"""
|
|
||||||
try:
|
|
||||||
imageData = base64.b64decode(content)
|
|
||||||
chunks = []
|
|
||||||
chunkSize = self.chunkSizes["image"]
|
|
||||||
|
|
||||||
for i in range(0, len(imageData), chunkSize):
|
|
||||||
chunk = imageData[i:i + chunkSize]
|
|
||||||
chunks.append(base64.b64encode(chunk).decode('utf-8'))
|
|
||||||
|
|
||||||
return chunks
|
|
||||||
except Exception:
|
|
||||||
return [content]
|
|
||||||
|
|
||||||
def _chunkVideo(self, content: str) -> List[str]:
|
|
||||||
"""Chunk video content"""
|
|
||||||
try:
|
|
||||||
videoData = base64.b64decode(content)
|
|
||||||
chunks = []
|
|
||||||
chunkSize = self.chunkSizes["video"]
|
|
||||||
|
|
||||||
for i in range(0, len(videoData), chunkSize):
|
|
||||||
chunk = videoData[i:i + chunkSize]
|
|
||||||
chunks.append(base64.b64encode(chunk).decode('utf-8'))
|
|
||||||
|
|
||||||
return chunks
|
|
||||||
except Exception:
|
|
||||||
return [content]
|
|
||||||
|
|
||||||
def _chunkBinary(self, content: str) -> List[str]:
|
def _chunkBinary(self, content: str) -> List[str]:
|
||||||
"""Chunk binary content"""
|
"""Chunk binary content"""
|
||||||
try:
|
try:
|
||||||
|
|
@ -810,4 +756,87 @@ class DocumentProcessor:
|
||||||
except Exception:
|
except Exception:
|
||||||
return [content]
|
return [content]
|
||||||
|
|
||||||
|
async def _chunkPdf(self, content: str) -> List[str]:
|
||||||
|
"""Chunk PDF content"""
|
||||||
|
try:
|
||||||
|
pdfData = base64.b64decode(content)
|
||||||
|
chunks = []
|
||||||
|
chunkSize = self.chunkSizes["pdf"]
|
||||||
|
|
||||||
|
with io.BytesIO(pdfData) as pdfStream:
|
||||||
|
pdfReader = PyPDF2.PdfReader(pdfStream)
|
||||||
|
for pageNum in range(len(pdfReader.pages)):
|
||||||
|
page = pdfReader.pages[pageNum]
|
||||||
|
pageText = page.extract_text()
|
||||||
|
if pageText:
|
||||||
|
chunks.append(pageText)
|
||||||
|
|
||||||
|
return chunks
|
||||||
|
except Exception:
|
||||||
|
return [content]
|
||||||
|
|
||||||
|
async def _chunkDocx(self, content: str) -> List[str]:
|
||||||
|
"""Chunk Word document content"""
|
||||||
|
try:
|
||||||
|
docxData = base64.b64decode(content)
|
||||||
|
chunks = []
|
||||||
|
chunkSize = self.chunkSizes["docx"]
|
||||||
|
|
||||||
|
with io.BytesIO(docxData) as docxStream:
|
||||||
|
doc = docx.Document(docxStream)
|
||||||
|
for para in doc.paragraphs:
|
||||||
|
chunks.append(para.text)
|
||||||
|
for table in doc.tables:
|
||||||
|
for row in table.rows:
|
||||||
|
rowText = []
|
||||||
|
for cell in row.cells:
|
||||||
|
rowText.append(cell.text)
|
||||||
|
chunks.append(" | ".join(rowText))
|
||||||
|
|
||||||
|
return chunks
|
||||||
|
except Exception:
|
||||||
|
return [content]
|
||||||
|
|
||||||
|
async def _chunkXlsx(self, content: str) -> List[str]:
|
||||||
|
"""Chunk Excel document content"""
|
||||||
|
try:
|
||||||
|
xlsxData = base64.b64decode(content)
|
||||||
|
chunks = []
|
||||||
|
chunkSize = self.chunkSizes["xlsx"]
|
||||||
|
|
||||||
|
with io.BytesIO(xlsxData) as xlsxStream:
|
||||||
|
workbook = openpyxl.load_workbook(xlsxStream, data_only=True)
|
||||||
|
for sheetName in workbook.sheetnames:
|
||||||
|
sheet = workbook[sheetName]
|
||||||
|
for row in sheet.iter_rows():
|
||||||
|
rowText = []
|
||||||
|
for cell in row:
|
||||||
|
value = cell.value
|
||||||
|
if value is None:
|
||||||
|
rowText.append("")
|
||||||
|
else:
|
||||||
|
rowText.append(str(value).replace('"', '""'))
|
||||||
|
chunks.append(','.join(f'"{cell}"' for cell in rowText))
|
||||||
|
|
||||||
|
return chunks
|
||||||
|
except Exception:
|
||||||
|
return [content]
|
||||||
|
|
||||||
|
async def _chunkPptx(self, content: str) -> List[str]:
|
||||||
|
"""Chunk PowerPoint document content"""
|
||||||
|
try:
|
||||||
|
pptxData = base64.b64decode(content)
|
||||||
|
chunks = []
|
||||||
|
chunkSize = self.chunkSizes["pptx"]
|
||||||
|
|
||||||
|
with io.BytesIO(pptxData) as pptxStream:
|
||||||
|
# openpyxl is not suitable for PowerPoint, so we'll just read text
|
||||||
|
# This is a placeholder and would require a different library for full pptx processing
|
||||||
|
# For now, we'll just return the base64 encoded content as a single chunk
|
||||||
|
chunks.append(content)
|
||||||
|
|
||||||
|
return chunks
|
||||||
|
except Exception:
|
||||||
|
return [content]
|
||||||
|
|
||||||
|
|
||||||
|
|
@ -2,6 +2,7 @@ import logging
|
||||||
import importlib
|
import importlib
|
||||||
import pkgutil
|
import pkgutil
|
||||||
import inspect
|
import inspect
|
||||||
|
import os
|
||||||
from typing import Dict, Any, List, Optional
|
from typing import Dict, Any, List, Optional
|
||||||
from modules.interfaces.interfaceAppModel import User, UserConnection
|
from modules.interfaces.interfaceAppModel import User, UserConnection
|
||||||
from modules.interfaces.interfaceChatModel import (
|
from modules.interfaces.interfaceChatModel import (
|
||||||
|
|
@ -111,6 +112,155 @@ class ServiceContainer:
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.error(f"Error discovering methods: {str(e)}")
|
logger.error(f"Error discovering methods: {str(e)}")
|
||||||
|
|
||||||
|
def detectContentTypeFromData(self, fileData: bytes, filename: str) -> str:
|
||||||
|
"""
|
||||||
|
Detect content type from file data and filename.
|
||||||
|
This method makes the MIME type detection function accessible through the service container.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
fileData: Raw file data as bytes
|
||||||
|
filename: Name of the file
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
str: Detected MIME type
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
# Check file extension first
|
||||||
|
ext = os.path.splitext(filename)[1].lower()
|
||||||
|
if ext:
|
||||||
|
# Map common extensions to MIME types
|
||||||
|
extToMime = {
|
||||||
|
'.txt': 'text/plain',
|
||||||
|
'.md': 'text/markdown',
|
||||||
|
'.csv': 'text/csv',
|
||||||
|
'.json': 'application/json',
|
||||||
|
'.xml': 'application/xml',
|
||||||
|
'.js': 'application/javascript',
|
||||||
|
'.py': 'application/x-python',
|
||||||
|
'.svg': 'image/svg+xml',
|
||||||
|
'.jpg': 'image/jpeg',
|
||||||
|
'.jpeg': 'image/jpeg',
|
||||||
|
'.png': 'image/png',
|
||||||
|
'.gif': 'image/gif',
|
||||||
|
'.bmp': 'image/bmp',
|
||||||
|
'.webp': 'image/webp',
|
||||||
|
'.pdf': 'application/pdf',
|
||||||
|
'.docx': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
|
||||||
|
'.doc': 'application/msword',
|
||||||
|
'.xlsx': 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
|
||||||
|
'.xls': 'application/vnd.ms-excel',
|
||||||
|
'.pptx': 'application/vnd.openxmlformats-officedocument.presentationml.presentation',
|
||||||
|
'.ppt': 'application/vnd.ms-powerpoint',
|
||||||
|
'.html': 'text/html',
|
||||||
|
'.htm': 'text/html',
|
||||||
|
'.css': 'text/css',
|
||||||
|
'.zip': 'application/zip',
|
||||||
|
'.rar': 'application/x-rar-compressed',
|
||||||
|
'.7z': 'application/x-7z-compressed',
|
||||||
|
'.tar': 'application/x-tar',
|
||||||
|
'.gz': 'application/gzip'
|
||||||
|
}
|
||||||
|
if ext in extToMime:
|
||||||
|
return extToMime[ext]
|
||||||
|
|
||||||
|
# Try to detect from content
|
||||||
|
if fileData.startswith(b'%PDF'):
|
||||||
|
return 'application/pdf'
|
||||||
|
elif fileData.startswith(b'PK\x03\x04'):
|
||||||
|
# ZIP-based formats (docx, xlsx, pptx)
|
||||||
|
return 'application/zip'
|
||||||
|
elif fileData.startswith(b'<'):
|
||||||
|
# XML-based formats
|
||||||
|
try:
|
||||||
|
text = fileData.decode('utf-8', errors='ignore')
|
||||||
|
if '<svg' in text.lower():
|
||||||
|
return 'image/svg+xml'
|
||||||
|
elif '<html' in text.lower():
|
||||||
|
return 'text/html'
|
||||||
|
else:
|
||||||
|
return 'application/xml'
|
||||||
|
except:
|
||||||
|
pass
|
||||||
|
elif fileData.startswith(b'\x89PNG\r\n\x1a\n'):
|
||||||
|
return 'image/png'
|
||||||
|
elif fileData.startswith(b'\xff\xd8\xff'):
|
||||||
|
return 'image/jpeg'
|
||||||
|
elif fileData.startswith(b'GIF87a') or fileData.startswith(b'GIF89a'):
|
||||||
|
return 'image/gif'
|
||||||
|
elif fileData.startswith(b'BM'):
|
||||||
|
return 'image/bmp'
|
||||||
|
elif fileData.startswith(b'RIFF') and fileData[8:12] == b'WEBP':
|
||||||
|
return 'image/webp'
|
||||||
|
|
||||||
|
return 'application/octet-stream'
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Error detecting content type from data: {str(e)}")
|
||||||
|
return 'application/octet-stream'
|
||||||
|
|
||||||
|
def getMimeTypeFromExtension(self, extension: str) -> str:
|
||||||
|
"""
|
||||||
|
Get MIME type based on file extension.
|
||||||
|
This method consolidates MIME type detection from extension.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
extension: File extension (with or without dot)
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
str: MIME type for the extension
|
||||||
|
"""
|
||||||
|
# Normalize extension (remove dot if present)
|
||||||
|
if extension.startswith('.'):
|
||||||
|
extension = extension[1:]
|
||||||
|
|
||||||
|
# Map extensions to MIME types
|
||||||
|
mime_types = {
|
||||||
|
'txt': 'text/plain',
|
||||||
|
'json': 'application/json',
|
||||||
|
'xml': 'application/xml',
|
||||||
|
'csv': 'text/csv',
|
||||||
|
'html': 'text/html',
|
||||||
|
'htm': 'text/html',
|
||||||
|
'md': 'text/markdown',
|
||||||
|
'py': 'text/x-python',
|
||||||
|
'js': 'application/javascript',
|
||||||
|
'css': 'text/css',
|
||||||
|
'pdf': 'application/pdf',
|
||||||
|
'doc': 'application/msword',
|
||||||
|
'docx': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
|
||||||
|
'xls': 'application/vnd.ms-excel',
|
||||||
|
'xlsx': 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
|
||||||
|
'ppt': 'application/vnd.ms-powerpoint',
|
||||||
|
'pptx': 'application/vnd.openxmlformats-officedocument.presentationml.presentation',
|
||||||
|
'svg': 'image/svg+xml',
|
||||||
|
'jpg': 'image/jpeg',
|
||||||
|
'jpeg': 'image/jpeg',
|
||||||
|
'png': 'image/png',
|
||||||
|
'gif': 'image/gif',
|
||||||
|
'bmp': 'image/bmp',
|
||||||
|
'webp': 'image/webp',
|
||||||
|
'zip': 'application/zip',
|
||||||
|
'rar': 'application/x-rar-compressed',
|
||||||
|
'7z': 'application/x-7z-compressed',
|
||||||
|
'tar': 'application/x-tar',
|
||||||
|
'gz': 'application/gzip'
|
||||||
|
}
|
||||||
|
return mime_types.get(extension.lower(), 'application/octet-stream')
|
||||||
|
|
||||||
|
def getFileExtension(self, filename: str) -> str:
|
||||||
|
"""
|
||||||
|
Extract file extension from filename.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
filename: Name of the file
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
str: File extension (without dot)
|
||||||
|
"""
|
||||||
|
if '.' in filename:
|
||||||
|
return filename.split('.')[-1].lower()
|
||||||
|
return "txt" # Default to text
|
||||||
|
|
||||||
# ===== Functions =====
|
# ===== Functions =====
|
||||||
|
|
||||||
def extractContent(self, prompt: str, document: ChatDocument) -> str:
|
def extractContent(self, prompt: str, document: ChatDocument) -> str:
|
||||||
|
|
@ -399,11 +549,11 @@ Please provide a clear summary of this message."""
|
||||||
"""Advanced text processing using Anthropic"""
|
"""Advanced text processing using Anthropic"""
|
||||||
return self.interfaceAiCalls.callAiTextAdvanced(prompt, context)
|
return self.interfaceAiCalls.callAiTextAdvanced(prompt, context)
|
||||||
|
|
||||||
def callAiImageBasic(self, prompt: str, imageData: bytes, mimeType: str) -> str:
|
def callAiImageBasic(self, prompt: str, imageData: str, mimeType: str) -> str:
|
||||||
"""Basic image processing using OpenAI"""
|
"""Basic image processing using OpenAI"""
|
||||||
return self.interfaceAiCalls.callAiImageBasic(prompt, imageData, mimeType)
|
return self.interfaceAiCalls.callAiImageBasic(prompt, imageData, mimeType)
|
||||||
|
|
||||||
def callAiImageAdvanced(self, prompt: str, imageData: bytes, mimeType: str) -> str:
|
def callAiImageAdvanced(self, prompt: str, imageData: str, mimeType: str) -> str:
|
||||||
"""Advanced image processing using Anthropic"""
|
"""Advanced image processing using Anthropic"""
|
||||||
return self.interfaceAiCalls.callAiImageAdvanced(prompt, imageData, mimeType)
|
return self.interfaceAiCalls.callAiImageAdvanced(prompt, imageData, mimeType)
|
||||||
|
|
||||||
|
|
@ -463,6 +613,30 @@ Please provide a clear summary of this message."""
|
||||||
mimeType=mimeType
|
mimeType=mimeType
|
||||||
)
|
)
|
||||||
|
|
||||||
|
def extractTextFromContentObjects(self, content_objects: List[Any]) -> List[str]:
|
||||||
|
"""
|
||||||
|
Extract text content from ExtractedContent objects or other content objects.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
content_objects: List of ExtractedContent objects or other content objects
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
List of extracted text strings
|
||||||
|
"""
|
||||||
|
text_contents = []
|
||||||
|
for content_obj in content_objects:
|
||||||
|
if hasattr(content_obj, 'contents') and content_obj.contents:
|
||||||
|
# Extract text from ContentItem objects
|
||||||
|
for content_item in content_obj.contents:
|
||||||
|
if hasattr(content_item, 'data') and content_item.data:
|
||||||
|
text_contents.append(content_item.data)
|
||||||
|
elif isinstance(content_obj, str):
|
||||||
|
text_contents.append(content_obj)
|
||||||
|
else:
|
||||||
|
# Fallback: convert to string representation
|
||||||
|
text_contents.append(str(content_obj))
|
||||||
|
return text_contents
|
||||||
|
|
||||||
async def executeAction(self, methodName: str, actionName: str, parameters: Dict[str, Any]) -> ActionResult:
|
async def executeAction(self, methodName: str, actionName: str, parameters: Dict[str, Any]) -> ActionResult:
|
||||||
"""Execute a method action"""
|
"""Execute a method action"""
|
||||||
try:
|
try:
|
||||||
|
|
|
||||||
31
run_document_test.ps1
Normal file
31
run_document_test.ps1
Normal file
|
|
@ -0,0 +1,31 @@
|
||||||
|
# PowerShell script to run document extraction test
|
||||||
|
# Usage: .\run_document_test.ps1 [file_path]
|
||||||
|
|
||||||
|
param(
|
||||||
|
[string]$FilePath = "test_sample_document.txt"
|
||||||
|
)
|
||||||
|
|
||||||
|
Write-Host "=== PowerOn Document Extraction Test ===" -ForegroundColor Green
|
||||||
|
Write-Host ""
|
||||||
|
|
||||||
|
# Check if file exists
|
||||||
|
if (-not (Test-Path $FilePath)) {
|
||||||
|
Write-Host "Error: File not found: $FilePath" -ForegroundColor Red
|
||||||
|
Write-Host "Please provide a valid file path as parameter or ensure test_sample_document.txt exists." -ForegroundColor Yellow
|
||||||
|
exit 1
|
||||||
|
}
|
||||||
|
|
||||||
|
Write-Host "Testing document extraction for file: $FilePath" -ForegroundColor Cyan
|
||||||
|
Write-Host "Log file will be: test_document_extraction.log" -ForegroundColor Cyan
|
||||||
|
Write-Host ""
|
||||||
|
|
||||||
|
# Run the Python test
|
||||||
|
try {
|
||||||
|
python test_document_extraction.py $FilePath
|
||||||
|
Write-Host ""
|
||||||
|
Write-Host "Test completed successfully!" -ForegroundColor Green
|
||||||
|
Write-Host "Check test_document_extraction.log for detailed results." -ForegroundColor Cyan
|
||||||
|
} catch {
|
||||||
|
Write-Host "Test failed with error: $($_.Exception.Message)" -ForegroundColor Red
|
||||||
|
exit 1
|
||||||
|
}
|
||||||
288
test_document_extraction.py
Normal file
288
test_document_extraction.py
Normal file
|
|
@ -0,0 +1,288 @@
|
||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Test procedure for DocumentManager document extraction functionality.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
import sys
|
||||||
|
import os
|
||||||
|
import json
|
||||||
|
import argparse
|
||||||
|
from datetime import datetime, UTC
|
||||||
|
from pathlib import Path
|
||||||
|
import logging
|
||||||
|
|
||||||
|
print("Starting test_document_extraction.py...")
|
||||||
|
|
||||||
|
# Configure logging FIRST, before any other imports
|
||||||
|
import logging
|
||||||
|
|
||||||
|
# Clear any existing handlers to avoid duplicate logs
|
||||||
|
for handler in logging.root.handlers[:]:
|
||||||
|
logging.root.removeHandler(handler)
|
||||||
|
|
||||||
|
logging.basicConfig(
|
||||||
|
level=logging.DEBUG,
|
||||||
|
format='%(asctime)s - %(levelname)s - %(name)s - %(message)s',
|
||||||
|
handlers=[
|
||||||
|
logging.StreamHandler(sys.stdout),
|
||||||
|
logging.FileHandler('test_document_extraction.log', mode='w', encoding='utf-8') # 'w' mode clears the file
|
||||||
|
],
|
||||||
|
force=True # Force reconfiguration even if already configured
|
||||||
|
)
|
||||||
|
|
||||||
|
# Filter out httpcore messages
|
||||||
|
logging.getLogger('httpcore').setLevel(logging.WARNING)
|
||||||
|
logging.getLogger('httpx').setLevel(logging.WARNING)
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
# Set up test configuration
|
||||||
|
os.environ['POWERON_CONFIG_FILE'] = 'test_config.ini'
|
||||||
|
print("Set POWERON_CONFIG_FILE environment variable")
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Import required modules
|
||||||
|
from modules.interfaces.interfaceAppObjects import User, UserConnection
|
||||||
|
from modules.interfaces.interfaceChatModel import ChatWorkflow
|
||||||
|
from modules.workflow.managerDocument import DocumentManager
|
||||||
|
from modules.workflow.serviceContainer import ServiceContainer
|
||||||
|
print("All imports successful")
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Import error: {e}")
|
||||||
|
import traceback
|
||||||
|
traceback.print_exc()
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
def log_extraction_debug(message: str, data: dict = None):
|
||||||
|
"""Log extraction debug data with JSON dumps"""
|
||||||
|
timestamp = datetime.now(UTC).isoformat()
|
||||||
|
if data:
|
||||||
|
logger.debug(f"[{timestamp}] {message}\n{json.dumps(data, indent=2, ensure_ascii=False)}")
|
||||||
|
else:
|
||||||
|
logger.debug(f"[{timestamp}] {message}")
|
||||||
|
|
||||||
|
def create_test_user() -> User:
|
||||||
|
"""Create a test user for the document extraction"""
|
||||||
|
return User(
|
||||||
|
id="test-user-doc-001",
|
||||||
|
mandateId="test-mandate-doc-001",
|
||||||
|
username="testuser_doc",
|
||||||
|
email="test_doc@example.com",
|
||||||
|
fullName="Test Document User",
|
||||||
|
enabled=True,
|
||||||
|
language="en",
|
||||||
|
privilege="user",
|
||||||
|
authenticationAuthority="local"
|
||||||
|
)
|
||||||
|
|
||||||
|
def create_test_workflow() -> ChatWorkflow:
|
||||||
|
"""Create a test workflow for document extraction"""
|
||||||
|
return ChatWorkflow(
|
||||||
|
id="test-workflow-doc-001",
|
||||||
|
mandateId="test-mandate-doc-001",
|
||||||
|
status="running",
|
||||||
|
name="Document Extraction Test Workflow",
|
||||||
|
currentRound=1,
|
||||||
|
lastActivity=datetime.now(UTC).isoformat(),
|
||||||
|
startedAt=datetime.now(UTC).isoformat(),
|
||||||
|
logs=[],
|
||||||
|
messages=[],
|
||||||
|
stats=None,
|
||||||
|
tasks=[]
|
||||||
|
)
|
||||||
|
|
||||||
|
def detect_mime_type(file_path: str) -> str:
|
||||||
|
"""Detect MIME type based on file extension"""
|
||||||
|
ext = Path(file_path).suffix.lower()
|
||||||
|
mime_types = {
|
||||||
|
'.txt': 'text/plain',
|
||||||
|
'.md': 'text/markdown',
|
||||||
|
'.csv': 'text/csv',
|
||||||
|
'.json': 'application/json',
|
||||||
|
'.xml': 'application/xml',
|
||||||
|
'.js': 'application/javascript',
|
||||||
|
'.py': 'application/x-python',
|
||||||
|
'.svg': 'image/svg+xml',
|
||||||
|
'.jpg': 'image/jpeg',
|
||||||
|
'.jpeg': 'image/jpeg',
|
||||||
|
'.png': 'image/png',
|
||||||
|
'.gif': 'image/gif',
|
||||||
|
'.pdf': 'application/pdf',
|
||||||
|
'.docx': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
|
||||||
|
'.doc': 'application/msword',
|
||||||
|
'.xlsx': 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
|
||||||
|
'.xls': 'application/vnd.ms-excel',
|
||||||
|
'.pptx': 'application/vnd.openxmlformats-officedocument.presentationml.presentation',
|
||||||
|
'.ppt': 'application/vnd.ms-powerpoint',
|
||||||
|
'.html': 'text/html',
|
||||||
|
'.htm': 'text/html'
|
||||||
|
}
|
||||||
|
return mime_types.get(ext, 'application/octet-stream')
|
||||||
|
|
||||||
|
async def test_document_extraction(file_path: str):
|
||||||
|
"""Test document extraction from a file path"""
|
||||||
|
try:
|
||||||
|
# Clear the log file before each run
|
||||||
|
log_file_path = "test_document_extraction.log"
|
||||||
|
if os.path.exists(log_file_path):
|
||||||
|
with open(log_file_path, 'w') as f:
|
||||||
|
f.write("") # Clear the file
|
||||||
|
logger.info(f"Cleared log file: {log_file_path}")
|
||||||
|
|
||||||
|
logger.info("=== STARTING DOCUMENT EXTRACTION TEST ===")
|
||||||
|
|
||||||
|
# Validate file path
|
||||||
|
if not os.path.exists(file_path):
|
||||||
|
raise FileNotFoundError(f"File not found: {file_path}")
|
||||||
|
|
||||||
|
# Get file info
|
||||||
|
file_path_obj = Path(file_path)
|
||||||
|
filename = file_path_obj.name
|
||||||
|
mime_type = detect_mime_type(file_path)
|
||||||
|
file_size = file_path_obj.stat().st_size
|
||||||
|
|
||||||
|
log_extraction_debug("File information", {
|
||||||
|
"file_path": file_path,
|
||||||
|
"filename": filename,
|
||||||
|
"mime_type": mime_type,
|
||||||
|
"file_size_bytes": file_size,
|
||||||
|
"file_size_mb": round(file_size / (1024 * 1024), 2)
|
||||||
|
})
|
||||||
|
|
||||||
|
# Read file data
|
||||||
|
try:
|
||||||
|
with open(file_path, 'rb') as f:
|
||||||
|
file_data = f.read()
|
||||||
|
log_extraction_debug("File read successfully", {
|
||||||
|
"bytes_read": len(file_data),
|
||||||
|
"file_encoding": "binary"
|
||||||
|
})
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Error reading file: {str(e)}")
|
||||||
|
raise
|
||||||
|
|
||||||
|
# Create test user and workflow
|
||||||
|
test_user = create_test_user()
|
||||||
|
test_workflow = create_test_workflow()
|
||||||
|
|
||||||
|
# Create service container
|
||||||
|
service_container = ServiceContainer(test_user, test_workflow)
|
||||||
|
log_extraction_debug("Service container created", {
|
||||||
|
"user_id": test_user.id,
|
||||||
|
"workflow_id": test_workflow.id
|
||||||
|
})
|
||||||
|
|
||||||
|
# Create document manager
|
||||||
|
document_manager = DocumentManager(service_container)
|
||||||
|
log_extraction_debug("Document manager created")
|
||||||
|
|
||||||
|
# Define extraction prompt
|
||||||
|
extraction_prompt = "extract the table and convert it to a csv table"
|
||||||
|
|
||||||
|
log_extraction_debug("Starting document extraction", {
|
||||||
|
"prompt": extraction_prompt,
|
||||||
|
"filename": filename,
|
||||||
|
"mime_type": mime_type
|
||||||
|
})
|
||||||
|
|
||||||
|
# Extract content from file data
|
||||||
|
try:
|
||||||
|
extracted_content = await document_manager.extractContentFromFileData(
|
||||||
|
prompt=extraction_prompt,
|
||||||
|
fileData=file_data,
|
||||||
|
filename=filename,
|
||||||
|
mimeType=mime_type,
|
||||||
|
base64Encoded=False,
|
||||||
|
documentId=f"test-doc-{datetime.now(UTC).timestamp()}"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Log extraction results
|
||||||
|
extraction_result = {
|
||||||
|
"extracted_content_id": extracted_content.id,
|
||||||
|
"content_items_count": len(extracted_content.contents)
|
||||||
|
}
|
||||||
|
|
||||||
|
# Add objectId and objectType if they exist (set by DocumentManager)
|
||||||
|
if hasattr(extracted_content, 'objectId'):
|
||||||
|
extraction_result["object_id"] = extracted_content.objectId
|
||||||
|
if hasattr(extracted_content, 'objectType'):
|
||||||
|
extraction_result["object_type"] = extracted_content.objectType
|
||||||
|
|
||||||
|
log_extraction_debug("Document extraction completed successfully", extraction_result)
|
||||||
|
|
||||||
|
# Log detailed content information
|
||||||
|
for i, content_item in enumerate(extracted_content.contents):
|
||||||
|
content_info = {
|
||||||
|
"label": content_item.label,
|
||||||
|
"data_length": len(content_item.data) if content_item.data else 0,
|
||||||
|
"data_preview": content_item.data[:500] + "..." if content_item.data and len(content_item.data) > 500 else content_item.data
|
||||||
|
}
|
||||||
|
|
||||||
|
# Add metadata if available
|
||||||
|
if content_item.metadata:
|
||||||
|
content_info["metadata"] = {
|
||||||
|
"size": content_item.metadata.size,
|
||||||
|
"mime_type": content_item.metadata.mimeType,
|
||||||
|
"base64_encoded": content_item.metadata.base64Encoded,
|
||||||
|
"pages": content_item.metadata.pages
|
||||||
|
}
|
||||||
|
|
||||||
|
log_extraction_debug(f"CONTENT ITEM {i+1}:", content_info)
|
||||||
|
|
||||||
|
# Log summary of all extracted content
|
||||||
|
all_content = "\n\n".join([item.data for item in extracted_content.contents if item.data])
|
||||||
|
log_extraction_debug("COMPLETE EXTRACTED CONTENT:", {
|
||||||
|
"total_length": len(all_content),
|
||||||
|
"content": all_content
|
||||||
|
})
|
||||||
|
|
||||||
|
return extracted_content
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
log_extraction_debug("DOCUMENT EXTRACTION EXCEPTION:", {
|
||||||
|
"error_type": type(e).__name__,
|
||||||
|
"error_message": str(e),
|
||||||
|
"error_args": e.args if hasattr(e, 'args') else None
|
||||||
|
})
|
||||||
|
raise
|
||||||
|
|
||||||
|
logger.info("=== DOCUMENT EXTRACTION TEST COMPLETED ===")
|
||||||
|
return extracted_content
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"❌ Document extraction test failed with error: {str(e)}")
|
||||||
|
log_extraction_debug("Full error details", {
|
||||||
|
"error_type": type(e).__name__,
|
||||||
|
"error_message": str(e)
|
||||||
|
})
|
||||||
|
raise
|
||||||
|
|
||||||
|
async def main():
|
||||||
|
"""Main function to run the document extraction test"""
|
||||||
|
print("Inside main()")
|
||||||
|
logger.info("=" * 50)
|
||||||
|
logger.info("DOCUMENT EXTRACTION TEST")
|
||||||
|
logger.info("=" * 50)
|
||||||
|
|
||||||
|
# Parse command line arguments
|
||||||
|
parser = argparse.ArgumentParser(description='Test document extraction functionality')
|
||||||
|
parser.add_argument('file_path', help='Path to the file to extract content from')
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
try:
|
||||||
|
extracted_content = await test_document_extraction(args.file_path)
|
||||||
|
logger.info("=" * 50)
|
||||||
|
logger.info("TEST COMPLETED SUCCESSFULLY")
|
||||||
|
logger.info("=" * 50)
|
||||||
|
return extracted_content
|
||||||
|
except Exception as e:
|
||||||
|
logger.error("=" * 50)
|
||||||
|
logger.error("TEST FAILED")
|
||||||
|
logger.error("=" * 50)
|
||||||
|
raise
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
print("About to run main()")
|
||||||
|
asyncio.run(main())
|
||||||
|
print("main() finished")
|
||||||
289
test_retry_enhancement.py
Normal file
289
test_retry_enhancement.py
Normal file
|
|
@ -0,0 +1,289 @@
|
||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Test script for retry enhancement in managerChat.py
|
||||||
|
Tests that previous action results and review feedback are properly passed to retry prompts.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
import logging
|
||||||
|
import sys
|
||||||
|
import os
|
||||||
|
|
||||||
|
# Add the gateway directory to the Python path
|
||||||
|
sys.path.insert(0, os.path.join(os.path.dirname(__file__), 'gateway'))
|
||||||
|
|
||||||
|
from modules.workflow.managerChat import ChatManager
|
||||||
|
from modules.interfaces.interfaceAppModel import User
|
||||||
|
from modules.interfaces.interfaceChatModel import ChatWorkflow, ChatMessage
|
||||||
|
from modules.interfaces.interfaceChatObjects import ChatObjects
|
||||||
|
|
||||||
|
# Configure logging
|
||||||
|
logging.basicConfig(level=logging.DEBUG)
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
class MockChatObjects(ChatObjects):
|
||||||
|
"""Mock implementation of ChatObjects for testing"""
|
||||||
|
|
||||||
|
def createTaskAction(self, action_data):
|
||||||
|
"""Mock task action creation"""
|
||||||
|
class MockTaskAction:
|
||||||
|
def __init__(self, data):
|
||||||
|
self.id = "test_action_id"
|
||||||
|
self.execMethod = data.get("execMethod", "unknown")
|
||||||
|
self.execAction = data.get("execAction", "unknown")
|
||||||
|
self.execParameters = data.get("execParameters", {})
|
||||||
|
self.execResultLabel = data.get("execResultLabel", "")
|
||||||
|
self.status = data.get("status", "PENDING")
|
||||||
|
self.result = ""
|
||||||
|
self.error = ""
|
||||||
|
|
||||||
|
def setSuccess(self):
|
||||||
|
self.status = "COMPLETED"
|
||||||
|
|
||||||
|
def setError(self, error):
|
||||||
|
self.status = "FAILED"
|
||||||
|
self.error = error
|
||||||
|
|
||||||
|
def isSuccessful(self):
|
||||||
|
return self.status == "COMPLETED"
|
||||||
|
|
||||||
|
return MockTaskAction(action_data)
|
||||||
|
|
||||||
|
def createChatDocument(self, document_data):
|
||||||
|
"""Mock document creation"""
|
||||||
|
class MockChatDocument:
|
||||||
|
def __init__(self, data):
|
||||||
|
self.fileId = data.get("fileId", "")
|
||||||
|
self.filename = data.get("filename", "unknown")
|
||||||
|
self.fileSize = data.get("fileSize", 0)
|
||||||
|
self.mimeType = data.get("mimeType", "application/octet-stream")
|
||||||
|
self.content = ""
|
||||||
|
|
||||||
|
return MockChatDocument(document_data)
|
||||||
|
|
||||||
|
def createWorkflowMessage(self, message_data):
|
||||||
|
"""Mock message creation"""
|
||||||
|
class MockWorkflowMessage:
|
||||||
|
def __init__(self, data):
|
||||||
|
self.workflowId = data.get("workflowId", "")
|
||||||
|
self.role = data.get("role", "assistant")
|
||||||
|
self.message = data.get("message", "")
|
||||||
|
self.status = data.get("status", "step")
|
||||||
|
self.sequenceNr = data.get("sequenceNr", 1)
|
||||||
|
self.publishedAt = data.get("publishedAt", "")
|
||||||
|
self.actionId = data.get("actionId", "")
|
||||||
|
self.actionMethod = data.get("actionMethod", "")
|
||||||
|
self.actionName = data.get("actionName", "")
|
||||||
|
self.documentsLabel = data.get("documentsLabel", "")
|
||||||
|
self.documents = data.get("documents", [])
|
||||||
|
|
||||||
|
return MockWorkflowMessage(message_data)
|
||||||
|
|
||||||
|
class MockServiceContainer:
|
||||||
|
"""Mock service container for testing"""
|
||||||
|
|
||||||
|
def __init__(self, user, workflow):
|
||||||
|
self.user = user
|
||||||
|
self.workflow = workflow
|
||||||
|
|
||||||
|
def getMethodsList(self):
|
||||||
|
"""Mock methods list"""
|
||||||
|
return ["document.extract(documentList, aiPrompt)", "document.analyze(documentList, aiPrompt)"]
|
||||||
|
|
||||||
|
async def summarizeChat(self, messages):
|
||||||
|
"""Mock chat summarization"""
|
||||||
|
return "Mock chat history summary"
|
||||||
|
|
||||||
|
def getDocumentReferenceList(self):
|
||||||
|
"""Mock document references"""
|
||||||
|
return {
|
||||||
|
'chat': [],
|
||||||
|
'history': []
|
||||||
|
}
|
||||||
|
|
||||||
|
def getConnectionReferenceList(self):
|
||||||
|
"""Mock connection references"""
|
||||||
|
return ["connection1", "connection2"]
|
||||||
|
|
||||||
|
def getFileInfo(self, fileId):
|
||||||
|
"""Mock file info"""
|
||||||
|
return {
|
||||||
|
"filename": f"test_file_{fileId}.txt",
|
||||||
|
"size": 1024,
|
||||||
|
"mimeType": "text/plain"
|
||||||
|
}
|
||||||
|
|
||||||
|
def createFile(self, fileName, mimeType, content, base64encoded=False):
|
||||||
|
"""Mock file creation"""
|
||||||
|
return f"file_id_{fileName}"
|
||||||
|
|
||||||
|
def createDocument(self, fileName, mimeType, content, base64encoded=False):
|
||||||
|
"""Mock document creation"""
|
||||||
|
class MockDocument:
|
||||||
|
def __init__(self, name, mime, cont):
|
||||||
|
self.filename = name
|
||||||
|
self.mimeType = mime
|
||||||
|
self.content = cont
|
||||||
|
self.fileSize = len(cont)
|
||||||
|
|
||||||
|
return MockDocument(fileName, mimeType, content)
|
||||||
|
|
||||||
|
def getFileExtension(self, filename):
|
||||||
|
"""Mock file extension extraction"""
|
||||||
|
return filename.split('.')[-1] if '.' in filename else 'txt'
|
||||||
|
|
||||||
|
def getMimeTypeFromExtension(self, extension):
|
||||||
|
"""Mock MIME type detection"""
|
||||||
|
mime_types = {
|
||||||
|
'txt': 'text/plain',
|
||||||
|
'pdf': 'application/pdf',
|
||||||
|
'doc': 'application/msword',
|
||||||
|
'json': 'application/json'
|
||||||
|
}
|
||||||
|
return mime_types.get(extension, 'application/octet-stream')
|
||||||
|
|
||||||
|
def detectContentTypeFromData(self, file_bytes, filename):
|
||||||
|
"""Mock content type detection"""
|
||||||
|
if filename.endswith('.txt'):
|
||||||
|
return 'text/plain'
|
||||||
|
elif filename.endswith('.pdf'):
|
||||||
|
return 'application/pdf'
|
||||||
|
elif filename.endswith('.json'):
|
||||||
|
return 'application/json'
|
||||||
|
return 'application/octet-stream'
|
||||||
|
|
||||||
|
async def callAiTextBasic(self, prompt):
|
||||||
|
"""Mock AI call"""
|
||||||
|
return '{"actions": [{"method": "document", "action": "extract", "parameters": {"documentList": ["test"], "aiPrompt": "Test prompt"}, "resultLabel": "task1_action1_test", "description": "Test action"}]}'
|
||||||
|
|
||||||
|
async def callAiTextAdvanced(self, prompt):
|
||||||
|
"""Mock advanced AI call"""
|
||||||
|
return '{"overview": "Test plan", "tasks": [{"id": "task_1", "description": "Test task", "dependencies": [], "expected_outputs": ["output1"], "success_criteria": ["criteria1"], "required_documents": [], "estimated_complexity": "low", "ai_prompt": "Test prompt"}]}'
|
||||||
|
|
||||||
|
async def executeAction(self, methodName, actionName, parameters):
|
||||||
|
"""Mock action execution"""
|
||||||
|
class MockResult:
|
||||||
|
def __init__(self):
|
||||||
|
self.success = True
|
||||||
|
self.data = {
|
||||||
|
"result": "Mock execution result",
|
||||||
|
"documents": []
|
||||||
|
}
|
||||||
|
self.error = None
|
||||||
|
|
||||||
|
return MockResult()
|
||||||
|
|
||||||
|
async def test_retry_enhancement():
|
||||||
|
"""Test the retry enhancement functionality"""
|
||||||
|
logger.info("Testing retry enhancement in managerChat.py")
|
||||||
|
|
||||||
|
# Create mock objects
|
||||||
|
mock_user = User(id="test_user", username="testuser", email="test@example.com", mandateId="test_mandate")
|
||||||
|
mock_chat_objects = MockChatObjects()
|
||||||
|
mock_workflow = ChatWorkflow(
|
||||||
|
id="test_workflow",
|
||||||
|
userId="test_user",
|
||||||
|
status="active",
|
||||||
|
messages=[],
|
||||||
|
createdAt="2024-01-01T00:00:00Z",
|
||||||
|
updatedAt="2024-01-01T00:00:00Z",
|
||||||
|
mandateId="test_mandate",
|
||||||
|
currentRound=1,
|
||||||
|
lastActivity="2024-01-01T00:00:00Z",
|
||||||
|
startedAt="2024-01-01T00:00:00Z"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Create chat manager
|
||||||
|
chat_manager = ChatManager(mock_user, mock_chat_objects)
|
||||||
|
|
||||||
|
# Mock the service container directly instead of initializing
|
||||||
|
chat_manager.service = MockServiceContainer(mock_user, mock_workflow)
|
||||||
|
chat_manager.workflow = mock_workflow
|
||||||
|
|
||||||
|
# Test 1: Basic action definition without retry
|
||||||
|
logger.info("Test 1: Basic action definition")
|
||||||
|
task_step = {
|
||||||
|
"id": "task_1",
|
||||||
|
"description": "Test task",
|
||||||
|
"expected_outputs": ["output1"],
|
||||||
|
"success_criteria": ["criteria1"],
|
||||||
|
"ai_prompt": "Test AI prompt"
|
||||||
|
}
|
||||||
|
|
||||||
|
actions = await chat_manager.defineTaskActions(task_step, mock_workflow, [])
|
||||||
|
logger.info(f"Generated {len(actions)} actions without retry context")
|
||||||
|
|
||||||
|
# Test 2: Action definition with retry context
|
||||||
|
logger.info("Test 2: Action definition with retry context")
|
||||||
|
enhanced_context = {
|
||||||
|
'task_step': task_step,
|
||||||
|
'workflow': mock_workflow,
|
||||||
|
'workflow_id': mock_workflow.id,
|
||||||
|
'available_documents': ["test_doc.txt"],
|
||||||
|
'previous_results': ["task0_action1_results"],
|
||||||
|
'improvements': "Previous attempt failed - ensure comprehensive extraction",
|
||||||
|
'retry_count': 1,
|
||||||
|
'previous_action_results': [
|
||||||
|
{
|
||||||
|
'actionMethod': 'document',
|
||||||
|
'actionName': 'extract',
|
||||||
|
'status': 'failed',
|
||||||
|
'error': 'Empty result returned',
|
||||||
|
'result': 'No content extracted',
|
||||||
|
'resultLabel': 'task1_action1_failed'
|
||||||
|
}
|
||||||
|
],
|
||||||
|
'previous_review_result': {
|
||||||
|
'status': 'retry',
|
||||||
|
'reason': 'Incomplete extraction',
|
||||||
|
'quality_score': 3,
|
||||||
|
'missing_outputs': ['detailed_analysis'],
|
||||||
|
'unmet_criteria': ['comprehensive_coverage']
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
retry_actions = await chat_manager.defineTaskActions(task_step, mock_workflow, [], enhanced_context)
|
||||||
|
logger.info(f"Generated {len(retry_actions)} actions with retry context")
|
||||||
|
|
||||||
|
# Test 3: Verify retry context is properly handled
|
||||||
|
logger.info("Test 3: Verifying retry context handling")
|
||||||
|
|
||||||
|
# Create a test prompt to see if retry context is included
|
||||||
|
test_prompt = await chat_manager._createActionDefinitionPrompt(enhanced_context)
|
||||||
|
|
||||||
|
# Check if retry context is in the prompt
|
||||||
|
if "RETRY CONTEXT" in test_prompt:
|
||||||
|
logger.info("✓ Retry context properly included in prompt")
|
||||||
|
else:
|
||||||
|
logger.error("✗ Retry context not found in prompt")
|
||||||
|
|
||||||
|
if "Previous action results that failed" in test_prompt:
|
||||||
|
logger.info("✓ Previous action results included in prompt")
|
||||||
|
else:
|
||||||
|
logger.error("✗ Previous action results not found in prompt")
|
||||||
|
|
||||||
|
if "Previous review feedback" in test_prompt:
|
||||||
|
logger.info("✓ Previous review feedback included in prompt")
|
||||||
|
else:
|
||||||
|
logger.error("✗ Previous review feedback not found in prompt")
|
||||||
|
|
||||||
|
if "Previous attempt failed" in test_prompt:
|
||||||
|
logger.info("✓ Improvements needed included in prompt")
|
||||||
|
else:
|
||||||
|
logger.error("✗ Improvements needed not found in prompt")
|
||||||
|
|
||||||
|
# Test 4: Verify fallback actions with retry context
|
||||||
|
logger.info("Test 4: Testing fallback actions with retry context")
|
||||||
|
fallback_actions = chat_manager._createFallbackActions(task_step, enhanced_context)
|
||||||
|
logger.info(f"Generated {len(fallback_actions)} fallback actions with retry context")
|
||||||
|
|
||||||
|
# Check if fallback actions include retry information
|
||||||
|
if any("retry" in action.get("resultLabel", "") for action in fallback_actions):
|
||||||
|
logger.info("✓ Fallback actions include retry information")
|
||||||
|
else:
|
||||||
|
logger.error("✗ Fallback actions missing retry information")
|
||||||
|
|
||||||
|
logger.info("Retry enhancement test completed successfully!")
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
asyncio.run(test_retry_enhancement())
|
||||||
47
test_sample_document.txt
Normal file
47
test_sample_document.txt
Normal file
|
|
@ -0,0 +1,47 @@
|
||||||
|
PowerOn System Architecture Overview
|
||||||
|
|
||||||
|
This document provides a comprehensive overview of the PowerOn system architecture, including its key components, data flow, and technical specifications.
|
||||||
|
|
||||||
|
MAJOR TOPICS:
|
||||||
|
|
||||||
|
1. System Architecture
|
||||||
|
- Frontend Agents: Web-based user interface components
|
||||||
|
- Gateway: Central API and workflow management system
|
||||||
|
- Database: JSON-based data storage with component interfaces
|
||||||
|
- AI Integration: Anthropic and OpenAI connectors for intelligent processing
|
||||||
|
|
||||||
|
2. Core Components
|
||||||
|
- Document Manager: Handles file processing and content extraction
|
||||||
|
- Workflow Manager: Orchestrates complex business processes
|
||||||
|
- Service Container: Provides unified access to all system services
|
||||||
|
- Neutralizer: Data anonymization and privacy protection
|
||||||
|
|
||||||
|
3. Data Flow Architecture
|
||||||
|
- User authentication and authorization
|
||||||
|
- Document upload and processing pipeline
|
||||||
|
- AI-powered content analysis and extraction
|
||||||
|
- Workflow execution and task management
|
||||||
|
- Result generation and storage
|
||||||
|
|
||||||
|
4. Technical Specifications
|
||||||
|
- Python-based backend with async/await support
|
||||||
|
- RESTful API design with JSON data exchange
|
||||||
|
- Modular component architecture
|
||||||
|
- Extensible method system for business logic
|
||||||
|
- Comprehensive logging and monitoring
|
||||||
|
|
||||||
|
5. Security Features
|
||||||
|
- Multi-authentication authority support (Local, Microsoft, Google)
|
||||||
|
- Token-based session management
|
||||||
|
- Data encryption and anonymization
|
||||||
|
- Role-based access control
|
||||||
|
- Audit trail and compliance features
|
||||||
|
|
||||||
|
6. Integration Capabilities
|
||||||
|
- SharePoint document management
|
||||||
|
- Email system integration (Outlook)
|
||||||
|
- Web crawling and data collection
|
||||||
|
- AI service integration (Anthropic, OpenAI)
|
||||||
|
- Custom method development framework
|
||||||
|
|
||||||
|
The PowerOn system is designed to provide a comprehensive platform for intelligent document processing, workflow automation, and AI-powered business process management. It combines modern web technologies with advanced AI capabilities to deliver a robust and scalable solution for enterprise document management and workflow automation.
|
||||||
Loading…
Reference in a new issue