system test
This commit is contained in:
parent
aa854f27b7
commit
86fe43e987
21 changed files with 2247 additions and 590 deletions
114
README_document_test.md
Normal file
114
README_document_test.md
Normal file
|
|
@ -0,0 +1,114 @@
|
|||
# Document Extraction Test
|
||||
|
||||
This test procedure validates the DocumentManager's ability to extract content from files using AI-powered analysis.
|
||||
|
||||
## Files Created
|
||||
|
||||
- `test_document_extraction.py` - Main test script
|
||||
- `test_sample_document.txt` - Sample document for testing
|
||||
- `run_document_test.ps1` - PowerShell wrapper script
|
||||
- `test_document_extraction.log` - Generated log file (cleared on each run)
|
||||
|
||||
## Usage
|
||||
|
||||
### Method 1: Using PowerShell Script (Recommended)
|
||||
|
||||
```powershell
|
||||
# Test with default sample file
|
||||
.\run_document_test.ps1
|
||||
|
||||
# Test with custom file
|
||||
.\run_document_test.ps1 "path\to\your\document.pdf"
|
||||
```
|
||||
|
||||
### Method 2: Direct Python Execution
|
||||
|
||||
```bash
|
||||
# Test with default sample file
|
||||
python test_document_extraction.py test_sample_document.txt
|
||||
|
||||
# Test with custom file
|
||||
python test_document_extraction.py "path/to/your/document.docx"
|
||||
```
|
||||
|
||||
## Test Features
|
||||
|
||||
1. **File Validation**: Checks if the specified file exists
|
||||
2. **MIME Type Detection**: Automatically detects file type based on extension
|
||||
3. **Content Extraction**: Uses the DocumentManager to extract content
|
||||
4. **AI Processing**: Applies the prompt "summarize the content and give list of the major topics"
|
||||
5. **Comprehensive Logging**: Logs all steps and results to `test_document_extraction.log`
|
||||
6. **Log Cleanup**: Clears the log file on each test run
|
||||
|
||||
## Supported File Types
|
||||
|
||||
- Text files (.txt, .md)
|
||||
- CSV files (.csv)
|
||||
- JSON files (.json)
|
||||
- XML files (.xml)
|
||||
- HTML files (.html, .htm)
|
||||
- Images (.jpg, .jpeg, .png, .gif, .svg)
|
||||
- PDF files (.pdf)
|
||||
- Office documents (.docx, .xlsx, .pptx)
|
||||
- And more (fallback to binary processing)
|
||||
|
||||
## Test Output
|
||||
|
||||
The test generates detailed logs including:
|
||||
|
||||
- File information (path, size, MIME type)
|
||||
- Extraction process details
|
||||
- Extracted content summary
|
||||
- AI-processed results
|
||||
- Error details if any issues occur
|
||||
|
||||
## Example Output
|
||||
|
||||
```
|
||||
=== STARTING DOCUMENT EXTRACTION TEST ===
|
||||
File information: {
|
||||
"file_path": "test_sample_document.txt",
|
||||
"filename": "test_sample_document.txt",
|
||||
"mime_type": "text/plain",
|
||||
"file_size_bytes": 2048,
|
||||
"file_size_mb": 0.0
|
||||
}
|
||||
Document extraction completed successfully: {
|
||||
"extracted_content_id": "test-doc-1234567890",
|
||||
"content_items_count": 1,
|
||||
"object_type": "ExtractedContent"
|
||||
}
|
||||
COMPLETE EXTRACTED CONTENT: {
|
||||
"total_length": 1500,
|
||||
"content": "PowerOn System Architecture Overview... [AI processed summary]"
|
||||
}
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
The test includes comprehensive error handling for:
|
||||
|
||||
- File not found errors
|
||||
- File reading errors
|
||||
- Document processing errors
|
||||
- AI processing errors
|
||||
- Import errors
|
||||
|
||||
All errors are logged with detailed information for debugging.
|
||||
|
||||
## Configuration
|
||||
|
||||
The test uses the same configuration as other tests:
|
||||
|
||||
- Environment variable: `POWERON_CONFIG_FILE = 'test_config.ini'`
|
||||
- Log file: `test_document_extraction.log`
|
||||
- Log level: DEBUG
|
||||
|
||||
## Dependencies
|
||||
|
||||
The test requires the same dependencies as the main PowerOn system:
|
||||
|
||||
- Python 3.8+
|
||||
- Required Python packages (see requirements.txt)
|
||||
- Access to AI services (if AI processing is enabled)
|
||||
- Proper configuration in test_config.ini
|
||||
|
|
@ -98,7 +98,27 @@ class AiOpenai:
|
|||
The response from the OpenAI Vision API as text
|
||||
"""
|
||||
try:
|
||||
logger.debug(f"Starting image analysis for {mimeType} with query '{prompt}' for {mimeType} size {len(imageData)}B...")
|
||||
logger.debug(f"Starting image analysis with query '{prompt}' for size {len(imageData)}B...")
|
||||
|
||||
# Ensure imageData is a string (base64 encoded)
|
||||
if not isinstance(imageData, str):
|
||||
raise ValueError("imageData must be a string (base64 encoded)")
|
||||
|
||||
# Fix base64 padding if needed
|
||||
padding_needed = len(imageData) % 4
|
||||
if padding_needed:
|
||||
imageData += '=' * (4 - padding_needed)
|
||||
|
||||
# Use default MIME type if not provided
|
||||
if not mimeType:
|
||||
mimeType = "image/jpeg"
|
||||
|
||||
logger.debug(f"Using MIME type: {mimeType}")
|
||||
logger.debug(f"Base64 data length: {len(imageData)} characters")
|
||||
|
||||
# Create the data URL format as required by OpenAI Vision API
|
||||
data_url = f"data:{mimeType};base64,{imageData}"
|
||||
|
||||
messages = [
|
||||
{
|
||||
"role": "user",
|
||||
|
|
@ -107,15 +127,40 @@ class AiOpenai:
|
|||
{
|
||||
"type": "image_url",
|
||||
"image_url": {
|
||||
"url": f"data:{mimeType};base64,{imageData}"
|
||||
"url": data_url
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
|
||||
# Use the existing callApi function with the Vision model
|
||||
response = await self.callApi(messages)
|
||||
# Use a vision-capable model for image analysis
|
||||
# Override the model for vision tasks
|
||||
visionModel = "gpt-4o" # or "gpt-4-vision-preview" depending on availability
|
||||
|
||||
# Use parameters from configuration
|
||||
temperature = self.config.get("temperature", 0.2)
|
||||
maxTokens = self.config.get("maxTokens", 2000)
|
||||
|
||||
payload = {
|
||||
"model": visionModel,
|
||||
"messages": messages,
|
||||
"temperature": temperature,
|
||||
"max_tokens": maxTokens
|
||||
}
|
||||
|
||||
response = await self.httpClient.post(
|
||||
self.apiUrl,
|
||||
json=payload
|
||||
)
|
||||
|
||||
if response.status_code != 200:
|
||||
logger.error(f"OpenAI API error: {response.status_code} - {response.text}")
|
||||
raise HTTPException(status_code=500, detail="Error communicating with OpenAI API")
|
||||
|
||||
responseJson = response.json()
|
||||
content = responseJson["choices"][0]["message"]["content"]
|
||||
return content
|
||||
|
||||
# Return content
|
||||
return response
|
||||
|
|
|
|||
|
|
@ -173,13 +173,31 @@ class DatabaseConnector:
|
|||
record["_modifiedAt"] = currentTime.isoformat()
|
||||
record["_modifiedBy"] = self.userId
|
||||
|
||||
# Save the record file
|
||||
# Save the record file using atomic write
|
||||
recordPath = self._getRecordPath(table, recordId)
|
||||
tempPath = recordPath + '.tmp'
|
||||
|
||||
# Ensure directory exists
|
||||
os.makedirs(os.path.dirname(recordPath), exist_ok=True)
|
||||
|
||||
with open(recordPath, 'w', encoding='utf-8') as f:
|
||||
# Write to temporary file first
|
||||
with open(tempPath, 'w', encoding='utf-8') as f:
|
||||
json.dump(record, f, indent=2, ensure_ascii=False)
|
||||
|
||||
# Verify the temporary file can be read back (validation)
|
||||
try:
|
||||
with open(tempPath, 'r', encoding='utf-8') as f:
|
||||
json.load(f) # This will fail if file is corrupted
|
||||
except Exception as e:
|
||||
logger.error(f"Validation failed for record {recordId}: {e}")
|
||||
# Clean up temp file
|
||||
if os.path.exists(tempPath):
|
||||
os.remove(tempPath)
|
||||
raise ValueError(f"Record validation failed: {e}")
|
||||
|
||||
# Atomic move from temp to final location
|
||||
os.replace(tempPath, recordPath)
|
||||
|
||||
# Update metadata
|
||||
metadata = self._loadTableMetadata(table)
|
||||
if recordId not in metadata["recordIds"]:
|
||||
|
|
@ -203,6 +221,13 @@ class DatabaseConnector:
|
|||
|
||||
except Exception as e:
|
||||
logger.error(f"Error saving record {recordId} to table {table}: {e}")
|
||||
# Clean up temp file if it exists
|
||||
tempPath = self._getRecordPath(table, recordId) + '.tmp'
|
||||
if os.path.exists(tempPath):
|
||||
try:
|
||||
os.remove(tempPath)
|
||||
except:
|
||||
pass
|
||||
return False
|
||||
|
||||
def _loadTable(self, table: str) -> List[Dict[str, Any]]:
|
||||
|
|
|
|||
|
|
@ -116,7 +116,7 @@ class AiCalls:
|
|||
The AI response as text
|
||||
"""
|
||||
try:
|
||||
return await self.openaiService.callAiImage(imageData, mimeType, prompt)
|
||||
return await self.openaiService.callAiImage(prompt, imageData, mimeType)
|
||||
except Exception as e:
|
||||
logger.error(f"Error in OpenAI image call: {str(e)}")
|
||||
return f"Error: {str(e)}"
|
||||
|
|
|
|||
|
|
@ -237,7 +237,6 @@ class AppObjects:
|
|||
# Find user by username
|
||||
for user_dict in users:
|
||||
if user_dict.get("username") == username:
|
||||
logger.info(f"Found user with username {username}")
|
||||
return User.from_dict(user_dict)
|
||||
|
||||
logger.info(f"No user found with username {username}")
|
||||
|
|
|
|||
|
|
@ -760,7 +760,7 @@ class ChatObjects:
|
|||
else:
|
||||
# Create new workflow
|
||||
workflowData = {
|
||||
"name": userInput.name or "New Workflow",
|
||||
"name": "New Workflow", # Default name since UserInputRequest doesn't have a name field
|
||||
"status": "running",
|
||||
"startedAt": currentTime,
|
||||
"lastActivity": currentTime,
|
||||
|
|
|
|||
|
|
@ -690,34 +690,39 @@ class ComponentObjects:
|
|||
return None
|
||||
|
||||
# Process content based on file type
|
||||
contentType = "binary"
|
||||
isText = False
|
||||
content = ""
|
||||
encoding = None
|
||||
|
||||
if file.get("mimeType", "").startswith("text/"):
|
||||
# Use proper attribute access for FileItem object
|
||||
if file.mimeType.startswith("text/"):
|
||||
# For text files, return full content
|
||||
try:
|
||||
content = fileContent.decode('utf-8')
|
||||
contentType = "text"
|
||||
isText = True
|
||||
encoding = 'utf-8'
|
||||
except UnicodeDecodeError:
|
||||
content = fileContent.decode('latin-1')
|
||||
contentType = "text"
|
||||
elif file.get("mimeType", "").startswith("image/"):
|
||||
isText = True
|
||||
encoding = 'latin-1'
|
||||
elif file.mimeType.startswith("image/"):
|
||||
# For images, return base64
|
||||
contentType = "base64"
|
||||
content = f"data:{file['mimeType']};base64,{fileContent.hex()}"
|
||||
import base64
|
||||
content = base64.b64encode(fileContent).decode('utf-8')
|
||||
isText = False
|
||||
else:
|
||||
# For other files, return as base64
|
||||
contentType = "base64"
|
||||
content = f"data:{file['mimeType']};base64,{fileContent.hex()}"
|
||||
import base64
|
||||
content = base64.b64encode(fileContent).decode('utf-8')
|
||||
isText = False
|
||||
|
||||
return FilePreview(
|
||||
id=fileId,
|
||||
name=file.get("name", "Unknown"),
|
||||
mimeType=file.get("mimeType", "application/octet-stream"),
|
||||
size=file.get("size", 0),
|
||||
content=content,
|
||||
contentType=contentType,
|
||||
metadata=file.get("metadata", {})
|
||||
mimeType=file.mimeType,
|
||||
filename=file.filename,
|
||||
isText=isText,
|
||||
encoding=encoding,
|
||||
size=file.fileSize
|
||||
)
|
||||
|
||||
except Exception as e:
|
||||
|
|
|
|||
|
|
@ -1,4 +1,4 @@
|
|||
from typing import Dict, Any, Optional
|
||||
from typing import Dict, Any, Optional, List
|
||||
import logging
|
||||
import uuid
|
||||
from datetime import datetime, UTC
|
||||
|
|
@ -11,10 +11,11 @@ class MethodCoder(MethodBase):
|
|||
"""Coder method implementation for code operations"""
|
||||
|
||||
def __init__(self, serviceContainer: Any):
|
||||
"""Initialize the coder method"""
|
||||
super().__init__(serviceContainer)
|
||||
self.name = "coder"
|
||||
self.description = "Handle code operations like analysis and generation"
|
||||
|
||||
self.description = "Handle code operations like analysis, generation, and refactoring"
|
||||
|
||||
@action
|
||||
async def analyze(self, parameters: Dict[str, Any]) -> ActionResult:
|
||||
"""
|
||||
|
|
@ -55,7 +56,7 @@ class MethodCoder(MethodBase):
|
|||
error="No documents found for the provided reference"
|
||||
)
|
||||
|
||||
# Extract content from all documents
|
||||
# Process each document individually
|
||||
all_code_content = []
|
||||
|
||||
for chatDocument in chatDocuments:
|
||||
|
|
@ -85,15 +86,18 @@ class MethodCoder(MethodBase):
|
|||
error="No code content could be extracted from any documents"
|
||||
)
|
||||
|
||||
# Combine all code content for analysis
|
||||
combined_code = "\n\n--- CODE SEPARATOR ---\n\n".join(all_code_content)
|
||||
# Extract text content from ExtractedContent objects
|
||||
text_contents = self.service.extractTextFromContentObjects(all_code_content)
|
||||
|
||||
# Combine all extracted text content for analysis
|
||||
combined_content = "\n\n--- CODE SEPARATOR ---\n\n".join(text_contents)
|
||||
|
||||
# Create analysis prompt
|
||||
analysis_prompt = f"""
|
||||
Analyze this {language} code for quality, structure, and potential issues.
|
||||
|
||||
Code to analyze:
|
||||
{combined_code}
|
||||
{combined_content}
|
||||
|
||||
Please check for:
|
||||
{', '.join(checks)}
|
||||
|
|
|
|||
|
|
@ -26,18 +26,16 @@ class MethodDocument(MethodBase):
|
|||
@action
|
||||
async def extract(self, parameters: Dict[str, Any]) -> ActionResult:
|
||||
"""
|
||||
Extract content from document
|
||||
Extract specific content from document with ai prompt and return it as a json file
|
||||
|
||||
Parameters:
|
||||
documentList (str): Reference to the document list to extract content from
|
||||
aiPrompt (str): AI prompt for content extraction
|
||||
format (str, optional): Output format (default: "text")
|
||||
includeMetadata (bool, optional): Whether to include metadata (default: True)
|
||||
"""
|
||||
try:
|
||||
documentList = parameters.get("documentList")
|
||||
aiPrompt = parameters.get("aiPrompt")
|
||||
format = parameters.get("format", "text")
|
||||
includeMetadata = parameters.get("includeMetadata", True)
|
||||
|
||||
if not documentList:
|
||||
|
|
@ -95,12 +93,14 @@ class MethodDocument(MethodBase):
|
|||
error="No content could be extracted from any documents"
|
||||
)
|
||||
|
||||
# Combine all extracted content
|
||||
combined_content = "\n\n--- DOCUMENT SEPARATOR ---\n\n".join(all_extracted_content)
|
||||
# Extract text content from ExtractedContent objects
|
||||
text_contents = self.service.extractTextFromContentObjects(all_extracted_content)
|
||||
|
||||
# Combine all extracted text content
|
||||
combined_content = "\n\n--- DOCUMENT SEPARATOR ---\n\n".join(text_contents)
|
||||
|
||||
result_data = {
|
||||
"documentCount": len(chatDocuments),
|
||||
"format": format,
|
||||
"content": combined_content,
|
||||
"fileInfos": file_infos if includeMetadata else None,
|
||||
"timestamp": datetime.now(UTC).isoformat()
|
||||
|
|
@ -124,236 +124,3 @@ class MethodDocument(MethodBase):
|
|||
data={},
|
||||
error=str(e)
|
||||
)
|
||||
|
||||
@action
|
||||
async def analyze(self, parameters: Dict[str, Any]) -> ActionResult:
|
||||
"""
|
||||
Analyze document content
|
||||
|
||||
Parameters:
|
||||
documentList (str): Reference to the document list to analyze
|
||||
aiPrompt (str): AI prompt for content analysis
|
||||
analysis (List[str], optional): Types of analysis to perform (default: ["entities", "topics", "sentiment"])
|
||||
"""
|
||||
try:
|
||||
documentList = parameters.get("documentList")
|
||||
aiPrompt = parameters.get("aiPrompt")
|
||||
analysis = parameters.get("analysis", ["entities", "topics", "sentiment"])
|
||||
|
||||
if not documentList:
|
||||
return self._createResult(
|
||||
success=False,
|
||||
data={},
|
||||
error="Document list reference is required"
|
||||
)
|
||||
|
||||
if not aiPrompt:
|
||||
return self._createResult(
|
||||
success=False,
|
||||
data={},
|
||||
error="AI prompt is required"
|
||||
)
|
||||
|
||||
chatDocuments = self.service.getChatDocumentsFromDocumentList(documentList)
|
||||
if not chatDocuments:
|
||||
return self._createResult(
|
||||
success=False,
|
||||
data={},
|
||||
error="No documents found for the provided reference"
|
||||
)
|
||||
|
||||
# Extract content from all documents
|
||||
all_extracted_content = []
|
||||
|
||||
for chatDocument in chatDocuments:
|
||||
fileId = chatDocument.fileId
|
||||
file_data = self.service.getFileData(fileId)
|
||||
file_info = self.service.getFileInfo(fileId)
|
||||
|
||||
if not file_data:
|
||||
logger.warning(f"File not found or empty for fileId: {fileId}")
|
||||
continue
|
||||
|
||||
extracted_content = await self.service.extractContentFromFileData(
|
||||
prompt=aiPrompt,
|
||||
fileData=file_data,
|
||||
filename=file_info.get('name', 'document'),
|
||||
mimeType=file_info.get('mimeType', 'application/octet-stream'),
|
||||
base64Encoded=False,
|
||||
documentId=chatDocument.id
|
||||
)
|
||||
|
||||
all_extracted_content.append(extracted_content)
|
||||
|
||||
if not all_extracted_content:
|
||||
return self._createResult(
|
||||
success=False,
|
||||
data={},
|
||||
error="No content could be extracted from any documents"
|
||||
)
|
||||
|
||||
# Combine all extracted content for analysis
|
||||
combined_content = "\n\n--- DOCUMENT SEPARATOR ---\n\n".join(all_extracted_content)
|
||||
|
||||
analysis_prompt = f"""
|
||||
Analyze this document content for the following aspects:
|
||||
{', '.join(analysis)}
|
||||
|
||||
Document content:
|
||||
{combined_content[:8000]} # Limit content length
|
||||
|
||||
Please provide a detailed analysis including:
|
||||
1. Key entities (people, organizations, locations, dates)
|
||||
2. Main topics and themes
|
||||
3. Sentiment analysis (positive, negative, neutral)
|
||||
4. Key insights and patterns
|
||||
5. Important relationships between entities
|
||||
6. Document structure and organization
|
||||
"""
|
||||
|
||||
analysis_result = await self.service.interfaceAiCalls.callAiTextAdvanced(analysis_prompt)
|
||||
|
||||
result_data = {
|
||||
"documentCount": len(chatDocuments),
|
||||
"analysis": analysis,
|
||||
"results": analysis_result,
|
||||
"content": combined_content,
|
||||
"timestamp": datetime.now(UTC).isoformat()
|
||||
}
|
||||
|
||||
return self._createResult(
|
||||
success=True,
|
||||
data={
|
||||
"documents": [
|
||||
{
|
||||
"documentName": f"document_analysis_{datetime.now(UTC).strftime('%Y%m%d_%H%M%S')}.json",
|
||||
"documentData": result_data
|
||||
}
|
||||
]
|
||||
}
|
||||
)
|
||||
except Exception as e:
|
||||
logger.error(f"Error analyzing content: {str(e)}")
|
||||
return self._createResult(
|
||||
success=False,
|
||||
data={},
|
||||
error=str(e)
|
||||
)
|
||||
|
||||
@action
|
||||
async def summarize(self, parameters: Dict[str, Any]) -> ActionResult:
|
||||
"""
|
||||
Summarize document content
|
||||
|
||||
Parameters:
|
||||
documentList (str): Reference to the document list to summarize
|
||||
aiPrompt (str): AI prompt for content extraction
|
||||
maxLength (int, optional): Maximum length of summary in words (default: 200)
|
||||
format (str, optional): Output format (default: "text")
|
||||
"""
|
||||
try:
|
||||
documentList = parameters.get("documentList")
|
||||
aiPrompt = parameters.get("aiPrompt")
|
||||
maxLength = parameters.get("maxLength", 200)
|
||||
format = parameters.get("format", "text")
|
||||
|
||||
if not documentList:
|
||||
return self._createResult(
|
||||
success=False,
|
||||
data={},
|
||||
error="Document list reference is required"
|
||||
)
|
||||
|
||||
if not aiPrompt:
|
||||
return self._createResult(
|
||||
success=False,
|
||||
data={},
|
||||
error="AI prompt is required"
|
||||
)
|
||||
|
||||
chatDocuments = self.service.getChatDocumentsFromDocumentList(documentList)
|
||||
if not chatDocuments:
|
||||
return self._createResult(
|
||||
success=False,
|
||||
data={},
|
||||
error="No documents found for the provided reference"
|
||||
)
|
||||
|
||||
# Extract content from all documents
|
||||
all_extracted_content = []
|
||||
|
||||
for chatDocument in chatDocuments:
|
||||
fileId = chatDocument.fileId
|
||||
file_data = self.service.getFileData(fileId)
|
||||
file_info = self.service.getFileInfo(fileId)
|
||||
|
||||
if not file_data:
|
||||
logger.warning(f"File not found or empty for fileId: {fileId}")
|
||||
continue
|
||||
|
||||
extracted_content = await self.service.extractContentFromFileData(
|
||||
prompt=aiPrompt,
|
||||
fileData=file_data,
|
||||
filename=file_info.get('name', 'document'),
|
||||
mimeType=file_info.get('mimeType', 'application/octet-stream'),
|
||||
base64Encoded=False,
|
||||
documentId=chatDocument.id
|
||||
)
|
||||
|
||||
all_extracted_content.append(extracted_content)
|
||||
|
||||
if not all_extracted_content:
|
||||
return self._createResult(
|
||||
success=False,
|
||||
data={},
|
||||
error="No content could be extracted from any documents"
|
||||
)
|
||||
|
||||
# Combine all extracted content for summarization
|
||||
combined_content = "\n\n--- DOCUMENT SEPARATOR ---\n\n".join(all_extracted_content)
|
||||
|
||||
summary_prompt = f"""
|
||||
Create a comprehensive summary of this document content.
|
||||
|
||||
Document content:
|
||||
{combined_content[:8000]} # Limit content length
|
||||
|
||||
Requirements:
|
||||
- Maximum length: {maxLength} words
|
||||
- Format: {format}
|
||||
- Include key points and main ideas
|
||||
- Maintain accuracy and completeness
|
||||
- Use clear, professional language
|
||||
- Highlight important insights and conclusions
|
||||
"""
|
||||
|
||||
summary = await self.service.interfaceAiCalls.callAiTextAdvanced(summary_prompt)
|
||||
|
||||
result_data = {
|
||||
"documentCount": len(chatDocuments),
|
||||
"maxLength": maxLength,
|
||||
"format": format,
|
||||
"summary": summary,
|
||||
"wordCount": len(summary.split()),
|
||||
"originalContent": combined_content,
|
||||
"timestamp": datetime.now(UTC).isoformat()
|
||||
}
|
||||
|
||||
return self._createResult(
|
||||
success=True,
|
||||
data={
|
||||
"documents": [
|
||||
{
|
||||
"documentName": f"document_summary_{datetime.now(UTC).strftime('%Y%m%d_%H%M%S')}.txt",
|
||||
"documentData": result_data
|
||||
}
|
||||
]
|
||||
}
|
||||
)
|
||||
except Exception as e:
|
||||
logger.error(f"Error summarizing content: {str(e)}")
|
||||
return self._createResult(
|
||||
success=False,
|
||||
data={},
|
||||
error=str(e)
|
||||
)
|
||||
|
|
|
|||
|
|
@ -133,7 +133,7 @@ async def get_file(
|
|||
detail=f"File with ID {fileId} not found"
|
||||
)
|
||||
|
||||
return FileItem(**fileData)
|
||||
return fileData
|
||||
|
||||
except interfaceComponentObjects.FileNotFoundError as e:
|
||||
logger.warning(f"File not found: {str(e)}")
|
||||
|
|
@ -180,8 +180,8 @@ async def update_file(
|
|||
detail=f"File with ID {fileId} not found"
|
||||
)
|
||||
|
||||
# Check if user has access to the file
|
||||
if file.get("userId", 0) != currentUser.get("id", 0):
|
||||
# Check if user has access to the file using the interface's permission system
|
||||
if not managementInterface._canModify("files", fileId):
|
||||
raise HTTPException(
|
||||
status_code=status.HTTP_403_FORBIDDEN,
|
||||
detail="Not authorized to update this file"
|
||||
|
|
@ -195,9 +195,9 @@ async def update_file(
|
|||
detail="Failed to update file"
|
||||
)
|
||||
|
||||
# Get updated file and convert to FileItem
|
||||
# Get updated file
|
||||
updatedFile = managementInterface.getFile(fileId)
|
||||
return FileItem(**updatedFile)
|
||||
return updatedFile
|
||||
|
||||
except HTTPException as he:
|
||||
raise he
|
||||
|
|
@ -328,15 +328,15 @@ async def preview_file(
|
|||
try:
|
||||
managementInterface = interfaceComponentObjects.getInterface(currentUser)
|
||||
|
||||
# Get file preview
|
||||
preview = managementInterface.getFilePreview(fileId)
|
||||
# Get file preview using the correct method
|
||||
preview = managementInterface.getFileContent(fileId)
|
||||
if not preview:
|
||||
raise HTTPException(
|
||||
status_code=status.HTTP_404_NOT_FOUND,
|
||||
detail=f"File with ID {fileId} not found or no content available"
|
||||
)
|
||||
|
||||
return FilePreview(**preview)
|
||||
return preview
|
||||
except HTTPException:
|
||||
raise
|
||||
except Exception as e:
|
||||
|
|
|
|||
|
|
@ -54,7 +54,7 @@ async def create_prompt(
|
|||
# Create prompt
|
||||
newPrompt = managementInterface.createPrompt(prompt_data)
|
||||
|
||||
return Prompt.from_dict(newPrompt)
|
||||
return Prompt(**newPrompt)
|
||||
|
||||
@router.get("/{promptId}", response_model=Prompt)
|
||||
@limiter.limit("30/minute")
|
||||
|
|
@ -74,7 +74,7 @@ async def get_prompt(
|
|||
detail=f"Prompt with ID {promptId} not found"
|
||||
)
|
||||
|
||||
return Prompt.from_dict(prompt)
|
||||
return prompt
|
||||
|
||||
@router.put("/{promptId}", response_model=Prompt)
|
||||
@limiter.limit("10/minute")
|
||||
|
|
@ -107,7 +107,7 @@ async def update_prompt(
|
|||
detail="Error updating the prompt"
|
||||
)
|
||||
|
||||
return Prompt.from_dict(updatedPrompt)
|
||||
return Prompt(**updatedPrompt)
|
||||
|
||||
@router.delete("/{promptId}", response_model=Dict[str, Any])
|
||||
@limiter.limit("10/minute")
|
||||
|
|
|
|||
|
|
@ -48,7 +48,7 @@ def getServiceChat(currentUser: User):
|
|||
|
||||
# Consolidated endpoint for getting all workflows
|
||||
@router.get("/", response_model=List[ChatWorkflow])
|
||||
@limiter.limit("30/minute")
|
||||
@limiter.limit("120/minute")
|
||||
async def get_workflows(
|
||||
request: Request,
|
||||
currentUser: User = Depends(getCurrentUser)
|
||||
|
|
@ -56,7 +56,31 @@ async def get_workflows(
|
|||
"""Get all workflows for the current user."""
|
||||
try:
|
||||
appInterface = getInterface(currentUser)
|
||||
return appInterface.getAllWorkflows()
|
||||
workflows_data = appInterface.getAllWorkflows()
|
||||
|
||||
# Convert raw dictionaries to ChatWorkflow objects
|
||||
workflows = []
|
||||
for workflow_data in workflows_data:
|
||||
try:
|
||||
workflow = ChatWorkflow(
|
||||
id=workflow_data["id"],
|
||||
status=workflow_data.get("status", "running"),
|
||||
name=workflow_data.get("name"),
|
||||
currentRound=workflow_data.get("currentRound", 1),
|
||||
lastActivity=workflow_data.get("lastActivity", appInterface._getCurrentTimestamp()),
|
||||
startedAt=workflow_data.get("startedAt", appInterface._getCurrentTimestamp()),
|
||||
logs=[ChatLog(**log) for log in workflow_data.get("logs", [])],
|
||||
messages=[ChatMessage(**msg) for msg in workflow_data.get("messages", [])],
|
||||
stats=ChatStat(**workflow_data.get("dataStats", {})) if workflow_data.get("dataStats") else None,
|
||||
mandateId=workflow_data.get("mandateId", currentUser.mandateId or "")
|
||||
)
|
||||
workflows.append(workflow)
|
||||
except Exception as e:
|
||||
logger.warning(f"Error converting workflow data to ChatWorkflow object: {str(e)}")
|
||||
# Skip invalid workflows instead of failing the entire request
|
||||
continue
|
||||
|
||||
return workflows
|
||||
except Exception as e:
|
||||
logger.error(f"Error getting workflows: {str(e)}")
|
||||
raise HTTPException(
|
||||
|
|
@ -65,7 +89,7 @@ async def get_workflows(
|
|||
)
|
||||
|
||||
@router.get("/{workflowId}", response_model=ChatWorkflow)
|
||||
@limiter.limit("30/minute")
|
||||
@limiter.limit("120/minute")
|
||||
async def get_workflow(
|
||||
request: Request,
|
||||
workflowId: str = Path(..., description="ID of the workflow"),
|
||||
|
|
@ -93,9 +117,58 @@ async def get_workflow(
|
|||
detail=f"Failed to get workflow: {str(e)}"
|
||||
)
|
||||
|
||||
@router.put("/{workflowId}", response_model=ChatWorkflow)
|
||||
@limiter.limit("120/minute")
|
||||
async def update_workflow(
|
||||
request: Request,
|
||||
workflowId: str = Path(..., description="ID of the workflow to update"),
|
||||
workflowData: Dict[str, Any] = Body(...),
|
||||
currentUser: User = Depends(getCurrentUser)
|
||||
) -> ChatWorkflow:
|
||||
"""Update workflow by ID"""
|
||||
try:
|
||||
# Get workflow interface with current user context
|
||||
workflowInterface = getInterface(currentUser)
|
||||
|
||||
# Get raw workflow data from database to check permissions
|
||||
workflows = workflowInterface.db.getRecordset("workflows", recordFilter={"id": workflowId})
|
||||
if not workflows:
|
||||
raise HTTPException(
|
||||
status_code=status.HTTP_404_NOT_FOUND,
|
||||
detail="Workflow not found"
|
||||
)
|
||||
|
||||
workflow_data = workflows[0]
|
||||
|
||||
# Check if user has permission to update using the interface's permission system
|
||||
if not workflowInterface._canModify("workflows", workflowId):
|
||||
raise HTTPException(
|
||||
status_code=status.HTTP_403_FORBIDDEN,
|
||||
detail="You don't have permission to update this workflow"
|
||||
)
|
||||
|
||||
# Update workflow
|
||||
updatedWorkflow = workflowInterface.updateWorkflow(workflowId, workflowData)
|
||||
if not updatedWorkflow:
|
||||
raise HTTPException(
|
||||
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
|
||||
detail="Failed to update workflow"
|
||||
)
|
||||
|
||||
return updatedWorkflow
|
||||
|
||||
except HTTPException:
|
||||
raise
|
||||
except Exception as e:
|
||||
logger.error(f"Error updating workflow: {str(e)}")
|
||||
raise HTTPException(
|
||||
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
|
||||
detail=f"Failed to update workflow: {str(e)}"
|
||||
)
|
||||
|
||||
# API Endpoint for workflow status
|
||||
@router.get("/{workflowId}/status", response_model=ChatWorkflow)
|
||||
@limiter.limit("30/minute")
|
||||
@limiter.limit("120/minute")
|
||||
async def get_workflow_status(
|
||||
request: Request,
|
||||
workflowId: str = Path(..., description="ID of the workflow"),
|
||||
|
|
@ -114,7 +187,7 @@ async def get_workflow_status(
|
|||
detail=f"Workflow with ID {workflowId} not found"
|
||||
)
|
||||
|
||||
return ChatWorkflow(**workflow)
|
||||
return workflow
|
||||
except HTTPException:
|
||||
raise
|
||||
except Exception as e:
|
||||
|
|
@ -126,7 +199,7 @@ async def get_workflow_status(
|
|||
|
||||
# API Endpoint for workflow logs with selective data transfer
|
||||
@router.get("/{workflowId}/logs", response_model=List[ChatLog])
|
||||
@limiter.limit("30/minute")
|
||||
@limiter.limit("120/minute")
|
||||
async def get_workflow_logs(
|
||||
request: Request,
|
||||
workflowId: str = Path(..., description="ID of the workflow"),
|
||||
|
|
@ -152,12 +225,12 @@ async def get_workflow_logs(
|
|||
# Apply selective data transfer if logId is provided
|
||||
if logId:
|
||||
# Find the index of the log with the given ID
|
||||
logIndex = next((i for i, log in enumerate(allLogs) if log.get("id") == logId), -1)
|
||||
logIndex = next((i for i, log in enumerate(allLogs) if log.id == logId), -1)
|
||||
if logIndex >= 0:
|
||||
# Return only logs after the specified log
|
||||
return [ChatLog(**log) for log in allLogs[logIndex + 1:]]
|
||||
return allLogs[logIndex + 1:]
|
||||
|
||||
return [ChatLog(**log) for log in allLogs]
|
||||
return allLogs
|
||||
except HTTPException:
|
||||
raise
|
||||
except Exception as e:
|
||||
|
|
@ -169,7 +242,7 @@ async def get_workflow_logs(
|
|||
|
||||
# API Endpoint for workflow messages with selective data transfer
|
||||
@router.get("/{workflowId}/messages", response_model=List[ChatMessage])
|
||||
@limiter.limit("30/minute")
|
||||
@limiter.limit("120/minute")
|
||||
async def get_workflow_messages(
|
||||
request: Request,
|
||||
workflowId: str = Path(..., description="ID of the workflow"),
|
||||
|
|
@ -195,12 +268,12 @@ async def get_workflow_messages(
|
|||
# Apply selective data transfer if messageId is provided
|
||||
if messageId:
|
||||
# Find the index of the message with the given ID
|
||||
messageIndex = next((i for i, msg in enumerate(allMessages) if msg.get("id") == messageId), -1)
|
||||
messageIndex = next((i for i, msg in enumerate(allMessages) if msg.id == messageId), -1)
|
||||
if messageIndex >= 0:
|
||||
# Return only messages after the specified message
|
||||
return [ChatMessage(**msg) for msg in allMessages[messageIndex + 1:]]
|
||||
return allMessages[messageIndex + 1:]
|
||||
|
||||
return [ChatMessage(**msg) for msg in allMessages]
|
||||
return allMessages
|
||||
except HTTPException:
|
||||
raise
|
||||
except Exception as e:
|
||||
|
|
@ -212,7 +285,7 @@ async def get_workflow_messages(
|
|||
|
||||
# State 1: Workflow Initialization endpoint
|
||||
@router.post("/start", response_model=ChatWorkflow)
|
||||
@limiter.limit("10/minute")
|
||||
@limiter.limit("120/minute")
|
||||
async def start_workflow(
|
||||
request: Request,
|
||||
workflowId: Optional[str] = Query(None, description="Optional ID of the workflow to continue"),
|
||||
|
|
@ -230,7 +303,7 @@ async def start_workflow(
|
|||
# Start or continue workflow using ChatObjects
|
||||
workflow = await interfaceChat.workflowStart(currentUser, userInput, workflowId)
|
||||
|
||||
return ChatWorkflow(**workflow)
|
||||
return workflow
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error in start_workflow: {str(e)}")
|
||||
|
|
@ -241,7 +314,7 @@ async def start_workflow(
|
|||
|
||||
# State 8: Workflow Stopped endpoint
|
||||
@router.post("/{workflowId}/stop", response_model=ChatWorkflow)
|
||||
@limiter.limit("10/minute")
|
||||
@limiter.limit("120/minute")
|
||||
async def stop_workflow(
|
||||
request: Request,
|
||||
workflowId: str = Path(..., description="ID of the workflow to stop"),
|
||||
|
|
@ -255,7 +328,7 @@ async def stop_workflow(
|
|||
# Stop workflow using ChatObjects
|
||||
workflow = await interfaceChat.workflowStop(workflowId)
|
||||
|
||||
return ChatWorkflow(**workflow)
|
||||
return workflow
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error in stop_workflow: {str(e)}")
|
||||
|
|
@ -266,7 +339,7 @@ async def stop_workflow(
|
|||
|
||||
# State 11: Workflow Reset/Deletion endpoint
|
||||
@router.delete("/{workflowId}", response_model=Dict[str, Any])
|
||||
@limiter.limit("10/minute")
|
||||
@limiter.limit("120/minute")
|
||||
async def delete_workflow(
|
||||
request: Request,
|
||||
workflowId: str = Path(..., description="ID of the workflow to delete"),
|
||||
|
|
@ -277,16 +350,18 @@ async def delete_workflow(
|
|||
# Get service container
|
||||
interfaceChat = getServiceChat(currentUser)
|
||||
|
||||
# Verify workflow exists
|
||||
workflow = interfaceChat.getWorkflow(workflowId)
|
||||
if not workflow:
|
||||
# Get raw workflow data from database to check permissions
|
||||
workflows = interfaceChat.db.getRecordset("workflows", recordFilter={"id": workflowId})
|
||||
if not workflows:
|
||||
raise HTTPException(
|
||||
status_code=status.HTTP_404_NOT_FOUND,
|
||||
detail=f"Workflow with ID {workflowId} not found"
|
||||
)
|
||||
|
||||
# Check if user has permission to delete
|
||||
if workflow.get("_userId") != currentUser["id"]:
|
||||
workflow_data = workflows[0]
|
||||
|
||||
# Check if user has permission to delete using the interface's permission system
|
||||
if not interfaceChat._canModify("workflows", workflowId):
|
||||
raise HTTPException(
|
||||
status_code=status.HTTP_403_FORBIDDEN,
|
||||
detail="You don't have permission to delete this workflow"
|
||||
|
|
@ -318,7 +393,7 @@ async def delete_workflow(
|
|||
# Document Management Endpoints
|
||||
|
||||
@router.delete("/{workflowId}/messages/{messageId}", response_model=Dict[str, Any])
|
||||
@limiter.limit("10/minute")
|
||||
@limiter.limit("120/minute")
|
||||
async def delete_workflow_message(
|
||||
request: Request,
|
||||
workflowId: str = Path(..., description="ID of the workflow"),
|
||||
|
|
@ -368,7 +443,7 @@ async def delete_workflow_message(
|
|||
)
|
||||
|
||||
@router.delete("/{workflowId}/messages/{messageId}/files/{fileId}", response_model=Dict[str, Any])
|
||||
@limiter.limit("10/minute")
|
||||
@limiter.limit("120/minute")
|
||||
async def delete_file_from_message(
|
||||
request: Request,
|
||||
workflowId: str = Path(..., description="ID of the workflow"),
|
||||
|
|
|
|||
File diff suppressed because it is too large
Load diff
|
|
@ -17,8 +17,8 @@ class DocumentManager:
|
|||
|
||||
def __init__(self, serviceContainer):
|
||||
self.service = serviceContainer
|
||||
# Create processor without any dependencies
|
||||
self._processor = DocumentProcessor()
|
||||
# Create processor with service container for AI calls
|
||||
self._processor = DocumentProcessor(serviceContainer)
|
||||
|
||||
async def extractContentFromDocument(self, prompt: str, document: ChatDocument) -> ExtractedContent:
|
||||
"""Extract content from ChatDocument using prompt"""
|
||||
|
|
|
|||
|
|
@ -52,8 +52,56 @@ class WorkflowManager:
|
|||
|
||||
except WorkflowStoppedException:
|
||||
logger.info("Workflow stopped by user")
|
||||
# Update workflow status to stopped
|
||||
workflow.status = "stopped"
|
||||
workflow.lastActivity = datetime.now(UTC).isoformat()
|
||||
self.chatInterface.updateWorkflow(workflow.id, {
|
||||
"status": "stopped",
|
||||
"lastActivity": workflow.lastActivity
|
||||
})
|
||||
|
||||
# Add log entry
|
||||
self.chatInterface.createWorkflowLog({
|
||||
"workflowId": workflow.id,
|
||||
"message": "Workflow stopped by user",
|
||||
"type": "warning",
|
||||
"status": "stopped",
|
||||
"progress": 100
|
||||
})
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Workflow processing error: {str(e)}")
|
||||
|
||||
# Update workflow status to failed
|
||||
workflow.status = "failed"
|
||||
workflow.lastActivity = datetime.now(UTC).isoformat()
|
||||
self.chatInterface.updateWorkflow(workflow.id, {
|
||||
"status": "failed",
|
||||
"lastActivity": workflow.lastActivity
|
||||
})
|
||||
|
||||
# Create error message
|
||||
error_message = {
|
||||
"workflowId": workflow.id,
|
||||
"role": "assistant",
|
||||
"message": f"Workflow processing failed: {str(e)}",
|
||||
"status": "last",
|
||||
"sequenceNr": len(workflow.messages) + 1,
|
||||
"publishedAt": datetime.now(UTC).isoformat()
|
||||
}
|
||||
message = self.chatInterface.createWorkflowMessage(error_message)
|
||||
if message:
|
||||
workflow.messages.append(message)
|
||||
|
||||
# Add error log entry
|
||||
self.chatInterface.createWorkflowLog({
|
||||
"workflowId": workflow.id,
|
||||
"message": f"Workflow failed: {str(e)}",
|
||||
"type": "error",
|
||||
"status": "failed",
|
||||
"progress": 100
|
||||
})
|
||||
|
||||
raise
|
||||
|
||||
async def _sendFirstMessage(self, userInput: UserInputRequest, workflow: ChatWorkflow) -> ChatMessage:
|
||||
|
|
@ -108,6 +156,25 @@ class WorkflowManager:
|
|||
if message:
|
||||
workflow.messages.append(message)
|
||||
|
||||
# Update workflow status to completed
|
||||
workflow.status = "completed"
|
||||
workflow.lastActivity = datetime.now(UTC).isoformat()
|
||||
|
||||
# Update workflow in database
|
||||
self.chatInterface.updateWorkflow(workflow.id, {
|
||||
"status": "completed",
|
||||
"lastActivity": workflow.lastActivity
|
||||
})
|
||||
|
||||
# Add completion log entry
|
||||
self.chatInterface.createWorkflowLog({
|
||||
"workflowId": workflow.id,
|
||||
"message": "Workflow completed successfully",
|
||||
"type": "success",
|
||||
"status": "completed",
|
||||
"progress": 100
|
||||
})
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error sending last message: {str(e)}")
|
||||
raise
|
||||
|
|
@ -128,6 +195,14 @@ class WorkflowManager:
|
|||
message = self.chatInterface.createWorkflowMessage(error_message)
|
||||
if message:
|
||||
workflow.messages.append(message)
|
||||
|
||||
# Update workflow status to failed
|
||||
workflow.status = "failed"
|
||||
workflow.lastActivity = datetime.now(UTC).isoformat()
|
||||
self.chatInterface.updateWorkflow(workflow.id, {
|
||||
"status": "failed",
|
||||
"lastActivity": workflow.lastActivity
|
||||
})
|
||||
return
|
||||
|
||||
# Process successful workflow results
|
||||
|
|
@ -174,6 +249,14 @@ class WorkflowManager:
|
|||
if message:
|
||||
workflow.messages.append(message)
|
||||
|
||||
# Update workflow status to completed for successful workflows
|
||||
workflow.status = "completed"
|
||||
workflow.lastActivity = datetime.now(UTC).isoformat()
|
||||
self.chatInterface.updateWorkflow(workflow.id, {
|
||||
"status": "completed",
|
||||
"lastActivity": workflow.lastActivity
|
||||
})
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error processing workflow results: {str(e)}")
|
||||
# Create error message
|
||||
|
|
@ -188,4 +271,12 @@ class WorkflowManager:
|
|||
message = self.chatInterface.createWorkflowMessage(error_message)
|
||||
if message:
|
||||
workflow.messages.append(message)
|
||||
|
||||
# Update workflow status to failed
|
||||
workflow.status = "failed"
|
||||
workflow.lastActivity = datetime.now(UTC).isoformat()
|
||||
self.chatInterface.updateWorkflow(workflow.id, {
|
||||
"status": "failed",
|
||||
"lastActivity": workflow.lastActivity
|
||||
})
|
||||
|
||||
|
|
|
|||
|
|
@ -32,9 +32,10 @@ class FileProcessingError(Exception):
|
|||
class DocumentProcessor:
|
||||
"""Processor for handling document operations and content extraction."""
|
||||
|
||||
def __init__(self):
|
||||
def __init__(self, serviceContainer=None):
|
||||
"""Initialize the document processor."""
|
||||
self._neutralizer = DataAnonymizer() if APP_CONFIG.get("ENABLE_CONTENT_NEUTRALIZATION", False) else None
|
||||
self._serviceContainer = serviceContainer
|
||||
|
||||
self.supportedTypes: Dict[str, Callable[[bytes, str, str], Awaitable[List[ContentItem]]]] = {
|
||||
'text/plain': self._processText,
|
||||
|
|
@ -108,7 +109,9 @@ class DocumentProcessor:
|
|||
logger.info("Image processing libraries successfully loaded")
|
||||
except ImportError as e:
|
||||
logger.warning(f"Image processing libraries could not be loaded: {e}")
|
||||
|
||||
|
||||
|
||||
|
||||
async def processFileData(self, fileData: bytes, filename: str, mimeType: str, base64Encoded: bool = False, prompt: str = None, documentId: str = None) -> ExtractedContent:
|
||||
"""
|
||||
Process file data directly and extract its contents with AI processing.
|
||||
|
|
@ -133,7 +136,7 @@ class DocumentProcessor:
|
|||
|
||||
# Detect content type if needed
|
||||
if mimeType == "application/octet-stream":
|
||||
mimeType = self._detectContentTypeFromData(fileData, filename)
|
||||
mimeType = self._serviceContainer.detectContentTypeFromData(fileData, filename)
|
||||
|
||||
# Process document based on type
|
||||
if mimeType not in self.supportedTypes:
|
||||
|
|
@ -161,61 +164,8 @@ class DocumentProcessor:
|
|||
except Exception as e:
|
||||
logger.error(f"Error processing file data: {str(e)}")
|
||||
raise FileProcessingError(f"Failed to process file data: {str(e)}")
|
||||
|
||||
def _detectContentTypeFromData(self, fileData: bytes, filename: str) -> str:
|
||||
"""Detect content type from file data and filename"""
|
||||
try:
|
||||
# Check file extension first
|
||||
ext = os.path.splitext(filename)[1].lower()
|
||||
if ext:
|
||||
# Map common extensions to MIME types
|
||||
extToMime = {
|
||||
'.txt': 'text/plain',
|
||||
'.md': 'text/markdown',
|
||||
'.csv': 'text/csv',
|
||||
'.json': 'application/json',
|
||||
'.xml': 'application/xml',
|
||||
'.js': 'application/javascript',
|
||||
'.py': 'application/x-python',
|
||||
'.svg': 'image/svg+xml',
|
||||
'.jpg': 'image/jpeg',
|
||||
'.png': 'image/png',
|
||||
'.gif': 'image/gif',
|
||||
'.pdf': 'application/pdf',
|
||||
'.docx': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
|
||||
'.doc': 'application/msword',
|
||||
'.xlsx': 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
|
||||
'.xls': 'application/vnd.ms-excel',
|
||||
'.pptx': 'application/vnd.openxmlformats-officedocument.presentationml.presentation',
|
||||
'.ppt': 'application/vnd.ms-powerpoint'
|
||||
}
|
||||
if ext in extToMime:
|
||||
return extToMime[ext]
|
||||
|
||||
# Try to detect from content
|
||||
if fileData.startswith(b'%PDF'):
|
||||
return 'application/pdf'
|
||||
elif fileData.startswith(b'PK\x03\x04'):
|
||||
# ZIP-based formats (docx, xlsx, pptx)
|
||||
return 'application/zip'
|
||||
elif fileData.startswith(b'<'):
|
||||
# XML-based formats
|
||||
try:
|
||||
text = fileData.decode('utf-8', errors='ignore')
|
||||
if '<svg' in text.lower():
|
||||
return 'image/svg+xml'
|
||||
elif '<html' in text.lower():
|
||||
return 'text/html'
|
||||
else:
|
||||
return 'application/xml'
|
||||
except:
|
||||
pass
|
||||
|
||||
return 'application/octet-stream'
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error detecting content type from data: {str(e)}")
|
||||
return 'application/octet-stream'
|
||||
|
||||
|
||||
|
||||
async def _processText(self, fileData: bytes, filename: str, mimeType: str) -> List[ContentItem]:
|
||||
"""Process text document"""
|
||||
|
|
@ -546,14 +496,22 @@ class DocumentProcessor:
|
|||
try:
|
||||
# Get content type from metadata
|
||||
mimeType = item.metadata.mimeType if hasattr(item.metadata, 'mimeType') else "text/plain"
|
||||
logger.debug(f"Processing content item with MIME type: {mimeType}, label: {item.label}")
|
||||
|
||||
# Chunk content based on type
|
||||
if mimeType.startswith('text/'):
|
||||
chunks = self._chunkText(item.data, mimeType)
|
||||
elif mimeType.startswith('image/'):
|
||||
chunks = self._chunkImage(item.data)
|
||||
elif mimeType.startswith('video/'):
|
||||
chunks = self._chunkVideo(item.data)
|
||||
# Images should not be chunked - process as single unit
|
||||
chunks = [item.data]
|
||||
elif mimeType == "application/pdf":
|
||||
chunks = self._chunkPdf(item.data)
|
||||
elif mimeType == "application/vnd.openxmlformats-officedocument.wordprocessingml.document":
|
||||
chunks = self._chunkDocx(item.data)
|
||||
elif mimeType == "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet":
|
||||
chunks = self._chunkXlsx(item.data)
|
||||
elif mimeType.startswith('application/vnd.openxmlformats-officedocument.presentationml.presentation'):
|
||||
chunks = self._chunkPptx(item.data)
|
||||
else:
|
||||
# Binary data - no chunking
|
||||
chunks = [item.data]
|
||||
|
|
@ -561,26 +519,42 @@ class DocumentProcessor:
|
|||
# Process each chunk
|
||||
chunkResults = []
|
||||
for chunk in chunks:
|
||||
# Neutralize content if neutralizer is enabled
|
||||
contentToProcess = chunk
|
||||
if self._neutralizer and contentToProcess:
|
||||
contentToProcess = self._neutralizer.neutralize(contentToProcess)
|
||||
|
||||
# Create AI prompt for this chunk
|
||||
aiPrompt = f"""
|
||||
Extract relevant information from this content based on the following prompt:
|
||||
|
||||
PROMPT: {prompt}
|
||||
|
||||
CONTENT:
|
||||
{contentToProcess}
|
||||
|
||||
Return ONLY the extracted information in a clear, concise format.
|
||||
"""
|
||||
|
||||
# Note: This would need to be implemented with actual AI service
|
||||
# For now, just return the original content
|
||||
chunkResults.append(contentToProcess)
|
||||
# Process with AI based on content type
|
||||
try:
|
||||
logger.debug(f"AI processing chunk with MIME type: {mimeType}")
|
||||
if mimeType.startswith('image/'):
|
||||
# For images, use image AI service with base64 data
|
||||
# chunk is already base64 encoded string from _processImage
|
||||
# Use the original prompt directly for images (no content embedding)
|
||||
logger.debug(f"Calling image AI service for MIME type: {mimeType}")
|
||||
processedContent = await self._serviceContainer.callAiImageBasic(prompt, chunk, mimeType)
|
||||
else:
|
||||
# For text content, use text AI service
|
||||
# Neutralize content if neutralizer is enabled (only for text)
|
||||
contentToProcess = chunk
|
||||
if self._neutralizer and contentToProcess:
|
||||
contentToProcess = self._neutralizer.neutralize(contentToProcess)
|
||||
|
||||
# Create AI prompt for text content
|
||||
aiPrompt = f"""
|
||||
Extract relevant information from this content based on the following prompt:
|
||||
|
||||
PROMPT: {prompt}
|
||||
|
||||
CONTENT:
|
||||
{contentToProcess}
|
||||
|
||||
Return ONLY the extracted information in a clear, concise format.
|
||||
"""
|
||||
|
||||
logger.debug(f"Calling text AI service for MIME type: {mimeType}")
|
||||
processedContent = await self._serviceContainer.callAiTextBasic(aiPrompt, contentToProcess)
|
||||
|
||||
chunkResults.append(processedContent)
|
||||
except Exception as aiError:
|
||||
logger.error(f"AI processing failed for chunk: {str(aiError)}")
|
||||
# Fallback to original content
|
||||
chunkResults.append(chunk)
|
||||
|
||||
# Combine chunk results
|
||||
combinedResult = "\n".join(chunkResults)
|
||||
|
|
@ -604,6 +578,8 @@ class DocumentProcessor:
|
|||
|
||||
return processedItems
|
||||
|
||||
|
||||
|
||||
def _chunkText(self, content: str, mimeType: str) -> List[str]:
|
||||
"""Chunk text content based on mime type"""
|
||||
if mimeType == "text/plain":
|
||||
|
|
@ -765,36 +741,6 @@ class DocumentProcessor:
|
|||
except Exception:
|
||||
return [content]
|
||||
|
||||
def _chunkImage(self, content: str) -> List[str]:
|
||||
"""Chunk image content"""
|
||||
try:
|
||||
imageData = base64.b64decode(content)
|
||||
chunks = []
|
||||
chunkSize = self.chunkSizes["image"]
|
||||
|
||||
for i in range(0, len(imageData), chunkSize):
|
||||
chunk = imageData[i:i + chunkSize]
|
||||
chunks.append(base64.b64encode(chunk).decode('utf-8'))
|
||||
|
||||
return chunks
|
||||
except Exception:
|
||||
return [content]
|
||||
|
||||
def _chunkVideo(self, content: str) -> List[str]:
|
||||
"""Chunk video content"""
|
||||
try:
|
||||
videoData = base64.b64decode(content)
|
||||
chunks = []
|
||||
chunkSize = self.chunkSizes["video"]
|
||||
|
||||
for i in range(0, len(videoData), chunkSize):
|
||||
chunk = videoData[i:i + chunkSize]
|
||||
chunks.append(base64.b64encode(chunk).decode('utf-8'))
|
||||
|
||||
return chunks
|
||||
except Exception:
|
||||
return [content]
|
||||
|
||||
def _chunkBinary(self, content: str) -> List[str]:
|
||||
"""Chunk binary content"""
|
||||
try:
|
||||
|
|
@ -810,4 +756,87 @@ class DocumentProcessor:
|
|||
except Exception:
|
||||
return [content]
|
||||
|
||||
async def _chunkPdf(self, content: str) -> List[str]:
|
||||
"""Chunk PDF content"""
|
||||
try:
|
||||
pdfData = base64.b64decode(content)
|
||||
chunks = []
|
||||
chunkSize = self.chunkSizes["pdf"]
|
||||
|
||||
with io.BytesIO(pdfData) as pdfStream:
|
||||
pdfReader = PyPDF2.PdfReader(pdfStream)
|
||||
for pageNum in range(len(pdfReader.pages)):
|
||||
page = pdfReader.pages[pageNum]
|
||||
pageText = page.extract_text()
|
||||
if pageText:
|
||||
chunks.append(pageText)
|
||||
|
||||
return chunks
|
||||
except Exception:
|
||||
return [content]
|
||||
|
||||
async def _chunkDocx(self, content: str) -> List[str]:
|
||||
"""Chunk Word document content"""
|
||||
try:
|
||||
docxData = base64.b64decode(content)
|
||||
chunks = []
|
||||
chunkSize = self.chunkSizes["docx"]
|
||||
|
||||
with io.BytesIO(docxData) as docxStream:
|
||||
doc = docx.Document(docxStream)
|
||||
for para in doc.paragraphs:
|
||||
chunks.append(para.text)
|
||||
for table in doc.tables:
|
||||
for row in table.rows:
|
||||
rowText = []
|
||||
for cell in row.cells:
|
||||
rowText.append(cell.text)
|
||||
chunks.append(" | ".join(rowText))
|
||||
|
||||
return chunks
|
||||
except Exception:
|
||||
return [content]
|
||||
|
||||
async def _chunkXlsx(self, content: str) -> List[str]:
|
||||
"""Chunk Excel document content"""
|
||||
try:
|
||||
xlsxData = base64.b64decode(content)
|
||||
chunks = []
|
||||
chunkSize = self.chunkSizes["xlsx"]
|
||||
|
||||
with io.BytesIO(xlsxData) as xlsxStream:
|
||||
workbook = openpyxl.load_workbook(xlsxStream, data_only=True)
|
||||
for sheetName in workbook.sheetnames:
|
||||
sheet = workbook[sheetName]
|
||||
for row in sheet.iter_rows():
|
||||
rowText = []
|
||||
for cell in row:
|
||||
value = cell.value
|
||||
if value is None:
|
||||
rowText.append("")
|
||||
else:
|
||||
rowText.append(str(value).replace('"', '""'))
|
||||
chunks.append(','.join(f'"{cell}"' for cell in rowText))
|
||||
|
||||
return chunks
|
||||
except Exception:
|
||||
return [content]
|
||||
|
||||
async def _chunkPptx(self, content: str) -> List[str]:
|
||||
"""Chunk PowerPoint document content"""
|
||||
try:
|
||||
pptxData = base64.b64decode(content)
|
||||
chunks = []
|
||||
chunkSize = self.chunkSizes["pptx"]
|
||||
|
||||
with io.BytesIO(pptxData) as pptxStream:
|
||||
# openpyxl is not suitable for PowerPoint, so we'll just read text
|
||||
# This is a placeholder and would require a different library for full pptx processing
|
||||
# For now, we'll just return the base64 encoded content as a single chunk
|
||||
chunks.append(content)
|
||||
|
||||
return chunks
|
||||
except Exception:
|
||||
return [content]
|
||||
|
||||
|
||||
|
|
@ -2,6 +2,7 @@ import logging
|
|||
import importlib
|
||||
import pkgutil
|
||||
import inspect
|
||||
import os
|
||||
from typing import Dict, Any, List, Optional
|
||||
from modules.interfaces.interfaceAppModel import User, UserConnection
|
||||
from modules.interfaces.interfaceChatModel import (
|
||||
|
|
@ -111,6 +112,155 @@ class ServiceContainer:
|
|||
except Exception as e:
|
||||
logger.error(f"Error discovering methods: {str(e)}")
|
||||
|
||||
def detectContentTypeFromData(self, fileData: bytes, filename: str) -> str:
|
||||
"""
|
||||
Detect content type from file data and filename.
|
||||
This method makes the MIME type detection function accessible through the service container.
|
||||
|
||||
Args:
|
||||
fileData: Raw file data as bytes
|
||||
filename: Name of the file
|
||||
|
||||
Returns:
|
||||
str: Detected MIME type
|
||||
"""
|
||||
try:
|
||||
# Check file extension first
|
||||
ext = os.path.splitext(filename)[1].lower()
|
||||
if ext:
|
||||
# Map common extensions to MIME types
|
||||
extToMime = {
|
||||
'.txt': 'text/plain',
|
||||
'.md': 'text/markdown',
|
||||
'.csv': 'text/csv',
|
||||
'.json': 'application/json',
|
||||
'.xml': 'application/xml',
|
||||
'.js': 'application/javascript',
|
||||
'.py': 'application/x-python',
|
||||
'.svg': 'image/svg+xml',
|
||||
'.jpg': 'image/jpeg',
|
||||
'.jpeg': 'image/jpeg',
|
||||
'.png': 'image/png',
|
||||
'.gif': 'image/gif',
|
||||
'.bmp': 'image/bmp',
|
||||
'.webp': 'image/webp',
|
||||
'.pdf': 'application/pdf',
|
||||
'.docx': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
|
||||
'.doc': 'application/msword',
|
||||
'.xlsx': 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
|
||||
'.xls': 'application/vnd.ms-excel',
|
||||
'.pptx': 'application/vnd.openxmlformats-officedocument.presentationml.presentation',
|
||||
'.ppt': 'application/vnd.ms-powerpoint',
|
||||
'.html': 'text/html',
|
||||
'.htm': 'text/html',
|
||||
'.css': 'text/css',
|
||||
'.zip': 'application/zip',
|
||||
'.rar': 'application/x-rar-compressed',
|
||||
'.7z': 'application/x-7z-compressed',
|
||||
'.tar': 'application/x-tar',
|
||||
'.gz': 'application/gzip'
|
||||
}
|
||||
if ext in extToMime:
|
||||
return extToMime[ext]
|
||||
|
||||
# Try to detect from content
|
||||
if fileData.startswith(b'%PDF'):
|
||||
return 'application/pdf'
|
||||
elif fileData.startswith(b'PK\x03\x04'):
|
||||
# ZIP-based formats (docx, xlsx, pptx)
|
||||
return 'application/zip'
|
||||
elif fileData.startswith(b'<'):
|
||||
# XML-based formats
|
||||
try:
|
||||
text = fileData.decode('utf-8', errors='ignore')
|
||||
if '<svg' in text.lower():
|
||||
return 'image/svg+xml'
|
||||
elif '<html' in text.lower():
|
||||
return 'text/html'
|
||||
else:
|
||||
return 'application/xml'
|
||||
except:
|
||||
pass
|
||||
elif fileData.startswith(b'\x89PNG\r\n\x1a\n'):
|
||||
return 'image/png'
|
||||
elif fileData.startswith(b'\xff\xd8\xff'):
|
||||
return 'image/jpeg'
|
||||
elif fileData.startswith(b'GIF87a') or fileData.startswith(b'GIF89a'):
|
||||
return 'image/gif'
|
||||
elif fileData.startswith(b'BM'):
|
||||
return 'image/bmp'
|
||||
elif fileData.startswith(b'RIFF') and fileData[8:12] == b'WEBP':
|
||||
return 'image/webp'
|
||||
|
||||
return 'application/octet-stream'
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error detecting content type from data: {str(e)}")
|
||||
return 'application/octet-stream'
|
||||
|
||||
def getMimeTypeFromExtension(self, extension: str) -> str:
|
||||
"""
|
||||
Get MIME type based on file extension.
|
||||
This method consolidates MIME type detection from extension.
|
||||
|
||||
Args:
|
||||
extension: File extension (with or without dot)
|
||||
|
||||
Returns:
|
||||
str: MIME type for the extension
|
||||
"""
|
||||
# Normalize extension (remove dot if present)
|
||||
if extension.startswith('.'):
|
||||
extension = extension[1:]
|
||||
|
||||
# Map extensions to MIME types
|
||||
mime_types = {
|
||||
'txt': 'text/plain',
|
||||
'json': 'application/json',
|
||||
'xml': 'application/xml',
|
||||
'csv': 'text/csv',
|
||||
'html': 'text/html',
|
||||
'htm': 'text/html',
|
||||
'md': 'text/markdown',
|
||||
'py': 'text/x-python',
|
||||
'js': 'application/javascript',
|
||||
'css': 'text/css',
|
||||
'pdf': 'application/pdf',
|
||||
'doc': 'application/msword',
|
||||
'docx': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
|
||||
'xls': 'application/vnd.ms-excel',
|
||||
'xlsx': 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
|
||||
'ppt': 'application/vnd.ms-powerpoint',
|
||||
'pptx': 'application/vnd.openxmlformats-officedocument.presentationml.presentation',
|
||||
'svg': 'image/svg+xml',
|
||||
'jpg': 'image/jpeg',
|
||||
'jpeg': 'image/jpeg',
|
||||
'png': 'image/png',
|
||||
'gif': 'image/gif',
|
||||
'bmp': 'image/bmp',
|
||||
'webp': 'image/webp',
|
||||
'zip': 'application/zip',
|
||||
'rar': 'application/x-rar-compressed',
|
||||
'7z': 'application/x-7z-compressed',
|
||||
'tar': 'application/x-tar',
|
||||
'gz': 'application/gzip'
|
||||
}
|
||||
return mime_types.get(extension.lower(), 'application/octet-stream')
|
||||
|
||||
def getFileExtension(self, filename: str) -> str:
|
||||
"""
|
||||
Extract file extension from filename.
|
||||
|
||||
Args:
|
||||
filename: Name of the file
|
||||
|
||||
Returns:
|
||||
str: File extension (without dot)
|
||||
"""
|
||||
if '.' in filename:
|
||||
return filename.split('.')[-1].lower()
|
||||
return "txt" # Default to text
|
||||
|
||||
# ===== Functions =====
|
||||
|
||||
def extractContent(self, prompt: str, document: ChatDocument) -> str:
|
||||
|
|
@ -399,11 +549,11 @@ Please provide a clear summary of this message."""
|
|||
"""Advanced text processing using Anthropic"""
|
||||
return self.interfaceAiCalls.callAiTextAdvanced(prompt, context)
|
||||
|
||||
def callAiImageBasic(self, prompt: str, imageData: bytes, mimeType: str) -> str:
|
||||
def callAiImageBasic(self, prompt: str, imageData: str, mimeType: str) -> str:
|
||||
"""Basic image processing using OpenAI"""
|
||||
return self.interfaceAiCalls.callAiImageBasic(prompt, imageData, mimeType)
|
||||
|
||||
def callAiImageAdvanced(self, prompt: str, imageData: bytes, mimeType: str) -> str:
|
||||
def callAiImageAdvanced(self, prompt: str, imageData: str, mimeType: str) -> str:
|
||||
"""Advanced image processing using Anthropic"""
|
||||
return self.interfaceAiCalls.callAiImageAdvanced(prompt, imageData, mimeType)
|
||||
|
||||
|
|
@ -463,6 +613,30 @@ Please provide a clear summary of this message."""
|
|||
mimeType=mimeType
|
||||
)
|
||||
|
||||
def extractTextFromContentObjects(self, content_objects: List[Any]) -> List[str]:
|
||||
"""
|
||||
Extract text content from ExtractedContent objects or other content objects.
|
||||
|
||||
Args:
|
||||
content_objects: List of ExtractedContent objects or other content objects
|
||||
|
||||
Returns:
|
||||
List of extracted text strings
|
||||
"""
|
||||
text_contents = []
|
||||
for content_obj in content_objects:
|
||||
if hasattr(content_obj, 'contents') and content_obj.contents:
|
||||
# Extract text from ContentItem objects
|
||||
for content_item in content_obj.contents:
|
||||
if hasattr(content_item, 'data') and content_item.data:
|
||||
text_contents.append(content_item.data)
|
||||
elif isinstance(content_obj, str):
|
||||
text_contents.append(content_obj)
|
||||
else:
|
||||
# Fallback: convert to string representation
|
||||
text_contents.append(str(content_obj))
|
||||
return text_contents
|
||||
|
||||
async def executeAction(self, methodName: str, actionName: str, parameters: Dict[str, Any]) -> ActionResult:
|
||||
"""Execute a method action"""
|
||||
try:
|
||||
|
|
|
|||
31
run_document_test.ps1
Normal file
31
run_document_test.ps1
Normal file
|
|
@ -0,0 +1,31 @@
|
|||
# PowerShell script to run document extraction test
|
||||
# Usage: .\run_document_test.ps1 [file_path]
|
||||
|
||||
param(
|
||||
[string]$FilePath = "test_sample_document.txt"
|
||||
)
|
||||
|
||||
Write-Host "=== PowerOn Document Extraction Test ===" -ForegroundColor Green
|
||||
Write-Host ""
|
||||
|
||||
# Check if file exists
|
||||
if (-not (Test-Path $FilePath)) {
|
||||
Write-Host "Error: File not found: $FilePath" -ForegroundColor Red
|
||||
Write-Host "Please provide a valid file path as parameter or ensure test_sample_document.txt exists." -ForegroundColor Yellow
|
||||
exit 1
|
||||
}
|
||||
|
||||
Write-Host "Testing document extraction for file: $FilePath" -ForegroundColor Cyan
|
||||
Write-Host "Log file will be: test_document_extraction.log" -ForegroundColor Cyan
|
||||
Write-Host ""
|
||||
|
||||
# Run the Python test
|
||||
try {
|
||||
python test_document_extraction.py $FilePath
|
||||
Write-Host ""
|
||||
Write-Host "Test completed successfully!" -ForegroundColor Green
|
||||
Write-Host "Check test_document_extraction.log for detailed results." -ForegroundColor Cyan
|
||||
} catch {
|
||||
Write-Host "Test failed with error: $($_.Exception.Message)" -ForegroundColor Red
|
||||
exit 1
|
||||
}
|
||||
288
test_document_extraction.py
Normal file
288
test_document_extraction.py
Normal file
|
|
@ -0,0 +1,288 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
Test procedure for DocumentManager document extraction functionality.
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import sys
|
||||
import os
|
||||
import json
|
||||
import argparse
|
||||
from datetime import datetime, UTC
|
||||
from pathlib import Path
|
||||
import logging
|
||||
|
||||
print("Starting test_document_extraction.py...")
|
||||
|
||||
# Configure logging FIRST, before any other imports
|
||||
import logging
|
||||
|
||||
# Clear any existing handlers to avoid duplicate logs
|
||||
for handler in logging.root.handlers[:]:
|
||||
logging.root.removeHandler(handler)
|
||||
|
||||
logging.basicConfig(
|
||||
level=logging.DEBUG,
|
||||
format='%(asctime)s - %(levelname)s - %(name)s - %(message)s',
|
||||
handlers=[
|
||||
logging.StreamHandler(sys.stdout),
|
||||
logging.FileHandler('test_document_extraction.log', mode='w', encoding='utf-8') # 'w' mode clears the file
|
||||
],
|
||||
force=True # Force reconfiguration even if already configured
|
||||
)
|
||||
|
||||
# Filter out httpcore messages
|
||||
logging.getLogger('httpcore').setLevel(logging.WARNING)
|
||||
logging.getLogger('httpx').setLevel(logging.WARNING)
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Set up test configuration
|
||||
os.environ['POWERON_CONFIG_FILE'] = 'test_config.ini'
|
||||
print("Set POWERON_CONFIG_FILE environment variable")
|
||||
|
||||
try:
|
||||
# Import required modules
|
||||
from modules.interfaces.interfaceAppObjects import User, UserConnection
|
||||
from modules.interfaces.interfaceChatModel import ChatWorkflow
|
||||
from modules.workflow.managerDocument import DocumentManager
|
||||
from modules.workflow.serviceContainer import ServiceContainer
|
||||
print("All imports successful")
|
||||
except Exception as e:
|
||||
print(f"Import error: {e}")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
sys.exit(1)
|
||||
|
||||
def log_extraction_debug(message: str, data: dict = None):
|
||||
"""Log extraction debug data with JSON dumps"""
|
||||
timestamp = datetime.now(UTC).isoformat()
|
||||
if data:
|
||||
logger.debug(f"[{timestamp}] {message}\n{json.dumps(data, indent=2, ensure_ascii=False)}")
|
||||
else:
|
||||
logger.debug(f"[{timestamp}] {message}")
|
||||
|
||||
def create_test_user() -> User:
|
||||
"""Create a test user for the document extraction"""
|
||||
return User(
|
||||
id="test-user-doc-001",
|
||||
mandateId="test-mandate-doc-001",
|
||||
username="testuser_doc",
|
||||
email="test_doc@example.com",
|
||||
fullName="Test Document User",
|
||||
enabled=True,
|
||||
language="en",
|
||||
privilege="user",
|
||||
authenticationAuthority="local"
|
||||
)
|
||||
|
||||
def create_test_workflow() -> ChatWorkflow:
|
||||
"""Create a test workflow for document extraction"""
|
||||
return ChatWorkflow(
|
||||
id="test-workflow-doc-001",
|
||||
mandateId="test-mandate-doc-001",
|
||||
status="running",
|
||||
name="Document Extraction Test Workflow",
|
||||
currentRound=1,
|
||||
lastActivity=datetime.now(UTC).isoformat(),
|
||||
startedAt=datetime.now(UTC).isoformat(),
|
||||
logs=[],
|
||||
messages=[],
|
||||
stats=None,
|
||||
tasks=[]
|
||||
)
|
||||
|
||||
def detect_mime_type(file_path: str) -> str:
|
||||
"""Detect MIME type based on file extension"""
|
||||
ext = Path(file_path).suffix.lower()
|
||||
mime_types = {
|
||||
'.txt': 'text/plain',
|
||||
'.md': 'text/markdown',
|
||||
'.csv': 'text/csv',
|
||||
'.json': 'application/json',
|
||||
'.xml': 'application/xml',
|
||||
'.js': 'application/javascript',
|
||||
'.py': 'application/x-python',
|
||||
'.svg': 'image/svg+xml',
|
||||
'.jpg': 'image/jpeg',
|
||||
'.jpeg': 'image/jpeg',
|
||||
'.png': 'image/png',
|
||||
'.gif': 'image/gif',
|
||||
'.pdf': 'application/pdf',
|
||||
'.docx': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
|
||||
'.doc': 'application/msword',
|
||||
'.xlsx': 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
|
||||
'.xls': 'application/vnd.ms-excel',
|
||||
'.pptx': 'application/vnd.openxmlformats-officedocument.presentationml.presentation',
|
||||
'.ppt': 'application/vnd.ms-powerpoint',
|
||||
'.html': 'text/html',
|
||||
'.htm': 'text/html'
|
||||
}
|
||||
return mime_types.get(ext, 'application/octet-stream')
|
||||
|
||||
async def test_document_extraction(file_path: str):
|
||||
"""Test document extraction from a file path"""
|
||||
try:
|
||||
# Clear the log file before each run
|
||||
log_file_path = "test_document_extraction.log"
|
||||
if os.path.exists(log_file_path):
|
||||
with open(log_file_path, 'w') as f:
|
||||
f.write("") # Clear the file
|
||||
logger.info(f"Cleared log file: {log_file_path}")
|
||||
|
||||
logger.info("=== STARTING DOCUMENT EXTRACTION TEST ===")
|
||||
|
||||
# Validate file path
|
||||
if not os.path.exists(file_path):
|
||||
raise FileNotFoundError(f"File not found: {file_path}")
|
||||
|
||||
# Get file info
|
||||
file_path_obj = Path(file_path)
|
||||
filename = file_path_obj.name
|
||||
mime_type = detect_mime_type(file_path)
|
||||
file_size = file_path_obj.stat().st_size
|
||||
|
||||
log_extraction_debug("File information", {
|
||||
"file_path": file_path,
|
||||
"filename": filename,
|
||||
"mime_type": mime_type,
|
||||
"file_size_bytes": file_size,
|
||||
"file_size_mb": round(file_size / (1024 * 1024), 2)
|
||||
})
|
||||
|
||||
# Read file data
|
||||
try:
|
||||
with open(file_path, 'rb') as f:
|
||||
file_data = f.read()
|
||||
log_extraction_debug("File read successfully", {
|
||||
"bytes_read": len(file_data),
|
||||
"file_encoding": "binary"
|
||||
})
|
||||
except Exception as e:
|
||||
logger.error(f"Error reading file: {str(e)}")
|
||||
raise
|
||||
|
||||
# Create test user and workflow
|
||||
test_user = create_test_user()
|
||||
test_workflow = create_test_workflow()
|
||||
|
||||
# Create service container
|
||||
service_container = ServiceContainer(test_user, test_workflow)
|
||||
log_extraction_debug("Service container created", {
|
||||
"user_id": test_user.id,
|
||||
"workflow_id": test_workflow.id
|
||||
})
|
||||
|
||||
# Create document manager
|
||||
document_manager = DocumentManager(service_container)
|
||||
log_extraction_debug("Document manager created")
|
||||
|
||||
# Define extraction prompt
|
||||
extraction_prompt = "extract the table and convert it to a csv table"
|
||||
|
||||
log_extraction_debug("Starting document extraction", {
|
||||
"prompt": extraction_prompt,
|
||||
"filename": filename,
|
||||
"mime_type": mime_type
|
||||
})
|
||||
|
||||
# Extract content from file data
|
||||
try:
|
||||
extracted_content = await document_manager.extractContentFromFileData(
|
||||
prompt=extraction_prompt,
|
||||
fileData=file_data,
|
||||
filename=filename,
|
||||
mimeType=mime_type,
|
||||
base64Encoded=False,
|
||||
documentId=f"test-doc-{datetime.now(UTC).timestamp()}"
|
||||
)
|
||||
|
||||
# Log extraction results
|
||||
extraction_result = {
|
||||
"extracted_content_id": extracted_content.id,
|
||||
"content_items_count": len(extracted_content.contents)
|
||||
}
|
||||
|
||||
# Add objectId and objectType if they exist (set by DocumentManager)
|
||||
if hasattr(extracted_content, 'objectId'):
|
||||
extraction_result["object_id"] = extracted_content.objectId
|
||||
if hasattr(extracted_content, 'objectType'):
|
||||
extraction_result["object_type"] = extracted_content.objectType
|
||||
|
||||
log_extraction_debug("Document extraction completed successfully", extraction_result)
|
||||
|
||||
# Log detailed content information
|
||||
for i, content_item in enumerate(extracted_content.contents):
|
||||
content_info = {
|
||||
"label": content_item.label,
|
||||
"data_length": len(content_item.data) if content_item.data else 0,
|
||||
"data_preview": content_item.data[:500] + "..." if content_item.data and len(content_item.data) > 500 else content_item.data
|
||||
}
|
||||
|
||||
# Add metadata if available
|
||||
if content_item.metadata:
|
||||
content_info["metadata"] = {
|
||||
"size": content_item.metadata.size,
|
||||
"mime_type": content_item.metadata.mimeType,
|
||||
"base64_encoded": content_item.metadata.base64Encoded,
|
||||
"pages": content_item.metadata.pages
|
||||
}
|
||||
|
||||
log_extraction_debug(f"CONTENT ITEM {i+1}:", content_info)
|
||||
|
||||
# Log summary of all extracted content
|
||||
all_content = "\n\n".join([item.data for item in extracted_content.contents if item.data])
|
||||
log_extraction_debug("COMPLETE EXTRACTED CONTENT:", {
|
||||
"total_length": len(all_content),
|
||||
"content": all_content
|
||||
})
|
||||
|
||||
return extracted_content
|
||||
|
||||
except Exception as e:
|
||||
log_extraction_debug("DOCUMENT EXTRACTION EXCEPTION:", {
|
||||
"error_type": type(e).__name__,
|
||||
"error_message": str(e),
|
||||
"error_args": e.args if hasattr(e, 'args') else None
|
||||
})
|
||||
raise
|
||||
|
||||
logger.info("=== DOCUMENT EXTRACTION TEST COMPLETED ===")
|
||||
return extracted_content
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"❌ Document extraction test failed with error: {str(e)}")
|
||||
log_extraction_debug("Full error details", {
|
||||
"error_type": type(e).__name__,
|
||||
"error_message": str(e)
|
||||
})
|
||||
raise
|
||||
|
||||
async def main():
|
||||
"""Main function to run the document extraction test"""
|
||||
print("Inside main()")
|
||||
logger.info("=" * 50)
|
||||
logger.info("DOCUMENT EXTRACTION TEST")
|
||||
logger.info("=" * 50)
|
||||
|
||||
# Parse command line arguments
|
||||
parser = argparse.ArgumentParser(description='Test document extraction functionality')
|
||||
parser.add_argument('file_path', help='Path to the file to extract content from')
|
||||
args = parser.parse_args()
|
||||
|
||||
try:
|
||||
extracted_content = await test_document_extraction(args.file_path)
|
||||
logger.info("=" * 50)
|
||||
logger.info("TEST COMPLETED SUCCESSFULLY")
|
||||
logger.info("=" * 50)
|
||||
return extracted_content
|
||||
except Exception as e:
|
||||
logger.error("=" * 50)
|
||||
logger.error("TEST FAILED")
|
||||
logger.error("=" * 50)
|
||||
raise
|
||||
|
||||
if __name__ == "__main__":
|
||||
print("About to run main()")
|
||||
asyncio.run(main())
|
||||
print("main() finished")
|
||||
289
test_retry_enhancement.py
Normal file
289
test_retry_enhancement.py
Normal file
|
|
@ -0,0 +1,289 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
Test script for retry enhancement in managerChat.py
|
||||
Tests that previous action results and review feedback are properly passed to retry prompts.
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import logging
|
||||
import sys
|
||||
import os
|
||||
|
||||
# Add the gateway directory to the Python path
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), 'gateway'))
|
||||
|
||||
from modules.workflow.managerChat import ChatManager
|
||||
from modules.interfaces.interfaceAppModel import User
|
||||
from modules.interfaces.interfaceChatModel import ChatWorkflow, ChatMessage
|
||||
from modules.interfaces.interfaceChatObjects import ChatObjects
|
||||
|
||||
# Configure logging
|
||||
logging.basicConfig(level=logging.DEBUG)
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
class MockChatObjects(ChatObjects):
|
||||
"""Mock implementation of ChatObjects for testing"""
|
||||
|
||||
def createTaskAction(self, action_data):
|
||||
"""Mock task action creation"""
|
||||
class MockTaskAction:
|
||||
def __init__(self, data):
|
||||
self.id = "test_action_id"
|
||||
self.execMethod = data.get("execMethod", "unknown")
|
||||
self.execAction = data.get("execAction", "unknown")
|
||||
self.execParameters = data.get("execParameters", {})
|
||||
self.execResultLabel = data.get("execResultLabel", "")
|
||||
self.status = data.get("status", "PENDING")
|
||||
self.result = ""
|
||||
self.error = ""
|
||||
|
||||
def setSuccess(self):
|
||||
self.status = "COMPLETED"
|
||||
|
||||
def setError(self, error):
|
||||
self.status = "FAILED"
|
||||
self.error = error
|
||||
|
||||
def isSuccessful(self):
|
||||
return self.status == "COMPLETED"
|
||||
|
||||
return MockTaskAction(action_data)
|
||||
|
||||
def createChatDocument(self, document_data):
|
||||
"""Mock document creation"""
|
||||
class MockChatDocument:
|
||||
def __init__(self, data):
|
||||
self.fileId = data.get("fileId", "")
|
||||
self.filename = data.get("filename", "unknown")
|
||||
self.fileSize = data.get("fileSize", 0)
|
||||
self.mimeType = data.get("mimeType", "application/octet-stream")
|
||||
self.content = ""
|
||||
|
||||
return MockChatDocument(document_data)
|
||||
|
||||
def createWorkflowMessage(self, message_data):
|
||||
"""Mock message creation"""
|
||||
class MockWorkflowMessage:
|
||||
def __init__(self, data):
|
||||
self.workflowId = data.get("workflowId", "")
|
||||
self.role = data.get("role", "assistant")
|
||||
self.message = data.get("message", "")
|
||||
self.status = data.get("status", "step")
|
||||
self.sequenceNr = data.get("sequenceNr", 1)
|
||||
self.publishedAt = data.get("publishedAt", "")
|
||||
self.actionId = data.get("actionId", "")
|
||||
self.actionMethod = data.get("actionMethod", "")
|
||||
self.actionName = data.get("actionName", "")
|
||||
self.documentsLabel = data.get("documentsLabel", "")
|
||||
self.documents = data.get("documents", [])
|
||||
|
||||
return MockWorkflowMessage(message_data)
|
||||
|
||||
class MockServiceContainer:
|
||||
"""Mock service container for testing"""
|
||||
|
||||
def __init__(self, user, workflow):
|
||||
self.user = user
|
||||
self.workflow = workflow
|
||||
|
||||
def getMethodsList(self):
|
||||
"""Mock methods list"""
|
||||
return ["document.extract(documentList, aiPrompt)", "document.analyze(documentList, aiPrompt)"]
|
||||
|
||||
async def summarizeChat(self, messages):
|
||||
"""Mock chat summarization"""
|
||||
return "Mock chat history summary"
|
||||
|
||||
def getDocumentReferenceList(self):
|
||||
"""Mock document references"""
|
||||
return {
|
||||
'chat': [],
|
||||
'history': []
|
||||
}
|
||||
|
||||
def getConnectionReferenceList(self):
|
||||
"""Mock connection references"""
|
||||
return ["connection1", "connection2"]
|
||||
|
||||
def getFileInfo(self, fileId):
|
||||
"""Mock file info"""
|
||||
return {
|
||||
"filename": f"test_file_{fileId}.txt",
|
||||
"size": 1024,
|
||||
"mimeType": "text/plain"
|
||||
}
|
||||
|
||||
def createFile(self, fileName, mimeType, content, base64encoded=False):
|
||||
"""Mock file creation"""
|
||||
return f"file_id_{fileName}"
|
||||
|
||||
def createDocument(self, fileName, mimeType, content, base64encoded=False):
|
||||
"""Mock document creation"""
|
||||
class MockDocument:
|
||||
def __init__(self, name, mime, cont):
|
||||
self.filename = name
|
||||
self.mimeType = mime
|
||||
self.content = cont
|
||||
self.fileSize = len(cont)
|
||||
|
||||
return MockDocument(fileName, mimeType, content)
|
||||
|
||||
def getFileExtension(self, filename):
|
||||
"""Mock file extension extraction"""
|
||||
return filename.split('.')[-1] if '.' in filename else 'txt'
|
||||
|
||||
def getMimeTypeFromExtension(self, extension):
|
||||
"""Mock MIME type detection"""
|
||||
mime_types = {
|
||||
'txt': 'text/plain',
|
||||
'pdf': 'application/pdf',
|
||||
'doc': 'application/msword',
|
||||
'json': 'application/json'
|
||||
}
|
||||
return mime_types.get(extension, 'application/octet-stream')
|
||||
|
||||
def detectContentTypeFromData(self, file_bytes, filename):
|
||||
"""Mock content type detection"""
|
||||
if filename.endswith('.txt'):
|
||||
return 'text/plain'
|
||||
elif filename.endswith('.pdf'):
|
||||
return 'application/pdf'
|
||||
elif filename.endswith('.json'):
|
||||
return 'application/json'
|
||||
return 'application/octet-stream'
|
||||
|
||||
async def callAiTextBasic(self, prompt):
|
||||
"""Mock AI call"""
|
||||
return '{"actions": [{"method": "document", "action": "extract", "parameters": {"documentList": ["test"], "aiPrompt": "Test prompt"}, "resultLabel": "task1_action1_test", "description": "Test action"}]}'
|
||||
|
||||
async def callAiTextAdvanced(self, prompt):
|
||||
"""Mock advanced AI call"""
|
||||
return '{"overview": "Test plan", "tasks": [{"id": "task_1", "description": "Test task", "dependencies": [], "expected_outputs": ["output1"], "success_criteria": ["criteria1"], "required_documents": [], "estimated_complexity": "low", "ai_prompt": "Test prompt"}]}'
|
||||
|
||||
async def executeAction(self, methodName, actionName, parameters):
|
||||
"""Mock action execution"""
|
||||
class MockResult:
|
||||
def __init__(self):
|
||||
self.success = True
|
||||
self.data = {
|
||||
"result": "Mock execution result",
|
||||
"documents": []
|
||||
}
|
||||
self.error = None
|
||||
|
||||
return MockResult()
|
||||
|
||||
async def test_retry_enhancement():
|
||||
"""Test the retry enhancement functionality"""
|
||||
logger.info("Testing retry enhancement in managerChat.py")
|
||||
|
||||
# Create mock objects
|
||||
mock_user = User(id="test_user", username="testuser", email="test@example.com", mandateId="test_mandate")
|
||||
mock_chat_objects = MockChatObjects()
|
||||
mock_workflow = ChatWorkflow(
|
||||
id="test_workflow",
|
||||
userId="test_user",
|
||||
status="active",
|
||||
messages=[],
|
||||
createdAt="2024-01-01T00:00:00Z",
|
||||
updatedAt="2024-01-01T00:00:00Z",
|
||||
mandateId="test_mandate",
|
||||
currentRound=1,
|
||||
lastActivity="2024-01-01T00:00:00Z",
|
||||
startedAt="2024-01-01T00:00:00Z"
|
||||
)
|
||||
|
||||
# Create chat manager
|
||||
chat_manager = ChatManager(mock_user, mock_chat_objects)
|
||||
|
||||
# Mock the service container directly instead of initializing
|
||||
chat_manager.service = MockServiceContainer(mock_user, mock_workflow)
|
||||
chat_manager.workflow = mock_workflow
|
||||
|
||||
# Test 1: Basic action definition without retry
|
||||
logger.info("Test 1: Basic action definition")
|
||||
task_step = {
|
||||
"id": "task_1",
|
||||
"description": "Test task",
|
||||
"expected_outputs": ["output1"],
|
||||
"success_criteria": ["criteria1"],
|
||||
"ai_prompt": "Test AI prompt"
|
||||
}
|
||||
|
||||
actions = await chat_manager.defineTaskActions(task_step, mock_workflow, [])
|
||||
logger.info(f"Generated {len(actions)} actions without retry context")
|
||||
|
||||
# Test 2: Action definition with retry context
|
||||
logger.info("Test 2: Action definition with retry context")
|
||||
enhanced_context = {
|
||||
'task_step': task_step,
|
||||
'workflow': mock_workflow,
|
||||
'workflow_id': mock_workflow.id,
|
||||
'available_documents': ["test_doc.txt"],
|
||||
'previous_results': ["task0_action1_results"],
|
||||
'improvements': "Previous attempt failed - ensure comprehensive extraction",
|
||||
'retry_count': 1,
|
||||
'previous_action_results': [
|
||||
{
|
||||
'actionMethod': 'document',
|
||||
'actionName': 'extract',
|
||||
'status': 'failed',
|
||||
'error': 'Empty result returned',
|
||||
'result': 'No content extracted',
|
||||
'resultLabel': 'task1_action1_failed'
|
||||
}
|
||||
],
|
||||
'previous_review_result': {
|
||||
'status': 'retry',
|
||||
'reason': 'Incomplete extraction',
|
||||
'quality_score': 3,
|
||||
'missing_outputs': ['detailed_analysis'],
|
||||
'unmet_criteria': ['comprehensive_coverage']
|
||||
}
|
||||
}
|
||||
|
||||
retry_actions = await chat_manager.defineTaskActions(task_step, mock_workflow, [], enhanced_context)
|
||||
logger.info(f"Generated {len(retry_actions)} actions with retry context")
|
||||
|
||||
# Test 3: Verify retry context is properly handled
|
||||
logger.info("Test 3: Verifying retry context handling")
|
||||
|
||||
# Create a test prompt to see if retry context is included
|
||||
test_prompt = await chat_manager._createActionDefinitionPrompt(enhanced_context)
|
||||
|
||||
# Check if retry context is in the prompt
|
||||
if "RETRY CONTEXT" in test_prompt:
|
||||
logger.info("✓ Retry context properly included in prompt")
|
||||
else:
|
||||
logger.error("✗ Retry context not found in prompt")
|
||||
|
||||
if "Previous action results that failed" in test_prompt:
|
||||
logger.info("✓ Previous action results included in prompt")
|
||||
else:
|
||||
logger.error("✗ Previous action results not found in prompt")
|
||||
|
||||
if "Previous review feedback" in test_prompt:
|
||||
logger.info("✓ Previous review feedback included in prompt")
|
||||
else:
|
||||
logger.error("✗ Previous review feedback not found in prompt")
|
||||
|
||||
if "Previous attempt failed" in test_prompt:
|
||||
logger.info("✓ Improvements needed included in prompt")
|
||||
else:
|
||||
logger.error("✗ Improvements needed not found in prompt")
|
||||
|
||||
# Test 4: Verify fallback actions with retry context
|
||||
logger.info("Test 4: Testing fallback actions with retry context")
|
||||
fallback_actions = chat_manager._createFallbackActions(task_step, enhanced_context)
|
||||
logger.info(f"Generated {len(fallback_actions)} fallback actions with retry context")
|
||||
|
||||
# Check if fallback actions include retry information
|
||||
if any("retry" in action.get("resultLabel", "") for action in fallback_actions):
|
||||
logger.info("✓ Fallback actions include retry information")
|
||||
else:
|
||||
logger.error("✗ Fallback actions missing retry information")
|
||||
|
||||
logger.info("Retry enhancement test completed successfully!")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(test_retry_enhancement())
|
||||
47
test_sample_document.txt
Normal file
47
test_sample_document.txt
Normal file
|
|
@ -0,0 +1,47 @@
|
|||
PowerOn System Architecture Overview
|
||||
|
||||
This document provides a comprehensive overview of the PowerOn system architecture, including its key components, data flow, and technical specifications.
|
||||
|
||||
MAJOR TOPICS:
|
||||
|
||||
1. System Architecture
|
||||
- Frontend Agents: Web-based user interface components
|
||||
- Gateway: Central API and workflow management system
|
||||
- Database: JSON-based data storage with component interfaces
|
||||
- AI Integration: Anthropic and OpenAI connectors for intelligent processing
|
||||
|
||||
2. Core Components
|
||||
- Document Manager: Handles file processing and content extraction
|
||||
- Workflow Manager: Orchestrates complex business processes
|
||||
- Service Container: Provides unified access to all system services
|
||||
- Neutralizer: Data anonymization and privacy protection
|
||||
|
||||
3. Data Flow Architecture
|
||||
- User authentication and authorization
|
||||
- Document upload and processing pipeline
|
||||
- AI-powered content analysis and extraction
|
||||
- Workflow execution and task management
|
||||
- Result generation and storage
|
||||
|
||||
4. Technical Specifications
|
||||
- Python-based backend with async/await support
|
||||
- RESTful API design with JSON data exchange
|
||||
- Modular component architecture
|
||||
- Extensible method system for business logic
|
||||
- Comprehensive logging and monitoring
|
||||
|
||||
5. Security Features
|
||||
- Multi-authentication authority support (Local, Microsoft, Google)
|
||||
- Token-based session management
|
||||
- Data encryption and anonymization
|
||||
- Role-based access control
|
||||
- Audit trail and compliance features
|
||||
|
||||
6. Integration Capabilities
|
||||
- SharePoint document management
|
||||
- Email system integration (Outlook)
|
||||
- Web crawling and data collection
|
||||
- AI service integration (Anthropic, OpenAI)
|
||||
- Custom method development framework
|
||||
|
||||
The PowerOn system is designed to provide a comprehensive platform for intelligent document processing, workflow automation, and AI-powered business process management. It combines modern web technologies with advanced AI capabilities to deliver a robust and scalable solution for enterprise document management and workflow automation.
|
||||
Loading…
Reference in a new issue