187 lines
No EOL
6.7 KiB
Markdown
187 lines
No EOL
6.7 KiB
Markdown
# Document Management Refactoring Specification
|
|
|
|
## Overview
|
|
This specification outlines the refactoring of document management in the system, focusing on proper model separation, centralized content extraction, and future-proof neutralization integration.
|
|
|
|
## Model Structure
|
|
|
|
### Base Document Models
|
|
```python
|
|
class ContentMetadata(BaseModel, ModelMixin):
|
|
"""Metadata for content items"""
|
|
size: int = Field(description="Content size in bytes")
|
|
pages: Optional[int] = Field(None, description="Number of pages for multi-page content")
|
|
error: Optional[str] = Field(None, description="Processing error if any")
|
|
# Media-specific attributes
|
|
width: Optional[int] = Field(None, description="Width in pixels for images/videos")
|
|
height: Optional[int] = Field(None, description="Height in pixels for images/videos")
|
|
colorMode: Optional[str] = Field(None, description="Color mode (e.g., RGB, CMYK, grayscale)")
|
|
fps: Optional[float] = Field(None, description="Frames per second for videos")
|
|
durationSec: Optional[float] = Field(None, description="Duration in seconds for videos/audio")
|
|
|
|
class ContentItem(BaseModel, ModelMixin):
|
|
"""Individual content item from a document"""
|
|
label: str = Field(description="Content label (e.g., tab name, tag name)")
|
|
data: str = Field(description="Text content")
|
|
metadata: ContentMetadata = Field(description="Content metadata")
|
|
|
|
class ChatDocument(BaseModel, ModelMixin):
|
|
id: str = Field(default_factory=lambda: str(uuid.uuid4()))
|
|
fileId: str
|
|
filename: str
|
|
fileSize: int
|
|
mimeType: str
|
|
|
|
class TaskDocument(BaseModel, ModelMixin):
|
|
id: str = Field(default_factory=lambda: str(uuid.uuid4()))
|
|
filename: str
|
|
fileSize: int
|
|
mimeType: str
|
|
data: str # Base64 encoded file data
|
|
|
|
class ExtractedContent(BaseModel, ModelMixin):
|
|
objectId: str # Reference to source document
|
|
objectType: str = Field(description="Type of source object ('ChatDocument' or 'TaskDocument')")
|
|
contents: List[ContentItem]
|
|
```
|
|
|
|
## Service Layer Structure
|
|
|
|
### Document Service
|
|
```python
|
|
class DocumentService:
|
|
def __init__(self, service_container):
|
|
self.service = service_container
|
|
self.neutralizer_enabled = False # Flag for neutralization feature
|
|
|
|
async def extractFromChatDocument(self, prompt: str, document: ChatDocument) -> ExtractedContent:
|
|
"""
|
|
Extract content from a ChatDocument by converting it to TaskDocument first.
|
|
"""
|
|
# Convert ChatDocument to TaskDocument
|
|
task_doc = await self._convertToTaskDocument(document)
|
|
return await self.getDocumentContent(task_doc, prompt)
|
|
|
|
async def extractFromTaskDocument(self, prompt: str, document: TaskDocument) -> ExtractedContent:
|
|
"""
|
|
Extract content directly from a TaskDocument.
|
|
"""
|
|
return await self.getDocumentContent(document, prompt)
|
|
|
|
async def getDocumentContent(self, document: TaskDocument, prompt: str) -> ExtractedContent:
|
|
"""
|
|
Helper function for centralized content extraction.
|
|
Handles the actual content extraction and optional neutralization.
|
|
"""
|
|
# Extract content based on mimeType
|
|
content = await self._extractRawContent(document)
|
|
|
|
# Apply neutralization if enabled
|
|
if self.neutralizer_enabled:
|
|
from modules.neutralizer import neutralizer
|
|
content = await neutralizer.process_content(content)
|
|
|
|
# Process content with AI using prompt
|
|
processed_content = await self._processWithAI(content, prompt)
|
|
|
|
return ExtractedContent(
|
|
objectId=document.id,
|
|
objectType="TaskDocument",
|
|
contents=processed_content
|
|
)
|
|
```
|
|
|
|
## Implementation Steps
|
|
|
|
1. **Model Cleanup**
|
|
- Create new model classes in `serviceChatModel.py`
|
|
- Remove deprecated models:
|
|
- DocumentExtraction
|
|
- DocumentContext
|
|
- ProcessedDocument
|
|
- ChatContent (replaced by ContentItem)
|
|
- Update ChatDocument to remove contents attribute
|
|
- Convert all snake_case to camelCase in manager*.py and method*.py
|
|
|
|
2. **Service Implementation**
|
|
- Create new `DocumentService` class in `serviceDocument.py`
|
|
- Implement the three main methods:
|
|
- extractFromChatDocument
|
|
- extractFromTaskDocument
|
|
- getDocumentContent (helper function)
|
|
- Add neutralization integration with feature flag
|
|
|
|
3. **UserInput Processing**
|
|
- Update `UserInputRequest` processing to use `ChatMessage`
|
|
- Implement `processFileIds` in `serviceChatClass`
|
|
- Update all references to use new model structure
|
|
|
|
4. **Method Module Updates**
|
|
- Update all method*.py modules to use new service layer
|
|
- Remove direct file access
|
|
- Implement proper error handling and logging
|
|
|
|
5. **Testing and Validation**
|
|
- Create unit tests for new models and services
|
|
- Test document processing with various file types
|
|
- Validate content extraction and neutralization
|
|
- Test error handling and edge cases
|
|
|
|
## Files to be Removed/Modified
|
|
|
|
### To be Removed
|
|
1. `DocumentExtraction` class from serviceChatModel.py
|
|
2. `DocumentContext` class from serviceChatModel.py
|
|
3. `ProcessedDocument` class from serviceChatModel.py
|
|
4. `ChatContent` class from serviceChatModel.py
|
|
5. Direct file access methods from method*.py modules
|
|
|
|
### To be Modified
|
|
1. `serviceChatModel.py`
|
|
- Add new model classes
|
|
- Remove deprecated classes
|
|
- Update existing classes
|
|
|
|
2. `managerDocument.py`
|
|
- Move core functionality to DocumentService
|
|
- Update to use new model structure
|
|
- Remove redundant methods
|
|
|
|
3. `method*.py` modules
|
|
- Update to use DocumentService
|
|
- Remove direct file access
|
|
- Update error handling
|
|
|
|
4. `serviceChatClass.py`
|
|
- Implement processFileIds
|
|
- Update document handling
|
|
|
|
## Neutralization Integration
|
|
|
|
The neutralization feature is integrated into the `getDocumentContent` method with a feature flag. When enabled, it will process content through the neutralizer before sending it to AI processing.
|
|
|
|
```python
|
|
# In getDocumentContent method
|
|
if self.neutralizer_enabled:
|
|
from modules.neutralizer import neutralizer
|
|
content = await neutralizer.process_content(content)
|
|
```
|
|
|
|
This allows for easy enabling/disabling of the feature and future expansion of neutralization capabilities.
|
|
|
|
## Migration Strategy
|
|
|
|
1. Create new models and services
|
|
2. Implement new functionality alongside existing code
|
|
3. Gradually migrate method modules to use new services
|
|
4. Remove deprecated code once migration is complete
|
|
5. Enable neutralization feature when ready
|
|
|
|
## Testing Requirements
|
|
|
|
1. Unit tests for all new model classes
|
|
2. Integration tests for DocumentService
|
|
3. Tests for content extraction with various file types
|
|
4. Tests for neutralization integration
|
|
5. Performance tests for large file handling
|
|
6. Error handling and edge case tests |