6.7 KiB
6.7 KiB
Document Management Refactoring Specification
Overview
This specification outlines the refactoring of document management in the system, focusing on proper model separation, centralized content extraction, and future-proof neutralization integration.
Model Structure
Base Document Models
class ContentMetadata(BaseModel, ModelMixin):
"""Metadata for content items"""
size: int = Field(description="Content size in bytes")
pages: Optional[int] = Field(None, description="Number of pages for multi-page content")
error: Optional[str] = Field(None, description="Processing error if any")
# Media-specific attributes
width: Optional[int] = Field(None, description="Width in pixels for images/videos")
height: Optional[int] = Field(None, description="Height in pixels for images/videos")
colorMode: Optional[str] = Field(None, description="Color mode (e.g., RGB, CMYK, grayscale)")
fps: Optional[float] = Field(None, description="Frames per second for videos")
durationSec: Optional[float] = Field(None, description="Duration in seconds for videos/audio")
class ContentItem(BaseModel, ModelMixin):
"""Individual content item from a document"""
label: str = Field(description="Content label (e.g., tab name, tag name)")
data: str = Field(description="Text content")
metadata: ContentMetadata = Field(description="Content metadata")
class ChatDocument(BaseModel, ModelMixin):
id: str = Field(default_factory=lambda: str(uuid.uuid4()))
fileId: str
filename: str
fileSize: int
mimeType: str
class TaskDocument(BaseModel, ModelMixin):
id: str = Field(default_factory=lambda: str(uuid.uuid4()))
filename: str
fileSize: int
mimeType: str
data: str # Base64 encoded file data
class ExtractedContent(BaseModel, ModelMixin):
objectId: str # Reference to source document
objectType: str = Field(description="Type of source object ('ChatDocument' or 'TaskDocument')")
contents: List[ContentItem]
Service Layer Structure
Document Service
class DocumentService:
def __init__(self, service_container):
self.service = service_container
self.neutralizer_enabled = False # Flag for neutralization feature
async def extractFromChatDocument(self, prompt: str, document: ChatDocument) -> ExtractedContent:
"""
Extract content from a ChatDocument by converting it to TaskDocument first.
"""
# Convert ChatDocument to TaskDocument
task_doc = await self._convertToTaskDocument(document)
return await self.getDocumentContent(task_doc, prompt)
async def extractFromTaskDocument(self, prompt: str, document: TaskDocument) -> ExtractedContent:
"""
Extract content directly from a TaskDocument.
"""
return await self.getDocumentContent(document, prompt)
async def getDocumentContent(self, document: TaskDocument, prompt: str) -> ExtractedContent:
"""
Helper function for centralized content extraction.
Handles the actual content extraction and optional neutralization.
"""
# Extract content based on mimeType
content = await self._extractRawContent(document)
# Apply neutralization if enabled
if self.neutralizer_enabled:
from modules.neutralizer import neutralizer
content = await neutralizer.process_content(content)
# Process content with AI using prompt
processed_content = await self._processWithAI(content, prompt)
return ExtractedContent(
objectId=document.id,
objectType="TaskDocument",
contents=processed_content
)
Implementation Steps
-
Model Cleanup
- Create new model classes in
interfaceChatModel.py - Remove deprecated models:
- DocumentExtraction
- DocumentContext
- ProcessedDocument
- ChatContent (replaced by ContentItem)
- Update ChatDocument to remove contents attribute
- Convert all snake_case to camelCase in manager*.py and method*.py
- Create new model classes in
-
Service Implementation
- Create new
DocumentServiceclass inserviceDocument.py - Implement the three main methods:
- extractFromChatDocument
- extractFromTaskDocument
- getDocumentContent (helper function)
- Add neutralization integration with feature flag
- Create new
-
UserInput Processing
- Update
UserInputRequestprocessing to useChatMessage - Implement
processFileIdsininterfaceChatObjects - Update all references to use new model structure
- Update
-
Method Module Updates
- Update all method*.py modules to use new service layer
- Remove direct file access
- Implement proper error handling and logging
-
Testing and Validation
- Create unit tests for new models and services
- Test document processing with various file types
- Validate content extraction and neutralization
- Test error handling and edge cases
Files to be Removed/Modified
To be Removed
DocumentExtractionclass from interfaceChatModel.pyDocumentContextclass from interfaceChatModel.pyProcessedDocumentclass from interfaceChatModel.pyChatContentclass from interfaceChatModel.py- Direct file access methods from method*.py modules
To be Modified
-
interfaceChatModel.py- Add new model classes
- Remove deprecated classes
- Update existing classes
-
managerDocument.py- Move core functionality to DocumentService
- Update to use new model structure
- Remove redundant methods
-
method*.pymodules- Update to use DocumentService
- Remove direct file access
- Update error handling
-
interfaceChatObjects.py- Implement processFileIds
- Update document handling
Neutralization Integration
The neutralization feature is integrated into the getDocumentContent method with a feature flag. When enabled, it will process content through the neutralizer before sending it to AI processing.
# In getDocumentContent method
if self.neutralizer_enabled:
from modules.neutralizer import neutralizer
content = await neutralizer.process_content(content)
This allows for easy enabling/disabling of the feature and future expansion of neutralization capabilities.
Migration Strategy
- Create new models and services
- Implement new functionality alongside existing code
- Gradually migrate method modules to use new services
- Remove deprecated code once migration is complete
- Enable neutralization feature when ready
Testing Requirements
- Unit tests for all new model classes
- Integration tests for DocumentService
- Tests for content extraction with various file types
- Tests for neutralization integration
- Performance tests for large file handling
- Error handling and edge case tests