# Document Management Refactoring Specification ## Overview This specification outlines the refactoring of document management in the system, focusing on proper model separation, centralized content extraction, and future-proof neutralization integration. ## Model Structure ### Base Document Models ```python class ContentMetadata(BaseModel, ModelMixin): """Metadata for content items""" size: int = Field(description="Content size in bytes") pages: Optional[int] = Field(None, description="Number of pages for multi-page content") error: Optional[str] = Field(None, description="Processing error if any") # Media-specific attributes width: Optional[int] = Field(None, description="Width in pixels for images/videos") height: Optional[int] = Field(None, description="Height in pixels for images/videos") colorMode: Optional[str] = Field(None, description="Color mode (e.g., RGB, CMYK, grayscale)") fps: Optional[float] = Field(None, description="Frames per second for videos") durationSec: Optional[float] = Field(None, description="Duration in seconds for videos/audio") class ContentItem(BaseModel, ModelMixin): """Individual content item from a document""" label: str = Field(description="Content label (e.g., tab name, tag name)") data: str = Field(description="Text content") metadata: ContentMetadata = Field(description="Content metadata") class ChatDocument(BaseModel, ModelMixin): id: str = Field(default_factory=lambda: str(uuid.uuid4())) fileId: str filename: str fileSize: int mimeType: str class TaskDocument(BaseModel, ModelMixin): id: str = Field(default_factory=lambda: str(uuid.uuid4())) filename: str fileSize: int mimeType: str data: str # Base64 encoded file data class ExtractedContent(BaseModel, ModelMixin): objectId: str # Reference to source document objectType: str = Field(description="Type of source object ('ChatDocument' or 'TaskDocument')") contents: List[ContentItem] ``` ## Service Layer Structure ### Document Service ```python class DocumentService: def __init__(self, service_container): self.service = service_container self.neutralizer_enabled = False # Flag for neutralization feature async def extractFromChatDocument(self, prompt: str, document: ChatDocument) -> ExtractedContent: """ Extract content from a ChatDocument by converting it to TaskDocument first. """ # Convert ChatDocument to TaskDocument task_doc = await self._convertToTaskDocument(document) return await self.getDocumentContent(task_doc, prompt) async def extractFromTaskDocument(self, prompt: str, document: TaskDocument) -> ExtractedContent: """ Extract content directly from a TaskDocument. """ return await self.getDocumentContent(document, prompt) async def getDocumentContent(self, document: TaskDocument, prompt: str) -> ExtractedContent: """ Helper function for centralized content extraction. Handles the actual content extraction and optional neutralization. """ # Extract content based on mimeType content = await self._extractRawContent(document) # Apply neutralization if enabled if self.neutralizer_enabled: from modules.neutralizer import neutralizer content = await neutralizer.process_content(content) # Process content with AI using prompt processed_content = await self._processWithAI(content, prompt) return ExtractedContent( objectId=document.id, objectType="TaskDocument", contents=processed_content ) ``` ## Implementation Steps 1. **Model Cleanup** - Create new model classes in `interfaceChatModel.py` - Remove deprecated models: - DocumentExtraction - DocumentContext - ProcessedDocument - ChatContent (replaced by ContentItem) - Update ChatDocument to remove contents attribute - Convert all snake_case to camelCase in manager*.py and method*.py 2. **Service Implementation** - Create new `DocumentService` class in `serviceDocument.py` - Implement the three main methods: - extractFromChatDocument - extractFromTaskDocument - getDocumentContent (helper function) - Add neutralization integration with feature flag 3. **UserInput Processing** - Update `UserInputRequest` processing to use `ChatMessage` - Implement `processFileIds` in `interfaceChatObjects` - Update all references to use new model structure 4. **Method Module Updates** - Update all method*.py modules to use new service layer - Remove direct file access - Implement proper error handling and logging 5. **Testing and Validation** - Create unit tests for new models and services - Test document processing with various file types - Validate content extraction and neutralization - Test error handling and edge cases ## Files to be Removed/Modified ### To be Removed 1. `DocumentExtraction` class from interfaceChatModel.py 2. `DocumentContext` class from interfaceChatModel.py 3. `ProcessedDocument` class from interfaceChatModel.py 4. `ChatContent` class from interfaceChatModel.py 5. Direct file access methods from method*.py modules ### To be Modified 1. `interfaceChatModel.py` - Add new model classes - Remove deprecated classes - Update existing classes 2. `managerDocument.py` - Move core functionality to DocumentService - Update to use new model structure - Remove redundant methods 3. `method*.py` modules - Update to use DocumentService - Remove direct file access - Update error handling 4. `interfaceChatObjects.py` - Implement processFileIds - Update document handling ## Neutralization Integration The neutralization feature is integrated into the `getDocumentContent` method with a feature flag. When enabled, it will process content through the neutralizer before sending it to AI processing. ```python # In getDocumentContent method if self.neutralizer_enabled: from modules.neutralizer import neutralizer content = await neutralizer.process_content(content) ``` This allows for easy enabling/disabling of the feature and future expansion of neutralization capabilities. ## Migration Strategy 1. Create new models and services 2. Implement new functionality alongside existing code 3. Gradually migrate method modules to use new services 4. Remove deprecated code once migration is complete 5. Enable neutralization feature when ready ## Testing Requirements 1. Unit tests for all new model classes 2. Integration tests for DocumentService 3. Tests for content extraction with various file types 4. Tests for neutralization integration 5. Performance tests for large file handling 6. Error handling and edge case tests