gateway/notes/data_specification.md
2025-06-13 00:41:51 +02:00

6.7 KiB

Document Management Refactoring Specification

Overview

This specification outlines the refactoring of document management in the system, focusing on proper model separation, centralized content extraction, and future-proof neutralization integration.

Model Structure

Base Document Models

class ContentMetadata(BaseModel, ModelMixin):
    """Metadata for content items"""
    size: int = Field(description="Content size in bytes")
    pages: Optional[int] = Field(None, description="Number of pages for multi-page content")
    error: Optional[str] = Field(None, description="Processing error if any")
    # Media-specific attributes
    width: Optional[int] = Field(None, description="Width in pixels for images/videos")
    height: Optional[int] = Field(None, description="Height in pixels for images/videos")
    colorMode: Optional[str] = Field(None, description="Color mode (e.g., RGB, CMYK, grayscale)")
    fps: Optional[float] = Field(None, description="Frames per second for videos")
    durationSec: Optional[float] = Field(None, description="Duration in seconds for videos/audio")

class ContentItem(BaseModel, ModelMixin):
    """Individual content item from a document"""
    label: str = Field(description="Content label (e.g., tab name, tag name)")
    data: str = Field(description="Text content")
    metadata: ContentMetadata = Field(description="Content metadata")

class ChatDocument(BaseModel, ModelMixin):
    id: str = Field(default_factory=lambda: str(uuid.uuid4()))
    fileId: str
    filename: str
    fileSize: int
    mimeType: str

class TaskDocument(BaseModel, ModelMixin):
    id: str = Field(default_factory=lambda: str(uuid.uuid4()))
    filename: str
    fileSize: int
    mimeType: str
    data: str  # Base64 encoded file data

class ExtractedContent(BaseModel, ModelMixin):
    objectId: str  # Reference to source document
    objectType: str = Field(description="Type of source object ('ChatDocument' or 'TaskDocument')")
    contents: List[ContentItem]

Service Layer Structure

Document Service

class DocumentService:
    def __init__(self, service_container):
        self.service = service_container
        self.neutralizer_enabled = False  # Flag for neutralization feature

    async def extractFromChatDocument(self, prompt: str, document: ChatDocument) -> ExtractedContent:
        """
        Extract content from a ChatDocument by converting it to TaskDocument first.
        """
        # Convert ChatDocument to TaskDocument
        task_doc = await self._convertToTaskDocument(document)
        return await self.getDocumentContent(task_doc, prompt)

    async def extractFromTaskDocument(self, prompt: str, document: TaskDocument) -> ExtractedContent:
        """
        Extract content directly from a TaskDocument.
        """
        return await self.getDocumentContent(document, prompt)

    async def getDocumentContent(self, document: TaskDocument, prompt: str) -> ExtractedContent:
        """
        Helper function for centralized content extraction.
        Handles the actual content extraction and optional neutralization.
        """
        # Extract content based on mimeType
        content = await self._extractRawContent(document)
        
        # Apply neutralization if enabled
        if self.neutralizer_enabled:
            from modules.neutralizer import neutralizer
            content = await neutralizer.process_content(content)
        
        # Process content with AI using prompt
        processed_content = await self._processWithAI(content, prompt)
        
        return ExtractedContent(
            objectId=document.id,
            objectType="TaskDocument",
            contents=processed_content
        )

Implementation Steps

  1. Model Cleanup

    • Create new model classes in interfaceChatModel.py
    • Remove deprecated models:
      • DocumentExtraction
      • DocumentContext
      • ProcessedDocument
      • ChatContent (replaced by ContentItem)
    • Update ChatDocument to remove contents attribute
    • Convert all snake_case to camelCase in manager*.py and method*.py
  2. Service Implementation

    • Create new DocumentService class in serviceDocument.py
    • Implement the three main methods:
      • extractFromChatDocument
      • extractFromTaskDocument
      • getDocumentContent (helper function)
    • Add neutralization integration with feature flag
  3. UserInput Processing

    • Update UserInputRequest processing to use ChatMessage
    • Implement processFileIds in interfaceChatObjects
    • Update all references to use new model structure
  4. Method Module Updates

    • Update all method*.py modules to use new service layer
    • Remove direct file access
    • Implement proper error handling and logging
  5. Testing and Validation

    • Create unit tests for new models and services
    • Test document processing with various file types
    • Validate content extraction and neutralization
    • Test error handling and edge cases

Files to be Removed/Modified

To be Removed

  1. DocumentExtraction class from interfaceChatModel.py
  2. DocumentContext class from interfaceChatModel.py
  3. ProcessedDocument class from interfaceChatModel.py
  4. ChatContent class from interfaceChatModel.py
  5. Direct file access methods from method*.py modules

To be Modified

  1. interfaceChatModel.py

    • Add new model classes
    • Remove deprecated classes
    • Update existing classes
  2. managerDocument.py

    • Move core functionality to DocumentService
    • Update to use new model structure
    • Remove redundant methods
  3. method*.py modules

    • Update to use DocumentService
    • Remove direct file access
    • Update error handling
  4. interfaceChatObjects.py

    • Implement processFileIds
    • Update document handling

Neutralization Integration

The neutralization feature is integrated into the getDocumentContent method with a feature flag. When enabled, it will process content through the neutralizer before sending it to AI processing.

# In getDocumentContent method
if self.neutralizer_enabled:
    from modules.neutralizer import neutralizer
    content = await neutralizer.process_content(content)

This allows for easy enabling/disabling of the feature and future expansion of neutralization capabilities.

Migration Strategy

  1. Create new models and services
  2. Implement new functionality alongside existing code
  3. Gradually migrate method modules to use new services
  4. Remove deprecated code once migration is complete
  5. Enable neutralization feature when ready

Testing Requirements

  1. Unit tests for all new model classes
  2. Integration tests for DocumentService
  3. Tests for content extraction with various file types
  4. Tests for neutralization integration
  5. Performance tests for large file handling
  6. Error handling and edge case tests