gateway/notes/data_specification.md

# Document Management Refactoring Specification

## Overview
This specification outlines the refactoring of document management in the system, focusing on proper model separation, centralized content extraction, and future-proof neutralization integration.

## Model Structure

### Base Document Models
```python
class ContentMetadata(BaseModel, ModelMixin):
    """Metadata for content items"""
    size: int = Field(description="Content size in bytes")
    pages: Optional[int] = Field(None, description="Number of pages for multi-page content")
    error: Optional[str] = Field(None, description="Processing error if any")
    # Media-specific attributes
    width: Optional[int] = Field(None, description="Width in pixels for images/videos")
    height: Optional[int] = Field(None, description="Height in pixels for images/videos")
    colorMode: Optional[str] = Field(None, description="Color mode (e.g., RGB, CMYK, grayscale)")
    fps: Optional[float] = Field(None, description="Frames per second for videos")
    durationSec: Optional[float] = Field(None, description="Duration in seconds for videos/audio")

class ContentItem(BaseModel, ModelMixin):
    """Individual content item from a document"""
    label: str = Field(description="Content label (e.g., tab name, tag name)")
    data: str = Field(description="Text content")
    metadata: ContentMetadata = Field(description="Content metadata")

class ChatDocument(BaseModel, ModelMixin):
    id: str = Field(default_factory=lambda: str(uuid.uuid4()))
    fileId: str
    filename: str
    fileSize: int
    mimeType: str

class TaskDocument(BaseModel, ModelMixin):
    id: str = Field(default_factory=lambda: str(uuid.uuid4()))
    filename: str
    fileSize: int
    mimeType: str
    data: str  # Base64 encoded file data

class ExtractedContent(BaseModel, ModelMixin):
    objectId: str  # Reference to source document
    objectType: str = Field(description="Type of source object ('ChatDocument' or 'TaskDocument')")
    contents: List[ContentItem]
```

## Service Layer Structure

### Document Service
```python
class DocumentService:
    def __init__(self, service_container):
        self.service = service_container
        self.neutralizer_enabled = False  # Flag for neutralization feature

    async def extractFromChatDocument(self, prompt: str, document: ChatDocument) -> ExtractedContent:
        """
        Extract content from a ChatDocument by converting it to TaskDocument first.
        """
        # Convert ChatDocument to TaskDocument
        task_doc = await self._convertToTaskDocument(document)
        return await self.getDocumentContent(task_doc, prompt)

    async def extractFromTaskDocument(self, prompt: str, document: TaskDocument) -> ExtractedContent:
        """
        Extract content directly from a TaskDocument.
        """
        return await self.getDocumentContent(document, prompt)

    async def getDocumentContent(self, document: TaskDocument, prompt: str) -> ExtractedContent:
        """
        Helper function for centralized content extraction.
        Handles the actual content extraction and optional neutralization.
        """
        # Extract content based on mimeType
        content = await self._extractRawContent(document)

        # Apply neutralization if enabled
        if self.neutralizer_enabled:
            from modules.neutralizer import neutralizer
            content = await neutralizer.process_content(content)

        # Process content with AI using prompt
        processed_content = await self._processWithAI(content, prompt)

        return ExtractedContent(
            objectId=document.id,
            objectType="TaskDocument",
            contents=processed_content
        )
```

## Implementation Steps

1. **Model Cleanup**
   - Create new model classes in `serviceChatModel.py`
   - Remove deprecated models:
     - DocumentExtraction
     - DocumentContext
     - ProcessedDocument
     - ChatContent (replaced by ContentItem)
   - Update ChatDocument to remove contents attribute
   - Convert all snake_case to camelCase in manager*.py and method*.py

2. **Service Implementation**
   - Create new `DocumentService` class in `serviceDocument.py`
   - Implement the three main methods:
     - extractFromChatDocument
     - extractFromTaskDocument
     - getDocumentContent (helper function)
   - Add neutralization integration with feature flag

3. **UserInput Processing**
   - Update `UserInputRequest` processing to use `ChatMessage`
   - Implement `processFileIds` in `serviceChatClass`
   - Update all references to use new model structure

4. **Method Module Updates**
   - Update all method*.py modules to use new service layer
   - Remove direct file access
   - Implement proper error handling and logging

5. **Testing and Validation**
   - Create unit tests for new models and services
   - Test document processing with various file types
   - Validate content extraction and neutralization
   - Test error handling and edge cases

## Files to be Removed/Modified

### To be Removed
1. `DocumentExtraction` class from serviceChatModel.py
2. `DocumentContext` class from serviceChatModel.py
3. `ProcessedDocument` class from serviceChatModel.py
4. `ChatContent` class from serviceChatModel.py
5. Direct file access methods from method*.py modules

### To be Modified
1. `serviceChatModel.py`
   - Add new model classes
   - Remove deprecated classes
   - Update existing classes

2. `managerDocument.py`
   - Move core functionality to DocumentService
   - Update to use new model structure
   - Remove redundant methods

3. `method*.py` modules
   - Update to use DocumentService
   - Remove direct file access
   - Update error handling

4. `serviceChatClass.py`
   - Implement processFileIds
   - Update document handling

## Neutralization Integration

The neutralization feature is integrated into the `getDocumentContent` method with a feature flag. When enabled, it will process content through the neutralizer before sending it to AI processing.

```python
# In getDocumentContent method
if self.neutralizer_enabled:
    from modules.neutralizer import neutralizer
    content = await neutralizer.process_content(content)
```

This allows for easy enabling/disabling of the feature and future expansion of neutralization capabilities.

## Migration Strategy

1. Create new models and services
2. Implement new functionality alongside existing code
3. Gradually migrate method modules to use new services
4. Remove deprecated code once migration is complete
5. Enable neutralization feature when ready

## Testing Requirements

1. Unit tests for all new model classes
2. Integration tests for DocumentService
3. Tests for content extraction with various file types
4. Tests for neutralization integration
5. Performance tests for large file handling
6. Error handling and edge case tests