gateway/modules/neutralizer/readme.md

# Neutralizer Module Structure

This module provides DSGVO-compliant data anonymization for AI agent systems. The code has been refactored into specialized sub-modules for better maintainability and code reuse.

## Module Overview

### Core Module
- **`neutralizer.py`** - Main DataAnonymizer class that orchestrates all processing

### Specialized Processors
- **`subProcessText.py`** - Handles plain text processing without header information
- **`subProcessList.py`** - Handles structured data with headers (CSV, JSON, XML)
- **`subProcessBinary.py`** - Handles binary data types (images, audio, video, etc.)

### Utility Modules
- **`subParseString.py`** - String parsing and replacement utilities for emails, phones, addresses, IDs and names
- **`subProcessCommon.py`** - Common utilities and data structures shared across modules
- **`patterns.py`** - Pattern definitions for data anonymization

## Key Features

### 1. Modular Architecture
- **Separation of Concerns**: Each module handles a specific type of data processing
- **Code Reuse**: Common functionality is centralized in utility modules
- **Maintainability**: Easier to modify and extend individual components

### 2. Processing Order
1. **Pattern-based matches** (emails, phones, addresses, etc.) are processed FIRST
2. **Custom names** from the user list are processed SECOND
3. **Already anonymized content** (placeholders) is skipped

### 3. Supported Data Types
- **Text**: Plain text documents, emails, etc.
- **Structured Data**: CSV, JSON, XML with headers
- **Binary Data**: Images, audio, video (framework ready, implementation pending)

### 4. Placeholder Protection
- Prevents re-anonymization of already processed content
- Uses format `[tag.uuid]` for placeholders
- Validates placeholder format before processing

## Usage Example

```python
from modules.neutralizer import DataAnonymizer

# Initialize with custom names
anonymizer = DataAnonymizer(names_to_parse=['John Doe', 'Jane Smith'])

# Process content (auto-detects type)
result = anonymizer.process_content(content, content_type='text')

# Or specify content type explicitly
result = anonymizer.process_content(content, content_type='csv')

# Get mapping of original values to placeholders
mapping = anonymizer.get_mapping()
```

## Module Dependencies

```
neutralizer.py
├── subProcessCommon.py (ProcessResult, CommonUtils)
├── subProcessText.py (TextProcessor)
├── subProcessList.py (ListProcessor)
├── subProcessBinary.py (BinaryProcessor)
└── patterns.py (Pattern definitions)

subProcessText.py
└── subParseString.py (StringParser)

subProcessList.py
├── subParseString.py (StringParser)
└── patterns.py (HeaderPatterns)

subProcessBinary.py
└── (standalone)

subParseString.py
└── patterns.py (DataPatterns)
```

## Benefits of New Structure

1. **Single Responsibility**: Each module has one clear purpose
2. **DRY Principle**: No code duplication across modules
3. **Testability**: Individual modules can be tested in isolation
4. **Extensibility**: Easy to add new data types or processing methods
5. **Maintainability**: Changes to one module don't affect others
6. **Performance**: Specialized processors are optimized for their data types