gateway/modules/neutralizer/readme.md
2025-09-22 00:39:15 +02:00

91 lines
3.2 KiB
Markdown

# Neutralizer Module Structure
This module provides DSGVO-compliant data anonymization for AI agent systems. The code has been refactored into specialized sub-modules for better maintainability and code reuse.
## Module Overview
### Core Module
- **`neutralizer.py`** - Main DataAnonymizer class that orchestrates all processing
### Specialized Processors
- **`subProcessText.py`** - Handles plain text processing without header information
- **`subProcessList.py`** - Handles structured data with headers (CSV, JSON, XML)
- **`subProcessBinary.py`** - Handles binary data types (images, audio, video, etc.)
### Utility Modules
- **`subParseString.py`** - String parsing and replacement utilities for emails, phones, addresses, IDs and names
- **`subProcessCommon.py`** - Common utilities and data structures shared across modules
- **`patterns.py`** - Pattern definitions for data anonymization
## Key Features
### 1. Modular Architecture
- **Separation of Concerns**: Each module handles a specific type of data processing
- **Code Reuse**: Common functionality is centralized in utility modules
- **Maintainability**: Easier to modify and extend individual components
### 2. Processing Order
1. **Pattern-based matches** (emails, phones, addresses, etc.) are processed FIRST
2. **Custom names** from the user list are processed SECOND
3. **Already anonymized content** (placeholders) is skipped
### 3. Supported Data Types
- **Text**: Plain text documents, emails, etc.
- **Structured Data**: CSV, JSON, XML with headers
- **Binary Data**: Images, audio, video (framework ready, implementation pending)
### 4. Placeholder Protection
- Prevents re-anonymization of already processed content
- Uses format `[tag.uuid]` for placeholders
- Validates placeholder format before processing
## Usage Example
```python
from modules.neutralizer import DataAnonymizer
# Initialize with custom names
anonymizer = DataAnonymizer(names_to_parse=['John Doe', 'Jane Smith'])
# Process content (auto-detects type)
result = anonymizer.process_content(content, content_type='text')
# Or specify content type explicitly
result = anonymizer.process_content(content, content_type='csv')
# Get mapping of original values to placeholders
mapping = anonymizer.get_mapping()
```
## Module Dependencies
```
neutralizer.py
├── subProcessCommon.py (ProcessResult, CommonUtils)
├── subProcessText.py (TextProcessor)
├── subProcessList.py (ListProcessor)
├── subProcessBinary.py (BinaryProcessor)
└── patterns.py (Pattern definitions)
subProcessText.py
└── subParseString.py (StringParser)
subProcessList.py
├── subParseString.py (StringParser)
└── patterns.py (HeaderPatterns)
subProcessBinary.py
└── (standalone)
subParseString.py
└── patterns.py (DataPatterns)
```
## Benefits of New Structure
1. **Single Responsibility**: Each module has one clear purpose
2. **DRY Principle**: No code duplication across modules
3. **Testability**: Individual modules can be tested in isolation
4. **Extensibility**: Easy to add new data types or processing methods
5. **Maintainability**: Changes to one module don't affect others
6. **Performance**: Specialized processors are optimized for their data types