91 lines
3.2 KiB
Markdown
91 lines
3.2 KiB
Markdown
# Neutralizer Module Structure
|
|
|
|
This module provides DSGVO-compliant data anonymization for AI agent systems. The code has been refactored into specialized sub-modules for better maintainability and code reuse.
|
|
|
|
## Module Overview
|
|
|
|
### Core Module
|
|
- **`neutralizer.py`** - Main DataAnonymizer class that orchestrates all processing
|
|
|
|
### Specialized Processors
|
|
- **`subProcessText.py`** - Handles plain text processing without header information
|
|
- **`subProcessList.py`** - Handles structured data with headers (CSV, JSON, XML)
|
|
- **`subProcessBinary.py`** - Handles binary data types (images, audio, video, etc.)
|
|
|
|
### Utility Modules
|
|
- **`subParseString.py`** - String parsing and replacement utilities for emails, phones, addresses, IDs and names
|
|
- **`subProcessCommon.py`** - Common utilities and data structures shared across modules
|
|
- **`patterns.py`** - Pattern definitions for data anonymization
|
|
|
|
## Key Features
|
|
|
|
### 1. Modular Architecture
|
|
- **Separation of Concerns**: Each module handles a specific type of data processing
|
|
- **Code Reuse**: Common functionality is centralized in utility modules
|
|
- **Maintainability**: Easier to modify and extend individual components
|
|
|
|
### 2. Processing Order
|
|
1. **Pattern-based matches** (emails, phones, addresses, etc.) are processed FIRST
|
|
2. **Custom names** from the user list are processed SECOND
|
|
3. **Already anonymized content** (placeholders) is skipped
|
|
|
|
### 3. Supported Data Types
|
|
- **Text**: Plain text documents, emails, etc.
|
|
- **Structured Data**: CSV, JSON, XML with headers
|
|
- **Binary Data**: Images, audio, video (framework ready, implementation pending)
|
|
|
|
### 4. Placeholder Protection
|
|
- Prevents re-anonymization of already processed content
|
|
- Uses format `[tag.uuid]` for placeholders
|
|
- Validates placeholder format before processing
|
|
|
|
## Usage Example
|
|
|
|
```python
|
|
from modules.neutralizer import DataAnonymizer
|
|
|
|
# Initialize with custom names
|
|
anonymizer = DataAnonymizer(names_to_parse=['John Doe', 'Jane Smith'])
|
|
|
|
# Process content (auto-detects type)
|
|
result = anonymizer.process_content(content, content_type='text')
|
|
|
|
# Or specify content type explicitly
|
|
result = anonymizer.process_content(content, content_type='csv')
|
|
|
|
# Get mapping of original values to placeholders
|
|
mapping = anonymizer.get_mapping()
|
|
```
|
|
|
|
## Module Dependencies
|
|
|
|
```
|
|
neutralizer.py
|
|
├── subProcessCommon.py (ProcessResult, CommonUtils)
|
|
├── subProcessText.py (TextProcessor)
|
|
├── subProcessList.py (ListProcessor)
|
|
├── subProcessBinary.py (BinaryProcessor)
|
|
└── patterns.py (Pattern definitions)
|
|
|
|
subProcessText.py
|
|
└── subParseString.py (StringParser)
|
|
|
|
subProcessList.py
|
|
├── subParseString.py (StringParser)
|
|
└── patterns.py (HeaderPatterns)
|
|
|
|
subProcessBinary.py
|
|
└── (standalone)
|
|
|
|
subParseString.py
|
|
└── patterns.py (DataPatterns)
|
|
```
|
|
|
|
## Benefits of New Structure
|
|
|
|
1. **Single Responsibility**: Each module has one clear purpose
|
|
2. **DRY Principle**: No code duplication across modules
|
|
3. **Testability**: Individual modules can be tested in isolation
|
|
4. **Extensibility**: Easy to add new data types or processing methods
|
|
5. **Maintainability**: Changes to one module don't affect others
|
|
6. **Performance**: Specialized processors are optimized for their data types
|