gateway/modules/neutralizer/readme.md
2025-09-22 00:39:15 +02:00

3.2 KiB

Neutralizer Module Structure

This module provides DSGVO-compliant data anonymization for AI agent systems. The code has been refactored into specialized sub-modules for better maintainability and code reuse.

Module Overview

Core Module

  • neutralizer.py - Main DataAnonymizer class that orchestrates all processing

Specialized Processors

  • subProcessText.py - Handles plain text processing without header information
  • subProcessList.py - Handles structured data with headers (CSV, JSON, XML)
  • subProcessBinary.py - Handles binary data types (images, audio, video, etc.)

Utility Modules

  • subParseString.py - String parsing and replacement utilities for emails, phones, addresses, IDs and names
  • subProcessCommon.py - Common utilities and data structures shared across modules
  • patterns.py - Pattern definitions for data anonymization

Key Features

1. Modular Architecture

  • Separation of Concerns: Each module handles a specific type of data processing
  • Code Reuse: Common functionality is centralized in utility modules
  • Maintainability: Easier to modify and extend individual components

2. Processing Order

  1. Pattern-based matches (emails, phones, addresses, etc.) are processed FIRST
  2. Custom names from the user list are processed SECOND
  3. Already anonymized content (placeholders) is skipped

3. Supported Data Types

  • Text: Plain text documents, emails, etc.
  • Structured Data: CSV, JSON, XML with headers
  • Binary Data: Images, audio, video (framework ready, implementation pending)

4. Placeholder Protection

  • Prevents re-anonymization of already processed content
  • Uses format [tag.uuid] for placeholders
  • Validates placeholder format before processing

Usage Example

from modules.neutralizer import DataAnonymizer

# Initialize with custom names
anonymizer = DataAnonymizer(names_to_parse=['John Doe', 'Jane Smith'])

# Process content (auto-detects type)
result = anonymizer.process_content(content, content_type='text')

# Or specify content type explicitly
result = anonymizer.process_content(content, content_type='csv')

# Get mapping of original values to placeholders
mapping = anonymizer.get_mapping()

Module Dependencies

neutralizer.py
├── subProcessCommon.py (ProcessResult, CommonUtils)
├── subProcessText.py (TextProcessor)
├── subProcessList.py (ListProcessor)
├── subProcessBinary.py (BinaryProcessor)
└── patterns.py (Pattern definitions)

subProcessText.py
└── subParseString.py (StringParser)

subProcessList.py
├── subParseString.py (StringParser)
└── patterns.py (HeaderPatterns)

subProcessBinary.py
└── (standalone)

subParseString.py
└── patterns.py (DataPatterns)

Benefits of New Structure

  1. Single Responsibility: Each module has one clear purpose
  2. DRY Principle: No code duplication across modules
  3. Testability: Individual modules can be tested in isolation
  4. Extensibility: Easy to add new data types or processing methods
  5. Maintainability: Changes to one module don't affect others
  6. Performance: Specialized processors are optimized for their data types