13 KiB
13 KiB
Parallel Processing Refactoring Concept
Current State (Sequential)
Chapter Sections Structure Generation (_generateChapterSectionsStructure)
- Current: Processes chapters sequentially, one after another
- Flow:
- Iterate through documents
- For each document, iterate through chapters
- For each chapter, generate sections structure using AI
- Update progress after each chapter
Section Content Generation (_fillChapterSections)
- Current: Processes chapters sequentially, sections within each chapter sequentially
- Flow:
- Iterate through documents
- For each document, iterate through chapters
- For each chapter, iterate through sections
- For each section, generate content using AI
- Update progress after each section
Desired State (Parallel)
Chapter Sections Structure Generation
- Target: Process all chapters in parallel
- Requirements:
- Maintain chapter order in final result
- Each chapter can be processed independently
- Progress updates should reflect parallel processing
- Errors in one chapter should not stop others
Section Content Generation
- Target: Process sections within each chapter in parallel
- Requirements:
- Maintain section order within each chapter
- Sections within a chapter can be processed independently
- Chapters still processed sequentially (to maintain order)
- Progress updates should reflect parallel processing
- Errors in one section should not stop others
Implementation Strategy
Phase 1: Chapter Sections Structure Generation Parallelization
Step 1.1: Extract Single Chapter Processing
- Create:
_generateSingleChapterSectionsStructure()method - Purpose: Process one chapter independently
- Parameters:
chapter: Chapter dictchapterIndex: Index for orderingchapterId,chapterLevel,chapterTitle: Chapter metadatagenerationHint: Generation instructionscontentPartIds,contentPartInstructions: Content part infocontentParts: Full content parts listuserPrompt: User's original promptlanguage: Language for generationparentOperationId: For progress logging
- Returns: None (modifies chapter dict in place)
- Error Handling: Logs errors, raises exception to be caught by caller
Step 1.2: Refactor Main Method
- Modify:
_generateChapterSectionsStructure() - Changes:
- Collect all chapters with their indices
- Create async tasks for each chapter using
_generateSingleChapterSectionsStructure - Use
asyncio.gather()to execute all tasks in parallel - Process results in order (using
zipwith original order) - Handle errors per chapter (don't fail entire operation)
- Update progress after each chapter completes
Step 1.3: Progress Reporting
- Maintain: Overall progress tracking
- Update: Progress after each chapter completes (not sequentially)
- Format: "Chapter X/Y completed" or "Chapter X/Y error"
Phase 2: Section Content Generation Parallelization
Step 2.1: Extract Single Section Processing
- Create:
_processSingleSection()method - Purpose: Process one section independently
- Parameters:
section: Section dictsectionIndex: Index for orderingtotalSections: Total sections in chapterchapterIndex: Chapter indextotalChapters: Total chapterschapterId: Chapter IDchapterOperationId: Chapter progress operation IDfillOperationId: Overall fill operation IDcontentParts: Full content parts listuserPrompt: User's original promptall_sections_list: All sections for contextlanguage: Language for generationcalculateOverallProgress: Function to calculate overall progress
- Returns:
List[Dict[str, Any]](elements for the section) - Error Handling: Returns error element instead of raising
Step 2.2: Extract Section Processing Logic
- Create: Helper methods for different processing paths:
_processSectionAggregation(): Handle aggregation path (multiple parts)_processSectionGeneration(): Handle generation without parts (only generationHint)_processSectionParts(): Handle individual part processing
- Purpose: Keep logic organized and reusable
Step 2.3: Refactor Main Method
- Modify:
_fillChapterSections() - Changes:
- Keep sequential chapter processing (maintains order)
- For each chapter, collect all sections with indices
- Create async tasks for each section using
_processSingleSection - Use
asyncio.gather()to execute all section tasks in parallel - Process results in order (using
zipwith original order) - Assign elements to sections in correct order
- Update progress after each section completes
- Handle errors per section (don't fail entire chapter)
Step 2.4: Progress Reporting
- Maintain: Hierarchical progress tracking
- Update:
- Section progress: After each section completes
- Chapter progress: After all sections in chapter complete
- Overall progress: After each section/chapter completes
- Format: "Chapter X/Y, Section A/B completed"
Key Considerations
Order Preservation
- Chapters: Must maintain document order → process chapters sequentially
- Sections: Must maintain chapter order → process sections sequentially within chapter
- Solution: Use
asyncio.gather()with ordered task list, thenzipresults with original order
Error Handling
- Chapters: Error in one chapter should not stop others
- Sections: Error in one section should not stop others
- Solution: Use
return_exceptions=Trueinasyncio.gather(), checkisinstance(result, Exception)
Progress Reporting
- Challenge: Progress updates happen out of order
- Solution: Update progress when each task completes, not sequentially
- Format: Show completed count, not sequential position
Shared State
- Chapters: Modify chapter dicts in place (safe, each chapter is independent)
- Sections: Return elements, assign to sections in order (safe, each section is independent)
- Content Parts: Read-only, passed to all tasks (safe)
Dependencies
- Chapters: No dependencies between chapters
- Sections: No dependencies between sections (each is self-contained)
- Solution: All tasks can run truly in parallel
Implementation Steps
Step 1: Clean Current Code
- Ensure current sequential implementation is correct
- Fix any existing bugs
- Verify all tests pass
Step 2: Implement Chapter Parallelization
- Create
_generateSingleChapterSectionsStructure()method - Extract chapter processing logic
- Refactor
_generateChapterSectionsStructure()to use parallel processing - Test with single chapter
- Test with multiple chapters
- Verify order preservation
- Verify error handling
Step 3: Implement Section Parallelization
- Create
_processSingleSection()method - Extract section processing logic into helper methods
- Refactor
_fillChapterSections()to use parallel processing for sections - Test with single section
- Test with multiple sections
- Test with multiple chapters
- Verify order preservation
- Verify error handling
Step 4: Testing & Validation
- Test with various document structures
- Test error scenarios
- Verify progress reporting accuracy
- Performance testing (compare sequential vs parallel)
- Verify final output order matches input order
Code Structure
New Methods to Create
async def _generateSingleChapterSectionsStructure(
self,
chapter: Dict[str, Any],
chapterIndex: int,
chapterId: str,
chapterLevel: int,
chapterTitle: str,
generationHint: str,
contentPartIds: List[str],
contentPartInstructions: Dict[str, Any],
contentParts: List[ContentPart],
userPrompt: str,
language: str,
parentOperationId: str
) -> None:
"""Generate sections structure for a single chapter (used for parallel processing)."""
# Extract logic from current sequential loop
# Modify chapter dict in place
# Handle errors internally, raise if critical
async def _processSingleSection(
self,
section: Dict[str, Any],
sectionIndex: int,
totalSections: int,
chapterIndex: int,
totalChapters: int,
chapterId: str,
chapterOperationId: str,
fillOperationId: str,
contentParts: List[ContentPart],
userPrompt: str,
all_sections_list: List[Dict[str, Any]],
language: str,
calculateOverallProgress: Callable
) -> List[Dict[str, Any]]:
"""Process a single section and return its elements."""
# Extract logic from current sequential loop
# Return elements list
# Return error element on failure (don't raise)
async def _processSectionAggregation(
self,
section: Dict[str, Any],
sectionId: str,
sectionTitle: str,
sectionIndex: int,
totalSections: int,
chapterId: str,
chapterOperationId: str,
fillOperationId: str,
contentPartIds: List[str],
contentFormats: Dict[str, str],
contentParts: List[ContentPart],
userPrompt: str,
generationHint: str,
all_sections_list: List[Dict[str, Any]],
language: str
) -> List[Dict[str, Any]]:
"""Process section with aggregation (multiple parts together)."""
# Extract aggregation logic
# Return elements list
async def _processSectionGeneration(
self,
section: Dict[str, Any],
sectionId: str,
sectionTitle: str,
sectionIndex: int,
totalSections: int,
chapterId: str,
chapterOperationId: str,
fillOperationId: str,
contentType: str,
userPrompt: str,
generationHint: str,
all_sections_list: List[Dict[str, Any]],
language: str
) -> List[Dict[str, Any]]:
"""Process section generation without content parts (only generationHint)."""
# Extract generation logic
# Return elements list
async def _processSectionParts(
self,
section: Dict[str, Any],
sectionId: str,
sectionTitle: str,
sectionIndex: int,
totalSections: int,
chapterId: str,
chapterOperationId: str,
fillOperationId: str,
contentPartIds: List[str],
contentFormats: Dict[str, str],
contentParts: List[ContentPart],
contentType: str,
useAiCall: bool,
generationHint: str,
userPrompt: str,
all_sections_list: List[Dict[str, Any]],
language: str
) -> List[Dict[str, Any]]:
"""Process individual parts in a section."""
# Extract individual part processing logic
# Return elements list
Modified Methods
async def _generateChapterSectionsStructure(
self,
chapterStructure: Dict[str, Any],
contentParts: List[ContentPart],
userPrompt: str,
parentOperationId: str
) -> Dict[str, Any]:
"""Generate sections structure for all chapters in parallel."""
# Collect chapters with indices
# Create tasks
# Execute in parallel
# Process results in order
# Update progress
async def _fillChapterSections(
self,
chapterStructure: Dict[str, Any],
contentParts: List[ContentPart],
userPrompt: str,
fillOperationId: str
) -> Dict[str, Any]:
"""Fill sections with content, processing sections in parallel within each chapter."""
# Process chapters sequentially
# For each chapter, process sections in parallel
# Maintain order
# Update progress
Testing Strategy
Unit Tests
- Test
_generateSingleChapterSectionsStructureindependently - Test
_processSingleSectionindependently - Test helper methods independently
Integration Tests
- Test parallel chapter processing with multiple chapters
- Test parallel section processing with multiple sections
- Test error handling (one chapter/section fails)
- Test order preservation
Performance Tests
- Measure sequential vs parallel execution time
- Verify parallel processing is faster
- Check resource usage (memory, CPU)
Risk Mitigation
Risks
- Order not preserved: Use
zipwith original order - Race conditions: No shared mutable state between tasks
- Progress reporting incorrect: Update progress when tasks complete
- Errors not handled: Use
return_exceptions=Trueand check results - Performance degradation: Test and measure, fallback to sequential if needed
Safety Measures
- Keep sequential implementation as fallback (commented out)
- Add feature flag to enable/disable parallel processing
- Extensive logging for debugging
- Gradual rollout (test with small datasets first)
Migration Path
- Phase 1: Implement chapter parallelization, test thoroughly
- Phase 2: Implement section parallelization, test thoroughly
- Phase 3: Enable both in production with monitoring
- Phase 4: Remove sequential fallback code (if stable)
Notes
- All async methods must use
awaitcorrectly - Progress updates happen asynchronously (may appear out of order in logs)
- Final result order is guaranteed by processing results in order
- Error handling is per-task, not global
- No shared mutable state between parallel tasks (read-only contentParts, independent chapter/section dicts)