627 lines
22 KiB
Markdown
627 lines
22 KiB
Markdown
# Document Generation One-Path Refactoring Plan
|
|
|
|
## Overview
|
|
|
|
This document outlines the refactoring plan to unify the document generation system by eliminating the dual-path approach (single-file vs multi-file) and implementing a unified multi-file approach that handles both single and multiple document generation seamlessly.
|
|
|
|
## Current State Analysis
|
|
|
|
### Current Dual-Path Structure
|
|
- **Single File Path**: `_callAiWithSingleFileGeneration()`
|
|
- **Multi File Path**: `_callAiWithMultiFileGeneration()`
|
|
- **Code Duplication**: ~80% of functionality is duplicated
|
|
- **Maintenance Overhead**: Two separate code paths to maintain
|
|
|
|
### Key Differences to Address
|
|
1. **Prompt Generation**: `getExtractionPrompt` vs `getAdaptiveExtractionPrompt`
|
|
2. **Result Structure**: Single object vs array structure
|
|
3. **Validation Logic**: Different validation rules for single vs multi-file
|
|
4. **Processing Pipeline**: Separate processing flows
|
|
|
|
## Refactoring Goals
|
|
|
|
1. **Unify Code Paths**: Single processing pipeline for all document generation
|
|
2. **Eliminate Duplication**: Remove ~200 lines of duplicate code
|
|
3. **Improve Maintainability**: Single code path to maintain and test
|
|
4. **Enhance Flexibility**: Dynamic switching between single/multi based on content
|
|
5. **Preserve Functionality**: Maintain all existing capabilities
|
|
|
|
## Implementation Plan
|
|
|
|
### Phase 1: Prompt Generation Unification
|
|
|
|
#### 1.1 Modify `getAdaptiveExtractionPrompt` to Handle Single File
|
|
```python
|
|
async def getAdaptiveExtractionPrompt(
|
|
self,
|
|
outputFormat: str,
|
|
userPrompt: str,
|
|
title: str,
|
|
promptAnalysis: Dict[str, Any],
|
|
aiService: AiService
|
|
) -> str:
|
|
"""
|
|
Unified extraction prompt that handles both single and multi-file cases.
|
|
Hides multi-file specific parts when single file is requested.
|
|
"""
|
|
|
|
# Base prompt structure
|
|
basePrompt = f"""
|
|
Generate a structured document in {outputFormat} format based on the user request.
|
|
|
|
User Request: {userPrompt}
|
|
Title: {title}
|
|
"""
|
|
|
|
# Add multi-file logic only if needed
|
|
if promptAnalysis.get("is_multi_file", False):
|
|
multiFileSection = f"""
|
|
|
|
MULTI-FILE GENERATION:
|
|
- Split strategy: {promptAnalysis.get("strategy", "custom")}
|
|
- Split criteria: {promptAnalysis.get("criteria", "content-based")}
|
|
- File naming pattern: {promptAnalysis.get("file_naming_pattern", "document_{index}")}
|
|
|
|
Return JSON structure:
|
|
{{
|
|
"documents": [
|
|
{{
|
|
"id": "doc_1",
|
|
"title": "Document Title",
|
|
"filename": "document_1.{outputFormat}",
|
|
"sections": [...]
|
|
}}
|
|
]
|
|
}}
|
|
"""
|
|
basePrompt += multiFileSection
|
|
else:
|
|
singleFileSection = f"""
|
|
|
|
SINGLE-FILE GENERATION:
|
|
Return JSON structure:
|
|
{{
|
|
"documents": [
|
|
{{
|
|
"id": "doc_1",
|
|
"title": "{title}",
|
|
"filename": "{title}.{outputFormat}",
|
|
"sections": [...]
|
|
}}
|
|
]
|
|
}}
|
|
"""
|
|
basePrompt += singleFileSection
|
|
|
|
# Add chunking support for large documents
|
|
chunkingSection = """
|
|
|
|
CHUNKING SUPPORT:
|
|
If the document is too large to generate in one response, include:
|
|
- "continue": true
|
|
- "continuation_context": {
|
|
"last_section_id": "section_id",
|
|
"last_element_index": 0,
|
|
"remaining_requirements": "description"
|
|
}
|
|
|
|
The system will automatically request continuation chunks until complete.
|
|
"""
|
|
basePrompt += chunkingSection
|
|
|
|
return basePrompt
|
|
```
|
|
|
|
#### 1.2 Remove `getExtractionPrompt` Method
|
|
- Delete the single-file specific prompt generation method
|
|
- Update all references to use `getAdaptiveExtractionPrompt`
|
|
|
|
### Phase 2: Unified Processing Pipeline
|
|
|
|
#### 2.1 Create Unified `callAiWithDocumentGeneration` Method
|
|
```python
|
|
async def callAiWithDocumentGeneration(
|
|
self,
|
|
prompt: str,
|
|
documents: Optional[List[ChatDocument]],
|
|
options: AiCallOptions,
|
|
outputFormat: str,
|
|
title: Optional[str]
|
|
) -> Dict[str, Any]:
|
|
"""
|
|
Unified document generation method that handles both single and multi-file cases.
|
|
Always uses multi-file approach internally.
|
|
"""
|
|
try:
|
|
# 1. Analyze prompt intent
|
|
promptAnalysis = await self._analyzePromptIntent(prompt, self)
|
|
logger.info(f"Prompt analysis result: {promptAnalysis}")
|
|
|
|
# 2. Get unified extraction prompt
|
|
from modules.services.serviceGeneration.mainServiceGeneration import GenerationService
|
|
generationService = GenerationService(self.services)
|
|
|
|
extractionPrompt = await generationService.getAdaptiveExtractionPrompt(
|
|
outputFormat=outputFormat,
|
|
userPrompt=prompt,
|
|
title=title,
|
|
promptAnalysis=promptAnalysis,
|
|
aiService=self
|
|
)
|
|
|
|
# 3. Process with unified pipeline (always multi-file approach)
|
|
aiResponse = await self._processDocumentsUnified(
|
|
documents, extractionPrompt, options, outputFormat, title, promptAnalysis
|
|
)
|
|
|
|
# 4. Return unified result structure
|
|
return self._buildUnifiedResult(aiResponse, outputFormat, title, promptAnalysis)
|
|
|
|
except Exception as e:
|
|
logger.error(f"Error in unified document generation: {str(e)}")
|
|
return self._buildErrorResult(str(e), outputFormat, title)
|
|
```
|
|
|
|
#### 2.2 Create Unified Processing Method
|
|
```python
|
|
async def _processDocumentsUnified(
|
|
self,
|
|
documents: Optional[List[ChatDocument]],
|
|
extractionPrompt: str,
|
|
options: AiCallOptions,
|
|
outputFormat: str,
|
|
title: str,
|
|
promptAnalysis: Dict[str, Any]
|
|
) -> Dict[str, Any]:
|
|
"""
|
|
Unified document processing that handles both single and multi-file cases.
|
|
Always processes as multi-file structure internally.
|
|
"""
|
|
import time
|
|
|
|
# Create progress logger
|
|
workflow = self.services.currentWorkflow
|
|
progressLogger = self.services.workflow.createProgressLogger(workflow)
|
|
operationId = f"docGenUnified_{workflow.id}_{int(time.time())}"
|
|
|
|
try:
|
|
# Start progress tracking
|
|
progressLogger.startOperation(
|
|
operationId,
|
|
"Generate",
|
|
"Unified Document Generation",
|
|
f"Processing {len(documents) if documents else 0} documents"
|
|
)
|
|
|
|
# Update progress - generating extraction prompt
|
|
progressLogger.updateProgress(operationId, 0.1, "Generating prompt")
|
|
|
|
# Process with unified JSON pipeline
|
|
aiResponse = await self.documentProcessor.processDocumentsPerChunkJsonWithPrompt(
|
|
documents, extractionPrompt, options
|
|
)
|
|
|
|
# Update progress - AI processing completed
|
|
progressLogger.updateProgress(operationId, 0.6, "Processing done")
|
|
|
|
# Validate response structure
|
|
if not self._validateUnifiedResponseStructure(aiResponse):
|
|
raise Exception("AI response is not valid unified document structure")
|
|
|
|
# Emit raw extracted data as a chat message attachment
|
|
try:
|
|
await self._postRawDataChatMessage(aiResponse, label="raw_extraction_unified")
|
|
except Exception:
|
|
logger.warning("Failed to emit raw extraction chat message (unified)")
|
|
|
|
# Complete progress tracking
|
|
progressLogger.completeOperation(operationId, True)
|
|
|
|
return aiResponse
|
|
|
|
except Exception as e:
|
|
logger.error(f"Error in unified document processing: {str(e)}")
|
|
progressLogger.completeOperation(operationId, False)
|
|
raise
|
|
```
|
|
|
|
### Phase 3: Unified Validation System
|
|
|
|
#### 3.1 Create Unified Validation Method
|
|
```python
|
|
def _validateUnifiedResponseStructure(self, response: Dict[str, Any]) -> bool:
|
|
"""
|
|
Unified validation that checks for multi-file structure.
|
|
Validates that response has documents array and each document has sections.
|
|
"""
|
|
try:
|
|
if not isinstance(response, dict):
|
|
logger.warning(f"Response validation failed: Response is not a dict, got {type(response)}")
|
|
return False
|
|
|
|
# Check for documents array
|
|
hasDocuments = "documents" in response
|
|
isDocumentsList = isinstance(response.get("documents"), list)
|
|
|
|
if not (hasDocuments and isDocumentsList):
|
|
logger.warning(f"Unified validation failed: documents key present={hasDocuments}, documents is list={isDocumentsList}")
|
|
logger.warning(f"Available keys: {list(response.keys())}")
|
|
return False
|
|
|
|
documents = response.get("documents", [])
|
|
if not documents:
|
|
logger.warning("Unified validation failed: documents array is empty")
|
|
return False
|
|
|
|
# Validate each document individually
|
|
validDocuments = 0
|
|
for i, doc in enumerate(documents):
|
|
if self._validateDocumentStructure(doc, i):
|
|
validDocuments += 1
|
|
else:
|
|
logger.warning(f"Document {i} failed validation, but continuing with others")
|
|
|
|
# Process succeeds if at least one document is valid
|
|
if validDocuments == 0:
|
|
logger.error("Unified validation failed: no valid documents found")
|
|
return False
|
|
|
|
logger.info(f"Unified validation passed: {validDocuments}/{len(documents)} documents valid")
|
|
return True
|
|
|
|
except Exception as e:
|
|
logger.warning(f"Unified response validation failed with exception: {str(e)}")
|
|
return False
|
|
|
|
def _validateDocumentStructure(self, document: Dict[str, Any], documentIndex: int) -> bool:
|
|
"""
|
|
Validate individual document structure.
|
|
Returns True if document is valid, False otherwise.
|
|
Does not fail the entire process if one document is invalid.
|
|
"""
|
|
try:
|
|
if not isinstance(document, dict):
|
|
logger.warning(f"Document {documentIndex} validation failed: not a dict")
|
|
return False
|
|
|
|
# Check for required fields
|
|
hasTitle = "title" in document
|
|
hasSections = "sections" in document
|
|
isSectionsList = isinstance(document.get("sections"), list)
|
|
|
|
if not (hasTitle and hasSections and isSectionsList):
|
|
logger.warning(f"Document {documentIndex} validation failed: title={hasTitle}, sections={hasSections}, sections_list={isSectionsList}")
|
|
return False
|
|
|
|
sections = document.get("sections", [])
|
|
if not sections:
|
|
logger.warning(f"Document {documentIndex} validation failed: sections array is empty")
|
|
return False
|
|
|
|
logger.info(f"Document {documentIndex} validation passed")
|
|
return True
|
|
|
|
except Exception as e:
|
|
logger.warning(f"Document {documentIndex} validation failed with exception: {str(e)}")
|
|
return False
|
|
```
|
|
|
|
#### 3.2 Remove Old Validation Methods
|
|
- Delete `_validateResponseStructure` method
|
|
- Update all references to use `_validateUnifiedResponseStructure`
|
|
|
|
### Phase 4: Unified Result Structure
|
|
|
|
#### 4.1 Create Unified Result Builder
|
|
```python
|
|
def _buildUnifiedResult(
|
|
self,
|
|
aiResponse: Dict[str, Any],
|
|
outputFormat: str,
|
|
title: str,
|
|
promptAnalysis: Dict[str, Any]
|
|
) -> Dict[str, Any]:
|
|
"""
|
|
Build unified result structure that always returns array-based format.
|
|
Content is always a multi-document structure.
|
|
"""
|
|
try:
|
|
# Process all documents uniformly
|
|
generatedDocuments = []
|
|
documents = aiResponse.get("documents", [])
|
|
|
|
for i, docData in enumerate(documents):
|
|
try:
|
|
processedDocument = await self._processDocument(
|
|
docData, outputFormat, title, promptAnalysis, i
|
|
)
|
|
generatedDocuments.append(processedDocument)
|
|
except Exception as e:
|
|
logger.warning(f"Failed to process document {i}: {str(e)}, skipping")
|
|
continue
|
|
|
|
if not generatedDocuments:
|
|
raise Exception("No documents could be processed successfully")
|
|
|
|
# Build unified result
|
|
result = {
|
|
"success": True,
|
|
"content": aiResponse, # Always multi-document structure
|
|
"documents": generatedDocuments, # Always array
|
|
"is_multi_file": len(generatedDocuments) > 1,
|
|
"format": outputFormat,
|
|
"title": title,
|
|
"split_strategy": promptAnalysis.get("strategy", "single"),
|
|
"total_documents": len(generatedDocuments),
|
|
"processed_documents": len(generatedDocuments)
|
|
}
|
|
|
|
return result
|
|
|
|
except Exception as e:
|
|
logger.error(f"Error building unified result: {str(e)}")
|
|
return self._buildErrorResult(str(e), outputFormat, title)
|
|
|
|
async def _processDocument(
|
|
self,
|
|
docData: Dict[str, Any],
|
|
outputFormat: str,
|
|
title: str,
|
|
promptAnalysis: Dict[str, Any],
|
|
documentIndex: int
|
|
) -> Dict[str, Any]:
|
|
"""
|
|
Process individual document with content enhancement and rendering.
|
|
"""
|
|
try:
|
|
# Get generation service
|
|
from modules.services.serviceGeneration.mainServiceGeneration import GenerationService
|
|
generationService = GenerationService(self.services)
|
|
|
|
# Use AI generation to enhance the extracted JSON before rendering
|
|
enhancedContent = docData # Default to original
|
|
if docData.get("sections"):
|
|
try:
|
|
# Get generation prompt
|
|
generationPrompt = await generationService.getGenerationPrompt(
|
|
outputFormat=outputFormat,
|
|
userPrompt=title,
|
|
title=docData.get("title", title),
|
|
aiService=self
|
|
)
|
|
|
|
# Prepare the AI call
|
|
from modules.datamodels.datamodelAi import AiCallRequest, AiCallOptions, OperationType
|
|
requestOptions = AiCallOptions()
|
|
requestOptions.operationType = OperationType.GENERAL
|
|
|
|
# Create context with the extracted JSON content
|
|
import json
|
|
context = f"Extracted JSON content:\n{json.dumps(docData, indent=2)}"
|
|
|
|
request = AiCallRequest(
|
|
prompt=generationPrompt,
|
|
context=context,
|
|
options=requestOptions
|
|
)
|
|
|
|
# Call AI to enhance the content
|
|
response = await self.aiObjects.call(request)
|
|
|
|
if response and response.content:
|
|
# Parse the AI response as JSON
|
|
try:
|
|
import re
|
|
result = response.content.strip()
|
|
|
|
# Extract JSON from markdown if present
|
|
jsonMatch = re.search(r'```json\s*\n(.*?)\n```', result, re.DOTALL)
|
|
if jsonMatch:
|
|
result = jsonMatch.group(1).strip()
|
|
elif result.startswith('```json'):
|
|
result = re.sub(r'^```json\s*', '', result)
|
|
result = re.sub(r'\s*```$', '', result)
|
|
elif result.startswith('```'):
|
|
result = re.sub(r'^```\s*', '', result)
|
|
result = re.sub(r'\s*```$', '', result)
|
|
|
|
# Try to parse JSON
|
|
enhancedContent = json.loads(result)
|
|
logger.info(f"AI enhanced JSON content successfully for document {documentIndex}")
|
|
|
|
except json.JSONDecodeError as e:
|
|
logger.warning(f"AI generation returned invalid JSON for document {documentIndex}: {str(e)}, using original content")
|
|
enhancedContent = docData
|
|
else:
|
|
logger.warning(f"AI generation returned empty response for document {documentIndex}, using original content")
|
|
enhancedContent = docData
|
|
|
|
except Exception as e:
|
|
logger.warning(f"AI generation failed for document {documentIndex}: {str(e)}, using original content")
|
|
enhancedContent = docData
|
|
|
|
# Render the enhanced JSON content
|
|
renderedContent, mimeType = await generationService.renderReport(
|
|
extractedContent=enhancedContent,
|
|
outputFormat=outputFormat,
|
|
title=docData.get("title", title),
|
|
userPrompt=title,
|
|
aiService=self
|
|
)
|
|
|
|
# Generate proper filename
|
|
baseFilename = docData.get("filename", f"document_{documentIndex + 1}")
|
|
if '.' in baseFilename:
|
|
baseFilename = baseFilename.rsplit('.', 1)[0]
|
|
|
|
# Add proper extension based on output format
|
|
if outputFormat.lower() == "docx":
|
|
filename = f"{baseFilename}.docx"
|
|
elif outputFormat.lower() == "pdf":
|
|
filename = f"{baseFilename}.pdf"
|
|
elif outputFormat.lower() == "html":
|
|
filename = f"{baseFilename}.html"
|
|
else:
|
|
filename = f"{baseFilename}.{outputFormat}"
|
|
|
|
return {
|
|
"documentName": filename,
|
|
"documentData": renderedContent,
|
|
"mimeType": mimeType,
|
|
"title": docData.get("title", title),
|
|
"documentIndex": documentIndex
|
|
}
|
|
|
|
except Exception as e:
|
|
logger.error(f"Error processing document {documentIndex}: {str(e)}")
|
|
raise
|
|
|
|
def _buildErrorResult(self, errorMessage: str, outputFormat: str, title: str) -> Dict[str, Any]:
|
|
"""
|
|
Build error result with unified structure.
|
|
"""
|
|
return {
|
|
"success": False,
|
|
"error": errorMessage,
|
|
"content": {},
|
|
"documents": [],
|
|
"is_multi_file": False,
|
|
"format": outputFormat,
|
|
"title": title,
|
|
"split_strategy": "error",
|
|
"total_documents": 0,
|
|
"processed_documents": 0
|
|
}
|
|
```
|
|
|
|
### Phase 5: Remove Legacy Methods
|
|
|
|
#### 5.1 Delete Single-File Methods
|
|
```python
|
|
# Remove these methods:
|
|
- _callAiWithSingleFileGeneration()
|
|
- _callAiWithMultiFileGeneration()
|
|
- _validateResponseStructure()
|
|
- getExtractionPrompt() (in GenerationService)
|
|
```
|
|
|
|
#### 5.2 Update Method References
|
|
- Update all callers to use `callAiWithDocumentGeneration()`
|
|
- Update tests to use unified approach
|
|
- Update documentation
|
|
|
|
### Phase 6: Testing and Validation
|
|
|
|
#### 6.1 Unit Tests
|
|
```python
|
|
async def test_unified_single_file_generation():
|
|
"""Test that single file generation works with unified approach"""
|
|
result = await aiService.callAiWithDocumentGeneration(
|
|
prompt="Generate a single document",
|
|
documents=None,
|
|
options=options,
|
|
outputFormat="html",
|
|
title="Test Document"
|
|
)
|
|
|
|
assert result["success"] == True
|
|
assert result["is_multi_file"] == False
|
|
assert len(result["documents"]) == 1
|
|
assert isinstance(result["content"], dict)
|
|
assert "documents" in result["content"]
|
|
|
|
async def test_unified_multi_file_generation():
|
|
"""Test that multi file generation works with unified approach"""
|
|
result = await aiService.callAiWithDocumentGeneration(
|
|
prompt="Generate multiple documents",
|
|
documents=None,
|
|
options=options,
|
|
outputFormat="html",
|
|
title="Test Documents"
|
|
)
|
|
|
|
assert result["success"] == True
|
|
assert result["is_multi_file"] == True
|
|
assert len(result["documents"]) > 1
|
|
assert isinstance(result["content"], dict)
|
|
assert "documents" in result["content"]
|
|
|
|
async def test_unified_validation_partial_failure():
|
|
"""Test that partial document failure doesn't fail entire process"""
|
|
# Mock scenario where one document fails validation
|
|
# Should process remaining documents successfully
|
|
pass
|
|
```
|
|
|
|
#### 6.2 Integration Tests
|
|
- Test with various document types
|
|
- Test with different output formats
|
|
- Test chunking functionality
|
|
- Test error handling scenarios
|
|
|
|
## Migration Strategy
|
|
|
|
### Step 1: Implement Unified Methods
|
|
1. Create new unified methods alongside existing ones
|
|
2. Add feature flag to switch between old and new approaches
|
|
3. Test new methods thoroughly
|
|
|
|
### Step 2: Update Callers
|
|
1. Update all callers to use unified approach
|
|
2. Update tests to use new methods
|
|
3. Verify functionality is preserved
|
|
|
|
### Step 3: Remove Legacy Code
|
|
1. Remove old single-file and multi-file methods
|
|
2. Remove old validation methods
|
|
3. Clean up unused imports and references
|
|
|
|
### Step 4: Final Testing
|
|
1. Run full test suite
|
|
2. Test with real-world scenarios
|
|
3. Performance testing
|
|
4. Documentation updates
|
|
|
|
## Benefits of Unified Approach
|
|
|
|
### Code Quality
|
|
- **Reduced Duplication**: ~200 lines of duplicate code removed
|
|
- **Single Code Path**: Easier to maintain and debug
|
|
- **Consistent Behavior**: Same logic for all document types
|
|
|
|
### Performance
|
|
- **Better CPU Cache Usage**: Single code path
|
|
- **Reduced Memory Footprint**: No duplicate code
|
|
- **Faster Development**: Changes affect all cases automatically
|
|
|
|
### Maintainability
|
|
- **Single Point of Truth**: All document generation logic in one place
|
|
- **Easier Testing**: One code path to test
|
|
- **Simpler Debugging**: Single call stack to trace
|
|
|
|
### Flexibility
|
|
- **Dynamic Switching**: Can switch between single/multi based on content
|
|
- **Easy Extensions**: New features automatically work for all cases
|
|
- **Better Error Handling**: Unified error handling approach
|
|
|
|
## Risk Mitigation
|
|
|
|
### Backward Compatibility
|
|
- **No Backward Compatibility Required**: As specified
|
|
- **Clean Migration**: Complete replacement of old system
|
|
|
|
### Testing
|
|
- **Comprehensive Testing**: Unit and integration tests
|
|
- **Real-World Testing**: Test with actual use cases
|
|
- **Performance Testing**: Ensure no performance regression
|
|
|
|
### Rollback Plan
|
|
- **Feature Flag**: Can quickly switch back to old system if needed
|
|
- **Gradual Migration**: Can migrate callers one by one
|
|
- **Monitoring**: Monitor for any issues during migration
|
|
|
|
## Conclusion
|
|
|
|
This refactoring plan provides a clear path to unify the document generation system, eliminating code duplication while preserving all existing functionality. The unified approach is more maintainable, performant, and flexible than the current dual-path system.
|
|
|
|
The key insight is that single-file generation is just a special case of multi-file generation with one document, so the unified approach is more elegant and maintainable than maintaining separate code paths.
|