wiki/appdoc/json_string_accumulation_concept.md
2025-12-03 23:02:58 +01:00

721 lines
24 KiB
Markdown

# JSON String Accumulation Concept for Iterative AI Generation
## Problem Statement
Currently, the AI service processes each iteration's JSON string independently, then merges parsed objects. However, the real-world behavior is:
1. AI delivers a **STRING** containing JSON (not a parsed JSON object)
2. First iteration: AI delivers a JSON string that's cut off somewhere (broken/incomplete)
3. Subsequent iterations: AI delivers MORE JSON string fragments that need to be **APPENDED** to the previous JSON string
4. Challenge: How to handle incomplete JSON strings and merge them correctly
## Core Principle
- **If iteration 1 returns complete, valid JSON** → Use it directly (no accumulation needed)
- **If iteration 1 returns incomplete/broken JSON** → Enter accumulation mode
## State Management
State class is defined in `datamodelAi.py`:
```python
class JsonAccumulationState(BaseModel):
accumulatedJsonString: str # Raw accumulated JSON string
isAccumulationMode: bool # True if we're accumulating fragments
lastParsedResult: Optional[Dict[str, Any]] # Last successfully parsed result (for prompt context)
allSections: List[Dict[str, Any]] # Sections extracted so far (for prompt context)
```
## Flow Logic
### Phase 1: First Iteration Check
```
1. Receive JSON string from AI
2. Try to parse:
- SUCCESS + Complete → Extract sections → DONE (no accumulation)
- FAILURE or INCOMPLETE → Enter accumulation mode
```
### Phase 2: Accumulation Mode (if needed)
```
For each iteration:
1. Receive newFragmentString
2. Concatenate with overlap handling:
accumulatedJsonString = mergeJsonStringsWithOverlap(
accumulatedJsonString,
newFragmentString
)
3. Try to parse accumulatedJsonString:
- SUCCESS → Go to Phase 3 (completion)
- FAILURE → Continue accumulation
4. Extract partial sections (for prompt context):
- Use repairBrokenJson() to get best partial structure
- Extract sections from partial structure
- Update allSections (for next prompt)
5. Build continuation context for next prompt:
- Extract delivered_summary: Count of items/rows/lines per section
- Extract cut_off_element: Incomplete element where JSON was cut off
- Extract element_before_cutoff: Last complete element before cut-off
- Store last_raw_json: Raw JSON string for reference
6. Keep accumulatedJsonString for next iteration
```
### Phase 3: Completion (when parsing succeeds)
```
1. Analyze completeness:
- Check if all structures are closed
- Identify missing closing elements
2. Add closing elements if needed:
- Close unclosed arrays/objects
- Ensure proper JSON structure
3. Repair if corrupted:
- Fix any remaining corruption
4. Extract final sections:
- ExtractSectionsFromDocument()
5. DONE
```
## Function Design
### Main Function: `accumulateAndParseJsonFragments`
```python
@staticmethod
def accumulateAndParseJsonFragments(
accumulatedJsonString: str,
newFragmentString: str,
allSections: List[Dict[str, Any]],
iteration: int
) -> Tuple[str, List[Dict[str, Any]], bool, Optional[Dict[str, Any]]]:
"""
Accumulate JSON fragments and parse when complete.
GENERIC function that handles:
1. Concatenating JSON strings with overlap detection
2. Parsing the accumulated string
3. Extracting sections (partial if incomplete, final if complete)
4. Determining completion status
Args:
accumulatedJsonString: Previously accumulated JSON string
newFragmentString: New fragment string from current iteration
allSections: Sections extracted so far (for prompt context)
iteration: Current iteration number
Returns:
Tuple of:
- accumulatedJsonString: Updated accumulated string
- sections: Extracted sections (partial if incomplete, final if complete)
- isComplete: True if JSON is complete and valid
- parsedResult: Parsed JSON object (if parsing succeeded)
"""
# Step 1: Clean encoding issues from accumulated string (check end of first delivered part)
cleanedAccumulated = cleanEncodingIssues(accumulatedJsonString)
# Step 2: Clean encoding issues from new fragment
cleanedFragment = cleanEncodingIssues(newFragmentString)
# Step 3: Concatenate with overlap handling
combinedString = mergeJsonStringsWithOverlap(
cleanedAccumulated,
cleanedFragment
)
# Step 4: Try to parse
try:
extracted = extractJsonString(combinedString)
parsedResult = json.loads(extracted)
# Step 5: Parsing succeeded - check completeness
isComplete = isJsonComplete(parsedResult)
if isComplete:
# Step 6: Complete JSON - finalize
finalizedJson = finalizeJson(parsedResult)
sections = extractSectionsFromDocument(finalizedJson)
return combinedString, sections, True, finalizedJson
else:
# Step 7: Incomplete but parseable - extract partial sections
sections = extractSectionsFromDocument(parsedResult)
return combinedString, sections, False, parsedResult
except json.JSONDecodeError:
# Step 8: Still broken - repair and extract partial sections
repaired = repairBrokenJson(combinedString)
if repaired:
sections = extractSectionsFromDocument(repaired)
return combinedString, sections, False, repaired
else:
# Repair failed - continue with data BEFORE merging the problematic piece
# Return previous accumulated string (before adding new fragment)
# This ensures we don't lose previously accumulated data
logger.warning(f"Iteration {iteration}: Repair failed, continuing with previous accumulated data")
return accumulatedJsonString, [], False, None
```
## Helper Functions Needed
### 1. `mergeJsonStringsWithOverlap`
```python
@staticmethod
def mergeJsonStringsWithOverlap(
accumulated: str,
newFragment: str
) -> str:
"""
GENERIC function to merge two JSON strings, handling overlaps intelligently.
Works for ANY JSON structure - no specific logic for content types.
Overlap scenarios (all handled generically):
- Exact continuation: newFragment starts exactly where accumulated ends
- Partial overlap: newFragment overlaps with end of accumulated
- Full overlap: newFragment is subset of accumulated
Strategy:
1. Find longest common suffix/prefix match (string-based comparison)
2. Remove duplicate content
3. Concatenate remaining parts
Args:
accumulated: Previously accumulated JSON string
newFragment: New fragment string to append
Returns:
Combined JSON string with overlaps removed
"""
# Implementation:
# - Find longest common suffix/prefix match
# - Remove overlapping part
# - Concatenate: accumulated + newFragment[overlapEnd:]
pass
```
### 2. `isJsonComplete`
```python
@staticmethod
def isJsonComplete(parsedJson: Dict[str, Any]) -> bool:
"""
GENERIC function to check if parsed JSON structure is complete.
Works for ANY JSON structure - no specific logic for content types.
Completeness checks (all generic):
- All arrays are properly closed
- All objects are properly closed
- No incomplete structures
- Recursive validation of nested structures
Args:
parsedJson: Parsed JSON object
Returns:
True if JSON is complete, False otherwise
"""
# Implementation:
# - Recursively check all structures
# - Verify no incomplete arrays/objects
# - Generic validation (no content-type-specific logic)
pass
```
### 3. `finalizeJson`
```python
@staticmethod
def finalizeJson(parsedJson: Dict[str, Any]) -> Dict[str, Any]:
"""
GENERIC function to finalize complete JSON by adding missing closing elements and repairing corruption.
Works for ANY JSON structure - no specific logic for content types.
Steps (all generic):
1. Analyze structure for missing closing elements (recursively)
2. Add closing brackets/braces where needed
3. Repair any remaining corruption
4. Validate final structure
Args:
parsedJson: Parsed JSON object that needs finalization
Returns:
Finalized JSON object
"""
# Implementation:
# - Check for incomplete structures (generic recursive)
# - Add missing closing elements
# - Repair corruption using existing repair logic
# - Return finalized structure
pass
```
### 4. `cleanEncodingIssues`
```python
@staticmethod
def cleanEncodingIssues(jsonString: str) -> str:
"""
GENERIC function to remove problematic encoding parts from JSON string.
Works for ANY JSON structure - removes problematic characters/bytes.
Args:
jsonString: JSON string that may have encoding issues
Returns:
Cleaned JSON string
"""
try:
# Try to decode/encode to detect issues
jsonString.encode('utf-8').decode('utf-8')
return jsonString
except UnicodeError:
# Remove problematic parts
cleaned = jsonString.encode('utf-8', errors='ignore').decode('utf-8', errors='ignore')
logger.warning("Removed encoding issues from JSON string")
return cleaned
```
### 5. `extractKpiFromResponse`
```python
@staticmethod
def extractKpiFromResponse(aiResponse: str) -> Optional[int]:
"""
Extract KPI percentage from AI response.
AI is asked: "Based on the delivered data so far, approximately what percentage (%)
of the total required content has been delivered? Respond with an integer between 0-100."
Args:
aiResponse: AI response string that may contain percentage
Returns:
Integer percentage (0-100) or None if not found
"""
# Implementation:
# - Look for percentage pattern in response (e.g., "45%", "45 percent", "45")
# - Extract integer value
# - Validate range (0-100)
# - Return integer or None
pass
```
### 6. `validateKpiProgression`
```python
@staticmethod
def validateKpiProgression(
accumulationState: JsonAccumulationState,
currentKpi: int
) -> bool:
"""
Validate KPI progression from AI response.
Validation rules:
- If % goes DOWN → Error (e.g., no data received, started new) → Return False
- If % doesn't move (increment < 1%) → Error (no progress) → Return False
- If % goes UP (increment >= 1%) → Good progress → Return True
Args:
accumulationState: Current accumulation state (contains lastKpi)
currentKpi: Current KPI percentage from AI (integer 0-100)
Returns:
True if KPI progression is valid, False if error detected
"""
# Implementation:
# - Get lastKpi from accumulationState
# - Calculate increment = currentKpi - lastKpi
# - If increment < 0: return False (went down - error)
# - If increment < 1: return False (no progress - error)
# - If increment >= 1: return True (progress - good)
pass
```
## Continuation Context for Next Prompt
### What is Delivered for Next Iteration Prompt
When accumulating JSON fragments, the system needs to provide context to the AI for the next iteration. This is handled by `buildContinuationContext()` which extracts:
1. **deliveredSummary**: Summary of all sections with counts
- Per section: content type, item/row/line counts
- Example: `- bullet_list with 20 items`, `- table "section_table" with 8 rows`
- Truncated if too long (first 10 + last 10 items)
2. **cutOffElement**: The incomplete element where JSON was cut off
- Extracted from `lastRawResponse` (raw JSON string)
- Shows AI where generation stopped
- Used as reference point for continuation
3. **elementBeforeCutoff**: The last complete element before the cut-off
- Provides context of what was completed
- Helps AI understand structure
4. **lastRawJson**: Raw JSON string from last iteration
- Stored for reference
- Used to detect fragments vs. full JSON structures
5. **kpiQuestion**: Question for AI to answer with percentage delivered
- "Based on the delivered data so far, approximately what percentage (%) of the total required content has been delivered? Respond with an integer between 0-100."
- AI must respond with integer percentage (0-100)
### Logic Flow
```
After each accumulation iteration:
1. Extract sections from accumulated JSON (even if incomplete)
2. Build continuation context:
- Count items/rows/lines per section (for deliveredSummary)
- Find incomplete section from allSections
- Extract cut-off point from lastRawResponse
3. Pass context to prompt builder for next iteration
4. AI uses context to continue from cut-off point
```
## Integration Point
### Modified `_extractSectionsFromResponse` in `mainServiceAi.py`
```python
def _extractSectionsFromResponse(
result: str,
iteration: int,
debugPrefix: str,
allSections: List[Dict[str, Any]] = None,
accumulationState: Optional[JsonAccumulationState] = None # NEW: Track accumulation state
) -> Tuple[List[Dict[str, Any]], bool, Optional[Dict[str, Any]], Optional[JsonAccumulationState]]:
"""
Extract sections from AI response, handling both valid and broken JSON.
NEW BEHAVIOR:
- First iteration: Check if complete, if not start accumulation
- Subsequent iterations: Accumulate strings, parse when complete
Returns:
Tuple of:
- sections: Extracted sections
- wasJsonComplete: True if JSON is complete
- parsedResult: Parsed JSON object
- updatedAccumulationState: Updated accumulation state (None if not in accumulation mode)
"""
if iteration == 1:
# First iteration - check if complete
try:
extracted = extractJsonString(result)
parsed = json.loads(extracted)
# Check completeness
if JsonResponseHandler.isJsonComplete(parsed):
# Complete JSON - no accumulation needed
sections = extractSectionsFromDocument(parsed)
return sections, True, parsed, None # No accumulation
except:
pass
# Incomplete - start accumulation
logger.info(f"Iteration 1: Incomplete JSON detected, starting accumulation mode")
accumulationState = JsonAccumulationState(
accumulatedJsonString=result,
isAccumulationMode=True,
lastParsedResult=None,
allSections=[]
)
return [], False, None, accumulationState
else:
# Subsequent iterations - accumulate
if accumulationState and accumulationState.isAccumulationMode:
accumulated, sections, isComplete, parsedResult = \
JsonResponseHandler.accumulateAndParseJsonFragments(
accumulationState.accumulatedJsonString,
result,
allSections,
iteration
)
# Update accumulation state
accumulationState.accumulatedJsonString = accumulated
accumulationState.lastParsedResult = parsedResult
accumulationState.allSections = allSections + sections if sections else allSections
accumulationState.isAccumulationMode = not isComplete
return sections, isComplete, parsedResult, accumulationState
else:
# No accumulation mode - process normally (shouldn't happen)
logger.warning(f"Iteration {iteration}: No accumulation state but iteration > 1")
return [], False, None, None
```
### Modified Loop in `mainServiceAi.py`
```python
# In the iteration loop:
accumulationState = None # Track accumulation state
for iteration in range(1, maxIterations + 1):
# ... AI call ...
# Extract sections with accumulation support
extractedSections, wasJsonComplete, parsedResult, accumulationState = \
self._extractSectionsFromResponse(
result,
iteration,
debugPrefix,
allSections,
accumulationState # Pass accumulation state object
)
# Update allSections for prompt context
if extractedSections:
allSections = JsonResponseHandler.mergeSectionsIntelligently(
allSections,
extractedSections,
iteration
)
# Build continuation context for next prompt (if needed)
if not wasJsonComplete and (allSections or result):
continuationContext = buildContinuationContext(allSections, result)
# Add KPI question for AI to answer (percentage delivered)
continuationContext["kpiQuestion"] = "Based on the delivered data so far, approximately what percentage (%) of the total required content has been delivered? Respond with an integer between 0-100."
# Use continuationContext in next prompt
# Extract KPI from AI response and validate progression
if accumulationState and accumulationState.isAccumulationMode:
currentKpi = JsonResponseHandler.extractKpiFromResponse(result) # Extract percentage from AI response
if currentKpi is not None:
if not JsonResponseHandler.validateKpiProgression(accumulationState, currentKpi):
logger.warning(f"Iteration {iteration}: KPI validation failed, stopping accumulation")
break
# Store KPI in accumulation state
accumulationState.lastKpi = currentKpi
# Check completion
if wasJsonComplete:
break # Done
```
## Key Considerations
### 1. Overlap Detection Strategy
**Question:** How to detect overlaps between accumulated string and new fragment?
**GENERIC Approach:**
- Compare end of accumulated string with start of new fragment
- Find longest matching suffix/prefix (string-based comparison)
- Remove duplicate content
- Works for ANY JSON structure (no content-type-specific logic)
### 2. Partial Section Extraction
**Question:** Should we extract sections from incomplete JSON for prompt context?
**Answer:** Yes, with generic approach:
- Extract what's available (even if incomplete) - works for ANY content type
- Use for continuation prompts (via `buildContinuationContext()`)
- Build delivered summary with counts per section (generic counting)
- Extract cut-off point from raw JSON string (generic detection)
- Keep accumulated string separate (for next append)
### 3. State Storage
**Question:** Where to store `accumulatedJsonString`?
**Answer:** Store in `JsonAccumulationState` object for traceability
- Use `JsonAccumulationState` class from `datamodelAi.py`
- Store accumulated string, mode flag, parsed result, and sections
- Better traceability and debugging
- Can be logged/persisted if needed
### 4. Completion Detection
**Question:** When is JSON considered "complete"?
**GENERIC Criteria:**
- Parses successfully without errors
- All structures are properly closed (recursive check)
- No incomplete arrays/objects
- Generic validation (no content-type-specific checks)
### 5. Error Handling
**Scenarios:**
- Repair fails → Continue accumulation (don't stop)
- Parsing fails after accumulation → Try repair, continue if repair succeeds
- Merge fails → Log error, continue with best available data
## Implementation Steps
1. **Add state class** in `datamodelAi.py`:
- `JsonAccumulationState` (camelStyle naming)
2. **Create helper functions** in `subJsonResponseHandling.py`:
- `mergeJsonStringsWithOverlap()` (generic, camelStyle)
- `isJsonComplete()` (generic, camelStyle)
- `finalizeJson()` (generic, camelStyle)
3. **Create main function** in `subJsonResponseHandling.py`:
- `accumulateAndParseJsonFragments()` (generic, camelStyle)
4. **Modify `_extractSectionsFromResponse`** in `mainServiceAi.py`:
- Add `accumulationState` parameter (JsonAccumulationState object)
- Add first iteration check
- Call accumulation function for subsequent iterations
- Update accumulation state object
5. **Update iteration loop** in `mainServiceAi.py`:
- Track `accumulationState` object (JsonAccumulationState)
- Pass to `_extractSectionsFromResponse`
- Build continuation context using `buildContinuationContext()`
- Add KPI question to continuation context
- Extract KPI from AI response and validate progression
- Handle return values
6. **Create test file**:
- Test string accumulation with overlaps
- Test completion detection
- Test partial section extraction
- Test continuation context building
## Testing Strategy
### Test Cases
1. **Complete JSON on first iteration:**
- Should NOT enter accumulation mode
- Should extract sections directly
2. **Incomplete JSON on first iteration:**
- Should enter accumulation mode
- Should store string for next iteration
3. **Fragment with exact continuation:**
- Should concatenate without duplicates
- Should parse successfully
4. **Fragment with overlap:**
- Should detect and remove overlap
- Should concatenate correctly
5. **Fragment with full overlap:**
- Should handle duplicate content
- Should not add duplicates
6. **Multiple iterations:**
- Should accumulate across all iterations
- Should extract partial sections for prompts
- Should complete when JSON is valid
## Open Questions - Answers
### 1. How to handle very large accumulated strings? (Memory concerns)
**Answer:** No memory problems expected
- System handles files up to ~1GB
- String accumulation is acceptable for this size
- No special memory management needed
### 2. Should we limit accumulation attempts? (Prevent infinite loops)
**Answer:** Yes, use KPI-based stopping
- Add generic KPI to iteration prompt showing remaining elements needed
- KPI calculation: Compare expected vs. delivered counts per section type
- Stop if KPI doesn't decrease in 3 consecutive iterations
- KPI is AI-provided (not calculated by system) - AI answers percentage question
- Simple integer comparison for validation (no fuzzy AI calculation)
**KPI Question for Iteration Prompt:**
```
=== PROGRESS INDICATOR ===
Based on the delivered data so far, approximately what percentage (%) of the total
required content has been delivered?
Respond with an integer between 0-100.
⚠️ IMPORTANT:
- If percentage goes DOWN in next iteration → Generation will stop (error detected)
- If percentage doesn't increase by at least 1% → Generation will stop (no progress)
- Only continue if percentage increases by 1% or more
```
**KPI Validation Logic:**
```python
def validateKpiProgression(
accumulationState: JsonAccumulationState,
currentKpi: int
) -> bool:
"""
Validate KPI progression from AI response.
Validation rules:
- If % goes DOWN → Error (e.g., no data received, started new) → Return False
- If % doesn't move (increment < 1%) → Error (no progress) → Return False
- If % goes UP (increment >= 1%) → Good progress → Return True
Args:
accumulationState: Current accumulation state (contains lastKpi)
currentKpi: Current KPI percentage from AI (integer 0-100)
Returns:
True if KPI progression is valid, False if error detected
"""
lastKpi = accumulationState.lastKpi if accumulationState.lastKpi else 0
increment = currentKpi - lastKpi
if increment < 0:
return False # Went down - error
if increment < 1:
return False # No progress - error
return True # Progress - good
```
### 3. How to handle encoding issues in string concatenation?
**Answer:** Remove problematic parts
- Detect encoding errors during concatenation
- Remove problematic characters/bytes
- Continue with cleaned string
- Acceptable to lose some data rather than fail completely
**Implementation:**
```python
def cleanEncodingIssues(jsonString: str) -> str:
"""
Remove problematic encoding parts from JSON string.
Generic approach:
- Detect encoding errors
- Remove problematic characters/bytes
- Return cleaned string
"""
try:
# Try to decode/encode to detect issues
jsonString.encode('utf-8').decode('utf-8')
return jsonString
except UnicodeError:
# Remove problematic parts
cleaned = jsonString.encode('utf-8', errors='ignore').decode('utf-8', errors='ignore')
logger.warning("Removed encoding issues from JSON string")
return cleaned
```
### 4. Should overlap detection be configurable? (Performance vs. accuracy)
**Answer:** No, automated mode only
- AI calls take 30-180 seconds (plenty of time for overlap detection)
- No performance concerns
- Always use automated overlap detection
- No configuration needed