# JSON String Accumulation Concept for Iterative AI Generation ## Problem Statement Currently, the AI service processes each iteration's JSON string independently, then merges parsed objects. However, the real-world behavior is: 1. AI delivers a **STRING** containing JSON (not a parsed JSON object) 2. First iteration: AI delivers a JSON string that's cut off somewhere (broken/incomplete) 3. Subsequent iterations: AI delivers MORE JSON string fragments that need to be **APPENDED** to the previous JSON string 4. Challenge: How to handle incomplete JSON strings and merge them correctly ## Core Principle - **If iteration 1 returns complete, valid JSON** → Use it directly (no accumulation needed) - **If iteration 1 returns incomplete/broken JSON** → Enter accumulation mode ## State Management State class is defined in `datamodelAi.py`: ```python class JsonAccumulationState(BaseModel): accumulatedJsonString: str # Raw accumulated JSON string isAccumulationMode: bool # True if we're accumulating fragments lastParsedResult: Optional[Dict[str, Any]] # Last successfully parsed result (for prompt context) allSections: List[Dict[str, Any]] # Sections extracted so far (for prompt context) ``` ## Flow Logic ### Phase 1: First Iteration Check ``` 1. Receive JSON string from AI 2. Try to parse: - SUCCESS + Complete → Extract sections → DONE (no accumulation) - FAILURE or INCOMPLETE → Enter accumulation mode ``` ### Phase 2: Accumulation Mode (if needed) ``` For each iteration: 1. Receive newFragmentString 2. Concatenate with overlap handling: accumulatedJsonString = mergeJsonStringsWithOverlap( accumulatedJsonString, newFragmentString ) 3. Try to parse accumulatedJsonString: - SUCCESS → Go to Phase 3 (completion) - FAILURE → Continue accumulation 4. Extract partial sections (for prompt context): - Use repairBrokenJson() to get best partial structure - Extract sections from partial structure - Update allSections (for next prompt) 5. Build continuation context for next prompt: - Extract delivered_summary: Count of items/rows/lines per section - Extract cut_off_element: Incomplete element where JSON was cut off - Extract element_before_cutoff: Last complete element before cut-off - Store last_raw_json: Raw JSON string for reference 6. Keep accumulatedJsonString for next iteration ``` ### Phase 3: Completion (when parsing succeeds) ``` 1. Analyze completeness: - Check if all structures are closed - Identify missing closing elements 2. Add closing elements if needed: - Close unclosed arrays/objects - Ensure proper JSON structure 3. Repair if corrupted: - Fix any remaining corruption 4. Extract final sections: - ExtractSectionsFromDocument() 5. DONE ``` ## Function Design ### Main Function: `accumulateAndParseJsonFragments` ```python @staticmethod def accumulateAndParseJsonFragments( accumulatedJsonString: str, newFragmentString: str, allSections: List[Dict[str, Any]], iteration: int ) -> Tuple[str, List[Dict[str, Any]], bool, Optional[Dict[str, Any]]]: """ Accumulate JSON fragments and parse when complete. GENERIC function that handles: 1. Concatenating JSON strings with overlap detection 2. Parsing the accumulated string 3. Extracting sections (partial if incomplete, final if complete) 4. Determining completion status Args: accumulatedJsonString: Previously accumulated JSON string newFragmentString: New fragment string from current iteration allSections: Sections extracted so far (for prompt context) iteration: Current iteration number Returns: Tuple of: - accumulatedJsonString: Updated accumulated string - sections: Extracted sections (partial if incomplete, final if complete) - isComplete: True if JSON is complete and valid - parsedResult: Parsed JSON object (if parsing succeeded) """ # Step 1: Clean encoding issues from accumulated string (check end of first delivered part) cleanedAccumulated = cleanEncodingIssues(accumulatedJsonString) # Step 2: Clean encoding issues from new fragment cleanedFragment = cleanEncodingIssues(newFragmentString) # Step 3: Concatenate with overlap handling combinedString = mergeJsonStringsWithOverlap( cleanedAccumulated, cleanedFragment ) # Step 4: Try to parse try: extracted = extractJsonString(combinedString) parsedResult = json.loads(extracted) # Step 5: Parsing succeeded - check completeness isComplete = isJsonComplete(parsedResult) if isComplete: # Step 6: Complete JSON - finalize finalizedJson = finalizeJson(parsedResult) sections = extractSectionsFromDocument(finalizedJson) return combinedString, sections, True, finalizedJson else: # Step 7: Incomplete but parseable - extract partial sections sections = extractSectionsFromDocument(parsedResult) return combinedString, sections, False, parsedResult except json.JSONDecodeError: # Step 8: Still broken - repair and extract partial sections repaired = repairBrokenJson(combinedString) if repaired: sections = extractSectionsFromDocument(repaired) return combinedString, sections, False, repaired else: # Repair failed - continue with data BEFORE merging the problematic piece # Return previous accumulated string (before adding new fragment) # This ensures we don't lose previously accumulated data logger.warning(f"Iteration {iteration}: Repair failed, continuing with previous accumulated data") return accumulatedJsonString, [], False, None ``` ## Helper Functions Needed ### 1. `mergeJsonStringsWithOverlap` ```python @staticmethod def mergeJsonStringsWithOverlap( accumulated: str, newFragment: str ) -> str: """ GENERIC function to merge two JSON strings, handling overlaps intelligently. Works for ANY JSON structure - no specific logic for content types. Overlap scenarios (all handled generically): - Exact continuation: newFragment starts exactly where accumulated ends - Partial overlap: newFragment overlaps with end of accumulated - Full overlap: newFragment is subset of accumulated Strategy: 1. Find longest common suffix/prefix match (string-based comparison) 2. Remove duplicate content 3. Concatenate remaining parts Args: accumulated: Previously accumulated JSON string newFragment: New fragment string to append Returns: Combined JSON string with overlaps removed """ # Implementation: # - Find longest common suffix/prefix match # - Remove overlapping part # - Concatenate: accumulated + newFragment[overlapEnd:] pass ``` ### 2. `isJsonComplete` ```python @staticmethod def isJsonComplete(parsedJson: Dict[str, Any]) -> bool: """ GENERIC function to check if parsed JSON structure is complete. Works for ANY JSON structure - no specific logic for content types. Completeness checks (all generic): - All arrays are properly closed - All objects are properly closed - No incomplete structures - Recursive validation of nested structures Args: parsedJson: Parsed JSON object Returns: True if JSON is complete, False otherwise """ # Implementation: # - Recursively check all structures # - Verify no incomplete arrays/objects # - Generic validation (no content-type-specific logic) pass ``` ### 3. `finalizeJson` ```python @staticmethod def finalizeJson(parsedJson: Dict[str, Any]) -> Dict[str, Any]: """ GENERIC function to finalize complete JSON by adding missing closing elements and repairing corruption. Works for ANY JSON structure - no specific logic for content types. Steps (all generic): 1. Analyze structure for missing closing elements (recursively) 2. Add closing brackets/braces where needed 3. Repair any remaining corruption 4. Validate final structure Args: parsedJson: Parsed JSON object that needs finalization Returns: Finalized JSON object """ # Implementation: # - Check for incomplete structures (generic recursive) # - Add missing closing elements # - Repair corruption using existing repair logic # - Return finalized structure pass ``` ### 4. `cleanEncodingIssues` ```python @staticmethod def cleanEncodingIssues(jsonString: str) -> str: """ GENERIC function to remove problematic encoding parts from JSON string. Works for ANY JSON structure - removes problematic characters/bytes. Args: jsonString: JSON string that may have encoding issues Returns: Cleaned JSON string """ try: # Try to decode/encode to detect issues jsonString.encode('utf-8').decode('utf-8') return jsonString except UnicodeError: # Remove problematic parts cleaned = jsonString.encode('utf-8', errors='ignore').decode('utf-8', errors='ignore') logger.warning("Removed encoding issues from JSON string") return cleaned ``` ### 5. `extractKpiFromResponse` ```python @staticmethod def extractKpiFromResponse(aiResponse: str) -> Optional[int]: """ Extract KPI percentage from AI response. AI is asked: "Based on the delivered data so far, approximately what percentage (%) of the total required content has been delivered? Respond with an integer between 0-100." Args: aiResponse: AI response string that may contain percentage Returns: Integer percentage (0-100) or None if not found """ # Implementation: # - Look for percentage pattern in response (e.g., "45%", "45 percent", "45") # - Extract integer value # - Validate range (0-100) # - Return integer or None pass ``` ### 6. `validateKpiProgression` ```python @staticmethod def validateKpiProgression( accumulationState: JsonAccumulationState, currentKpi: int ) -> bool: """ Validate KPI progression from AI response. Validation rules: - If % goes DOWN → Error (e.g., no data received, started new) → Return False - If % doesn't move (increment < 1%) → Error (no progress) → Return False - If % goes UP (increment >= 1%) → Good progress → Return True Args: accumulationState: Current accumulation state (contains lastKpi) currentKpi: Current KPI percentage from AI (integer 0-100) Returns: True if KPI progression is valid, False if error detected """ # Implementation: # - Get lastKpi from accumulationState # - Calculate increment = currentKpi - lastKpi # - If increment < 0: return False (went down - error) # - If increment < 1: return False (no progress - error) # - If increment >= 1: return True (progress - good) pass ``` ## Continuation Context for Next Prompt ### What is Delivered for Next Iteration Prompt When accumulating JSON fragments, the system needs to provide context to the AI for the next iteration. This is handled by `buildContinuationContext()` which extracts: 1. **deliveredSummary**: Summary of all sections with counts - Per section: content type, item/row/line counts - Example: `- bullet_list with 20 items`, `- table "section_table" with 8 rows` - Truncated if too long (first 10 + last 10 items) 2. **cutOffElement**: The incomplete element where JSON was cut off - Extracted from `lastRawResponse` (raw JSON string) - Shows AI where generation stopped - Used as reference point for continuation 3. **elementBeforeCutoff**: The last complete element before the cut-off - Provides context of what was completed - Helps AI understand structure 4. **lastRawJson**: Raw JSON string from last iteration - Stored for reference - Used to detect fragments vs. full JSON structures 5. **kpiQuestion**: Question for AI to answer with percentage delivered - "Based on the delivered data so far, approximately what percentage (%) of the total required content has been delivered? Respond with an integer between 0-100." - AI must respond with integer percentage (0-100) ### Logic Flow ``` After each accumulation iteration: 1. Extract sections from accumulated JSON (even if incomplete) 2. Build continuation context: - Count items/rows/lines per section (for deliveredSummary) - Find incomplete section from allSections - Extract cut-off point from lastRawResponse 3. Pass context to prompt builder for next iteration 4. AI uses context to continue from cut-off point ``` ## Integration Point ### Modified `_extractSectionsFromResponse` in `mainServiceAi.py` ```python def _extractSectionsFromResponse( result: str, iteration: int, debugPrefix: str, allSections: List[Dict[str, Any]] = None, accumulationState: Optional[JsonAccumulationState] = None # NEW: Track accumulation state ) -> Tuple[List[Dict[str, Any]], bool, Optional[Dict[str, Any]], Optional[JsonAccumulationState]]: """ Extract sections from AI response, handling both valid and broken JSON. NEW BEHAVIOR: - First iteration: Check if complete, if not start accumulation - Subsequent iterations: Accumulate strings, parse when complete Returns: Tuple of: - sections: Extracted sections - wasJsonComplete: True if JSON is complete - parsedResult: Parsed JSON object - updatedAccumulationState: Updated accumulation state (None if not in accumulation mode) """ if iteration == 1: # First iteration - check if complete try: extracted = extractJsonString(result) parsed = json.loads(extracted) # Check completeness if JsonResponseHandler.isJsonComplete(parsed): # Complete JSON - no accumulation needed sections = extractSectionsFromDocument(parsed) return sections, True, parsed, None # No accumulation except: pass # Incomplete - start accumulation logger.info(f"Iteration 1: Incomplete JSON detected, starting accumulation mode") accumulationState = JsonAccumulationState( accumulatedJsonString=result, isAccumulationMode=True, lastParsedResult=None, allSections=[] ) return [], False, None, accumulationState else: # Subsequent iterations - accumulate if accumulationState and accumulationState.isAccumulationMode: accumulated, sections, isComplete, parsedResult = \ JsonResponseHandler.accumulateAndParseJsonFragments( accumulationState.accumulatedJsonString, result, allSections, iteration ) # Update accumulation state accumulationState.accumulatedJsonString = accumulated accumulationState.lastParsedResult = parsedResult accumulationState.allSections = allSections + sections if sections else allSections accumulationState.isAccumulationMode = not isComplete return sections, isComplete, parsedResult, accumulationState else: # No accumulation mode - process normally (shouldn't happen) logger.warning(f"Iteration {iteration}: No accumulation state but iteration > 1") return [], False, None, None ``` ### Modified Loop in `mainServiceAi.py` ```python # In the iteration loop: accumulationState = None # Track accumulation state for iteration in range(1, maxIterations + 1): # ... AI call ... # Extract sections with accumulation support extractedSections, wasJsonComplete, parsedResult, accumulationState = \ self._extractSectionsFromResponse( result, iteration, debugPrefix, allSections, accumulationState # Pass accumulation state object ) # Update allSections for prompt context if extractedSections: allSections = JsonResponseHandler.mergeSectionsIntelligently( allSections, extractedSections, iteration ) # Build continuation context for next prompt (if needed) if not wasJsonComplete and (allSections or result): continuationContext = buildContinuationContext(allSections, result) # Add KPI question for AI to answer (percentage delivered) continuationContext["kpiQuestion"] = "Based on the delivered data so far, approximately what percentage (%) of the total required content has been delivered? Respond with an integer between 0-100." # Use continuationContext in next prompt # Extract KPI from AI response and validate progression if accumulationState and accumulationState.isAccumulationMode: currentKpi = JsonResponseHandler.extractKpiFromResponse(result) # Extract percentage from AI response if currentKpi is not None: if not JsonResponseHandler.validateKpiProgression(accumulationState, currentKpi): logger.warning(f"Iteration {iteration}: KPI validation failed, stopping accumulation") break # Store KPI in accumulation state accumulationState.lastKpi = currentKpi # Check completion if wasJsonComplete: break # Done ``` ## Key Considerations ### 1. Overlap Detection Strategy **Question:** How to detect overlaps between accumulated string and new fragment? **GENERIC Approach:** - Compare end of accumulated string with start of new fragment - Find longest matching suffix/prefix (string-based comparison) - Remove duplicate content - Works for ANY JSON structure (no content-type-specific logic) ### 2. Partial Section Extraction **Question:** Should we extract sections from incomplete JSON for prompt context? **Answer:** Yes, with generic approach: - Extract what's available (even if incomplete) - works for ANY content type - Use for continuation prompts (via `buildContinuationContext()`) - Build delivered summary with counts per section (generic counting) - Extract cut-off point from raw JSON string (generic detection) - Keep accumulated string separate (for next append) ### 3. State Storage **Question:** Where to store `accumulatedJsonString`? **Answer:** Store in `JsonAccumulationState` object for traceability - Use `JsonAccumulationState` class from `datamodelAi.py` - Store accumulated string, mode flag, parsed result, and sections - Better traceability and debugging - Can be logged/persisted if needed ### 4. Completion Detection **Question:** When is JSON considered "complete"? **GENERIC Criteria:** - Parses successfully without errors - All structures are properly closed (recursive check) - No incomplete arrays/objects - Generic validation (no content-type-specific checks) ### 5. Error Handling **Scenarios:** - Repair fails → Continue accumulation (don't stop) - Parsing fails after accumulation → Try repair, continue if repair succeeds - Merge fails → Log error, continue with best available data ## Implementation Steps 1. **Add state class** in `datamodelAi.py`: - `JsonAccumulationState` (camelStyle naming) 2. **Create helper functions** in `subJsonResponseHandling.py`: - `mergeJsonStringsWithOverlap()` (generic, camelStyle) - `isJsonComplete()` (generic, camelStyle) - `finalizeJson()` (generic, camelStyle) 3. **Create main function** in `subJsonResponseHandling.py`: - `accumulateAndParseJsonFragments()` (generic, camelStyle) 4. **Modify `_extractSectionsFromResponse`** in `mainServiceAi.py`: - Add `accumulationState` parameter (JsonAccumulationState object) - Add first iteration check - Call accumulation function for subsequent iterations - Update accumulation state object 5. **Update iteration loop** in `mainServiceAi.py`: - Track `accumulationState` object (JsonAccumulationState) - Pass to `_extractSectionsFromResponse` - Build continuation context using `buildContinuationContext()` - Add KPI question to continuation context - Extract KPI from AI response and validate progression - Handle return values 6. **Create test file**: - Test string accumulation with overlaps - Test completion detection - Test partial section extraction - Test continuation context building ## Testing Strategy ### Test Cases 1. **Complete JSON on first iteration:** - Should NOT enter accumulation mode - Should extract sections directly 2. **Incomplete JSON on first iteration:** - Should enter accumulation mode - Should store string for next iteration 3. **Fragment with exact continuation:** - Should concatenate without duplicates - Should parse successfully 4. **Fragment with overlap:** - Should detect and remove overlap - Should concatenate correctly 5. **Fragment with full overlap:** - Should handle duplicate content - Should not add duplicates 6. **Multiple iterations:** - Should accumulate across all iterations - Should extract partial sections for prompts - Should complete when JSON is valid ## Open Questions - Answers ### 1. How to handle very large accumulated strings? (Memory concerns) **Answer:** No memory problems expected - System handles files up to ~1GB - String accumulation is acceptable for this size - No special memory management needed ### 2. Should we limit accumulation attempts? (Prevent infinite loops) **Answer:** Yes, use KPI-based stopping - Add generic KPI to iteration prompt showing remaining elements needed - KPI calculation: Compare expected vs. delivered counts per section type - Stop if KPI doesn't decrease in 3 consecutive iterations - KPI is AI-provided (not calculated by system) - AI answers percentage question - Simple integer comparison for validation (no fuzzy AI calculation) **KPI Question for Iteration Prompt:** ``` === PROGRESS INDICATOR === Based on the delivered data so far, approximately what percentage (%) of the total required content has been delivered? Respond with an integer between 0-100. ⚠️ IMPORTANT: - If percentage goes DOWN in next iteration → Generation will stop (error detected) - If percentage doesn't increase by at least 1% → Generation will stop (no progress) - Only continue if percentage increases by 1% or more ``` **KPI Validation Logic:** ```python def validateKpiProgression( accumulationState: JsonAccumulationState, currentKpi: int ) -> bool: """ Validate KPI progression from AI response. Validation rules: - If % goes DOWN → Error (e.g., no data received, started new) → Return False - If % doesn't move (increment < 1%) → Error (no progress) → Return False - If % goes UP (increment >= 1%) → Good progress → Return True Args: accumulationState: Current accumulation state (contains lastKpi) currentKpi: Current KPI percentage from AI (integer 0-100) Returns: True if KPI progression is valid, False if error detected """ lastKpi = accumulationState.lastKpi if accumulationState.lastKpi else 0 increment = currentKpi - lastKpi if increment < 0: return False # Went down - error if increment < 1: return False # No progress - error return True # Progress - good ``` ### 3. How to handle encoding issues in string concatenation? **Answer:** Remove problematic parts - Detect encoding errors during concatenation - Remove problematic characters/bytes - Continue with cleaned string - Acceptable to lose some data rather than fail completely **Implementation:** ```python def cleanEncodingIssues(jsonString: str) -> str: """ Remove problematic encoding parts from JSON string. Generic approach: - Detect encoding errors - Remove problematic characters/bytes - Return cleaned string """ try: # Try to decode/encode to detect issues jsonString.encode('utf-8').decode('utf-8') return jsonString except UnicodeError: # Remove problematic parts cleaned = jsonString.encode('utf-8', errors='ignore').decode('utf-8', errors='ignore') logger.warning("Removed encoding issues from JSON string") return cleaned ``` ### 4. Should overlap detection be configurable? (Performance vs. accuracy) **Answer:** No, automated mode only - AI calls take 30-180 seconds (plenty of time for overlap detection) - No performance concerns - Always use automated overlap detection - No configuration needed