wiki/appdoc/json_string_accumulation_concept.md
2025-12-03 23:02:58 +01:00

24 KiB

JSON String Accumulation Concept for Iterative AI Generation

Problem Statement

Currently, the AI service processes each iteration's JSON string independently, then merges parsed objects. However, the real-world behavior is:

  1. AI delivers a STRING containing JSON (not a parsed JSON object)
  2. First iteration: AI delivers a JSON string that's cut off somewhere (broken/incomplete)
  3. Subsequent iterations: AI delivers MORE JSON string fragments that need to be APPENDED to the previous JSON string
  4. Challenge: How to handle incomplete JSON strings and merge them correctly

Core Principle

  • If iteration 1 returns complete, valid JSON → Use it directly (no accumulation needed)
  • If iteration 1 returns incomplete/broken JSON → Enter accumulation mode

State Management

State class is defined in datamodelAi.py:

class JsonAccumulationState(BaseModel):
    accumulatedJsonString: str  # Raw accumulated JSON string
    isAccumulationMode: bool   # True if we're accumulating fragments
    lastParsedResult: Optional[Dict[str, Any]]  # Last successfully parsed result (for prompt context)
    allSections: List[Dict[str, Any]]  # Sections extracted so far (for prompt context)

Flow Logic

Phase 1: First Iteration Check

1. Receive JSON string from AI
2. Try to parse:
   - SUCCESS + Complete → Extract sections → DONE (no accumulation)
   - FAILURE or INCOMPLETE → Enter accumulation mode

Phase 2: Accumulation Mode (if needed)

For each iteration:
  1. Receive newFragmentString
  2. Concatenate with overlap handling:
     accumulatedJsonString = mergeJsonStringsWithOverlap(
         accumulatedJsonString, 
         newFragmentString
     )
  3. Try to parse accumulatedJsonString:
     - SUCCESS → Go to Phase 3 (completion)
     - FAILURE → Continue accumulation
  4. Extract partial sections (for prompt context):
     - Use repairBrokenJson() to get best partial structure
     - Extract sections from partial structure
     - Update allSections (for next prompt)
  5. Build continuation context for next prompt:
     - Extract delivered_summary: Count of items/rows/lines per section
     - Extract cut_off_element: Incomplete element where JSON was cut off
     - Extract element_before_cutoff: Last complete element before cut-off
     - Store last_raw_json: Raw JSON string for reference
  6. Keep accumulatedJsonString for next iteration

Phase 3: Completion (when parsing succeeds)

1. Analyze completeness:
   - Check if all structures are closed
   - Identify missing closing elements
2. Add closing elements if needed:
   - Close unclosed arrays/objects
   - Ensure proper JSON structure
3. Repair if corrupted:
   - Fix any remaining corruption
4. Extract final sections:
   - ExtractSectionsFromDocument()
5. DONE

Function Design

Main Function: accumulateAndParseJsonFragments

@staticmethod
def accumulateAndParseJsonFragments(
    accumulatedJsonString: str,
    newFragmentString: str,
    allSections: List[Dict[str, Any]],
    iteration: int
) -> Tuple[str, List[Dict[str, Any]], bool, Optional[Dict[str, Any]]]:
    """
    Accumulate JSON fragments and parse when complete.
    
    GENERIC function that handles:
    1. Concatenating JSON strings with overlap detection
    2. Parsing the accumulated string
    3. Extracting sections (partial if incomplete, final if complete)
    4. Determining completion status
    
    Args:
        accumulatedJsonString: Previously accumulated JSON string
        newFragmentString: New fragment string from current iteration
        allSections: Sections extracted so far (for prompt context)
        iteration: Current iteration number
    
    Returns:
        Tuple of:
        - accumulatedJsonString: Updated accumulated string
        - sections: Extracted sections (partial if incomplete, final if complete)
        - isComplete: True if JSON is complete and valid
        - parsedResult: Parsed JSON object (if parsing succeeded)
    """
    
    # Step 1: Clean encoding issues from accumulated string (check end of first delivered part)
    cleanedAccumulated = cleanEncodingIssues(accumulatedJsonString)
    
    # Step 2: Clean encoding issues from new fragment
    cleanedFragment = cleanEncodingIssues(newFragmentString)
    
    # Step 3: Concatenate with overlap handling
    combinedString = mergeJsonStringsWithOverlap(
        cleanedAccumulated, 
        cleanedFragment
    )
    
    # Step 4: Try to parse
    try:
        extracted = extractJsonString(combinedString)
        parsedResult = json.loads(extracted)
        
        # Step 5: Parsing succeeded - check completeness
        isComplete = isJsonComplete(parsedResult)
        
        if isComplete:
            # Step 6: Complete JSON - finalize
            finalizedJson = finalizeJson(parsedResult)
            sections = extractSectionsFromDocument(finalizedJson)
            return combinedString, sections, True, finalizedJson
        else:
            # Step 7: Incomplete but parseable - extract partial sections
            sections = extractSectionsFromDocument(parsedResult)
            return combinedString, sections, False, parsedResult
            
    except json.JSONDecodeError:
        # Step 8: Still broken - repair and extract partial sections
        repaired = repairBrokenJson(combinedString)
        if repaired:
            sections = extractSectionsFromDocument(repaired)
            return combinedString, sections, False, repaired
        else:
            # Repair failed - continue with data BEFORE merging the problematic piece
            # Return previous accumulated string (before adding new fragment)
            # This ensures we don't lose previously accumulated data
            logger.warning(f"Iteration {iteration}: Repair failed, continuing with previous accumulated data")
            return accumulatedJsonString, [], False, None

Helper Functions Needed

1. mergeJsonStringsWithOverlap

@staticmethod
def mergeJsonStringsWithOverlap(
    accumulated: str,
    newFragment: str
) -> str:
    """
    GENERIC function to merge two JSON strings, handling overlaps intelligently.
    
    Works for ANY JSON structure - no specific logic for content types.
    
    Overlap scenarios (all handled generically):
    - Exact continuation: newFragment starts exactly where accumulated ends
    - Partial overlap: newFragment overlaps with end of accumulated
    - Full overlap: newFragment is subset of accumulated
    
    Strategy:
    1. Find longest common suffix/prefix match (string-based comparison)
    2. Remove duplicate content
    3. Concatenate remaining parts
    
    Args:
        accumulated: Previously accumulated JSON string
        newFragment: New fragment string to append
    
    Returns:
        Combined JSON string with overlaps removed
    """
    # Implementation:
    # - Find longest common suffix/prefix match
    # - Remove overlapping part
    # - Concatenate: accumulated + newFragment[overlapEnd:]
    pass

2. isJsonComplete

@staticmethod
def isJsonComplete(parsedJson: Dict[str, Any]) -> bool:
    """
    GENERIC function to check if parsed JSON structure is complete.
    
    Works for ANY JSON structure - no specific logic for content types.
    
    Completeness checks (all generic):
    - All arrays are properly closed
    - All objects are properly closed
    - No incomplete structures
    - Recursive validation of nested structures
    
    Args:
        parsedJson: Parsed JSON object
    
    Returns:
        True if JSON is complete, False otherwise
    """
    # Implementation:
    # - Recursively check all structures
    # - Verify no incomplete arrays/objects
    # - Generic validation (no content-type-specific logic)
    pass

3. finalizeJson

@staticmethod
def finalizeJson(parsedJson: Dict[str, Any]) -> Dict[str, Any]:
    """
    GENERIC function to finalize complete JSON by adding missing closing elements and repairing corruption.
    
    Works for ANY JSON structure - no specific logic for content types.
    
    Steps (all generic):
    1. Analyze structure for missing closing elements (recursively)
    2. Add closing brackets/braces where needed
    3. Repair any remaining corruption
    4. Validate final structure
    
    Args:
        parsedJson: Parsed JSON object that needs finalization
    
    Returns:
        Finalized JSON object
    """
    # Implementation:
    # - Check for incomplete structures (generic recursive)
    # - Add missing closing elements
    # - Repair corruption using existing repair logic
    # - Return finalized structure
    pass

4. cleanEncodingIssues

@staticmethod
def cleanEncodingIssues(jsonString: str) -> str:
    """
    GENERIC function to remove problematic encoding parts from JSON string.
    
    Works for ANY JSON structure - removes problematic characters/bytes.
    
    Args:
        jsonString: JSON string that may have encoding issues
    
    Returns:
        Cleaned JSON string
    """
    try:
        # Try to decode/encode to detect issues
        jsonString.encode('utf-8').decode('utf-8')
        return jsonString
    except UnicodeError:
        # Remove problematic parts
        cleaned = jsonString.encode('utf-8', errors='ignore').decode('utf-8', errors='ignore')
        logger.warning("Removed encoding issues from JSON string")
        return cleaned

5. extractKpiFromResponse

@staticmethod
def extractKpiFromResponse(aiResponse: str) -> Optional[int]:
    """
    Extract KPI percentage from AI response.
    
    AI is asked: "Based on the delivered data so far, approximately what percentage (%) 
    of the total required content has been delivered? Respond with an integer between 0-100."
    
    Args:
        aiResponse: AI response string that may contain percentage
    
    Returns:
        Integer percentage (0-100) or None if not found
    """
    # Implementation:
    # - Look for percentage pattern in response (e.g., "45%", "45 percent", "45")
    # - Extract integer value
    # - Validate range (0-100)
    # - Return integer or None
    pass

6. validateKpiProgression

@staticmethod
def validateKpiProgression(
    accumulationState: JsonAccumulationState,
    currentKpi: int
) -> bool:
    """
    Validate KPI progression from AI response.
    
    Validation rules:
    - If % goes DOWN → Error (e.g., no data received, started new) → Return False
    - If % doesn't move (increment < 1%) → Error (no progress) → Return False
    - If % goes UP (increment >= 1%) → Good progress → Return True
    
    Args:
        accumulationState: Current accumulation state (contains lastKpi)
        currentKpi: Current KPI percentage from AI (integer 0-100)
    
    Returns:
        True if KPI progression is valid, False if error detected
    """
    # Implementation:
    # - Get lastKpi from accumulationState
    # - Calculate increment = currentKpi - lastKpi
    # - If increment < 0: return False (went down - error)
    # - If increment < 1: return False (no progress - error)
    # - If increment >= 1: return True (progress - good)
    pass

Continuation Context for Next Prompt

What is Delivered for Next Iteration Prompt

When accumulating JSON fragments, the system needs to provide context to the AI for the next iteration. This is handled by buildContinuationContext() which extracts:

  1. deliveredSummary: Summary of all sections with counts

    • Per section: content type, item/row/line counts
    • Example: - bullet_list with 20 items, - table "section_table" with 8 rows
    • Truncated if too long (first 10 + last 10 items)
  2. cutOffElement: The incomplete element where JSON was cut off

    • Extracted from lastRawResponse (raw JSON string)
    • Shows AI where generation stopped
    • Used as reference point for continuation
  3. elementBeforeCutoff: The last complete element before the cut-off

    • Provides context of what was completed
    • Helps AI understand structure
  4. lastRawJson: Raw JSON string from last iteration

    • Stored for reference
    • Used to detect fragments vs. full JSON structures
  5. kpiQuestion: Question for AI to answer with percentage delivered

    • "Based on the delivered data so far, approximately what percentage (%) of the total required content has been delivered? Respond with an integer between 0-100."
    • AI must respond with integer percentage (0-100)

Logic Flow

After each accumulation iteration:
1. Extract sections from accumulated JSON (even if incomplete)
2. Build continuation context:
   - Count items/rows/lines per section (for deliveredSummary)
   - Find incomplete section from allSections
   - Extract cut-off point from lastRawResponse
3. Pass context to prompt builder for next iteration
4. AI uses context to continue from cut-off point

Integration Point

Modified _extractSectionsFromResponse in mainServiceAi.py

def _extractSectionsFromResponse(
    result: str,
    iteration: int,
    debugPrefix: str,
    allSections: List[Dict[str, Any]] = None,
    accumulationState: Optional[JsonAccumulationState] = None  # NEW: Track accumulation state
) -> Tuple[List[Dict[str, Any]], bool, Optional[Dict[str, Any]], Optional[JsonAccumulationState]]:
    """
    Extract sections from AI response, handling both valid and broken JSON.
    
    NEW BEHAVIOR:
    - First iteration: Check if complete, if not start accumulation
    - Subsequent iterations: Accumulate strings, parse when complete
    
    Returns:
        Tuple of:
        - sections: Extracted sections
        - wasJsonComplete: True if JSON is complete
        - parsedResult: Parsed JSON object
        - updatedAccumulationState: Updated accumulation state (None if not in accumulation mode)
    """
    
    if iteration == 1:
        # First iteration - check if complete
        try:
            extracted = extractJsonString(result)
            parsed = json.loads(extracted)
            
            # Check completeness
            if JsonResponseHandler.isJsonComplete(parsed):
                # Complete JSON - no accumulation needed
                sections = extractSectionsFromDocument(parsed)
                return sections, True, parsed, None  # No accumulation
        except:
            pass
        
        # Incomplete - start accumulation
        logger.info(f"Iteration 1: Incomplete JSON detected, starting accumulation mode")
        accumulationState = JsonAccumulationState(
            accumulatedJsonString=result,
            isAccumulationMode=True,
            lastParsedResult=None,
            allSections=[]
        )
        return [], False, None, accumulationState
    
    else:
        # Subsequent iterations - accumulate
        if accumulationState and accumulationState.isAccumulationMode:
            accumulated, sections, isComplete, parsedResult = \
                JsonResponseHandler.accumulateAndParseJsonFragments(
                    accumulationState.accumulatedJsonString,
                    result,
                    allSections,
                    iteration
                )
            
            # Update accumulation state
            accumulationState.accumulatedJsonString = accumulated
            accumulationState.lastParsedResult = parsedResult
            accumulationState.allSections = allSections + sections if sections else allSections
            accumulationState.isAccumulationMode = not isComplete
            
            return sections, isComplete, parsedResult, accumulationState
        else:
            # No accumulation mode - process normally (shouldn't happen)
            logger.warning(f"Iteration {iteration}: No accumulation state but iteration > 1")
            return [], False, None, None

Modified Loop in mainServiceAi.py

# In the iteration loop:
accumulationState = None  # Track accumulation state

for iteration in range(1, maxIterations + 1):
    # ... AI call ...
    
    # Extract sections with accumulation support
    extractedSections, wasJsonComplete, parsedResult, accumulationState = \
        self._extractSectionsFromResponse(
            result, 
            iteration, 
            debugPrefix, 
            allSections,
            accumulationState  # Pass accumulation state object
        )
    
    # Update allSections for prompt context
    if extractedSections:
        allSections = JsonResponseHandler.mergeSectionsIntelligently(
            allSections, 
            extractedSections, 
            iteration
        )
    
    # Build continuation context for next prompt (if needed)
    if not wasJsonComplete and (allSections or result):
        continuationContext = buildContinuationContext(allSections, result)
        # Add KPI question for AI to answer (percentage delivered)
        continuationContext["kpiQuestion"] = "Based on the delivered data so far, approximately what percentage (%) of the total required content has been delivered? Respond with an integer between 0-100."
        # Use continuationContext in next prompt
    
    # Extract KPI from AI response and validate progression
    if accumulationState and accumulationState.isAccumulationMode:
        currentKpi = JsonResponseHandler.extractKpiFromResponse(result)  # Extract percentage from AI response
        if currentKpi is not None:
            if not JsonResponseHandler.validateKpiProgression(accumulationState, currentKpi):
                logger.warning(f"Iteration {iteration}: KPI validation failed, stopping accumulation")
                break
            # Store KPI in accumulation state
            accumulationState.lastKpi = currentKpi
    
    # Check completion
    if wasJsonComplete:
        break  # Done

Key Considerations

1. Overlap Detection Strategy

Question: How to detect overlaps between accumulated string and new fragment?

GENERIC Approach:

  • Compare end of accumulated string with start of new fragment
  • Find longest matching suffix/prefix (string-based comparison)
  • Remove duplicate content
  • Works for ANY JSON structure (no content-type-specific logic)

2. Partial Section Extraction

Question: Should we extract sections from incomplete JSON for prompt context?

Answer: Yes, with generic approach:

  • Extract what's available (even if incomplete) - works for ANY content type
  • Use for continuation prompts (via buildContinuationContext())
  • Build delivered summary with counts per section (generic counting)
  • Extract cut-off point from raw JSON string (generic detection)
  • Keep accumulated string separate (for next append)

3. State Storage

Question: Where to store accumulatedJsonString?

Answer: Store in JsonAccumulationState object for traceability

  • Use JsonAccumulationState class from datamodelAi.py
  • Store accumulated string, mode flag, parsed result, and sections
  • Better traceability and debugging
  • Can be logged/persisted if needed

4. Completion Detection

Question: When is JSON considered "complete"?

GENERIC Criteria:

  • Parses successfully without errors
  • All structures are properly closed (recursive check)
  • No incomplete arrays/objects
  • Generic validation (no content-type-specific checks)

5. Error Handling

Scenarios:

  • Repair fails → Continue accumulation (don't stop)
  • Parsing fails after accumulation → Try repair, continue if repair succeeds
  • Merge fails → Log error, continue with best available data

Implementation Steps

  1. Add state class in datamodelAi.py:

    • JsonAccumulationState (camelStyle naming)
  2. Create helper functions in subJsonResponseHandling.py:

    • mergeJsonStringsWithOverlap() (generic, camelStyle)
    • isJsonComplete() (generic, camelStyle)
    • finalizeJson() (generic, camelStyle)
  3. Create main function in subJsonResponseHandling.py:

    • accumulateAndParseJsonFragments() (generic, camelStyle)
  4. Modify _extractSectionsFromResponse in mainServiceAi.py:

    • Add accumulationState parameter (JsonAccumulationState object)
    • Add first iteration check
    • Call accumulation function for subsequent iterations
    • Update accumulation state object
  5. Update iteration loop in mainServiceAi.py:

    • Track accumulationState object (JsonAccumulationState)
    • Pass to _extractSectionsFromResponse
    • Build continuation context using buildContinuationContext()
    • Add KPI question to continuation context
    • Extract KPI from AI response and validate progression
    • Handle return values
  6. Create test file:

    • Test string accumulation with overlaps
    • Test completion detection
    • Test partial section extraction
    • Test continuation context building

Testing Strategy

Test Cases

  1. Complete JSON on first iteration:

    • Should NOT enter accumulation mode
    • Should extract sections directly
  2. Incomplete JSON on first iteration:

    • Should enter accumulation mode
    • Should store string for next iteration
  3. Fragment with exact continuation:

    • Should concatenate without duplicates
    • Should parse successfully
  4. Fragment with overlap:

    • Should detect and remove overlap
    • Should concatenate correctly
  5. Fragment with full overlap:

    • Should handle duplicate content
    • Should not add duplicates
  6. Multiple iterations:

    • Should accumulate across all iterations
    • Should extract partial sections for prompts
    • Should complete when JSON is valid

Open Questions - Answers

1. How to handle very large accumulated strings? (Memory concerns)

Answer: No memory problems expected

  • System handles files up to ~1GB
  • String accumulation is acceptable for this size
  • No special memory management needed

2. Should we limit accumulation attempts? (Prevent infinite loops)

Answer: Yes, use KPI-based stopping

  • Add generic KPI to iteration prompt showing remaining elements needed
  • KPI calculation: Compare expected vs. delivered counts per section type
  • Stop if KPI doesn't decrease in 3 consecutive iterations
  • KPI is AI-provided (not calculated by system) - AI answers percentage question
  • Simple integer comparison for validation (no fuzzy AI calculation)

KPI Question for Iteration Prompt:

=== PROGRESS INDICATOR ===
Based on the delivered data so far, approximately what percentage (%) of the total 
required content has been delivered? 

Respond with an integer between 0-100.

⚠️ IMPORTANT: 
- If percentage goes DOWN in next iteration → Generation will stop (error detected)
- If percentage doesn't increase by at least 1% → Generation will stop (no progress)
- Only continue if percentage increases by 1% or more

KPI Validation Logic:

def validateKpiProgression(
    accumulationState: JsonAccumulationState,
    currentKpi: int
) -> bool:
    """
    Validate KPI progression from AI response.
    
    Validation rules:
    - If % goes DOWN → Error (e.g., no data received, started new) → Return False
    - If % doesn't move (increment < 1%) → Error (no progress) → Return False
    - If % goes UP (increment >= 1%) → Good progress → Return True
    
    Args:
        accumulationState: Current accumulation state (contains lastKpi)
        currentKpi: Current KPI percentage from AI (integer 0-100)
    
    Returns:
        True if KPI progression is valid, False if error detected
    """
    lastKpi = accumulationState.lastKpi if accumulationState.lastKpi else 0
    increment = currentKpi - lastKpi
    
    if increment < 0:
        return False  # Went down - error
    if increment < 1:
        return False  # No progress - error
    return True  # Progress - good

3. How to handle encoding issues in string concatenation?

Answer: Remove problematic parts

  • Detect encoding errors during concatenation
  • Remove problematic characters/bytes
  • Continue with cleaned string
  • Acceptable to lose some data rather than fail completely

Implementation:

def cleanEncodingIssues(jsonString: str) -> str:
    """
    Remove problematic encoding parts from JSON string.
    
    Generic approach:
    - Detect encoding errors
    - Remove problematic characters/bytes
    - Return cleaned string
    """
    try:
        # Try to decode/encode to detect issues
        jsonString.encode('utf-8').decode('utf-8')
        return jsonString
    except UnicodeError:
        # Remove problematic parts
        cleaned = jsonString.encode('utf-8', errors='ignore').decode('utf-8', errors='ignore')
        logger.warning("Removed encoding issues from JSON string")
        return cleaned

4. Should overlap detection be configurable? (Performance vs. accuracy)

Answer: No, automated mode only

  • AI calls take 30-180 seconds (plenty of time for overlap detection)
  • No performance concerns
  • Always use automated overlap detection
  • No configuration needed