wiki/appdoc/json_string_accumulation_concept.md

# JSON String Accumulation Concept for Iterative AI Generation

## Problem Statement

Currently, the AI service processes each iteration's JSON string independently, then merges parsed objects. However, the real-world behavior is:

1. AI delivers a **STRING** containing JSON (not a parsed JSON object)
2. First iteration: AI delivers a JSON string that's cut off somewhere (broken/incomplete)
3. Subsequent iterations: AI delivers MORE JSON string fragments that need to be **APPENDED** to the previous JSON string
4. Challenge: How to handle incomplete JSON strings and merge them correctly

## Core Principle

- **If iteration 1 returns complete, valid JSON** → Use it directly (no accumulation needed)
- **If iteration 1 returns incomplete/broken JSON** → Enter accumulation mode

## State Management

State class is defined in `datamodelAi.py`:

```python
class JsonAccumulationState(BaseModel):
    accumulatedJsonString: str  # Raw accumulated JSON string
    isAccumulationMode: bool   # True if we're accumulating fragments
    lastParsedResult: Optional[Dict[str, Any]]  # Last successfully parsed result (for prompt context)
    allSections: List[Dict[str, Any]]  # Sections extracted so far (for prompt context)
```

## Flow Logic

### Phase 1: First Iteration Check

```
1. Receive JSON string from AI
2. Try to parse:
   - SUCCESS + Complete → Extract sections → DONE (no accumulation)
   - FAILURE or INCOMPLETE → Enter accumulation mode
```

### Phase 2: Accumulation Mode (if needed)

```
For each iteration:
  1. Receive newFragmentString
  2. Concatenate with overlap handling:
     accumulatedJsonString = mergeJsonStringsWithOverlap(
         accumulatedJsonString,
         newFragmentString
     )
  3. Try to parse accumulatedJsonString:
     - SUCCESS → Go to Phase 3 (completion)
     - FAILURE → Continue accumulation
  4. Extract partial sections (for prompt context):
     - Use repairBrokenJson() to get best partial structure
     - Extract sections from partial structure
     - Update allSections (for next prompt)
  5. Build continuation context for next prompt:
     - Extract delivered_summary: Count of items/rows/lines per section
     - Extract cut_off_element: Incomplete element where JSON was cut off
     - Extract element_before_cutoff: Last complete element before cut-off
     - Store last_raw_json: Raw JSON string for reference
  6. Keep accumulatedJsonString for next iteration
```

### Phase 3: Completion (when parsing succeeds)

```
1. Analyze completeness:
   - Check if all structures are closed
   - Identify missing closing elements
2. Add closing elements if needed:
   - Close unclosed arrays/objects
   - Ensure proper JSON structure
3. Repair if corrupted:
   - Fix any remaining corruption
4. Extract final sections:
   - ExtractSectionsFromDocument()
5. DONE
```

## Function Design

### Main Function: `accumulateAndParseJsonFragments`

```python
@staticmethod
def accumulateAndParseJsonFragments(
    accumulatedJsonString: str,
    newFragmentString: str,
    allSections: List[Dict[str, Any]],
    iteration: int
) -> Tuple[str, List[Dict[str, Any]], bool, Optional[Dict[str, Any]]]:
    """
    Accumulate JSON fragments and parse when complete.

    GENERIC function that handles:
    1. Concatenating JSON strings with overlap detection
    2. Parsing the accumulated string
    3. Extracting sections (partial if incomplete, final if complete)
    4. Determining completion status

    Args:
        accumulatedJsonString: Previously accumulated JSON string
        newFragmentString: New fragment string from current iteration
        allSections: Sections extracted so far (for prompt context)
        iteration: Current iteration number

    Returns:
        Tuple of:
        - accumulatedJsonString: Updated accumulated string
        - sections: Extracted sections (partial if incomplete, final if complete)
        - isComplete: True if JSON is complete and valid
        - parsedResult: Parsed JSON object (if parsing succeeded)
    """

    # Step 1: Clean encoding issues from accumulated string (check end of first delivered part)
    cleanedAccumulated = cleanEncodingIssues(accumulatedJsonString)

    # Step 2: Clean encoding issues from new fragment
    cleanedFragment = cleanEncodingIssues(newFragmentString)

    # Step 3: Concatenate with overlap handling
    combinedString = mergeJsonStringsWithOverlap(
        cleanedAccumulated,
        cleanedFragment
    )

    # Step 4: Try to parse
    try:
        extracted = extractJsonString(combinedString)
        parsedResult = json.loads(extracted)

        # Step 5: Parsing succeeded - check completeness
        isComplete = isJsonComplete(parsedResult)

        if isComplete:
            # Step 6: Complete JSON - finalize
            finalizedJson = finalizeJson(parsedResult)
            sections = extractSectionsFromDocument(finalizedJson)
            return combinedString, sections, True, finalizedJson
        else:
            # Step 7: Incomplete but parseable - extract partial sections
            sections = extractSectionsFromDocument(parsedResult)
            return combinedString, sections, False, parsedResult

    except json.JSONDecodeError:
        # Step 8: Still broken - repair and extract partial sections
        repaired = repairBrokenJson(combinedString)
        if repaired:
            sections = extractSectionsFromDocument(repaired)
            return combinedString, sections, False, repaired
        else:
            # Repair failed - continue with data BEFORE merging the problematic piece
            # Return previous accumulated string (before adding new fragment)
            # This ensures we don't lose previously accumulated data
            logger.warning(f"Iteration {iteration}: Repair failed, continuing with previous accumulated data")
            return accumulatedJsonString, [], False, None
```

## Helper Functions Needed

### 1. `mergeJsonStringsWithOverlap`

```python
@staticmethod
def mergeJsonStringsWithOverlap(
    accumulated: str,
    newFragment: str
) -> str:
    """
    GENERIC function to merge two JSON strings, handling overlaps intelligently.

    Works for ANY JSON structure - no specific logic for content types.

    Overlap scenarios (all handled generically):
    - Exact continuation: newFragment starts exactly where accumulated ends
    - Partial overlap: newFragment overlaps with end of accumulated
    - Full overlap: newFragment is subset of accumulated

    Strategy:
    1. Find longest common suffix/prefix match (string-based comparison)
    2. Remove duplicate content
    3. Concatenate remaining parts

    Args:
        accumulated: Previously accumulated JSON string
        newFragment: New fragment string to append

    Returns:
        Combined JSON string with overlaps removed
    """
    # Implementation:
    # - Find longest common suffix/prefix match
    # - Remove overlapping part
    # - Concatenate: accumulated + newFragment[overlapEnd:]
    pass
```

### 2. `isJsonComplete`

```python
@staticmethod
def isJsonComplete(parsedJson: Dict[str, Any]) -> bool:
    """
    GENERIC function to check if parsed JSON structure is complete.

    Works for ANY JSON structure - no specific logic for content types.

    Completeness checks (all generic):
    - All arrays are properly closed
    - All objects are properly closed
    - No incomplete structures
    - Recursive validation of nested structures

    Args:
        parsedJson: Parsed JSON object

    Returns:
        True if JSON is complete, False otherwise
    """
    # Implementation:
    # - Recursively check all structures
    # - Verify no incomplete arrays/objects
    # - Generic validation (no content-type-specific logic)
    pass
```

### 3. `finalizeJson`

```python
@staticmethod
def finalizeJson(parsedJson: Dict[str, Any]) -> Dict[str, Any]:
    """
    GENERIC function to finalize complete JSON by adding missing closing elements and repairing corruption.

    Works for ANY JSON structure - no specific logic for content types.

    Steps (all generic):
    1. Analyze structure for missing closing elements (recursively)
    2. Add closing brackets/braces where needed
    3. Repair any remaining corruption
    4. Validate final structure

    Args:
        parsedJson: Parsed JSON object that needs finalization

    Returns:
        Finalized JSON object
    """
    # Implementation:
    # - Check for incomplete structures (generic recursive)
    # - Add missing closing elements
    # - Repair corruption using existing repair logic
    # - Return finalized structure
    pass
```

### 4. `cleanEncodingIssues`

```python
@staticmethod
def cleanEncodingIssues(jsonString: str) -> str:
    """
    GENERIC function to remove problematic encoding parts from JSON string.

    Works for ANY JSON structure - removes problematic characters/bytes.

    Args:
        jsonString: JSON string that may have encoding issues

    Returns:
        Cleaned JSON string
    """
    try:
        # Try to decode/encode to detect issues
        jsonString.encode('utf-8').decode('utf-8')
        return jsonString
    except UnicodeError:
        # Remove problematic parts
        cleaned = jsonString.encode('utf-8', errors='ignore').decode('utf-8', errors='ignore')
        logger.warning("Removed encoding issues from JSON string")
        return cleaned
```

### 5. `extractKpiFromResponse`

```python
@staticmethod
def extractKpiFromResponse(aiResponse: str) -> Optional[int]:
    """
    Extract KPI percentage from AI response.

    AI is asked: "Based on the delivered data so far, approximately what percentage (%)
    of the total required content has been delivered? Respond with an integer between 0-100."

    Args:
        aiResponse: AI response string that may contain percentage

    Returns:
        Integer percentage (0-100) or None if not found
    """
    # Implementation:
    # - Look for percentage pattern in response (e.g., "45%", "45 percent", "45")
    # - Extract integer value
    # - Validate range (0-100)
    # - Return integer or None
    pass
```

### 6. `validateKpiProgression`

```python
@staticmethod
def validateKpiProgression(
    accumulationState: JsonAccumulationState,
    currentKpi: int
) -> bool:
    """
    Validate KPI progression from AI response.

    Validation rules:
    - If % goes DOWN → Error (e.g., no data received, started new) → Return False
    - If % doesn't move (increment < 1%) → Error (no progress) → Return False
    - If % goes UP (increment >= 1%) → Good progress → Return True

    Args:
        accumulationState: Current accumulation state (contains lastKpi)
        currentKpi: Current KPI percentage from AI (integer 0-100)

    Returns:
        True if KPI progression is valid, False if error detected
    """
    # Implementation:
    # - Get lastKpi from accumulationState
    # - Calculate increment = currentKpi - lastKpi
    # - If increment < 0: return False (went down - error)
    # - If increment < 1: return False (no progress - error)
    # - If increment >= 1: return True (progress - good)
    pass
```

## Continuation Context for Next Prompt

### What is Delivered for Next Iteration Prompt

When accumulating JSON fragments, the system needs to provide context to the AI for the next iteration. This is handled by `buildContinuationContext()` which extracts:

1. **deliveredSummary**: Summary of all sections with counts
   - Per section: content type, item/row/line counts
   - Example: `- bullet_list with 20 items`, `- table "section_table" with 8 rows`
   - Truncated if too long (first 10 + last 10 items)

2. **cutOffElement**: The incomplete element where JSON was cut off
   - Extracted from `lastRawResponse` (raw JSON string)
   - Shows AI where generation stopped
   - Used as reference point for continuation

3. **elementBeforeCutoff**: The last complete element before the cut-off
   - Provides context of what was completed
   - Helps AI understand structure

4. **lastRawJson**: Raw JSON string from last iteration
   - Stored for reference
   - Used to detect fragments vs. full JSON structures

5. **kpiQuestion**: Question for AI to answer with percentage delivered
   - "Based on the delivered data so far, approximately what percentage (%) of the total required content has been delivered? Respond with an integer between 0-100."
   - AI must respond with integer percentage (0-100)

### Logic Flow

```
After each accumulation iteration:
1. Extract sections from accumulated JSON (even if incomplete)
2. Build continuation context:
   - Count items/rows/lines per section (for deliveredSummary)
   - Find incomplete section from allSections
   - Extract cut-off point from lastRawResponse
3. Pass context to prompt builder for next iteration
4. AI uses context to continue from cut-off point
```

## Integration Point

### Modified `_extractSectionsFromResponse` in `mainServiceAi.py`

```python
def _extractSectionsFromResponse(
    result: str,
    iteration: int,
    debugPrefix: str,
    allSections: List[Dict[str, Any]] = None,
    accumulationState: Optional[JsonAccumulationState] = None  # NEW: Track accumulation state
) -> Tuple[List[Dict[str, Any]], bool, Optional[Dict[str, Any]], Optional[JsonAccumulationState]]:
    """
    Extract sections from AI response, handling both valid and broken JSON.

    NEW BEHAVIOR:
    - First iteration: Check if complete, if not start accumulation
    - Subsequent iterations: Accumulate strings, parse when complete

    Returns:
        Tuple of:
        - sections: Extracted sections
        - wasJsonComplete: True if JSON is complete
        - parsedResult: Parsed JSON object
        - updatedAccumulationState: Updated accumulation state (None if not in accumulation mode)
    """

    if iteration == 1:
        # First iteration - check if complete
        try:
            extracted = extractJsonString(result)
            parsed = json.loads(extracted)

            # Check completeness
            if JsonResponseHandler.isJsonComplete(parsed):
                # Complete JSON - no accumulation needed
                sections = extractSectionsFromDocument(parsed)
                return sections, True, parsed, None  # No accumulation
        except:
            pass

        # Incomplete - start accumulation
        logger.info(f"Iteration 1: Incomplete JSON detected, starting accumulation mode")
        accumulationState = JsonAccumulationState(
            accumulatedJsonString=result,
            isAccumulationMode=True,
            lastParsedResult=None,
            allSections=[]
        )
        return [], False, None, accumulationState

    else:
        # Subsequent iterations - accumulate
        if accumulationState and accumulationState.isAccumulationMode:
            accumulated, sections, isComplete, parsedResult = \
                JsonResponseHandler.accumulateAndParseJsonFragments(
                    accumulationState.accumulatedJsonString,
                    result,
                    allSections,
                    iteration
                )

            # Update accumulation state
            accumulationState.accumulatedJsonString = accumulated
            accumulationState.lastParsedResult = parsedResult
            accumulationState.allSections = allSections + sections if sections else allSections
            accumulationState.isAccumulationMode = not isComplete

            return sections, isComplete, parsedResult, accumulationState
        else:
            # No accumulation mode - process normally (shouldn't happen)
            logger.warning(f"Iteration {iteration}: No accumulation state but iteration > 1")
            return [], False, None, None
```

### Modified Loop in `mainServiceAi.py`

```python
# In the iteration loop:
accumulationState = None  # Track accumulation state

for iteration in range(1, maxIterations + 1):
    # ... AI call ...

    # Extract sections with accumulation support
    extractedSections, wasJsonComplete, parsedResult, accumulationState = \
        self._extractSectionsFromResponse(
            result,
            iteration,
            debugPrefix,
            allSections,
            accumulationState  # Pass accumulation state object
        )

    # Update allSections for prompt context
    if extractedSections:
        allSections = JsonResponseHandler.mergeSectionsIntelligently(
            allSections,
            extractedSections,
            iteration
        )

    # Build continuation context for next prompt (if needed)
    if not wasJsonComplete and (allSections or result):
        continuationContext = buildContinuationContext(allSections, result)
        # Add KPI question for AI to answer (percentage delivered)
        continuationContext["kpiQuestion"] = "Based on the delivered data so far, approximately what percentage (%) of the total required content has been delivered? Respond with an integer between 0-100."
        # Use continuationContext in next prompt

    # Extract KPI from AI response and validate progression
    if accumulationState and accumulationState.isAccumulationMode:
        currentKpi = JsonResponseHandler.extractKpiFromResponse(result)  # Extract percentage from AI response
        if currentKpi is not None:
            if not JsonResponseHandler.validateKpiProgression(accumulationState, currentKpi):
                logger.warning(f"Iteration {iteration}: KPI validation failed, stopping accumulation")
                break
            # Store KPI in accumulation state
            accumulationState.lastKpi = currentKpi

    # Check completion
    if wasJsonComplete:
        break  # Done
```

## Key Considerations

### 1. Overlap Detection Strategy

**Question:** How to detect overlaps between accumulated string and new fragment?

**GENERIC Approach:**
- Compare end of accumulated string with start of new fragment
- Find longest matching suffix/prefix (string-based comparison)
- Remove duplicate content
- Works for ANY JSON structure (no content-type-specific logic)

### 2. Partial Section Extraction

**Question:** Should we extract sections from incomplete JSON for prompt context?

**Answer:** Yes, with generic approach:
- Extract what's available (even if incomplete) - works for ANY content type
- Use for continuation prompts (via `buildContinuationContext()`)
- Build delivered summary with counts per section (generic counting)
- Extract cut-off point from raw JSON string (generic detection)
- Keep accumulated string separate (for next append)

### 3. State Storage

**Question:** Where to store `accumulatedJsonString`?

**Answer:** Store in `JsonAccumulationState` object for traceability
- Use `JsonAccumulationState` class from `datamodelAi.py`
- Store accumulated string, mode flag, parsed result, and sections
- Better traceability and debugging
- Can be logged/persisted if needed

### 4. Completion Detection

**Question:** When is JSON considered "complete"?

**GENERIC Criteria:**
- Parses successfully without errors
- All structures are properly closed (recursive check)
- No incomplete arrays/objects
- Generic validation (no content-type-specific checks)

### 5. Error Handling

**Scenarios:**
- Repair fails → Continue accumulation (don't stop)
- Parsing fails after accumulation → Try repair, continue if repair succeeds
- Merge fails → Log error, continue with best available data

## Implementation Steps

1. **Add state class** in `datamodelAi.py`:
   - `JsonAccumulationState` (camelStyle naming)

2. **Create helper functions** in `subJsonResponseHandling.py`:
   - `mergeJsonStringsWithOverlap()` (generic, camelStyle)
   - `isJsonComplete()` (generic, camelStyle)
   - `finalizeJson()` (generic, camelStyle)

3. **Create main function** in `subJsonResponseHandling.py`:
   - `accumulateAndParseJsonFragments()` (generic, camelStyle)

4. **Modify `_extractSectionsFromResponse`** in `mainServiceAi.py`:
   - Add `accumulationState` parameter (JsonAccumulationState object)
   - Add first iteration check
   - Call accumulation function for subsequent iterations
   - Update accumulation state object

5. **Update iteration loop** in `mainServiceAi.py`:
   - Track `accumulationState` object (JsonAccumulationState)
   - Pass to `_extractSectionsFromResponse`
   - Build continuation context using `buildContinuationContext()`
   - Add KPI question to continuation context
   - Extract KPI from AI response and validate progression
   - Handle return values

6. **Create test file**:
   - Test string accumulation with overlaps
   - Test completion detection
   - Test partial section extraction
   - Test continuation context building

## Testing Strategy

### Test Cases

1. **Complete JSON on first iteration:**
   - Should NOT enter accumulation mode
   - Should extract sections directly

2. **Incomplete JSON on first iteration:**
   - Should enter accumulation mode
   - Should store string for next iteration

3. **Fragment with exact continuation:**
   - Should concatenate without duplicates
   - Should parse successfully

4. **Fragment with overlap:**
   - Should detect and remove overlap
   - Should concatenate correctly

5. **Fragment with full overlap:**
   - Should handle duplicate content
   - Should not add duplicates

6. **Multiple iterations:**
   - Should accumulate across all iterations
   - Should extract partial sections for prompts
   - Should complete when JSON is valid


## Open Questions - Answers

### 1. How to handle very large accumulated strings? (Memory concerns)

**Answer:** No memory problems expected
- System handles files up to ~1GB
- String accumulation is acceptable for this size
- No special memory management needed

### 2. Should we limit accumulation attempts? (Prevent infinite loops)

**Answer:** Yes, use KPI-based stopping
- Add generic KPI to iteration prompt showing remaining elements needed
- KPI calculation: Compare expected vs. delivered counts per section type
- Stop if KPI doesn't decrease in 3 consecutive iterations
- KPI is AI-provided (not calculated by system) - AI answers percentage question
- Simple integer comparison for validation (no fuzzy AI calculation)

**KPI Question for Iteration Prompt:**

```
=== PROGRESS INDICATOR ===
Based on the delivered data so far, approximately what percentage (%) of the total
required content has been delivered?

Respond with an integer between 0-100.

⚠️ IMPORTANT:
- If percentage goes DOWN in next iteration → Generation will stop (error detected)
- If percentage doesn't increase by at least 1% → Generation will stop (no progress)
- Only continue if percentage increases by 1% or more
```

**KPI Validation Logic:**
```python
def validateKpiProgression(
    accumulationState: JsonAccumulationState,
    currentKpi: int
) -> bool:
    """
    Validate KPI progression from AI response.

    Validation rules:
    - If % goes DOWN → Error (e.g., no data received, started new) → Return False
    - If % doesn't move (increment < 1%) → Error (no progress) → Return False
    - If % goes UP (increment >= 1%) → Good progress → Return True

    Args:
        accumulationState: Current accumulation state (contains lastKpi)
        currentKpi: Current KPI percentage from AI (integer 0-100)

    Returns:
        True if KPI progression is valid, False if error detected
    """
    lastKpi = accumulationState.lastKpi if accumulationState.lastKpi else 0
    increment = currentKpi - lastKpi

    if increment < 0:
        return False  # Went down - error
    if increment < 1:
        return False  # No progress - error
    return True  # Progress - good
```

### 3. How to handle encoding issues in string concatenation?

**Answer:** Remove problematic parts
- Detect encoding errors during concatenation
- Remove problematic characters/bytes
- Continue with cleaned string
- Acceptable to lose some data rather than fail completely

**Implementation:**
```python
def cleanEncodingIssues(jsonString: str) -> str:
    """
    Remove problematic encoding parts from JSON string.

    Generic approach:
    - Detect encoding errors
    - Remove problematic characters/bytes
    - Return cleaned string
    """
    try:
        # Try to decode/encode to detect issues
        jsonString.encode('utf-8').decode('utf-8')
        return jsonString
    except UnicodeError:
        # Remove problematic parts
        cleaned = jsonString.encode('utf-8', errors='ignore').decode('utf-8', errors='ignore')
        logger.warning("Removed encoding issues from JSON string")
        return cleaned
```

### 4. Should overlap detection be configurable? (Performance vs. accuracy)

**Answer:** No, automated mode only
- AI calls take 30-180 seconds (plenty of time for overlap detection)
- No performance concerns
- Always use automated overlap detection
- No configuration needed