gateway/docs/WEBSEARCH_FIXES.md

5.3 KiB

Web Search Content Extraction Fixes

Problem Summary

The Tavily web search integration was failing to extract content from search results, causing web research to return empty or incomplete data. The main issues were related to handling None values and incomplete error recovery.

Main Issues Fixed

1. Incomplete Content Extraction from Search Results

Problem:

  • When Tavily API returned search results, some results had raw_content set to None (not missing, but explicitly None)
  • The code used result.get("raw_content") or result.get("content", "") which failed when raw_content existed but was None
  • This caused None values to propagate through the system instead of falling back to the content field or empty string

Fix: Changed the content extraction in aicorePluginTavily.py to properly handle None values:

# Before (line 344):
rawContent=result.get("raw_content") or result.get("content", "")

# After:
rawContent=result.get("raw_content") or result.get("content") or ""

This ensures that if raw_content is None, it falls back to content, and if that's also None, it defaults to an empty string.

Additional Fix: Added defensive checks in the webSearch method to safely extract content even when result objects have unexpected structures:

# Safely extract content with multiple fallbacks
content = ""
if hasattr(result, 'rawContent'):
    content = result.rawContent or ""
if not content and hasattr(result, 'content'):
    content = result.content or ""

2. NoneType Error When Logging Content Length

Problem:

  • Code attempted to check len(first_result.get('raw_content', '')) for logging
  • When raw_content key existed but value was None, .get() returned None instead of the default ''
  • This caused len(None) to fail with TypeError: object of type 'NoneType' has no len()

Fix: Changed the logging code to safely handle None values:

# Before (line 338):
logger.debug(f"First result has raw_content: {'raw_content' in first_result}, content length: {len(first_result.get('raw_content', ''))}")

# After:
raw_content = first_result.get('raw_content') or ''
logger.debug(f"First result has raw_content: {'raw_content' in first_result}, content length: {len(raw_content)}")

3. Missing Error Recovery in Content Extraction

Problem:

  • When processing search results, if one result failed to extract, the entire extraction could fail
  • No recovery mechanism to extract at least URLs even when content extraction failed
  • Errors were logged but processing stopped, losing potentially useful data

Fix: Added per-result error handling with recovery:

for result in searchResults:
    try:
        # Extract URL, content, title safely
        # ... extraction logic ...
    except Exception as resultError:
        logger.warning(f"Error processing individual search result: {resultError}")
        # Continue processing other results instead of failing completely
        continue

Also added recovery at the extraction level:

except Exception as extractionError:
    logger.error(f"Error extracting URLs and content from search results: {extractionError}")
    # Try to recover at least URLs
    try:
        urls = [result.url for result in searchResults if hasattr(result, 'url') and result.url]
        logger.info(f"Recovered {len(urls)} URLs after extraction error")
    except Exception:
        logger.error("Failed to recover any URLs from search results")

4. Incomplete Crawl Result Processing

Problem:

  • When crawl returned results but individual page processing failed, entire crawl was lost
  • No fallback to extract at least URLs from failed crawl results
  • Missing content fields could cause errors when formatting results

Fix: Added error handling for individual page processing:

for i, result in enumerate(crawlResults, 1):
    try:
        # Format page content
        # ... formatting logic ...
    except Exception as pageError:
        logger.warning(f"Error formatting page {i} from crawl: {pageError}")
        # Try to add at least the URL
        try:
            pageUrls.append(result.url if hasattr(result, 'url') and result.url else webCrawlPrompt.url)
        except Exception:
            pass

Also ensured all result fields have safe defaults:

results.append(WebCrawlResult(
    url=result_url or url,  # Fallback to base URL
    content=result_content,  # Already ensured to be string
    title=result_title      # Already ensured to be string
))

Impact

These fixes ensure that:

  1. Content is always extracted - Even when raw_content is None, the system falls back to content field or empty string
  2. Partial results are preserved - If some results fail, others are still processed and returned
  3. URLs are recovered - Even when content extraction fails completely, URLs can still be extracted for crawling
  4. No crashes from None values - All None values are properly handled before operations like len() are called

Testing Recommendations

  • Test with Tavily search results that have raw_content set to None
  • Test with mixed results (some with content, some without)
  • Test error recovery when individual results fail
  • Verify that URLs are still extracted even when content extraction fails