5.3 KiB
Web Search Content Extraction Fixes
Problem Summary
The Tavily web search integration was failing to extract content from search results, causing web research to return empty or incomplete data. The main issues were related to handling None values and incomplete error recovery.
Main Issues Fixed
1. Incomplete Content Extraction from Search Results
Problem:
- When Tavily API returned search results, some results had
raw_contentset toNone(not missing, but explicitlyNone) - The code used
result.get("raw_content") or result.get("content", "")which failed whenraw_contentexisted but wasNone - This caused
Nonevalues to propagate through the system instead of falling back to thecontentfield or empty string
Fix:
Changed the content extraction in aicorePluginTavily.py to properly handle None values:
# Before (line 344):
rawContent=result.get("raw_content") or result.get("content", "")
# After:
rawContent=result.get("raw_content") or result.get("content") or ""
This ensures that if raw_content is None, it falls back to content, and if that's also None, it defaults to an empty string.
Additional Fix:
Added defensive checks in the webSearch method to safely extract content even when result objects have unexpected structures:
# Safely extract content with multiple fallbacks
content = ""
if hasattr(result, 'rawContent'):
content = result.rawContent or ""
if not content and hasattr(result, 'content'):
content = result.content or ""
2. NoneType Error When Logging Content Length
Problem:
- Code attempted to check
len(first_result.get('raw_content', ''))for logging - When
raw_contentkey existed but value wasNone,.get()returnedNoneinstead of the default'' - This caused
len(None)to fail withTypeError: object of type 'NoneType' has no len()
Fix:
Changed the logging code to safely handle None values:
# Before (line 338):
logger.debug(f"First result has raw_content: {'raw_content' in first_result}, content length: {len(first_result.get('raw_content', ''))}")
# After:
raw_content = first_result.get('raw_content') or ''
logger.debug(f"First result has raw_content: {'raw_content' in first_result}, content length: {len(raw_content)}")
3. Missing Error Recovery in Content Extraction
Problem:
- When processing search results, if one result failed to extract, the entire extraction could fail
- No recovery mechanism to extract at least URLs even when content extraction failed
- Errors were logged but processing stopped, losing potentially useful data
Fix: Added per-result error handling with recovery:
for result in searchResults:
try:
# Extract URL, content, title safely
# ... extraction logic ...
except Exception as resultError:
logger.warning(f"Error processing individual search result: {resultError}")
# Continue processing other results instead of failing completely
continue
Also added recovery at the extraction level:
except Exception as extractionError:
logger.error(f"Error extracting URLs and content from search results: {extractionError}")
# Try to recover at least URLs
try:
urls = [result.url for result in searchResults if hasattr(result, 'url') and result.url]
logger.info(f"Recovered {len(urls)} URLs after extraction error")
except Exception:
logger.error("Failed to recover any URLs from search results")
4. Incomplete Crawl Result Processing
Problem:
- When crawl returned results but individual page processing failed, entire crawl was lost
- No fallback to extract at least URLs from failed crawl results
- Missing content fields could cause errors when formatting results
Fix: Added error handling for individual page processing:
for i, result in enumerate(crawlResults, 1):
try:
# Format page content
# ... formatting logic ...
except Exception as pageError:
logger.warning(f"Error formatting page {i} from crawl: {pageError}")
# Try to add at least the URL
try:
pageUrls.append(result.url if hasattr(result, 'url') and result.url else webCrawlPrompt.url)
except Exception:
pass
Also ensured all result fields have safe defaults:
results.append(WebCrawlResult(
url=result_url or url, # Fallback to base URL
content=result_content, # Already ensured to be string
title=result_title # Already ensured to be string
))
Impact
These fixes ensure that:
- Content is always extracted - Even when
raw_contentisNone, the system falls back tocontentfield or empty string - Partial results are preserved - If some results fail, others are still processed and returned
- URLs are recovered - Even when content extraction fails completely, URLs can still be extracted for crawling
- No crashes from None values - All
Nonevalues are properly handled before operations likelen()are called
Testing Recommendations
- Test with Tavily search results that have
raw_contentset toNone - Test with mixed results (some with content, some without)
- Test error recovery when individual results fail
- Verify that URLs are still extracted even when content extraction fails