134 lines
5.3 KiB
Markdown
134 lines
5.3 KiB
Markdown
# Web Search Content Extraction Fixes
|
|
|
|
## Problem Summary
|
|
|
|
The Tavily web search integration was failing to extract content from search results, causing web research to return empty or incomplete data. The main issues were related to handling `None` values and incomplete error recovery.
|
|
|
|
## Main Issues Fixed
|
|
|
|
### 1. Incomplete Content Extraction from Search Results
|
|
|
|
**Problem:**
|
|
- When Tavily API returned search results, some results had `raw_content` set to `None` (not missing, but explicitly `None`)
|
|
- The code used `result.get("raw_content") or result.get("content", "")` which failed when `raw_content` existed but was `None`
|
|
- This caused `None` values to propagate through the system instead of falling back to the `content` field or empty string
|
|
|
|
**Fix:**
|
|
Changed the content extraction in `aicorePluginTavily.py` to properly handle `None` values:
|
|
```python
|
|
# Before (line 344):
|
|
rawContent=result.get("raw_content") or result.get("content", "")
|
|
|
|
# After:
|
|
rawContent=result.get("raw_content") or result.get("content") or ""
|
|
```
|
|
|
|
This ensures that if `raw_content` is `None`, it falls back to `content`, and if that's also `None`, it defaults to an empty string.
|
|
|
|
**Additional Fix:**
|
|
Added defensive checks in the `webSearch` method to safely extract content even when result objects have unexpected structures:
|
|
```python
|
|
# Safely extract content with multiple fallbacks
|
|
content = ""
|
|
if hasattr(result, 'rawContent'):
|
|
content = result.rawContent or ""
|
|
if not content and hasattr(result, 'content'):
|
|
content = result.content or ""
|
|
```
|
|
|
|
### 2. NoneType Error When Logging Content Length
|
|
|
|
**Problem:**
|
|
- Code attempted to check `len(first_result.get('raw_content', ''))` for logging
|
|
- When `raw_content` key existed but value was `None`, `.get()` returned `None` instead of the default `''`
|
|
- This caused `len(None)` to fail with `TypeError: object of type 'NoneType' has no len()`
|
|
|
|
**Fix:**
|
|
Changed the logging code to safely handle `None` values:
|
|
```python
|
|
# Before (line 338):
|
|
logger.debug(f"First result has raw_content: {'raw_content' in first_result}, content length: {len(first_result.get('raw_content', ''))}")
|
|
|
|
# After:
|
|
raw_content = first_result.get('raw_content') or ''
|
|
logger.debug(f"First result has raw_content: {'raw_content' in first_result}, content length: {len(raw_content)}")
|
|
```
|
|
|
|
### 3. Missing Error Recovery in Content Extraction
|
|
|
|
**Problem:**
|
|
- When processing search results, if one result failed to extract, the entire extraction could fail
|
|
- No recovery mechanism to extract at least URLs even when content extraction failed
|
|
- Errors were logged but processing stopped, losing potentially useful data
|
|
|
|
**Fix:**
|
|
Added per-result error handling with recovery:
|
|
```python
|
|
for result in searchResults:
|
|
try:
|
|
# Extract URL, content, title safely
|
|
# ... extraction logic ...
|
|
except Exception as resultError:
|
|
logger.warning(f"Error processing individual search result: {resultError}")
|
|
# Continue processing other results instead of failing completely
|
|
continue
|
|
```
|
|
|
|
Also added recovery at the extraction level:
|
|
```python
|
|
except Exception as extractionError:
|
|
logger.error(f"Error extracting URLs and content from search results: {extractionError}")
|
|
# Try to recover at least URLs
|
|
try:
|
|
urls = [result.url for result in searchResults if hasattr(result, 'url') and result.url]
|
|
logger.info(f"Recovered {len(urls)} URLs after extraction error")
|
|
except Exception:
|
|
logger.error("Failed to recover any URLs from search results")
|
|
```
|
|
|
|
### 4. Incomplete Crawl Result Processing
|
|
|
|
**Problem:**
|
|
- When crawl returned results but individual page processing failed, entire crawl was lost
|
|
- No fallback to extract at least URLs from failed crawl results
|
|
- Missing content fields could cause errors when formatting results
|
|
|
|
**Fix:**
|
|
Added error handling for individual page processing:
|
|
```python
|
|
for i, result in enumerate(crawlResults, 1):
|
|
try:
|
|
# Format page content
|
|
# ... formatting logic ...
|
|
except Exception as pageError:
|
|
logger.warning(f"Error formatting page {i} from crawl: {pageError}")
|
|
# Try to add at least the URL
|
|
try:
|
|
pageUrls.append(result.url if hasattr(result, 'url') and result.url else webCrawlPrompt.url)
|
|
except Exception:
|
|
pass
|
|
```
|
|
|
|
Also ensured all result fields have safe defaults:
|
|
```python
|
|
results.append(WebCrawlResult(
|
|
url=result_url or url, # Fallback to base URL
|
|
content=result_content, # Already ensured to be string
|
|
title=result_title # Already ensured to be string
|
|
))
|
|
```
|
|
|
|
## Impact
|
|
|
|
These fixes ensure that:
|
|
1. **Content is always extracted** - Even when `raw_content` is `None`, the system falls back to `content` field or empty string
|
|
2. **Partial results are preserved** - If some results fail, others are still processed and returned
|
|
3. **URLs are recovered** - Even when content extraction fails completely, URLs can still be extracted for crawling
|
|
4. **No crashes from None values** - All `None` values are properly handled before operations like `len()` are called
|
|
|
|
## Testing Recommendations
|
|
|
|
- Test with Tavily search results that have `raw_content` set to `None`
|
|
- Test with mixed results (some with content, some without)
|
|
- Test error recovery when individual results fail
|
|
- Verify that URLs are still extracted even when content extraction fails
|