fix: fixed webresearch, tweaked system prompt

This commit is contained in:
Ida Dittrich 2026-01-09 10:49:06 +01:00
parent 2b6bd73b65
commit b327e6dc03
6 changed files with 990 additions and 71 deletions

134
docs/WEBSEARCH_FIXES.md Normal file
View file

@ -0,0 +1,134 @@
# Web Search Content Extraction Fixes
## Problem Summary
The Tavily web search integration was failing to extract content from search results, causing web research to return empty or incomplete data. The main issues were related to handling `None` values and incomplete error recovery.
## Main Issues Fixed
### 1. Incomplete Content Extraction from Search Results
**Problem:**
- When Tavily API returned search results, some results had `raw_content` set to `None` (not missing, but explicitly `None`)
- The code used `result.get("raw_content") or result.get("content", "")` which failed when `raw_content` existed but was `None`
- This caused `None` values to propagate through the system instead of falling back to the `content` field or empty string
**Fix:**
Changed the content extraction in `aicorePluginTavily.py` to properly handle `None` values:
```python
# Before (line 344):
rawContent=result.get("raw_content") or result.get("content", "")
# After:
rawContent=result.get("raw_content") or result.get("content") or ""
```
This ensures that if `raw_content` is `None`, it falls back to `content`, and if that's also `None`, it defaults to an empty string.
**Additional Fix:**
Added defensive checks in the `webSearch` method to safely extract content even when result objects have unexpected structures:
```python
# Safely extract content with multiple fallbacks
content = ""
if hasattr(result, 'rawContent'):
content = result.rawContent or ""
if not content and hasattr(result, 'content'):
content = result.content or ""
```
### 2. NoneType Error When Logging Content Length
**Problem:**
- Code attempted to check `len(first_result.get('raw_content', ''))` for logging
- When `raw_content` key existed but value was `None`, `.get()` returned `None` instead of the default `''`
- This caused `len(None)` to fail with `TypeError: object of type 'NoneType' has no len()`
**Fix:**
Changed the logging code to safely handle `None` values:
```python
# Before (line 338):
logger.debug(f"First result has raw_content: {'raw_content' in first_result}, content length: {len(first_result.get('raw_content', ''))}")
# After:
raw_content = first_result.get('raw_content') or ''
logger.debug(f"First result has raw_content: {'raw_content' in first_result}, content length: {len(raw_content)}")
```
### 3. Missing Error Recovery in Content Extraction
**Problem:**
- When processing search results, if one result failed to extract, the entire extraction could fail
- No recovery mechanism to extract at least URLs even when content extraction failed
- Errors were logged but processing stopped, losing potentially useful data
**Fix:**
Added per-result error handling with recovery:
```python
for result in searchResults:
try:
# Extract URL, content, title safely
# ... extraction logic ...
except Exception as resultError:
logger.warning(f"Error processing individual search result: {resultError}")
# Continue processing other results instead of failing completely
continue
```
Also added recovery at the extraction level:
```python
except Exception as extractionError:
logger.error(f"Error extracting URLs and content from search results: {extractionError}")
# Try to recover at least URLs
try:
urls = [result.url for result in searchResults if hasattr(result, 'url') and result.url]
logger.info(f"Recovered {len(urls)} URLs after extraction error")
except Exception:
logger.error("Failed to recover any URLs from search results")
```
### 4. Incomplete Crawl Result Processing
**Problem:**
- When crawl returned results but individual page processing failed, entire crawl was lost
- No fallback to extract at least URLs from failed crawl results
- Missing content fields could cause errors when formatting results
**Fix:**
Added error handling for individual page processing:
```python
for i, result in enumerate(crawlResults, 1):
try:
# Format page content
# ... formatting logic ...
except Exception as pageError:
logger.warning(f"Error formatting page {i} from crawl: {pageError}")
# Try to add at least the URL
try:
pageUrls.append(result.url if hasattr(result, 'url') and result.url else webCrawlPrompt.url)
except Exception:
pass
```
Also ensured all result fields have safe defaults:
```python
results.append(WebCrawlResult(
url=result_url or url, # Fallback to base URL
content=result_content, # Already ensured to be string
title=result_title # Already ensured to be string
))
```
## Impact
These fixes ensure that:
1. **Content is always extracted** - Even when `raw_content` is `None`, the system falls back to `content` field or empty string
2. **Partial results are preserved** - If some results fail, others are still processed and returned
3. **URLs are recovered** - Even when content extraction fails completely, URLs can still be extracted for crawling
4. **No crashes from None values** - All `None` values are properly handled before operations like `len()` are called
## Testing Recommendations
- Test with Tavily search results that have `raw_content` set to `None`
- Test with mixed results (some with content, some without)
- Test error recovery when individual results fail
- Verify that URLs are still extracted even when content extraction fails

View file

@ -321,13 +321,28 @@ class AiTavily(BaseConnectorAi):
# Return all results without score filtering
# Tavily's scoring is already applied by the API
logger.info(f"Tavily returned {len(response.get('results', []))} results")
results_count = len(response.get('results', []))
logger.info(f"Tavily returned {results_count} results")
# Log content availability
results_with_content = 0
for result in response.get('results', []):
if result.get("raw_content"):
results_with_content += 1
logger.info(f"Tavily results with raw_content: {results_with_content}/{results_count}")
# Log first result structure for debugging
if response.get('results') and len(response['results']) > 0:
first_result = response['results'][0]
logger.debug(f"First result keys: {list(first_result.keys())}")
raw_content = first_result.get('raw_content') or ''
logger.debug(f"First result has raw_content: {'raw_content' in first_result}, content length: {len(raw_content)}")
return [
WebSearchResult(
title=result["title"],
url=self._cleanUrl(result["url"]),
rawContent=result.get("raw_content")
title=result.get("title", ""),
url=self._cleanUrl(result.get("url", "")),
rawContent=result.get("raw_content") or result.get("content") or ""
)
for result in response["results"]
]
@ -381,24 +396,90 @@ class AiTavily(BaseConnectorAi):
logger.debug(f"Tavily response received: {type(response)}")
# Parse response - could be dict with results or list
if isinstance(response, dict) and "results" in response:
pageResults = response["results"]
if isinstance(response, dict):
if "results" in response:
pageResults = response["results"]
logger.debug(f"Found 'results' key in response dict with {len(pageResults)} items")
else:
logger.warning(f"Response dict keys: {list(response.keys())}")
# Check for other possible keys
if "pages" in response:
pageResults = response["pages"]
logger.debug(f"Found 'pages' key with {len(pageResults)} items")
elif "content" in response:
# Single page result
pageResults = [response]
logger.debug("Found 'content' key, treating as single page result")
else:
logger.warning(f"Unexpected response dict structure: {list(response.keys())}")
pageResults = []
elif isinstance(response, list):
pageResults = response
logger.debug(f"Response is a list with {len(pageResults)} items")
else:
logger.warning(f"Unexpected response format: {type(response)}")
logger.warning(f"Unexpected response format: {type(response)}, value: {str(response)[:200]}")
pageResults = []
logger.debug(f"Got {len(pageResults)} pages from crawl")
logger.info(f"Got {len(pageResults)} pages from crawl for URL: {url}")
if len(pageResults) == 0:
logger.warning(f"Tavily crawl returned 0 pages for URL: {url}. Response structure: {type(response)}")
if isinstance(response, dict):
logger.warning(f"Response keys: {list(response.keys())}")
# Log all values to debug (not just first 3)
for key, value in response.items():
value_str = str(value)
if len(value_str) > 200:
value_str = value_str[:200] + "..."
logger.warning(f" {key}: {type(value)} - {value_str}")
# Check for error messages in response
if "error" in response:
logger.error(f"Tavily API error in response: {response.get('error')}")
if "message" in response:
logger.warning(f"Tavily API message: {response.get('message')}")
elif isinstance(response, str):
logger.warning(f"Tavily returned string response (first 500 chars): {response[:500]}")
else:
logger.warning(f"Unexpected response type: {type(response)}, value: {str(response)[:500]}")
# Convert to WebCrawlResult format
# Convert to WebCrawlResult format with error handling
results = []
for result in pageResults:
results.append(WebCrawlResult(
url=result.get("url", url),
content=result.get("raw_content", result.get("content", "")),
title=result.get("title", "")
))
for idx, result in enumerate(pageResults):
try:
# Safely extract fields
result_url = result.get("url") if isinstance(result, dict) else (getattr(result, "url", None) if hasattr(result, "url") else url)
result_content = ""
if isinstance(result, dict):
result_content = result.get("raw_content") or result.get("content") or ""
elif hasattr(result, "raw_content"):
result_content = result.raw_content or ""
elif hasattr(result, "content"):
result_content = result.content or ""
result_title = ""
if isinstance(result, dict):
result_title = result.get("title", "")
elif hasattr(result, "title"):
result_title = result.title or ""
results.append(WebCrawlResult(
url=result_url or url,
content=result_content,
title=result_title
))
except Exception as resultError:
logger.warning(f"Error processing crawl result {idx}: {resultError}")
# Try to create a minimal result with at least the URL
try:
if isinstance(result, dict) and result.get("url"):
results.append(WebCrawlResult(
url=result.get("url", url),
content="",
title=""
))
except Exception:
logger.error(f"Failed to create minimal result for crawl result {idx}")
continue
logger.debug(f"Crawl successful: extracted {len(results)} pages from URL")
return results
@ -413,7 +494,7 @@ class AiTavily(BaseConnectorAi):
except Exception as e:
logger.warning(f"Crawl attempt {attempt + 1} failed for URL {url}: {str(e)}")
logger.debug(f"Full error details: {type(e).__name__}: {str(e)}")
logger.debug(f"Full error details: {type(e).__name__}: {str(e)}", exc_info=True)
# Check if it's a validation error and log more details
if "validation" in str(e).lower():
@ -427,10 +508,22 @@ class AiTavily(BaseConnectorAi):
if len(url) > 2000:
logger.debug(f" WARNING: URL is very long ({len(url)} chars)")
# Log API-specific errors
error_str = str(e).lower()
if "rate limit" in error_str or "429" in error_str:
logger.error(f"Tavily API rate limit hit for URL: {url}")
elif "401" in error_str or "unauthorized" in error_str:
logger.error(f"Tavily API authentication failed for URL: {url}")
elif "404" in error_str or "not found" in error_str:
logger.warning(f"URL not found (404) for: {url}")
elif "timeout" in error_str:
logger.warning(f"Timeout error for URL: {url}")
if attempt < maxRetries:
logger.info(f"Retrying in {retryDelay} seconds...")
await asyncio.sleep(retryDelay)
else:
logger.error(f"Crawl failed after {maxRetries + 1} attempts for URL: {url}")
raise Exception(f"Crawl failed after {maxRetries + 1} attempts: {str(e)}")
async def _routeWebOperation(self, modelCall: AiModelCall) -> "AiModelResponse":
@ -508,21 +601,84 @@ class AiTavily(BaseConnectorAi):
includeRawContent="text"
)
# Extract URLs from results
urls = [result.url for result in searchResults]
# Extract URLs and content from results with error handling
urls = []
results_with_content = []
content_count = 0
# Return as JSON array
try:
for result in searchResults:
try:
# Safely extract URL
url = result.url if hasattr(result, 'url') and result.url else ""
if url:
urls.append(url)
# Safely extract content
content = ""
if hasattr(result, 'rawContent'):
content = result.rawContent or ""
if not content and hasattr(result, 'content'):
content = result.content or ""
if content:
content_count += 1
# Safely extract title
title = result.title if hasattr(result, 'title') and result.title else ""
results_with_content.append({
"url": url,
"title": title,
"content": content,
"score": getattr(result, 'score', 0)
})
except Exception as resultError:
logger.warning(f"Error processing individual search result: {resultError}")
# Continue processing other results
continue
logger.info(f"Tavily search: {len(urls)} URLs, {content_count} with content, {len(results_with_content)} total results")
if content_count == 0:
logger.warning("Tavily search returned no content - results may need crawling")
except Exception as extractionError:
logger.error(f"Error extracting URLs and content from search results: {extractionError}")
# Try to recover at least URLs
try:
urls = [result.url for result in searchResults if hasattr(result, 'url') and result.url]
logger.info(f"Recovered {len(urls)} URLs after extraction error")
except Exception:
logger.error("Failed to recover any URLs from search results")
# Return both URLs and full results in JSON for direct extraction
# Format: {"urls": [...], "results": [...]}
import json
response_data = {
"urls": urls,
"results": results_with_content
}
return AiModelResponse(
content=json.dumps(urls, indent=2),
content=json.dumps(response_data, indent=2),
success=True,
metadata={"total_urls": len(urls), "operation": "WEB_SEARCH_DATA"}
metadata={
"total_urls": len(urls),
"operation": "WEB_SEARCH_DATA",
"results_with_content": results_with_content # Also in metadata for compatibility
}
)
except Exception as e:
logger.error(f"Error in Tavily web search: {str(e)}")
logger.error(f"Error in Tavily web search: {str(e)}", exc_info=True)
import json
# Return error response with empty results
error_response = {
"urls": [],
"results": [],
"error": str(e)
}
return AiModelResponse(
content="[]",
content=json.dumps(error_response, indent=2),
success=False,
error=str(e)
)
@ -575,23 +731,44 @@ class AiTavily(BaseConnectorAi):
# If we got multiple pages from the crawl, we need to format them differently
# Return the first result for backwards compatibility, but include total page count
if crawlResults and len(crawlResults) > 0:
# Get all pages content
# Get all pages content with error handling
allContent = ""
pageUrls = []
for i, result in enumerate(crawlResults, 1):
pageHeader = f"\n{'='*60}\nPAGE {i}: {result.url}\n{'='*60}\n"
if result.title:
allContent += f"{pageHeader}Title: {result.title}\n\n"
allContent += f"{result.content}\n"
try:
pageHeader = f"\n{'='*60}\nPAGE {i}: {result.url}\n{'='*60}\n"
if result.title:
allContent += f"{pageHeader}Title: {result.title}\n\n"
else:
allContent += f"{pageHeader}\n"
allContent += f"{result.content or ''}\n"
pageUrls.append(result.url)
except Exception as pageError:
logger.warning(f"Error formatting page {i} from crawl: {pageError}")
# Try to add at least the URL
try:
pageUrls.append(result.url if hasattr(result, 'url') and result.url else webCrawlPrompt.url)
except Exception:
pass
resultData = {
"url": webCrawlPrompt.url,
"title": crawlResults[0].title if crawlResults[0].title else "Content",
"title": crawlResults[0].title if crawlResults and crawlResults[0].title else "Content",
"content": allContent,
"pagesCrawled": len(crawlResults),
"pageUrls": [result.url for result in crawlResults]
"pageUrls": pageUrls
}
logger.info(f"Crawl successful: {len(crawlResults)} pages extracted from {webCrawlPrompt.url}")
else:
resultData = {"url": webCrawlPrompt.url, "title": "", "content": "", "error": "No content extracted", "pagesCrawled": 0}
logger.warning(f"Crawl returned no results for URL: {webCrawlPrompt.url}")
resultData = {
"url": webCrawlPrompt.url,
"title": "",
"content": "",
"error": "No content extracted - Tavily crawl returned 0 pages",
"pagesCrawled": 0,
"pageUrls": []
}
# Return as JSON - same format as Perplexity but with multiple pages content
import json
@ -602,9 +779,17 @@ class AiTavily(BaseConnectorAi):
)
except Exception as e:
logger.error(f"Error in Tavily web crawl: {str(e)}")
logger.error(f"Error in Tavily web crawl: {str(e)}", exc_info=True)
import json
errorResult = {"error": str(e), "url": webCrawlPrompt.url if 'webCrawlPrompt' in locals() else ""}
crawl_url = webCrawlPrompt.url if 'webCrawlPrompt' in locals() else ""
errorResult = {
"url": crawl_url,
"title": "",
"content": "",
"error": str(e),
"pagesCrawled": 0,
"pageUrls": []
}
return AiModelResponse(
content=json.dumps(errorResult, indent=2),
success=False,

View file

@ -158,12 +158,64 @@ WICHTIG: Wenn du Informationen aus der Datenbank präsentierst, kennzeichne dies
- Beginne deine Antwort mit einer klaren Kennzeichnung, z.B.: "Aus der Datenbank habe ich folgende Artikel gefunden:"
- Bei kombinierten Informationen (Datenbank + Internet): Trenne klar zwischen beiden Quellen
QUELLENANGABE - INTERNET:
WICHTIG: Wenn du Informationen aus dem Internet präsentierst, kennzeichne dies IMMER klar für den Nutzer.
- Beginne Internet-Recherchen mit: "Aus meiner Internet-Recherche:" oder "Laut Online-Quellen:"
- Gib IMMER die konkreten Quellen an (Website-Namen und Links)
- Bei mehreren Quellen: Liste die Quellen auf und verweise darauf
- Trenne klar zwischen Datenbank-Informationen und Internet-Recherchen
QUELLENANGABE - INTERNET - ABSOLUT VERBINDLICH
Wenn du Informationen aus einer Web-Recherche präsentierst, MUSS du dies IMMER explizit kennzeichnen und die Quellen angeben:
- VERBOTEN: Informationen aus Web-Recherchen ohne explizite Kennzeichnung zu präsentieren
- VERBOTEN: Informationen aus Web-Recherchen ohne Quellenangabe zu präsentieren
- VERBOTEN: Quellen nur am Ende als Liste zu präsentieren
- OBLIGATORISCH: Beginne IMMER mit einer expliziten Kennzeichnung, z.B.:
* "Aus meiner Web-Recherche habe ich folgende Informationen gefunden:"
* "Laut meiner Internet-Recherche:"
* "Aus meiner Online-Suche:"
- OBLIGATORISCH: Gib IMMER die konkreten Quellen DIREKT NACH der jeweiligen Information an (nicht am Ende!)
- OBLIGATORISCH: Format: [Information] ([Quelle: Website-Name](URL))
- OBLIGATORISCH: Bei mehreren Informationen: Gib nach JEDER Information die entsprechende Quelle an
- OBLIGATORISCH: Trenne klar zwischen Datenbank-Informationen und Web-Recherchen
- OBLIGATORISCH: Wenn sowohl Datenbank- als auch Web-Informationen vorhanden sind, trenne diese klar in separaten Abschnitten
DATENBLATT-LINKS - ABSOLUT VERBINDLICH
Wenn Web-Recherche-Ergebnisse vorhanden sind, MUSS du IMMER:
- OBLIGATORISCH: Explizit erwähnen, dass Datenblätter verfügbar sind
- OBLIGATORISCH: ALLE verfügbaren Datenblatt-Links angeben (vollständige URLs)
- OBLIGATORISCH: Format: "Datenblätter verfügbar: [Link 1](URL1), [Link 2](URL2)"
- OBLIGATORISCH: Wenn keine direkten Datenblatt-Links vorhanden sind, gib Links zu Seiten mit technischen Informationen an
- VERBOTEN: Datenblatt-Links zu verschweigen oder nicht explizit zu erwähnen
AUSFÜHRLICHE INFORMATIONEN - ABSOLUT VERBINDLICH
Wenn Web-Recherche-Ergebnisse vorhanden sind, MUSS du:
- OBLIGATORISCH: AUSFÜHRLICHE Informationen präsentieren (nicht nur kurze Zusammenfassungen!)
- OBLIGATORISCH: Alle relevanten technischen Details angeben:
* Technische Spezifikationen (Größe, Gewicht, Abmessungen, etc.)
* Betriebsbedingungen (Temperatur, Spannung, etc.)
* Kompatibilität und Anwendungsbereiche
* Zertifizierungen und Normen
* Installation und Verwendung
* Weitere relevante Produktdetails
- OBLIGATORISCH: Strukturiere die Informationen übersichtlich (z.B. mit Abschnitten oder Aufzählungen)
- VERBOTEN: Nur oberflächliche Informationen zu geben
- VERBOTEN: Wichtige Details auszulassen
BEISPIEL FÜR KORREKTE QUELLENANGABE MIT INLINE-QUELLEN:
"Aus meiner Web-Recherche habe ich folgende Informationen gefunden:
**Technische Spezifikationen:**
- Speicherkapazität: 2 GB ([Quelle: Siemens Support](https://...))
- Format: Secure Digital (SD) Card ([Quelle: Best4Automation](https://...))
- Betriebsspannung: 3,3 V DC ([Quelle: Automation24](https://...))
**Kompatibilität:**
- Geeignet für SIMATIC HMI Comfort Panels ([Quelle: Siemens Support](https://...))
- Montage im Hoch- und Querformat möglich ([Quelle: Best4Automation](https://...))
**Zertifizierungen:**
- CE-zertifiziert ([Quelle: Automation24](https://...))
- Für ATEX-Zonen geeignet ([Quelle: Elit](https://...))
**Datenblätter verfügbar:**
- [Siemens Produktdatenblatt](https://...)
- [Technische Dokumentation](https://...)"
NIEMALS Informationen aus Web-Recherchen präsentieren, ohne explizit zu erwähnen, dass es sich um eine Web-Recherche handelt und ohne die Quellen DIREKT NACH der jeweiligen Information anzugeben!
TABELLENLÄNGE UND ARTIKELANZAHL - KRITISCH:
WICHTIG: Zeige MAXIMAL 20 Artikel in Tabellen. Du darfst und sollst aber ausführliche Erklärungen liefern!
@ -316,6 +368,17 @@ Analysiere die Benutzeranfrage und bestimme:
- Jede Abfrage sollte fokussiert sein und die benötigten Informationen aus einer spezifischen Tabelle/Datenquelle abrufen
4. Begründung für deine Entscheidung
WICHTIG - WEB-RECHERCHE BEI ZUSÄTZLICHEN INFORMATIONEN
Wenn der Nutzer nach zusätzlichen Informationen fragt oder explizit eine Recherche anfordert, MUSS IMMER eine Web-Recherche durchgeführt werden (needsWebResearch = true).
Beispiele für solche Anfragen:
- "recherchier nach weiteren informationen zu diesem produkt"
- "suche nach zusätzlichen informationen"
- "finde mehr details"
- "recherchiere im internet"
- "suche online nach"
- Ähnliche Formulierungen, die eine Recherche oder zusätzliche Informationen anfordern
In diesen Fällen IMMER needsWebResearch auf true setzen!
WICHTIG für SQL-Abfragen:
- Verwende IMMER doppelte Anführungszeichen für Spaltennamen
- Bei Lagerbestandsabfragen: IMMER S_RESERVIERTER__BESTAND und verfügbaren Bestand einbeziehen
@ -550,6 +613,21 @@ Antworte auf die folgende Frage des Nutzers: {user_prompt}{context}
KRITISCH: Verwende NUR die oben angegebenen Daten. Erfinde KEINE Werte. Wenn Daten fehlen, schreibe "Nicht verfügbar".
ABSOLUT KRITISCH - WEB-RECHERCHE QUELLENANGABE
Wenn WEB-RECHERCHE-ERGEBNISSE oben vorhanden sind, MUSS du:
- IMMER explizit erwähnen, dass die Informationen aus einer Web-Recherche stammen
- IMMER alle Quellen DIREKT NACH der jeweiligen Information angeben (INLINE, nicht am Ende!)
- Format: [Information] ([Quelle: Website-Name](URL))
- IMMER AUSFÜHRLICHE Informationen präsentieren (nicht nur kurze Zusammenfassungen!)
- IMMER alle verfügbaren Datenblatt-Links explizit erwähnen und angeben
- Format für Datenblätter: "Datenblätter verfügbar: [Link 1](URL1), [Link 2](URL2)"
- Die Web-Recherche-Informationen klar von Datenbank-Informationen trennen
- VERBOTEN: Web-Recherche-Informationen ohne explizite Kennzeichnung zu präsentieren
- VERBOTEN: Web-Recherche-Informationen ohne Quellenangabe zu präsentieren
- VERBOTEN: Quellen nur am Ende als Liste zu präsentieren
- VERBOTEN: Datenblatt-Links zu verschweigen oder nicht explizit zu erwähnen
- VERBOTEN: Nur oberflächliche Informationen zu geben
ABSOLUT VERBOTEN - KEINE DATEN ERFINDEN
Wenn KEINE Datenbank-Ergebnisse vorhanden sind (keine DATENBANK-ERGEBNISSE oder STRUKTURIERTE DATEN oben), dann:
- ERFINDE KEINE Artikelnummern, Artikelbezeichnungen, Preise oder Lagerbestände!

View file

@ -315,6 +315,210 @@ async def _check_workflow_stopped(interfaceDbChat, workflowId: str) -> bool:
return False
def _buildWebResearchQuery(userPrompt: str, workflowMessages: List, queryResults: Optional[Dict[str, Any]] = None) -> str:
"""
Build enriched web research query by extracting product context from conversation history and current prompt.
Extracts product information from:
1. Current user prompt (article numbers, product mentions)
2. Database query results (if available)
3. Previous assistant messages (conversation history)
Args:
userPrompt: Current user prompt
workflowMessages: List of workflow messages (conversation history)
queryResults: Optional database query results to extract product info from
Returns:
Enriched search query string
"""
# Normalize user prompt for detection
prompt_lower = userPrompt.lower().strip()
# Patterns that indicate a search request
search_patterns = [
"ja", "yes", "oui", "si",
"such", "suche", "search", "recherche", "recherchier",
"internet", "web", "online",
"datenblatt", "datasheet", "fiche technique",
"mehr informationen", "more information", "plus d'information",
"weitere informationen", "further information", "additional information"
]
# Check if current prompt contains search-related keywords
has_search_intent = any(pattern in prompt_lower for pattern in search_patterns)
# Extract product information - try multiple sources
article_number = None
article_description = None
supplier = None
# Pattern for article numbers like "6AV2 181-8XP00-0AX0" or "6AV2181-8XP00-0AX0"
article_patterns = [
r'\b[A-Z0-9]{2,}\s+[0-9]{3,}-[A-Z0-9-]+\b', # With space: "6AV2 181-8XP00-0AX0"
r'\b[A-Z0-9]{4,}[\s-][A-Z0-9-]{6,}\b', # General pattern
r'\b[A-Z]{2,}[0-9]+\s+[0-9]+-[A-Z0-9-]+\b', # Specific Siemens pattern
]
# 1. First, try to extract from current user prompt
for pattern in article_patterns:
matches = re.findall(pattern, userPrompt)
if matches:
article_number = matches[0]
logger.info(f"Extracted article number from user prompt: {article_number}")
break
# 2. Try to extract from database query results if available
# Always check queryResults to enrich with product description and supplier, even if article_number was already found
if queryResults:
# Look for article numbers in query result text (if not already found)
if not article_number:
for key in queryResults.keys():
if key.startswith("query_") and not key.endswith("_error") and not key.endswith("_data"):
result_text = queryResults.get(key, "")
if isinstance(result_text, str):
for pattern in article_patterns:
matches = re.findall(pattern, result_text)
if matches:
article_number = matches[0]
logger.info(f"Extracted article number from query results: {article_number}")
break
if article_number:
break
# Always check data arrays for product description and supplier (even if article_number already found)
for key in queryResults.keys():
if key.startswith("query_") and not key.endswith("_error") and not key.endswith("_data"):
data_key = f"{key}_data"
if data_key in queryResults:
data_array = queryResults[data_key]
if isinstance(data_array, list) and len(data_array) > 0:
# Look for article number in first row (if not already found)
first_row = data_array[0]
if isinstance(first_row, dict):
# Check common article number fields (if not already found)
if not article_number:
for field in ["Artikelnummer", "Artikelkürzel", "article_number", "articleNumber"]:
if field in first_row and first_row[field]:
article_number = str(first_row[field])
logger.info(f"Extracted article number from query data: {article_number}")
break
# Always check article description (can enrich even if article_number already found)
if not article_description:
for field in ["Artikelbezeichnung", "Bezeichnung", "article_description", "description"]:
if field in first_row and first_row[field]:
article_description = str(first_row[field])
logger.info(f"Extracted article description from query data: {article_description}")
break
# Always check supplier (can enrich even if article_number already found)
if not supplier:
for field in ["Lieferant", "Supplier", "supplier"]:
if field in first_row and first_row[field]:
supplier = str(first_row[field])
logger.info(f"Extracted supplier from query data: {supplier}")
break
# If we found all needed info, we can stop
if article_number and article_description and supplier:
break
# 3. Extract from previous assistant messages (conversation history)
if not article_number or not article_description:
for msg in reversed(workflowMessages[-10:]):
if msg.role == "assistant":
message_text = msg.message
# Extract article number if not found yet
if not article_number:
for pattern in article_patterns:
matches = re.findall(pattern, message_text)
if matches:
article_number = matches[0]
break
# Extract article description if not found yet
if not article_description:
description_patterns = [
r'Es handelt sich um\s+([^\.]+)',
r'It is a\s+([^\.]+)',
r'C\'est\s+([^\.]+)',
r'Bezeichnung:\s*([^\n]+)',
r'Description:\s*([^\n]+)',
r'Artikelbezeichnung:\s*([^\n]+)',
r'Artikelbezeichnung:\s*([^\n]+)'
]
for pattern in description_patterns:
match = re.search(pattern, message_text, re.IGNORECASE)
if match:
article_description = match.group(1).strip()
break
# Extract supplier if not found yet
if not supplier:
supplier_patterns = [
r'von\s+([A-Z][A-Za-z\s]+(?:AG|GmbH|Ltd|Inc|Corp)?)',
r'from\s+([A-Z][A-Za-z\s]+(?:AG|GmbH|Ltd|Inc|Corp)?)',
r'Lieferant:\s*([^\n]+)',
r'Supplier:\s*([^\n]+)'
]
for pattern in supplier_patterns:
match = re.search(pattern, message_text, re.IGNORECASE)
if match:
supplier = match.group(1).strip()
break
# Stop if we found everything
if article_number and article_description and supplier:
break
# Build enriched search query
query_parts = []
# If we have search intent but no product info, try to use the user prompt intelligently
if has_search_intent and not article_number and not article_description:
# Try to extract meaningful parts from the prompt
# Remove common search phrases and keep the product-related parts
cleaned_prompt = userPrompt
for phrase in ["recherchier nach", "recherche", "suche nach", "search for", "find", "informationen zu", "information about", "weitere informationen", "further information"]:
cleaned_prompt = re.sub(phrase, "", cleaned_prompt, flags=re.IGNORECASE)
cleaned_prompt = cleaned_prompt.strip()
# If cleaned prompt still has content and is different, use it
if cleaned_prompt and cleaned_prompt != userPrompt and len(cleaned_prompt) > 10:
query_parts.append(cleaned_prompt)
# Add article description if found
if article_description:
query_parts.append(article_description)
# Add article number if found
if article_number:
query_parts.append(article_number)
# Add supplier if found
if supplier:
query_parts.append(supplier)
# Add "Datenblatt" or "datasheet" if user requested it or if we have product info
if "datenblatt" in prompt_lower or "datasheet" in prompt_lower or "fiche technique" in prompt_lower:
query_parts.append("Datenblatt")
elif query_parts:
# If we have product info but no explicit request for datasheet, add it anyway
query_parts.append("Datenblatt")
# If we found product information or built a meaningful query, use it
if query_parts:
enriched_query = " ".join(query_parts)
logger.info(f"Built enriched search query: '{enriched_query}' from context (original: '{userPrompt}')")
return enriched_query
else:
# Fall back to original prompt, but try to clean it up
logger.info(f"No product context found, using original prompt: '{userPrompt}'")
return userPrompt
async def _processChatbotMessage(
services,
workflowId: str,
@ -398,6 +602,11 @@ async def _processChatbotMessage(
logger.info(f"Analysis: DB={needsDatabaseQuery}, Web={needsWebResearch}, SQL queries={len(sql_queries)}")
# Build initial enriched web research query if needed (for logging, will be rebuilt after DB queries)
enriched_web_query = None
if needsWebResearch:
enriched_web_query = _buildWebResearchQuery(userInput.prompt, workflow.messages)
# Build list of queries to stream back
queries = []
@ -414,7 +623,7 @@ async def _processChatbotMessage(
if needsWebResearch:
queries.append({
"type": "web",
"query": userInput.prompt,
"query": enriched_web_query or userInput.prompt,
"reasoning": reasoning
})
@ -499,8 +708,17 @@ async def _processChatbotMessage(
await _emit_log_and_event(interfaceDbChat, workflowId, event_manager, "Suche im Internet nach Informationen...")
try:
# Rebuild enriched query with database results if available (better product context)
web_research_query = _buildWebResearchQuery(
userInput.prompt,
workflow.messages,
queryResults if queryResults else None
)
logger.info(f"Using enriched web research query: '{web_research_query}'")
researchResult = await services.web.performWebResearch(
prompt=userInput.prompt,
prompt=web_research_query,
urls=[],
country=None,
language=userInput.userLanguage or "de",
@ -510,7 +728,7 @@ async def _processChatbotMessage(
webResearchResults = json.dumps(researchResult, ensure_ascii=False, indent=2) if isinstance(researchResult, dict) else str(researchResult)
await _emit_log_and_event(interfaceDbChat, workflowId, event_manager, "Internet-Recherche abgeschlossen")
except Exception as e:
logger.error(f"Web research failed: {e}")
logger.error(f"Web research failed: {e}", exc_info=True)
webResearchResults = f"Web research error: {str(e)}"
await _emit_log_and_event(interfaceDbChat, workflowId, event_manager, "Internet-Recherche fehlgeschlagen", log_type="warning")

View file

@ -383,15 +383,36 @@ Respond with ONLY a JSON object in this exact format:
operationType=opType.value
)
self.services.chat.storeWorkflowStat(
self.services.workflow,
response,
f"ai.{opType.name.lower()}"
)
# Try to store workflow stats, but don't fail if workflow is None (e.g., in chatbot context)
try:
self.services.chat.storeWorkflowStat(
self.services.workflow,
response,
f"ai.{opType.name.lower()}"
)
except Exception as e:
# Log but don't fail - workflow might be None in some contexts (e.g., chatbot)
logger.debug(f"Could not store workflow stat (workflow may be None): {str(e)}")
self.services.chat.progressLogUpdate(aiOperationId, 0.9, f"{opType.name} completed")
self.services.chat.progressLogFinish(aiOperationId, True)
# Preserve metadata from response if available (e.g., results_with_content from Tavily)
# Check if response has metadata attribute (AiCallResponse from callAi)
if hasattr(response, 'metadata') and response.metadata:
# If metadata is a dict, store it in additionalData
if isinstance(response.metadata, dict):
if not metadata.additionalData:
metadata.additionalData = {}
metadata.additionalData.update(response.metadata)
# If metadata is an object with attributes, extract them
elif hasattr(response.metadata, '__dict__'):
if not metadata.additionalData:
metadata.additionalData = {}
for key, value in response.metadata.__dict__.items():
if not key.startswith('_'):
metadata.additionalData[key] = value
return AiResponse(
content=response.content,
metadata=metadata

View file

@ -85,22 +85,119 @@ class WebService:
if extractedUrls:
allUrls.extend(extractedUrls)
# Step 2: Search for URLs if needed (based on needsSearch flag)
# Step 2: Search for URLs and content if needed (based on needsSearch flag)
searchUrls = []
searchResultsWithContent = []
if needsSearch and (not allUrls or len(allUrls) < maxNumberPages):
self.services.chat.progressLogUpdate(operationId, 0.3, "Searching for URLs")
self.services.chat.progressLogUpdate(operationId, 0.3, "Searching for URLs and content")
searchUrls = await self._performWebSearch(
instruction=instruction,
maxNumberPages=maxNumberPages - len(allUrls),
country=countryCode,
language=languageCode
)
try:
searchUrls, searchResultsWithContent = await self._performWebSearch(
instruction=instruction,
maxNumberPages=maxNumberPages - len(allUrls),
country=countryCode,
language=languageCode
)
logger.info(f"Tavily search returned {len(searchUrls)} URLs with {len(searchResultsWithContent)} results containing content")
except Exception as e:
logger.error(f"Error performing Tavily search (continuing with other URLs): {str(e)}", exc_info=True)
searchUrls = []
searchResultsWithContent = []
# Add search URLs to the list
allUrls.extend(searchUrls)
# Prioritize Tavily search URLs over AI-extracted URLs (they're more relevant)
if searchUrls:
# Prepend Tavily URLs to the list (they're more relevant)
allUrls = searchUrls + allUrls
logger.info(f"Using {len(searchUrls)} Tavily URLs + {len(allUrls) - len(searchUrls)} other URLs = {len(allUrls)} total")
else:
# If Tavily search failed, use AI-extracted URLs
logger.warning("Tavily search returned no URLs, using AI-extracted URLs only")
self.services.chat.progressLogUpdate(operationId, 0.5, f"Found {len(allUrls)} total URLs")
# If we have search results (even without content), use them directly instead of crawling
# Tavily search results are more relevant than generic AI-extracted URLs
# Only crawl if we have NO search results at all
if searchResultsWithContent and len(searchResultsWithContent) > 0:
urls_with_actual_content = sum(1 for r in searchResultsWithContent if r.get("content") and len(r.get("content", "")) > 0)
logger.info(f"Using {len(searchResultsWithContent)} Tavily search results ({urls_with_actual_content} with content) directly (skipping crawl)")
# Convert search results to crawl result format
crawlResult = []
for result in searchResultsWithContent:
crawlResult.append({
"url": result["url"],
"title": result.get("title", ""),
"content": result.get("content", "")
})
# Calculate statistics
totalResults = len(crawlResult)
totalContentLength = sum(len(r.get("content", "")) for r in crawlResult)
urlsWithContent = sum(1 for r in crawlResult if r.get("content") and len(r.get("content", "")) > 0)
# Log content availability
if urlsWithContent == 0:
logger.warning(f"Tavily search returned {len(searchResultsWithContent)} results but none have content - URLs will be used but may need crawling")
else:
logger.info(f"Tavily search provided content for {urlsWithContent}/{len(searchResultsWithContent)} URLs")
# Even if content is empty, use these results - they're more relevant than generic URLs
# The final answer generation can work with URLs even if content is empty
# Convert to sections format
sections = []
for idx, item in enumerate(crawlResult):
section = {
"id": f"result_{idx}",
"content_type": "paragraph",
"title": item.get("title") or item.get("url", f"Result {idx + 1}"),
"order": idx
}
content = item.get("content", "")
if content:
section["textPreview"] = content[:200] + ("..." if len(content) > 200 else "")
sections.append(section)
# Return consolidated result
result = {
"metadata": {
"title": suggestedFilename or instruction[:100] if instruction else "Web Research Results",
"extraction_method": "tavily_search_direct",
"research_depth": finalResearchDepth,
"country": countryCode,
"language": languageCode,
"urls_searched": searchUrls[:20],
"total_urls": len(searchUrls),
"urls_with_content": urlsWithContent,
"total_content_length": totalContentLength,
"search_date": self.services.utils.timestampGetUtc() if hasattr(self.services, 'utils') else None
},
"sections": sections,
"statistics": {
"sectionCount": len(sections),
"total_urls": len(searchUrls),
"results_count": totalResults,
"urls_with_content": urlsWithContent,
"total_content_length": totalContentLength
},
"instruction": instruction,
"urls_crawled": searchUrls,
"total_urls": len(searchUrls),
"results": crawlResult,
"total_results": totalResults
}
if suggestedFilename:
result["suggested_filename"] = suggestedFilename
result["metadata"]["suggested_filename"] = suggestedFilename
if operationId:
self.services.chat.progressLogUpdate(operationId, 0.9, "Completed")
self.services.chat.progressLogFinish(operationId, True)
return result
# Step 3: Validate and filter URLs before crawling
validatedUrls = self._validateUrls(allUrls)
if not validatedUrls:
@ -322,8 +419,16 @@ Return ONLY valid JSON, no additional text:
maxNumberPages: int,
country: Optional[str],
language: Optional[str]
) -> List[str]:
"""Perform web search to find URLs."""
) -> tuple[List[str], List[Dict[str, Any]]]:
"""
Perform web search to find URLs and content.
Returns:
Tuple of (urls, search_results_with_content)
- urls: List of URL strings
- search_results_with_content: List of dicts with url, title, content from Tavily search
"""
search_results_with_content = []
try:
# Build search prompt model
searchPromptModel = AiCallPromptWebSearch(
@ -350,46 +455,224 @@ Return ONLY valid JSON, no additional text:
outputFormat="json"
)
# Check if metadata contains results with content (from Tavily)
if hasattr(searchResponse, 'metadata') and searchResponse.metadata:
# Check in additionalData first (where we store custom metadata)
additional_data = None
if hasattr(searchResponse.metadata, 'additionalData') and searchResponse.metadata.additionalData:
additional_data = searchResponse.metadata.additionalData
elif isinstance(searchResponse.metadata, dict):
additional_data = searchResponse.metadata.get("additionalData", {})
if additional_data:
results_with_content = additional_data.get("results_with_content", [])
if results_with_content:
logger.info(f"Found {len(results_with_content)} search results with content in metadata.additionalData")
# Extract URLs and content from metadata
for result in results_with_content:
if result.get("url"):
search_results_with_content.append({
"url": result.get("url"),
"title": result.get("title", ""),
"content": result.get("content", ""),
"score": result.get("score", 0)
})
# Also check directly in metadata (fallback)
if not search_results_with_content:
results_with_content = None
if hasattr(searchResponse.metadata, 'results_with_content'):
results_with_content = searchResponse.metadata.results_with_content
elif isinstance(searchResponse.metadata, dict):
results_with_content = searchResponse.metadata.get("results_with_content", [])
if results_with_content:
logger.info(f"Found {len(results_with_content)} search results with content in metadata (direct)")
for result in results_with_content:
if result.get("url"):
search_results_with_content.append({
"url": result.get("url"),
"title": result.get("title", ""),
"content": result.get("content", ""),
"score": result.get("score", 0)
})
# Extract content from AiResponse
searchResult = searchResponse.content
logger.debug(f"Search response content type: {type(searchResult)}, length: {len(str(searchResult)) if searchResult else 0}")
# Debug: persist search response
if isinstance(searchResult, str):
self.services.utils.writeDebugFile(searchResult, "websearch_response")
logger.debug(f"Search response (first 500 chars): {searchResult[:500]}")
else:
self.services.utils.writeDebugFile(json.dumps(searchResult, indent=2), "websearch_response")
logger.debug(f"Search response type: {type(searchResult)}, keys: {list(searchResult.keys()) if isinstance(searchResult, dict) else 'N/A'}")
# Parse and extract URLs
# Parse and extract URLs and content
if isinstance(searchResult, str):
# Extract JSON from response (handles markdown code blocks)
extractedJson = self.services.utils.jsonExtractString(searchResult)
searchData = json.loads(extractedJson) if extractedJson else json.loads(searchResult)
if extractedJson:
try:
searchData = json.loads(extractedJson)
logger.debug(f"Parsed JSON from extracted string, type: {type(searchData)}")
except json.JSONDecodeError as e:
logger.warning(f"Failed to parse extracted JSON: {e}, trying direct parse")
searchData = json.loads(searchResult)
else:
try:
searchData = json.loads(searchResult)
logger.debug(f"Parsed JSON directly from string, type: {type(searchData)}")
except json.JSONDecodeError as e:
logger.error(f"Failed to parse search result as JSON: {e}")
searchData = {}
else:
searchData = searchResult
logger.debug(f"Using searchResult directly as searchData, type: {type(searchData)}")
# Extract URLs from response
logger.debug(f"Final searchData type: {type(searchData)}, is dict: {isinstance(searchData, dict)}, keys: {list(searchData.keys()) if isinstance(searchData, dict) else 'N/A'}")
# Extract URLs and content from response
urls = []
if isinstance(searchData, dict):
if "urls" in searchData:
# Check for new format: {"urls": [...], "results": [...]}
if "urls" in searchData and "results" in searchData:
urls = searchData["urls"]
# Extract results with content
for r in searchData["results"]:
if r.get("url"):
# Only add if not already added from metadata
if not any(sr["url"] == r.get("url") for sr in search_results_with_content):
search_results_with_content.append({
"url": r.get("url"),
"title": r.get("title", ""),
"content": r.get("content", ""),
"score": r.get("score", 0)
})
logger.debug(f"Extracted {len(urls)} URLs and {len(search_results_with_content)} results with content from new format")
elif "urls" in searchData:
urls = searchData["urls"]
logger.debug(f"Extracted {len(urls)} URLs from 'urls' field")
elif "results" in searchData:
urls = [r.get("url") for r in searchData["results"] if r.get("url")]
# Extract URLs from results (content already in search_results_with_content if from metadata)
for r in searchData["results"]:
if r.get("url"):
urls.append(r.get("url"))
# Only add to search_results_with_content if not already added from metadata
if not any(sr["url"] == r.get("url") for sr in search_results_with_content):
search_results_with_content.append({
"url": r.get("url"),
"title": r.get("title", ""),
"content": r.get("raw_content") or r.get("content", ""),
"score": r.get("score", 0)
})
logger.debug(f"Extracted {len(urls)} URLs with content from 'results' field")
else:
logger.warning(f"Unexpected search data structure (dict): {list(searchData.keys())}")
elif isinstance(searchData, list):
# Handle both cases: list of URL strings or list of dicts with "url" key
for item in searchData:
if isinstance(item, str):
# Item is already a URL string
urls.append(item)
elif isinstance(item, dict) and item.get("url"):
# Item is a dict with "url" key
urls.append(item.get("url"))
elif isinstance(item, dict):
if item.get("url"):
urls.append(item.get("url"))
# Only add to search_results_with_content if not already added from metadata
if not any(sr["url"] == item.get("url") for sr in search_results_with_content):
search_results_with_content.append({
"url": item.get("url"),
"title": item.get("title", ""),
"content": item.get("raw_content") or item.get("content", ""),
"score": item.get("score", 0)
})
logger.debug(f"Extracted {len(urls)} URLs from list")
else:
logger.warning(f"Unexpected search data type: {type(searchData)}")
logger.info(f"Web search returned {len(urls)} URLs")
return urls
# If we got URLs but no content from metadata, extract URLs from search_results_with_content
if urls and not search_results_with_content:
# URLs were extracted but no content - create entries with empty content
for url in urls:
search_results_with_content.append({
"url": url,
"title": "",
"content": "",
"score": 0
})
elif search_results_with_content and not urls:
# We have content but no URLs - extract URLs from content results
urls = [r["url"] for r in search_results_with_content]
# If we have URLs but no search_results_with_content, create entries from URLs
# This ensures we can use Tavily URLs even if content extraction failed
if urls and not search_results_with_content:
logger.warning("Got URLs from Tavily search but no content extracted - creating entries for direct use")
for url in urls:
search_results_with_content.append({
"url": url,
"title": "",
"content": "", # Empty content - will need crawling if used
"score": 0
})
logger.info(f"Web search returned {len(urls)} URLs with {len(search_results_with_content)} results")
if search_results_with_content:
content_count = sum(1 for r in search_results_with_content if r.get("content") and len(r.get("content", "")) > 0)
logger.info(f" - {content_count} results have content, {len(search_results_with_content) - content_count} without content")
if content_count > 0:
first_with_content = next((r for r in search_results_with_content if r.get("content")), None)
if first_with_content:
logger.info(f"Content preview from first result with content: {first_with_content.get('content', '')[:200]}")
else:
logger.warning("No search results extracted - will need to crawl URLs")
return urls, search_results_with_content
except Exception as e:
logger.error(f"Error in web search: {str(e)}")
return []
logger.error(f"Error in web search: {str(e)}", exc_info=True)
# Even if there's an error, try to extract URLs from the response if available
recovered_urls = []
recovered_results = []
try:
if 'searchResponse' in locals() and searchResponse:
logger.info(f"Attempting to extract URLs from error response: {type(searchResponse)}")
# Try to get content from response
if hasattr(searchResponse, 'content'):
errorContent = searchResponse.content
if isinstance(errorContent, str):
# Try to parse as JSON
try:
errorData = json.loads(errorContent)
if isinstance(errorData, dict):
if "urls" in errorData:
recovered_urls = errorData["urls"]
elif "results" in errorData:
recovered_urls = [r.get("url") for r in errorData["results"] if r.get("url")]
recovered_results = [{"url": r.get("url"), "title": r.get("title", ""), "content": r.get("content", ""), "score": 0} for r in errorData["results"]]
elif isinstance(errorData, list):
recovered_urls = [item if isinstance(item, str) else item.get("url", "") for item in errorData if item]
if recovered_urls:
logger.info(f"Recovered {len(recovered_urls)} URLs from error response")
# Create entries for recovered URLs
if not recovered_results:
for url in recovered_urls:
recovered_results.append({"url": url, "title": "", "content": "", "score": 0})
return recovered_urls, recovered_results
except Exception as parseError:
logger.debug(f"Failed to parse error response: {parseError}")
except Exception as recoverError:
logger.debug(f"Failed to recover URLs from error: {recoverError}")
# If we have URLs from earlier extraction, return them
if 'urls' in locals() and urls:
logger.info(f"Returning {len(urls)} URLs extracted before error occurred")
# Create entries from URLs
results_from_urls = [{"url": url, "title": "", "content": "", "score": 0} for url in urls]
return urls, results_from_urls
return [], []
def _validateUrls(self, urls: List[str]) -> List[str]:
"""