Re-indexing the same file always triggered a full embedding run —
ingestion.skipped.duplicate never fired. Two independent causes:
1. _computeIngestionHash included contentObjectId in its payload, but
extractors generate fresh uuid4() per run, making the hash a
per-run nonce. Now hashed over (contentType, data) in extractor
order — stable across re-extractions, sensitive to content,
ordering, and type changes.
2. _autoIndexFile upserted the fresh pre-scan FileContentIndex before
requestIngestion's duplicate check, wiping structure._ingestion
and status=indexed from the prior run. The pre-upsert now merges
the existing _ingestion metadata and preserves the indexed status.
Verified end-to-end: second PATCH /scope on an already-indexed file
logs and returns in ~2s
with zero embedding API calls.
Adds test_ingestion_hash_stability.py (5 cases).
The default MergeStrategy concatenates every extracted text part into a
single ContentPart, collapsing a 500-page PDF into one chunk with a
blurred average embedding — RAG retrieval was effectively broken.
- ExtractionOptions.mergeStrategy is now Optional[MergeStrategy]; passing
None preserves per-part granularity. Default factory kept for
backward compatibility.
- routeDataFiles._autoIndexFile, _workspaceTools.readFile, and
_documentTools.describeImage explicitly pass mergeStrategy=None.
- Agent tools no longer carry redundant extraction + requestIngestion
fallback paths: the unified ingestion lane owns all corpus writes,
and readFile/describeImage are pure consumers of the knowledge store.
- Unit test asserts runExtraction(mergeStrategy=None) keeps every part.