Skip to content

fix(memory/holographic): sanitize FTS5 queries for natural-language recall#11333

Open
cyb3rwr3n wants to merge 1 commit into
NousResearch:mainfrom
cyb3rwr3n:fix/holographic-fts-sanitize
Open

fix(memory/holographic): sanitize FTS5 queries for natural-language recall#11333
cyb3rwr3n wants to merge 1 commit into
NousResearch:mainfrom
cyb3rwr3n:fix/holographic-fts-sanitize

Conversation

@cyb3rwr3n

@cyb3rwr3n cyb3rwr3n commented Apr 17, 2026

Copy link
Copy Markdown

Summary

The holographic memory provider's FactRetriever._fts_candidates passes the raw user query directly to FTS5's MATCH operator. FTS5 defaults to AND-between-tokens, which means any multi-word prose query requires every token to co-occur in a fact. For the prefetch() path where the query comes straight from the user message, this reduces recall to near-zero on natural-language prompts.

Example, before this fix:

query: "what happened with the deployment rollback"
FTS5 MATCH: "what AND happened AND with AND the AND deployment AND rollback"
results: 0  (nothing has all six tokens)

query: "deployment OR rollback"
results: 5  (normal recall)

The prefetch hook in run_agent.py that injects memory context on every turn was therefore silently missing relevant facts whenever the user phrased their message in prose.

Fix

Add _sanitize_fts_query() that:

  • tokenizes and lowercases the query
  • drops standard English stopwords and <2-char tokens
  • strips FTS5 operator characters from each remaining token
  • OR-joins the survivors as phrase literals: "tok1" OR "tok2" OR ...
  • falls back to the raw query if nothing survives sanitization (pathological inputs)

No changes to the HRR + Jaccard + trust reranking — those keep precision high once the candidate pool isn't empty.

Test plan

Ships with 10 new tests in tests/plugins/memory/test_holographic_retrieval.py:

  • parametrized sanitizer unit tests (stopword drop, single content word, pure-stopword fallback, FTS5-special stripping, empty input)
  • FTS5 crash-safety test against problematic inputs (quotes, stars, parens, carets, colons, hyphens, long strings)
  • integration tests against an in-memory MemoryStore:
    • natural-language prose query recovers the relevant fact (the exact regression this fix targets)
    • single-keyword query still works
    • pure-stopword query returns [] without crashing

Existing memory/fact-store test suite (329 tests) still passes.

pytest tests/plugins/memory/test_holographic_retrieval.py  # 10 passed
pytest tests/ -k "memory or fact_store or retriev or holographic"  # 329 passed, 4 skipped

Notes

  • Pure bug fix, no API surface change, no config change.
  • Stopword list is the standard English set (baked in). Could be made configurable later if multi-language is desired, but that's out of scope here.
  • The sanitizer is a @classmethod so tests can call it directly without instantiating a retriever + store.

…ecall

The FactRetriever's _fts_candidates passed the raw query string directly
to FTS5's MATCH operator. FTS5 defaults to AND-between-tokens, which
means any multi-word prose query like 'what happened with the deployment
rollback' required every single token to co-occur in a fact — dropping
recall to zero on the kind of queries agents actually issue via prefetch().

Fix: add _sanitize_fts_query() that:
- tokenizes the query and drops English stopwords
- strips FTS5 operator characters per token
- OR-joins the remaining content tokens as phrase literals

For pathological inputs (all stopwords, empty), falls back to the raw
query so the caller sees zero results instead of a SQL error.

This is a pure-retrieval-quality fix — the HRR + Jaccard reranking
stages still keep precision high. Ships with 10 tests covering the
sanitizer and retrieval integration.
@cyb3rwr3n cyb3rwr3n force-pushed the fix/holographic-fts-sanitize branch from 9a65f4c to 2c930a3 Compare April 19, 2026 00:10
@alt-glitch alt-glitch added type/bug Something isn't working P1 High — major feature broken, no workaround tool/memory Memory tool and memory providers comp/plugins Plugin system and bundled plugins labels Apr 25, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Related to #14033, #14262, #14794 — all address holographic FTS5 query sanitization. This PR is the most comprehensive (stopwords + OR-join + crash safety), but check for conflicts with those PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/plugins Plugin system and bundled plugins P1 High — major feature broken, no workaround tool/memory Memory tool and memory providers type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants