Skip to content

fix(tools): auto-quote hyphenated tokens in fact_store FTS5 queries#14794

Open
Tranquil-Flow wants to merge 1 commit into
NousResearch:mainfrom
Tranquil-Flow:fix/fact-store-fts5-hyphen
Open

fix(tools): auto-quote hyphenated tokens in fact_store FTS5 queries#14794
Tranquil-Flow wants to merge 1 commit into
NousResearch:mainfrom
Tranquil-Flow:fix/fact-store-fts5-hyphen

Conversation

@Tranquil-Flow

@Tranquil-Flow Tranquil-Flow commented Apr 23, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

FTS5's default unicode61 tokenizer treats hyphens as separators, so a query like "pve-01" is parsed as a column-restriction operator and fails with OperationalError: no such column: 01. Since the exception was silently caught, users got empty results with no error indication. Adds automatic quoting of hyphenated tokens.

Related Issue

Fixes #14024

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • plugins/memory/holographic/retrieval.py: Auto-quote hyphenated tokens in FTS5 queries
  • tests/plugins/memory/test_fts5_hyphen_sanitizer.py: 14 tests

How to Test

  1. Run:
    python -m pytest -o 'addopts=' tests/plugins/memory/test_fts5_hyphen_sanitizer.py -v
  2. Confirm result: 14 passed.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: macOS 15 (Darwin 24.6.0)

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A

Screenshots / Logs

python -m pytest -o 'addopts=' tests/plugins/memory/test_fts5_hyphen_sanitizer.py -v
# 14 passed

@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists tool/memory Memory tool and memory providers comp/plugins Plugin system and bundled plugins labels Apr 23, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Related to #14024 (issue), #14033 and #14262 (competing fix PRs) — FTS5 unicode61 tokenizer treats hyphens as column-restriction operators.

@perlowja perlowja left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — careful, well-scoped fix.

I read this closely because I'm working in an adjacent memory plugin (#14109, plugins/memory/hindsight/), so the plugins/memory/* subsystem is fresh in my head.

Things I verified:

  • The (?<!")...(?!") lookaround guard on _HYPHENATED_TOKEN_RE makes _sanitize_fts5_query idempotent — a second pass on an already-quoted token is a no-op. Test test_already_quoted_preserved locks that in.
  • FTS5 operator allowlist (AND, OR, NOT, NEAR) is the right set — confirmed against SQLite FTS5 syntax docs. Covered by test_fts5_operators_preserved + test_or_operator_preserved.
  • Integration test layer (TestFTS5Integration with a real in-memory sqlite FTS5 table) directly pins the before/after behavior: the raw query raises OperationalError: no such column, the sanitized query returns rows. That's the strongest possible confirmation the bug is real and this fix resolves it.
  • The sanitizer runs before params append in _fts_candidates (line 524-ish), so the MATCH argument is always post-sanitization. Correct integration point.

One small edge case worth noting (not blocking):

Only ASCII hyphens (U+002D) are caught. If a user pastes content containing Unicode hyphens/dashes (U+2010 hyphen, U+2013 en-dash, U+2014 em-dash) — which can happen with copy-paste from notes apps, emails, etc. — those tokens won't be quoted and will still hit the same FTS5 error. Could be a follow-up: either extend the regex to \w+(?:[-‐–—]\w+)+ or explicitly normalize dashes upstream. Not urgent — the common case (logs, hostnames, identifiers) is ASCII-only — just wanted to flag it.

Non-concern: \w includes underscore, which means tokens like db_name-01 get quoted too. That's actually correct behavior (bare db_name-01 would hit the same FTS5 error).

Nice work — the test coverage is notable (14 unit + 4 integration, per the PR body's summary). Happy to see the integration tests use a real FTS5 table rather than mocking SQLite.

@Tranquil-Flow

Copy link
Copy Markdown
Contributor Author

Thanks for the careful read @perlowja. I tested the Unicode-dash edge case empirically (against an empty FTS5 table to isolate the query parser from match results) and the result is actually the opposite of what we'd expect — only ASCII hyphen U+002D triggers the bug. Unicode dashes (U+2010 / U+2011 / U+2013 / U+2014 / U+2212, plus U+00AD / U+FE63 / U+FF0D / U+2043) all parse cleanly with no quoting needed.

Reason: ASCII - is FTS5's NOT-operator prefix, which is what routes pve-01 through the column-restriction parse path that produces no such column: 01. Other Unicode dashes aren't query operators and pass straight to unicode61 as ordinary separators.

Mixed copy-paste content also works without changes — pve-01–02 sanitizes to "pve-01"–02 (ASCII portion quoted, en-dash portion passes through), parses fine. So the ASCII-only scope of the regex is precisely the scope of the bug.

One open suggestion for maintainers: a small regression test asserting that bare Unicode-dash queries don't raise could be worth adding to lock in current FTS5 behavior. Happy to push it as a follow-up commit if useful.

@perlowja

Copy link
Copy Markdown
Contributor

Excellent test — thank you for running that empirically rather than just reasoning about it. The "ASCII hyphen is the NOT-operator prefix but other Unicode dashes are just tokenizer separators" framing is exactly the right way to scope the fix. The mixed-content example (pve-01–02"pve-01"–02) is a nice illustration that the regex boundary holds where it needs to.

Yes please on the Unicode-dash regression test as a follow-up — even if the current behavior is "by luck of FTS5 not treating en/em-dashes as operators," locking it down with an assertion means a future SQLite or unicode61 tokenizer change couldn't silently regress us into the same no such column: <chunk> failure mode this PR closes for ASCII.

Approval still stands; the empirical answer reinforces it.

@Tranquil-Flow

Copy link
Copy Markdown
Contributor Author

Pushed the Unicode-dash regression as 040949835 — thanks again @perlowja for the careful read and the framing.

The new tests are integration-level (against a real :memory: FTS5 table) and cover:

  • 9 parametrized cases asserting bare unsanitized queries pve<dash>01 do not raise OperationalError: no such column, for U+2010 / U+2011 / U+2013 / U+2014 / U+2212 / U+00AD / U+FE63 / U+FF0D / U+2043.
  • One mixed-content case locking in pve-01–02"pve-01"–02 and that the sanitizer's output parses cleanly.

The intent matches your suggestion exactly: if a future SQLite or unicode61 update starts treating any of these dashes as operator prefixes, these tests fail loudly rather than silently regressing us back into the same no such column: <chunk> failure mode this PR closes for ASCII. The sanitizer itself is left ASCII-only by design — only U+002D is FTS5's NOT-operator prefix, so widening the regex would be over-fitting; the assertion belongs on the parser path, not the sanitizer scope.

@perlowja

Copy link
Copy Markdown
Contributor

Thanks @Tranquil-Flow040949835 looks good. Empirical :memory: FTS5 testing is the right shape for query-parser semantics; reasoning-only tests on tokenizer behavior have bitten me before. Happy to see this land.

Cyrene963 pushed a commit to Cyrene963/hermes-agent that referenced this pull request May 5, 2026
…ousResearch#14794)

FTS5 unicode61 tokenizer treats hyphens as separators, so queries like
'pve-01' fail. This adds auto-quoting for hyphenated tokens.
@Tranquil-Flow Tranquil-Flow force-pushed the fix/fact-store-fts5-hyphen branch from 0409498 to ec640fd Compare May 25, 2026 13:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/plugins Plugin system and bundled plugins P2 Medium — degraded but workaround exists tool/memory Memory tool and memory providers type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: fact_store search action fails with "no such column" error on hyphenated queries (FTS5 tokenization)

3 participants