fix(tools): auto-quote hyphenated tokens in fact_store FTS5 queries#14794
fix(tools): auto-quote hyphenated tokens in fact_store FTS5 queries#14794Tranquil-Flow wants to merge 1 commit into
Conversation
perlowja
left a comment
There was a problem hiding this comment.
LGTM — careful, well-scoped fix.
I read this closely because I'm working in an adjacent memory plugin (#14109, plugins/memory/hindsight/), so the plugins/memory/* subsystem is fresh in my head.
Things I verified:
- The
(?<!")...(?!")lookaround guard on_HYPHENATED_TOKEN_REmakes_sanitize_fts5_queryidempotent — a second pass on an already-quoted token is a no-op. Testtest_already_quoted_preservedlocks that in. - FTS5 operator allowlist (
AND,OR,NOT,NEAR) is the right set — confirmed against SQLite FTS5 syntax docs. Covered bytest_fts5_operators_preserved+test_or_operator_preserved. - Integration test layer (
TestFTS5Integrationwith a real in-memory sqlite FTS5 table) directly pins the before/after behavior: the raw query raisesOperationalError: no such column, the sanitized query returns rows. That's the strongest possible confirmation the bug is real and this fix resolves it. - The sanitizer runs before params append in
_fts_candidates(line 524-ish), so the MATCH argument is always post-sanitization. Correct integration point.
One small edge case worth noting (not blocking):
Only ASCII hyphens (U+002D) are caught. If a user pastes content containing Unicode hyphens/dashes (U+2010 hyphen, U+2013 en-dash, U+2014 em-dash) — which can happen with copy-paste from notes apps, emails, etc. — those tokens won't be quoted and will still hit the same FTS5 error. Could be a follow-up: either extend the regex to \w+(?:[-‐–—]\w+)+ or explicitly normalize dashes upstream. Not urgent — the common case (logs, hostnames, identifiers) is ASCII-only — just wanted to flag it.
Non-concern: \w includes underscore, which means tokens like db_name-01 get quoted too. That's actually correct behavior (bare db_name-01 would hit the same FTS5 error).
Nice work — the test coverage is notable (14 unit + 4 integration, per the PR body's summary). Happy to see the integration tests use a real FTS5 table rather than mocking SQLite.
|
Thanks for the careful read @perlowja. I tested the Unicode-dash edge case empirically (against an empty FTS5 table to isolate the query parser from match results) and the result is actually the opposite of what we'd expect — only ASCII hyphen U+002D triggers the bug. Unicode dashes (U+2010 / U+2011 / U+2013 / U+2014 / U+2212, plus U+00AD / U+FE63 / U+FF0D / U+2043) all parse cleanly with no quoting needed. Reason: ASCII Mixed copy-paste content also works without changes — One open suggestion for maintainers: a small regression test asserting that bare Unicode-dash queries don't raise could be worth adding to lock in current FTS5 behavior. Happy to push it as a follow-up commit if useful. |
|
Excellent test — thank you for running that empirically rather than just reasoning about it. The "ASCII hyphen is the NOT-operator prefix but other Unicode dashes are just tokenizer separators" framing is exactly the right way to scope the fix. The mixed-content example ( Yes please on the Unicode-dash regression test as a follow-up — even if the current behavior is "by luck of FTS5 not treating en/em-dashes as operators," locking it down with an assertion means a future SQLite or unicode61 tokenizer change couldn't silently regress us into the same Approval still stands; the empirical answer reinforces it. |
|
Pushed the Unicode-dash regression as 040949835 — thanks again @perlowja for the careful read and the framing. The new tests are integration-level (against a real
The intent matches your suggestion exactly: if a future SQLite or |
|
Thanks @Tranquil-Flow — |
…ousResearch#14794) FTS5 unicode61 tokenizer treats hyphens as separators, so queries like 'pve-01' fail. This adds auto-quoting for hyphenated tokens.
0409498 to
ec640fd
Compare
What does this PR do?
FTS5's default unicode61 tokenizer treats hyphens as separators, so a query like "pve-01" is parsed as a column-restriction operator and fails with
OperationalError: no such column: 01. Since the exception was silently caught, users got empty results with no error indication. Adds automatic quoting of hyphenated tokens.Related Issue
Fixes #14024
Type of Change
Changes Made
plugins/memory/holographic/retrieval.py: Auto-quote hyphenated tokens in FTS5 queriestests/plugins/memory/test_fts5_hyphen_sanitizer.py: 14 testsHow to Test
python -m pytest -o 'addopts=' tests/plugins/memory/test_fts5_hyphen_sanitizer.py -vChecklist
Code
fix(scope):,feat(scope):, etc.)pytest tests/ -qand all tests passDocumentation & Housekeeping
docs/, docstrings) — or N/Acli-config.yaml.exampleif I added/changed config keys — or N/ACONTRIBUTING.mdorAGENTS.mdif I changed architecture or workflows — or N/AScreenshots / Logs