Skip to content

Fix hyphenated tokens in FTS5 lex queries#463

Merged
tobi merged 1 commit into
tobi:mainfrom
goldsr09:fix/hyphenated-lex-queries
Mar 28, 2026
Merged

Fix hyphenated tokens in FTS5 lex queries#463
tobi merged 1 commit into
tobi:mainfrom
goldsr09:fix/hyphenated-lex-queries

Conversation

@goldsr09

Copy link
Copy Markdown

Summary

Fixes #414, #383, #384, #390, #417

Hyphenated tokens like multi-agent, DEC-0054, and gpt-4 were broken in lex search because:

  1. sanitizeFTS5Term() strips all non-alphanumeric characters, turning multi-agent into multiagent
  2. The parser treated -agent as a negation prefix, so multi-agent became multi NOT agent

This PR:

  • Adds isHyphenatedToken() to detect compound words with internal hyphens
  • Adds sanitizeHyphenatedTerm() to split them into FTS5 phrase queries
  • multi-agent now becomes "multi agent" (phrase match), not "multiagent"*
  • Explicit negation (-sports) still works as before
  • Negated hyphenated terms (-multi-agent) also work correctly

Examples

Input Before After
multi-agent "multiagent"* "multi agent"
DEC-0054 "dec0054"* "dec 0054"
gpt-4 "gpt4"* "gpt 4"
multi-agent memory "multi"* NOT "agent"* AND "memory"* "multi agent" AND "memory"*
-sports (negation) NOT "sports"* NOT "sports"* (unchanged)

Test plan

  • 8 new test cases covering hyphenated terms, mixed queries, negation disambiguation
  • All 12 existing buildFTS5Query tests pass (no regressions)
  • Full structured-search test suite passes

Hyphenated terms like multi-agent, DEC-0054, gpt-4 were being stripped
of hyphens and concatenated (e.g., "multiagent") which missed matches.
Now they're split into FTS5 phrase queries ("multi agent") so the porter
tokenizer matches them correctly.
zeattacker pushed a commit to zeattacker/qmd that referenced this pull request Mar 26, 2026
Fixes multi-agent, DEC-0054, gpt-4 etc being broken in lex search.

Cherry-picked from tobi#463

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
zeattacker pushed a commit to zeattacker/qmd that referenced this pull request Mar 26, 2026
Merges dev-upstream-fixes (cherry-picked PRs tobi#462, tobi#463, tobi#455, tobi#418,
tobi#456, tobi#442, tobi#453) into dev. Resolved mcp/server.ts bind conflict —
keep 0.0.0.0 for Docker container accessibility.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@tobi tobi merged commit dd27f49 into tobi:main Mar 28, 2026
jaylfc added a commit to jaylfc/qmd that referenced this pull request Apr 5, 2026
Fix hyphenated tokens in FTS5 lex queries
jaylfc added a commit to jaylfc/qmd that referenced this pull request Apr 5, 2026
Fix hyphenated tokens in FTS5 lex queries
@rymalia

rymalia commented Apr 7, 2026

Copy link
Copy Markdown
Contributor

Thanks for the lex fix — hyphenated tokens in FTS5 queries work correctly now.

However, this PR's Fixes line claims to resolve #383, #384, #390, and #414, which are all about validateSemanticQuery() rejecting vec/hyde queries containing hyphenated words. That's a different function (store.ts:2910) in a different code path — it wasn't modified by this PR.

The /-\w/ regex in validateSemanticQuery is still unchanged in v2.1.0, so queries like vec: "multi-agent orchestration" or vec: "AST-aware chunking" are still rejected. #384 has the fix for that.

tanarchytan referenced this pull request in tanarchytan/lotl Apr 8, 2026
Fix hyphenated tokens in FTS5 lex queries
rymalia added a commit to rymalia/qmd that referenced this pull request May 27, 2026
CLAUDE.local.md version bump v2.0.1 → v2.1.0. Two session summaries:
upstream rebase sync (45 commits integrated via rebase) and v2.1.0
testing session (PR tobi#533 submitted, bugs found in JSON line field
and multi-collection search, comments on tobi#383, tobi#463, tobi#217, tobi#241).
rymalia added a commit to rymalia/qmd that referenced this pull request May 27, 2026
CLAUDE.local.md version bump v2.0.1 → v2.1.0. Two session summaries:
upstream rebase sync (45 commits integrated via rebase) and v2.1.0
testing session (PR tobi#533 submitted, bugs found in JSON line field
and multi-collection search, comments on tobi#383, tobi#463, tobi#217, tobi#241).
lucndm pushed a commit to lucndm/qmd that referenced this pull request Jun 7, 2026
Fix hyphenated tokens in FTS5 lex queries
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

vec/hyde queries reject hyphenated compound words as negation operators

3 participants