Skip to content

fix: handle hyphenated and underscore terms in lex and vec/hyde queries#418

Closed
fxstein wants to merge 1 commit into
tobi:mainfrom
fxstein:fix/hyphenated-query-handling
Closed

fix: handle hyphenated and underscore terms in lex and vec/hyde queries#418
fxstein wants to merge 1 commit into
tobi:mainfrom
fxstein:fix/hyphenated-query-handling

Conversation

@fxstein

@fxstein fxstein commented Mar 16, 2026

Copy link
Copy Markdown
Contributor

Problem

Hyphens and underscores in compound words and identifiers break both lex and vec/hyde search:

Lex — hyphens (#417): Hyphenated identifiers like DEC-0054, RFC-0011, CVE-2024-1234 are unsearchable:

  • Bare DEC-0054 → parsed as negation (DEC minus 0054) → 0 results
  • Quoted "DEC-0054"sanitizeFTS5Term() strips hyphen → dec0054 → doesn't match FTS5 unicode61 tokens (dec, 0054) → 0 results

Lex — underscores (#305): Snake_case identifiers like apply_secrets, __init__, my_variable are unsearchable:

  • sanitizeFTS5Term() strips underscores → applysecrets → doesn't match FTS5 unicode61 tokens (apply, secrets) → 0 results

Vec/hyde (#414): validateSemanticQuery() uses /-\w/ which rejects compound words like multi-agent, role-based, chain-of-thought as negation syntax.

Root Cause

FTS5's unicode61 tokenizer splits on hyphens and underscores at index time (DEC-0054dec + 0054, apply_secretsapply + secrets). But sanitizeFTS5Term() strips these characters entirely, concatenating the parts into a single token (dec0054, applysecrets) that can never match the index.

Fix

Two changes in store.ts:

  1. sanitizeFTS5Term() — preserve hyphens and underscores:
- return term.replace(/[^\p{L}\p{N}']/gu, '').toLowerCase();
+ return term.replace(/[^\p{L}\p{N}'_-]/gu, '').toLowerCase();

FTS5 applies the same tokenizer to query strings as to indexed content. Preserving the separator lets FTS5 split the query symmetrically — producing precise adjacency/phrase matches. This is simpler and more accurate than splitting at the JS level (which would produce AND terms matching anywhere in the document, not just adjacent occurrences).

  1. validateSemanticQuery() — only match hyphens preceded by whitespace or string start (actual negation), not internal hyphens in compound words:
- if (/-\w/.test(query) || /-"/.test(query)) {
+ if (/(?:^|\s)-[\w"]/.test(query)) {

No changes to buildFTS5Query() — FTS5 handles the splitting correctly when the separator characters are preserved.

Testing

Verified against real FTS5 porter unicode61 (in-memory + 7,665 document production index):

Query Before After
lex "DEC-0054" 0 results ✅ 93% top hit
lex DEC-0054 0 results ✅ 93% top hit
lex apply_secrets 0 results ✅ match (phrase, adjacent only)
lex __init__ 0 results ✅ match
lex my-app_v2 0 results ✅ match (mixed separators)
vec "multi-agent orchestration" ❌ Negation error ✅ 88% top hit
lex spawn -orchestrator ✅ Works ✅ Still works (no regression)

Precision verified: "apply_secrets" produces a phrase match (1 hit for adjacent apply + secrets), not an AND match (which would also hit documents containing both words non-adjacently). FTS5's symmetric tokenization gives us adjacency for free.

Relationship to #404

Complements #404 which also addresses #305 (underscores) via a similar sanitizeFTS5Term change. This PR extends the fix to hyphens and adds the validateSemanticQuery fix for vec/hyde false positives, which #404 does not cover.

Fixes #305, fixes #414, fixes #417
Related: #404

Environment

  • QMD: v2.0.1
  • Platform: macOS (Apple Silicon)
  • Node: v24.2.0

Preserve hyphens and underscores in sanitizeFTS5Term so FTS5's unicode61
tokenizer can split them symmetrically at query time, producing precise
phrase matches. Also fix validateSemanticQuery false positive that rejected
hyphenated terms like DEC-0054 as negation syntax in vec/hyde queries.

Complements tobi#404 (underscore-only fix) by also covering hyphens.
Refs: tobi#305, tobi#417
@fxstein fxstein force-pushed the fix/hyphenated-query-handling branch from 889b0e3 to b5f4286 Compare March 18, 2026 12:20
@fxstein fxstein changed the title fix: handle hyphenated terms in lex and vec/hyde queries fix: handle hyphenated and underscore terms in lex and vec/hyde queries Mar 18, 2026
zeattacker pushed a commit to zeattacker/qmd that referenced this pull request Mar 26, 2026
Merges dev-upstream-fixes (cherry-picked PRs tobi#462, tobi#463, tobi#455, tobi#418,
tobi#456, tobi#442, tobi#453) into dev. Resolved mcp/server.ts bind conflict —
keep 0.0.0.0 for Docker container accessibility.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@tobi

tobi commented Apr 5, 2026

Copy link
Copy Markdown
Owner

Closing — the underscore handling landed in #404. The remaining hyphen + validateSemanticQuery changes conflict with main and would need a rebase. Happy to revisit if you want to open a focused follow-up PR. Thanks!

@fxstein

fxstein commented Apr 23, 2026

Copy link
Copy Markdown
Contributor Author

@tobi Opened a focused follow-up PR as suggested: #601

idanariav pushed a commit to idanariav/qmd that referenced this pull request Apr 24, 2026
validateSemanticQuery uses /-\w/ to flag negation syntax, but that regex
also matches hyphens embedded inside identifiers like "DEC-0054" or
"ui-kit". Semantic queries containing these entirely reasonable tokens
are rejected with a confusing 'Negation (-term) is not supported' error.

Anchor the pattern to the start of the query or a whitespace boundary so
that only true negation tokens ("-word", '-"phrase"' at the start of a
word) trigger the validation, while mid-term hyphens pass through.

Adds test coverage for DEC-0054, scoped npm packages, compound adjectives,
and token-based identifiers.

Refs: tobi/qmd#418 (prior attempt, scope-reduced to the negation fix only)
Refs: tobi/qmd#305, tobi/qmd#417
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants