fix: allow mid-term hyphens in vec/hyde queries (DEC-0054, ui-kit)#601
fix: allow mid-term hyphens in vec/hyde queries (DEC-0054, ui-kit)#601fxstein wants to merge 1 commit into
Conversation
validateSemanticQuery uses /-\w/ to flag negation syntax, but that regex
also matches hyphens embedded inside identifiers like "DEC-0054" or
"ui-kit". Semantic queries containing these entirely reasonable tokens
are rejected with a confusing 'Negation (-term) is not supported' error.
Anchor the pattern to the start of the query or a whitespace boundary so
that only true negation tokens ("-word", '-"phrase"' at the start of a
word) trigger the validation, while mid-term hyphens pass through.
Adds test coverage for DEC-0054, scoped npm packages, compound adjectives,
and token-based identifiers.
Refs: tobi#418 (prior attempt, scope-reduced to the negation fix only)
Refs: tobi#305, tobi#417
|
Heads-up on the red The single failing test — I traced the root cause and opened a separate focused fix: #602. Once that lands, this PR's CI goes green automatically (verified — #602 itself is the first fully-green CI run on this codebase in two weeks). No action needed on this PR; just flagging that the red check here is pre-existing and orthogonal. |
|
Closing because the underlying vec/hyde hyphen false-positive bug has already shipped in the v2.5.x release line via the structured-search fix. Thanks for the report/fix direction. |
Problem
validateSemanticQuery()rejects semantic (vec/hyde) queries that contain mid-term hyphens inside compound words or identifiers, with a misleading error message:The regex
/-\w/matches any hyphen-followed-by-word-char, so perfectly reasonable tokens trigger false positives:DEC-0054,RFC-0011,CVE-2024-1234@scope/ui-kit,material-uistate-of-the-art,role-based,multi-agent,chain-of-thoughttoken-based,context-aware,fine-tunedUsers see a confusing "Negation is not supported" error for queries that contain no intentional negation at all.
Root Cause
The regex does not distinguish between true negation (
-wordat the start of a query or after whitespace — i.e. syntax borrowed from lex) and internal hyphens in compound words (multi-agent,DEC-0054). Both match/-\w/.Fix
One-line change in
validateSemanticQuery()— anchor the negation regex to the start of the query or a whitespace boundary:Now only true negation tokens (
-wordor-"phrase"at the start of a word) match. Mid-term hyphens pass through unchanged.Testing
Added four new test cases in
test/structured-search.test.tscovering common identifier patterns:"DEC-0054 architecture decision""how does @scope/ui-kit work""state-of-the-art retrieval""token-based chunking""performance -sports"(true negation)"foo -bar baz"(true negation)'-"exact phrase"'(true negation)Also verified manually against a 7,665 document production index —
vec: "multi-agent orchestration"returns results (88% top hit) instead of the negation error.Relationship to #418
This is a scope-reduced follow-up to #418, which you closed on 2026-04-05 with:
#418 bundled two independent changes:
sanitizeFTS5Term()— preserve hyphens so theunicode61tokenizer splits symmetrically at query time.validateSemanticQuery()negation regex fix — this PR.Since #418 was closed you took a different (and arguably cleaner) approach for (1) —
sanitizeHyphenatedTerm()which splits on-into separate tokens. That change landed onmainand already addresses the lex-side hyphen problem.Piece (2) — the
validateSemanticQuery()false positive — is an independent bug still present on currentmainand is unaffected bysanitizeHyphenatedTerm(). This PR isolates only that fix, with no overlap with your hyphenated-term work.Fixes #414
Environment
main(post-rebase, includessanitizeHyphenatedTermand the newrerankparameter)