Problem
Searching for Python/Ruby/Rust snake_case identifiers returns 0 results via BM25 search.
Example: documents contain atomic_write_json as a function name, but:
qmd search "atomic_write_json" # → 0 results
qmd search "atomic write json" # → 3 results, 88% score
Root Cause
In store.js, sanitizeFTS5Term replaces non-alphanumeric characters with empty string instead of space:
// Current (line ~1511):
function sanitizeFTS5Term(term) {
return term.replace(/[^\p{L}\p{N}']/gu, '').toLowerCase();
}
This collapses atomic_write_json → atomicwritejson (single token).
However, at index time, FTS5's unicode61 tokenizer correctly splits on _, storing atomic, write, json as separate tokens. The mismatch means the query can never match.
Additionally, buildFTS5Query splits only on \s+ (whitespace), so even if _ were preserved, it would still be treated as a single term.
Fix (2-line change)
function sanitizeFTS5Term(term) {
- return term.replace(/[^\p{L}\p{N}']/gu, '').toLowerCase();
+ return term.replace(/[^\p{L}\p{N}']/gu, ' ').toLowerCase().trim();
}
function buildFTS5Query(query) {
- const terms = query.split(/\s+/)
+ const terms = query.split(/[\s_]+/)
Why this works: replacing non-alnum with space (instead of empty) makes atomic_write_json → atomic write json → 3 AND terms, matching the FTS5 index. The _ in the split pattern handles edge cases where sanitize alone doesn't split (e.g., already-sanitized input).
No regressions: normal multi-word queries ("davinci assembler render") continue to work identically since spaces are already the split delimiter.
Context
This is especially impactful for code indexing use cases where project_code collections contain Python/Ruby/Rust source files full of snake_case identifiers — the primary search vocabulary for developers.
Environment
- qmd 1.0.7 (a0bd077aaf)
- macOS, Homebrew install
Problem
Searching for Python/Ruby/Rust snake_case identifiers returns 0 results via BM25 search.
Example: documents contain
atomic_write_jsonas a function name, but:Root Cause
In
store.js,sanitizeFTS5Termreplaces non-alphanumeric characters with empty string instead of space:This collapses
atomic_write_json→atomicwritejson(single token).However, at index time, FTS5's
unicode61tokenizer correctly splits on_, storingatomic,write,jsonas separate tokens. The mismatch means the query can never match.Additionally,
buildFTS5Querysplits only on\s+(whitespace), so even if_were preserved, it would still be treated as a single term.Fix (2-line change)
function sanitizeFTS5Term(term) { - return term.replace(/[^\p{L}\p{N}']/gu, '').toLowerCase(); + return term.replace(/[^\p{L}\p{N}']/gu, ' ').toLowerCase().trim(); } function buildFTS5Query(query) { - const terms = query.split(/\s+/) + const terms = query.split(/[\s_]+/)Why this works: replacing non-alnum with space (instead of empty) makes
atomic_write_json→atomic write json→ 3 AND terms, matching the FTS5 index. The_in the split pattern handles edge cases where sanitize alone doesn't split (e.g., already-sanitized input).No regressions: normal multi-word queries (
"davinci assembler render") continue to work identically since spaces are already the split delimiter.Context
This is especially impactful for code indexing use cases where
project_codecollections contain Python/Ruby/Rust source files full of snake_case identifiers — the primary search vocabulary for developers.Environment