Skip to content

BM25 search fails on snake_case identifiers (sanitizeFTS5Term strips underscores) #305

@Raymondrrb

Description

@Raymondrrb

Problem

Searching for Python/Ruby/Rust snake_case identifiers returns 0 results via BM25 search.

Example: documents contain atomic_write_json as a function name, but:

qmd search "atomic_write_json"    # → 0 results
qmd search "atomic write json"    # → 3 results, 88% score

Root Cause

In store.js, sanitizeFTS5Term replaces non-alphanumeric characters with empty string instead of space:

// Current (line ~1511):
function sanitizeFTS5Term(term) {
    return term.replace(/[^\p{L}\p{N}']/gu, '').toLowerCase();
}

This collapses atomic_write_jsonatomicwritejson (single token).

However, at index time, FTS5's unicode61 tokenizer correctly splits on _, storing atomic, write, json as separate tokens. The mismatch means the query can never match.

Additionally, buildFTS5Query splits only on \s+ (whitespace), so even if _ were preserved, it would still be treated as a single term.

Fix (2-line change)

 function sanitizeFTS5Term(term) {
-    return term.replace(/[^\p{L}\p{N}']/gu, '').toLowerCase();
+    return term.replace(/[^\p{L}\p{N}']/gu, ' ').toLowerCase().trim();
 }
 function buildFTS5Query(query) {
-    const terms = query.split(/\s+/)
+    const terms = query.split(/[\s_]+/)

Why this works: replacing non-alnum with space (instead of empty) makes atomic_write_jsonatomic write json → 3 AND terms, matching the FTS5 index. The _ in the split pattern handles edge cases where sanitize alone doesn't split (e.g., already-sanitized input).

No regressions: normal multi-word queries ("davinci assembler render") continue to work identically since spaces are already the split delimiter.

Context

This is especially impactful for code indexing use cases where project_code collections contain Python/Ruby/Rust source files full of snake_case identifiers — the primary search vocabulary for developers.

Environment

  • qmd 1.0.7 (a0bd077aaf)
  • macOS, Homebrew install

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions