Skip to content

perf: use FTS5 BM25 ranking for search results + fix phrase matching #293

@100yenadmin

Description

@100yenadmin

Context

A deep investigation into LCM's search architecture reveals that the recency-only ordering in lcm_grep was a deliberate design choice, not an oversight. The DAG structure itself is the relevance filter — terms that survive multi-level compression are already high-signal. The "smart" retrieval path (lcm_expand_query) uses sub-agent LLM reasoning for relevance, which is far more sophisticated than BM25.

However, for users with very long histories (200K+ messages), the inability to surface relevant older summaries when token budgets are tight is a real gap. FTS5 already computes BM25 scores — they're retrieved from the DB but discarded at sort time.

The Three Search Paths

Tool Purpose Sort Order Who Uses It
lcm_grep Keyword lookup → returns IDs for follow-up Recency Main agent (routing)
lcm_describe Inspect a summary node's subtree structure N/A Sub-agent
lcm_expand_query Autonomous sub-agent does multi-step retrieval Sub-agent's judgment Main agent (deep recall)
lcm_expand Low-level DAG traversal By depth Sub-agent only

Why Recency Was Chosen (Intentional)

  1. DAG = relevance filter. Terms surviving multi-level compression are already high-signal. BM25 on top double-counts.
  2. lcm_grep is a routing tool. Returns IDs + snippets for follow-up, not final answers. Recency is a reasonable routing default.
  3. lcm_expand_query handles "smart" retrieval. Sub-agent LLM reasoning >> BM25 heuristics.

Where BM25 Would Help (Without Conflicting)

1. Add configurable sort to lcm_grep (PRIMARY RECOMMENDATION)

Add a sort parameter: "recency" (default), "relevance" (ORDER BY FTS5 rank), or "hybrid".

  • Preserves backward compatibility (recency default)
  • RAG+LCM hybrid users get relevance option
  • Pure LCM users keep current behavior
  • FTS5 rank is already computed — just needs to be used in ORDER BY

2. Flip sort priority in describeAndExpand (INTERNAL OPTIMIZATION)

expansion.ts line 186-194 currently sorts recency-first, rank-as-tiebreaker. Flip to rank-first, recency-tiebreaker. This determines which summaries consume the expansion token budget — putting the most relevant ones first improves expand quality.

3. Fix phrase matching in fts5-sanitize.ts

"error handling" is split into "error" "handling" (AND of two terms) instead of preserved as a phrase. Small sanitizer fix.

4. Add periodic FTS optimization

After compaction, call INSERT INTO messages_fts(messages_fts) VALUES('optimize') to defragment. Prevents gradual search degradation.

5. Add messages_fts_cjk trigram table

CJK trigram search only covers summaries. Messages fall back to slow LIKE.

Where BM25 Could Hurt

  • Stale depth-2 summaries displacing fresh context — BM25 doesn't understand DAG depth
  • Double-counting importance — DAG compression + BM25 frequency compounds bias toward older compressed nodes
  • Breaking agent expectations — agents expect chronological grep results

LCM vs RAG

Aspect RAG LCM
Retrieval Vector similarity (finds paraphrases) Keyword + DAG traversal + sub-agent LLM
Ranking Embedding distance Recency + BM25-lite (assembler only)
Losslessness Lossy (chunks may miss context) Lossless (every message recoverable)
Smart retrieval Top-K by similarity Sub-agent with multi-step reasoning

LCM's key advantage: every message is recoverable via DAG traversal. RAG loses context at chunking boundaries. LCM's gap: keyword search can't find semantic matches (synonyms, paraphrases).

Proposed Implementation

Files to change

  • src/tools/lcm-grep-tool.ts — add sort parameter to schema
  • src/retrieval.ts — add sort option to GrepInput, apply in grep()
  • src/expansion.ts — flip sort priority in describeAndExpand
  • src/store/fts5-sanitize.ts — fix phrase matching
  • src/db/migration.ts — add messages_fts_cjk table + periodic FTS optimize

Configuration

The sort parameter would be per-tool-call (agent chooses per query), not a global config. This lets the agent use recency for "what just happened?" and relevance for "find the discussion about X from last week."

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions