Skip to content

search: low recall — relevant files omitted entirely (2 results where word finds 2658 hits) #539

@justrach

Description

@justrach

Problem

codedb search <term> has very low recall: it returns a tiny candidate set
and omits files that are the right answer, even when the term is pervasive in the
index. This is distinct from #537 — that issue is about re-ranking the
candidates codedb returns; this is about codedb never returning the relevant
file at all, so no re-ranker can recover it from search output.

engram's codedb-report (git history as ground truth) flags 29/30 changed
files as "not a lexical hit" on codedb's own repo: the file a commit touched is
not in codedb search's relevant results.

Failing test case (red)

Per CONTRIBUTING.md ("Red-To-Green"), an exact repro on this repo:

codedb . search content
#   ✓ 2 results for "content"
#     src/mcp.zig:208 ...
#     src/test_core.zig:588 ...
codedb . word content
#   ✓ 2658 hits for 'content'
  • The content commit changed src/snapshot.zig, but codedb search content
    omits it entirely (only 2 results) while codedb word content shows the
    term is pervasive (2658 hits).
  • Expected: search surfaces the relevant file(s) as candidates.
  • Actual: the relevant file is absent from search output.

Reproduce the aggregate with engram codedb-report . 30 → "29/30 ... isn't a
lexical hit".

Suggested fix

Broaden search recall so relevant files surface as candidates:

  • when trigram/line search returns few results, fall back to / blend the word
    and symbol inverted indexes, and/or
  • fold structural signals (import-centrality, symbol graph) into search recall,
    not just ranking.

Pair with #537 (re-ranking) so recovered candidates also rank well.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingpriority:p2Medium priority

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions