Problem
codedb search <term> has very low recall: it returns a tiny candidate set
and omits files that are the right answer, even when the term is pervasive in the
index. This is distinct from #537 — that issue is about re-ranking the
candidates codedb returns; this is about codedb never returning the relevant
file at all, so no re-ranker can recover it from search output.
engram's codedb-report (git history as ground truth) flags 29/30 changed
files as "not a lexical hit" on codedb's own repo: the file a commit touched is
not in codedb search's relevant results.
Failing test case (red)
Per CONTRIBUTING.md ("Red-To-Green"), an exact repro on this repo:
codedb . search content
# ✓ 2 results for "content"
# src/mcp.zig:208 ...
# src/test_core.zig:588 ...
codedb . word content
# ✓ 2658 hits for 'content'
- The
content commit changed src/snapshot.zig, but codedb search content
omits it entirely (only 2 results) while codedb word content shows the
term is pervasive (2658 hits).
- Expected: search surfaces the relevant file(s) as candidates.
- Actual: the relevant file is absent from search output.
Reproduce the aggregate with engram codedb-report . 30 → "29/30 ... isn't a
lexical hit".
Suggested fix
Broaden search recall so relevant files surface as candidates:
- when trigram/line search returns few results, fall back to / blend the
word
and symbol inverted indexes, and/or
- fold structural signals (import-centrality, symbol graph) into search recall,
not just ranking.
Pair with #537 (re-ranking) so recovered candidates also rank well.
Problem
codedb search <term>has very low recall: it returns a tiny candidate setand omits files that are the right answer, even when the term is pervasive in the
index. This is distinct from #537 — that issue is about re-ranking the
candidates codedb returns; this is about codedb never returning the relevant
file at all, so no re-ranker can recover it from search output.
engram's
codedb-report(git history as ground truth) flags 29/30 changedfiles as "not a lexical hit" on codedb's own repo: the file a commit touched is
not in codedb search's relevant results.
Failing test case (red)
Per
CONTRIBUTING.md("Red-To-Green"), an exact repro on this repo:contentcommit changedsrc/snapshot.zig, butcodedb search contentomits it entirely (only 2 results) while
codedb word contentshows theterm is pervasive (2658 hits).
Reproduce the aggregate with
engram codedb-report . 30→ "29/30 ... isn't alexical hit".
Suggested fix
Broaden search recall so relevant files surface as candidates:
wordand
symbolinverted indexes, and/ornot just ranking.
Pair with #537 (re-ranking) so recovered candidates also rank well.