Context
codedb's search is already multi-signal: searchContentRanked (src/explore.zig:3135) ranks with BM25/BM25+ (lines ~3230-3260) × call-graph centrality (centralityBoost, line 3286; built in ensureCallGraph, ~2957) × path/test/doc penalties (pathRelevanceMultiplier, ~2906-2944), and context blends BM25 × symbol-definition boost (src/mcp.zig). So this is not "add structure to ranking" — it's two signals that are genuinely absent and that target the hard case: disambiguating the target file among many same-keyword hits.
Motivation: in an ALMA-style retrieval experiment (engram re-ranking codedb features), the retrieval ceiling on a large repo (openclaw, 13.6k files) sat at ~0.30 MRR — a signal limit: when a query identifier matches many files, the current features (lexical, global centrality, name-match, degree) can't tell which file a change actually targets. Both missing signals below are query-specific, unlike codedb's current global signals.
1. Query-specific call-graph distance
codedb builds the call graph and uses it for global centrality (per-file importance) — but ranking has no query-specific graph signal: how close (in call/import hops) is this candidate to the symbols the query matched? findCallPath (src/explore.zig:3087) exists but only for navigation, not ranking. Folding "graph distance to matched symbols" into the score would prefer files structurally near the query's definitions over distant same-keyword files.
2. Git co-change
codedb is git-history-blind: it reads only the HEAD SHA (src/git.zig, for snapshot invalidation) and uses file mtime as its only temporal signal — no commit-log / co-change. "Files historically changed together" is a strong signal for which file a task touches. Parsing git log into a co-change graph adds a high-value, query-relevant signal. (Needs real history; shallow clones won't have it.)
(minor) Richer file-role
Role handling today is binary doc/code (isDocLanguage, src/explore.zig:191) + a heuristic test penalty (0.6x, ~2921). A richer multi-class role (config / generated / test / impl) used in ranking could route queries (a "config" query → .toml/.json; a "runtime" query → impl).
Why now
All three extend machinery codedb already has (call graph, BM25, path heuristics) and attack the disambiguation ceiling that BM25 + global centrality can't break on its own.
Surfaced via an ALMA-style retrieval experiment over codedb features.
Context
codedb's
searchis already multi-signal:searchContentRanked(src/explore.zig:3135) ranks with BM25/BM25+ (lines ~3230-3260) × call-graph centrality (centralityBoost, line 3286; built inensureCallGraph, ~2957) × path/test/doc penalties (pathRelevanceMultiplier, ~2906-2944), andcontextblends BM25 × symbol-definition boost (src/mcp.zig). So this is not "add structure to ranking" — it's two signals that are genuinely absent and that target the hard case: disambiguating the target file among many same-keyword hits.Motivation: in an ALMA-style retrieval experiment (engram re-ranking codedb features), the retrieval ceiling on a large repo (openclaw, 13.6k files) sat at ~0.30 MRR — a signal limit: when a query identifier matches many files, the current features (lexical, global centrality, name-match, degree) can't tell which file a change actually targets. Both missing signals below are query-specific, unlike codedb's current global signals.
1. Query-specific call-graph distance
codedb builds the call graph and uses it for global centrality (per-file importance) — but ranking has no query-specific graph signal: how close (in call/import hops) is this candidate to the symbols the query matched?
findCallPath(src/explore.zig:3087) exists but only for navigation, not ranking. Folding "graph distance to matched symbols" into the score would prefer files structurally near the query's definitions over distant same-keyword files.2. Git co-change
codedb is git-history-blind: it reads only the HEAD SHA (src/git.zig, for snapshot invalidation) and uses file mtime as its only temporal signal — no commit-log / co-change. "Files historically changed together" is a strong signal for which file a task touches. Parsing
git loginto a co-change graph adds a high-value, query-relevant signal. (Needs real history; shallow clones won't have it.)(minor) Richer file-role
Role handling today is binary doc/code (
isDocLanguage, src/explore.zig:191) + a heuristic test penalty (0.6x, ~2921). A richer multi-class role (config / generated / test / impl) used in ranking could route queries (a "config" query →.toml/.json; a "runtime" query → impl).Why now
All three extend machinery codedb already has (call graph, BM25, path heuristics) and attack the disambiguation ceiling that BM25 + global centrality can't break on its own.
Surfaced via an ALMA-style retrieval experiment over codedb features.