Skip to content

Enhancement: query-specific retrieval signals (call-graph distance to matched symbols; git co-change) #550

@justrach

Description

@justrach

Context

codedb's search is already multi-signal: searchContentRanked (src/explore.zig:3135) ranks with BM25/BM25+ (lines ~3230-3260) × call-graph centrality (centralityBoost, line 3286; built in ensureCallGraph, ~2957) × path/test/doc penalties (pathRelevanceMultiplier, ~2906-2944), and context blends BM25 × symbol-definition boost (src/mcp.zig). So this is not "add structure to ranking" — it's two signals that are genuinely absent and that target the hard case: disambiguating the target file among many same-keyword hits.

Motivation: in an ALMA-style retrieval experiment (engram re-ranking codedb features), the retrieval ceiling on a large repo (openclaw, 13.6k files) sat at ~0.30 MRR — a signal limit: when a query identifier matches many files, the current features (lexical, global centrality, name-match, degree) can't tell which file a change actually targets. Both missing signals below are query-specific, unlike codedb's current global signals.

1. Query-specific call-graph distance

codedb builds the call graph and uses it for global centrality (per-file importance) — but ranking has no query-specific graph signal: how close (in call/import hops) is this candidate to the symbols the query matched? findCallPath (src/explore.zig:3087) exists but only for navigation, not ranking. Folding "graph distance to matched symbols" into the score would prefer files structurally near the query's definitions over distant same-keyword files.

2. Git co-change

codedb is git-history-blind: it reads only the HEAD SHA (src/git.zig, for snapshot invalidation) and uses file mtime as its only temporal signal — no commit-log / co-change. "Files historically changed together" is a strong signal for which file a task touches. Parsing git log into a co-change graph adds a high-value, query-relevant signal. (Needs real history; shallow clones won't have it.)

(minor) Richer file-role

Role handling today is binary doc/code (isDocLanguage, src/explore.zig:191) + a heuristic test penalty (0.6x, ~2921). A richer multi-class role (config / generated / test / impl) used in ranking could route queries (a "config" query → .toml/.json; a "runtime" query → impl).

Why now

All three extend machinery codedb already has (call graph, BM25, path heuristics) and attack the disambiguation ceiling that BM25 + global centrality can't break on its own.

Surfaced via an ALMA-style retrieval experiment over codedb features.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions