feat: graph-aware ranking via call-graph centrality (+15% MRR)#523
Conversation
…lity) Phase 1 of the graphify-informed precision work: deterministic call-site extraction, name resolution into weighted edges, in-degree centrality. Isolated + tested (673/673); not yet wired into ranking. Paused for the mcpsync incident; resume to wire centrality into searchContentRanked. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…entRanked Wires the codegraph foundation into the real index + ranking (Phase 1b of the graphify-informed precision work). On first ranked search, builds a per-file "call centrality" map once: resolve each function's call sites (codegraph extractCallees) through the function symbol table, accumulate weighted in-degree per callee, aggregate per file. searchContentRanked multiplies each candidate's score by 1 + 0.15*log(1+centrality) — an ADDITIVE boost, never a filter, so a misresolved edge can only nudge a heavily-called (central) file up, never drop a real result. Build is mutex-guarded + idempotent; reads under the existing shared lock. On by default; CODEDB_NO_CENTRALITY disables. MRR-gated on the codedb repo (18 labeled multi-word queries, A/B via the env toggle): MRR 0.819 -> 0.944 (+0.125), P@1 12 -> 16, recall@5 18/18 unchanged; 4 queries' correct file jumped to rank 1, zero regressed. 673/673 tests pass. Follow-up: persist centrality in the snapshot to remove the one-time first-query build cost on large repos. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 542590dcc5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // Aggregate weighted in-degree per file. Keys borrow stable outlines keys. | ||
| var cmap = std.StringHashMap(f32).init(self.allocator); | ||
| for (node_path.items, in_degree) |path, deg| { | ||
| if (deg == 0) continue; | ||
| const gop = cmap.getOrPut(path) catch continue; |
There was a problem hiding this comment.
Invalidate centrality when indexed files change
Because this cache stores borrowed outlines keys and is built only once, an incremental update after the first ranked search leaves call_centrality stale; in the delete case removeFile frees the same stable path slice, so later centralityBoost lookups can probe a StringHashMap containing dangling keys in a long-running watcher/MCP process. Please clear/rebuild this map whenever commitParsedFileOwnedOutline, removeFile, or word-index replacement changes the indexed file set.
Useful? React with 👍 / 👎.
Benchmark Regression ReportThresholds: 10.00% and 50,000 ns absolute delta
|
Summary
The graphify-informed precision work: a deterministic, LLM-free resolved call graph whose centrality feeds an additive ranking boost. Two commits — the reusable foundation + the index/ranking integration.
Foundation —
src/codegraph.zigextractCallees(body)— walks a function body for call sites (ident(), filters cross-language keywords/control-flow, dedups.buildEdges(funcs, resolve)— resolves callee names → weighted edges (ambiguous names split 1/N, graphify's confidence idea).inDegreeCentrality(edges)— weighted "who's called most" (graphify's "god node" signal).Integration —
Explorer/searchContentRankedOn first ranked search,
ensureCallCentralitybuilds a per-file centrality map once (mutex-guarded, idempotent): resolve each function's call sites through the function symbol table, accumulate weighted in-degree per callee, aggregate per file.searchContentRankedmultiplies each candidate's score by1 + 0.15·log(1+centrality).Always additive, never a filter — a misresolved edge can only nudge a heavily-called file up, never drop a real result. On by default;
CODEDB_NO_CENTRALITYdisables.MRR-gated (kept only because it lifts)
Codedb repo, 18 labeled multi-word queries, A/B via the env toggle:
+0.125 MRR (+15%), +4 P@1, no recall loss — 4 queries' correct file jumped to rank 1, zero regressed. Reproduced on a clean cold index.
zig build test→ 673/673 (adds codegraph unit tests; existing ranked-search tests unaffected — tiny corpora → ~zero centrality).Follow-up
Persist centrality in the snapshot to remove the one-time first-query build cost on large repos.