Problem
The existing #363a fix established a per-file cap of max(1, max_results / 5) in searchContent Tier 0 (src/explore.zig:1525-1554) to keep a single hot doc file from saturating results. That works in the small-corpus regime (#363a's test indexed 4 docs × 4 mentions, max_results=10).
But in the high-density regime the cap doesn't help. With max_results=50 (the CLI default) the per-file cap is 10. Five markdown files (CHANGELOG.md, benchmarks/v0.2.572.md, benchmarks/v0.2.58x.md, design docs, …) that each mention the query 10+ times will collectively contribute 50 entries — filling result_list to max_results before Tier 0 reaches the canonical source file's posting-list entries. Tier 0 then returns and the source file never appears.
Validation finding from the eval against codedb 0.2.5805: codedb search "searchContent" returns 5 markdown files in the top-5 with src/explore.zig (the def site, 9 occurrences) completely absent from results. Same behaviour after the #429 rerank fix, because the rerank can only reorder retrieved results — it can't fix recall.
Failing Test
test "issue-430: Tier 0 markdown dominance starves canonical source file" {
var arena = std.heap.ArenaAllocator.init(testing.allocator);
defer arena.deinit();
var explorer = Explorer.init(arena.allocator());
const md_block = "fooBar mentioned here.\nfooBar mentioned here.\n" ** 5;
var i: usize = 0;
while (i < 5) : (i += 1) {
var path_buf: [64]u8 = undefined;
const path = try std.fmt.bufPrint(&path_buf, "docs/notes_{d}.md", .{i});
try explorer.indexFile(path, md_block);
}
try explorer.indexFile("src/foo.zig",
"pub fn fooBar() void {}\n" ++
"pub fn caller1() void { fooBar(); }\n" ++
"pub fn caller2() void { fooBar(); }\n" ++
"pub fn caller3() void { fooBar(); }\n");
const results = try explorer.searchContent("fooBar", testing.allocator, 50);
// ...defer free...
var found_source = false;
for (results) |r| {
if (std.mem.eql(u8, r.path, "src/foo.zig")) found_source = true;
}
try testing.expect(found_source);
}
Failing test on branch issue-430-failing-test (commit 6267580).
$ zig build test 2>&1 | rg "issue-430"
error: 'tests.test.issue-430: Tier 0 markdown dominance starves canonical source file' failed
/Users/.../src/tests.zig:10512: try testing.expect(found_source);
Expected
When a query matches both source files and doc files, source files should be retrieved before doc files saturate the quota.
Fix
In searchContent Tier 0 (src/explore.zig:1525-1554), partition word_hits by language priority and process code-language hits first, then doc-language hits. Concretely:
// Bucket word hits by whether their file is a code language.
var code_hits: std.ArrayList(WordHit) = .empty;
var doc_hits: std.ArrayList(WordHit) = .empty;
defer code_hits.deinit(allocator);
defer doc_hits.deinit(allocator);
for (word_hits) |hit| {
const hp = self.word_index.hitPath(hit);
const lang = detectLanguage(hp);
if (isDocLanguage(lang)) try doc_hits.append(allocator, hit) else try code_hits.append(allocator, hit);
}
// Process code first, then docs — same per-file cap and search/saturation logic for both passes.
isDocLanguage returns true for .markdown, .json, .yaml, .unknown — same predicate concept used in #426's langHasCallSites, just inverted. Effort: ~30 lines.
Related
This unlocks more wins for the #429 rerank — the rerank can only rank what's been retrieved, so fixing recall here exposes the cases where rerankSignalScore would correctly promote source files anyway.
Eval context
Surfaced by the issue-429-fix validation run (Sonnet 4.6 agent against the codedb codebase). Pre-existing on main; not introduced by #429.
Problem
The existing #363a fix established a per-file cap of
max(1, max_results / 5)insearchContentTier 0 (src/explore.zig:1525-1554) to keep a single hot doc file from saturating results. That works in the small-corpus regime (#363a's test indexed 4 docs × 4 mentions, max_results=10).But in the high-density regime the cap doesn't help. With
max_results=50(the CLI default) the per-file cap is 10. Five markdown files (CHANGELOG.md, benchmarks/v0.2.572.md, benchmarks/v0.2.58x.md, design docs, …) that each mention the query 10+ times will collectively contribute 50 entries — fillingresult_listtomax_resultsbefore Tier 0 reaches the canonical source file's posting-list entries. Tier 0 then returns and the source file never appears.Validation finding from the eval against codedb 0.2.5805:
codedb search "searchContent"returns 5 markdown files in the top-5 withsrc/explore.zig(the def site, 9 occurrences) completely absent from results. Same behaviour after the #429 rerank fix, because the rerank can only reorder retrieved results — it can't fix recall.Failing Test
Failing test on branch
issue-430-failing-test(commit6267580).Expected
When a query matches both source files and doc files, source files should be retrieved before doc files saturate the quota.
Fix
In
searchContentTier 0 (src/explore.zig:1525-1554), partitionword_hitsby language priority and process code-language hits first, then doc-language hits. Concretely:isDocLanguagereturns true for.markdown,.json,.yaml,.unknown— same predicate concept used in #426'slangHasCallSites, just inverted. Effort: ~30 lines.Related
This unlocks more wins for the #429 rerank — the rerank can only rank what's been retrieved, so fixing recall here exposes the cases where
rerankSignalScorewould correctly promote source files anyway.Eval context
Surfaced by the issue-429-fix validation run (Sonnet 4.6 agent against the codedb codebase). Pre-existing on
main; not introduced by #429.