Problem
Explorer.searchContent (src/explore.zig:1509) is the function handleSearch calls in production. Its Tier 1 (trigram candidates) sorts the candidate file list by file content length ascending at src/explore.zig:1590-1598:
const SortCtx = struct {
contents: *const std.StringHashMap([]const u8),
pub fn lessThan(ctx: @This(), a: []const u8, b: []const u8) bool {
const a_len = if (ctx.contents.get(a)) |c| c.len else std.math.maxInt(usize);
const b_len = if (ctx.contents.get(b)) |c| c.len else std.math.maxInt(usize);
return a_len < b_len;
}
};
It then applies a per-file cap of max(1, max_results / estimated_total) (line 1601). With many candidates, that cap is 1. Small unrelated files contribute one hit each in the order they appear, saturate result_list to max_results, and trigger the early-return at line 1607 before the larger canonical file is ever scanned.
Concrete reproduction (eval, codedb 0.2.5805): codedb_search(query="trigram_index", max_results=20) returns hits from adversarial_tests.zig (2 occurrences) and main.zig but does not return src/explore.zig, which has 15 occurrences and is the definition site of the symbol. With max_results=50, explore.zig finally appears but is then re-truncated to 5 lines by handleSearch's per-file cap.
The post-search frequency rerank at src/explore.zig:1681-1693 operates only on the lines already collected, so it cannot recover a file that was never read.
Failing Test
test "issue-427: searchContent Tier 1 sort starves the definition-dense file" {
var arena = std.heap.ArenaAllocator.init(testing.allocator);
defer arena.deinit();
var explorer = Explorer.init(arena.allocator());
const small_count: usize = 8;
var i: usize = 0;
while (i < small_count) : (i += 1) {
var path_buf: [32]u8 = undefined;
const path = try std.fmt.bufPrint(&path_buf, "small_{d}.zig", .{i});
try explorer.indexFile(path, "fn s() void { _ = widgetX; }\n");
}
const canonical_content =
"fn canonical() void {\n" ++
" _ = widgetX;\n" ++
" _ = widgetX;\n" ++
" _ = widgetX;\n" ++
" _ = widgetX;\n" ++
" // padding line ...\n" ++
" // padding line ...\n" ++
" // padding line ...\n" ++
" _ = 0;\n" ++
"}\n";
try explorer.indexFile("canonical.zig", canonical_content);
const results = try explorer.searchContent("widgetX", testing.allocator, 5);
// ...defer free...
var found_canonical = false;
for (results) |r| {
if (std.mem.eql(u8, r.path, "canonical.zig")) found_canonical = true;
}
try testing.expect(found_canonical);
}
Failing test on branch issue-427-failing-test (commit 7b7495e).
$ zig build test 2>&1 | rg "issue-427"
error: 'tests.test.issue-427: searchContent Tier 1 sort starves the definition-dense file' failed
/Users/.../src/tests.zig:10515: try testing.expect(found_canonical);
Expected
A file with the most occurrences of the term should appear in the result set. The reranker should not silently exclude the canonical file in favor of unrelated small files just because they were lexically shorter.
Fix
Two complementary changes in Explorer.searchContent (src/explore.zig:1587-1693):
-
Replace the file-length sort with a relevance-first order. Prefer files that the word index identifies as having the term in a symbol-definition context, then by total per-file occurrences if available. Effort: small (~10 lines).
-
Aggregate per-file occurrence counts before truncating to max_results. Run searchInContent for every candidate (still bounded by the per-file cap), collect counts into a file-keyed map, then sort the result list by (per_file_total desc, per_line_count desc, path asc, line_num asc). Drop the early-return at line 1607 in favor of post-aggregation truncation. Effort: medium.
A simpler intermediate stop-gap: when max_per_file == 1 and there are more candidates than max_results, skip the length sort and use word-index hit count per file as the primary key. This alone would fix the reported case.
The repo already has searchContentRanked (BM25) at src/explore.zig:1703 which does proper document-level ranking — handleSearch could be opted onto that path for queries with multiple word-tokens, leaving searchContent as the substring-match fast path with a fixed Tier 1 sort.
Eval context
Found by an automated codedb evaluation against codedb 0.2.5805. Filed alongside #425 (substring leakage in callers) and #426 (non-code files in callers).
Problem
Explorer.searchContent(src/explore.zig:1509) is the functionhandleSearchcalls in production. Its Tier 1 (trigram candidates) sorts the candidate file list by file content length ascending atsrc/explore.zig:1590-1598:It then applies a per-file cap of
max(1, max_results / estimated_total)(line 1601). With many candidates, that cap is1. Small unrelated files contribute one hit each in the order they appear, saturateresult_listtomax_results, and trigger the early-return at line 1607 before the larger canonical file is ever scanned.Concrete reproduction (eval, codedb 0.2.5805):
codedb_search(query="trigram_index", max_results=20)returns hits fromadversarial_tests.zig(2 occurrences) andmain.zigbut does not returnsrc/explore.zig, which has 15 occurrences and is the definition site of the symbol. Withmax_results=50,explore.zigfinally appears but is then re-truncated to 5 lines byhandleSearch's per-file cap.The post-search frequency rerank at
src/explore.zig:1681-1693operates only on the lines already collected, so it cannot recover a file that was never read.Failing Test
Failing test on branch
issue-427-failing-test(commit7b7495e).Expected
A file with the most occurrences of the term should appear in the result set. The reranker should not silently exclude the canonical file in favor of unrelated small files just because they were lexically shorter.
Fix
Two complementary changes in
Explorer.searchContent(src/explore.zig:1587-1693):Replace the file-length sort with a relevance-first order. Prefer files that the word index identifies as having the term in a symbol-definition context, then by total per-file occurrences if available. Effort: small (~10 lines).
Aggregate per-file occurrence counts before truncating to
max_results. RunsearchInContentfor every candidate (still bounded by the per-file cap), collect counts into a file-keyed map, then sort the result list by(per_file_total desc, per_line_count desc, path asc, line_num asc). Drop the early-return at line 1607 in favor of post-aggregation truncation. Effort: medium.A simpler intermediate stop-gap: when
max_per_file == 1and there are more candidates thanmax_results, skip the length sort and use word-index hit count per file as the primary key. This alone would fix the reported case.The repo already has
searchContentRanked(BM25) atsrc/explore.zig:1703which does proper document-level ranking —handleSearchcould be opted onto that path for queries with multiple word-tokens, leavingsearchContentas the substring-match fast path with a fixed Tier 1 sort.Eval context
Found by an automated codedb evaluation against codedb 0.2.5805. Filed alongside #425 (substring leakage in callers) and #426 (non-code files in callers).