Skip to content

explore: Tier 0 markdown dominance hides canonical source files at high density #430

@justrach

Description

@justrach

Problem

The existing #363a fix established a per-file cap of max(1, max_results / 5) in searchContent Tier 0 (src/explore.zig:1525-1554) to keep a single hot doc file from saturating results. That works in the small-corpus regime (#363a's test indexed 4 docs × 4 mentions, max_results=10).

But in the high-density regime the cap doesn't help. With max_results=50 (the CLI default) the per-file cap is 10. Five markdown files (CHANGELOG.md, benchmarks/v0.2.572.md, benchmarks/v0.2.58x.md, design docs, …) that each mention the query 10+ times will collectively contribute 50 entries — filling result_list to max_results before Tier 0 reaches the canonical source file's posting-list entries. Tier 0 then returns and the source file never appears.

Validation finding from the eval against codedb 0.2.5805: codedb search "searchContent" returns 5 markdown files in the top-5 with src/explore.zig (the def site, 9 occurrences) completely absent from results. Same behaviour after the #429 rerank fix, because the rerank can only reorder retrieved results — it can't fix recall.

Failing Test

test "issue-430: Tier 0 markdown dominance starves canonical source file" {
    var arena = std.heap.ArenaAllocator.init(testing.allocator);
    defer arena.deinit();
    var explorer = Explorer.init(arena.allocator());

    const md_block = "fooBar mentioned here.\nfooBar mentioned here.\n" ** 5;
    var i: usize = 0;
    while (i < 5) : (i += 1) {
        var path_buf: [64]u8 = undefined;
        const path = try std.fmt.bufPrint(&path_buf, "docs/notes_{d}.md", .{i});
        try explorer.indexFile(path, md_block);
    }
    try explorer.indexFile("src/foo.zig",
        "pub fn fooBar() void {}\n" ++
        "pub fn caller1() void { fooBar(); }\n" ++
        "pub fn caller2() void { fooBar(); }\n" ++
        "pub fn caller3() void { fooBar(); }\n");

    const results = try explorer.searchContent("fooBar", testing.allocator, 50);
    // ...defer free...
    var found_source = false;
    for (results) |r| {
        if (std.mem.eql(u8, r.path, "src/foo.zig")) found_source = true;
    }
    try testing.expect(found_source);
}

Failing test on branch issue-430-failing-test (commit 6267580).

$ zig build test 2>&1 | rg "issue-430"
error: 'tests.test.issue-430: Tier 0 markdown dominance starves canonical source file' failed
       /Users/.../src/tests.zig:10512: try testing.expect(found_source);

Expected

When a query matches both source files and doc files, source files should be retrieved before doc files saturate the quota.

Fix

In searchContent Tier 0 (src/explore.zig:1525-1554), partition word_hits by language priority and process code-language hits first, then doc-language hits. Concretely:

// Bucket word hits by whether their file is a code language.
var code_hits: std.ArrayList(WordHit) = .empty;
var doc_hits: std.ArrayList(WordHit) = .empty;
defer code_hits.deinit(allocator);
defer doc_hits.deinit(allocator);
for (word_hits) |hit| {
    const hp = self.word_index.hitPath(hit);
    const lang = detectLanguage(hp);
    if (isDocLanguage(lang)) try doc_hits.append(allocator, hit) else try code_hits.append(allocator, hit);
}
// Process code first, then docs — same per-file cap and search/saturation logic for both passes.

isDocLanguage returns true for .markdown, .json, .yaml, .unknown — same predicate concept used in #426's langHasCallSites, just inverted. Effort: ~30 lines.

Related

This unlocks more wins for the #429 rerank — the rerank can only rank what's been retrieved, so fixing recall here exposes the cases where rerankSignalScore would correctly promote source files anyway.

Eval context

Surfaced by the issue-429-fix validation run (Sonnet 4.6 agent against the codedb codebase). Pre-existing on main; not introduced by #429.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingpriority:p2Medium priority

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions