Skip to content

explore: searchContent rerank should compose multiple signals (basename match, path prior, symbol-def) #429

@justrach

Description

@justrach

Problem

The post-pass rerank in Explorer.searchContent (src/explore.zig:1700-1712) is a single-signal scorer:

for (result_list.items) |*r| {
    r.score = countOccurrences(r.line_text, query);
}
std.sort.block(SearchResult, result_list.items, {}, struct {
    pub fn lessThan(_: void, a: SearchResult, b: SearchResult) bool {
        if (a.score != b.score) return a.score > b.score;
        const ord = std.mem.order(u8, a.path, b.path);
        if (ord != .eq) return ord == .lt;
        return a.line_num < b.line_num;
    }
}.lessThan);

It counts occurrences inside one line and tiebreaks by (path asc, line_num asc). That ignores three signals an experienced reader would weight heavily:

  1. Basename match. Querying widgetX and one of the candidate files is src/widgetX.zig — the developer is almost certainly asking about that file. Today the alphabetic tiebreaker promotes src/unrelated.zig over src/widgetX.zig when both have one occurrence.
  2. Path prior. Hits in examples/, tests/, vendor/, node_modules/ are usually less relevant than hits in src/, lib/. Today examples/... outranks src/... simply because e < s.
  3. Symbol-definition lines. A line that defines a symbol named after the query is the canonical hit. A passing comment mention with the same per-line count should rank below it. Today both score 1, alphabetic tiebreaker decides.

Failing Tests

Three on branch issue-429-failing-test (commit c4c9056). Each demonstrates one signal in isolation. All three fail today:

test "issue-429-a: searchContent rerank boosts files whose basename matches the query" {
    // src/unrelated.zig and src/widgetX.zig, both with one hit. Expected:
    // src/widgetX.zig (basename match) ranks first. Today it doesn't.
    try testing.expectEqualStrings("src/widgetX.zig", results[0].path);
}

test "issue-429-b: searchContent rerank penalizes test/vendor/examples paths" {
    // examples/sample.zig and src/sample.zig, both with one hit. Expected:
    // src/sample.zig ranks first.
    try testing.expectEqualStrings("src/sample.zig", results[0].path);
}

test "issue-429-c: searchContent rerank boosts lines that are symbol definitions" {
    // aaa.zig (comment mention) and zzz_def.zig (`pub fn fooSym() void {}`).
    // Expected: zzz_def.zig ranks first (symbol-def boost).
    try testing.expectEqualStrings("zzz_def.zig", results[0].path);
}

Expected

searchContent's rerank composes (at minimum):

  • Per-line occurrence count (existing)
  • Basename-match boost
  • Path-prior penalty for examples/, tests/, vendor/, node_modules/
  • Symbol-definition boost (when the line is a symbol definition for the query, looked up via outline)

with weights so the existing per-line frequency signal still wins on its own when no other signal applies.

Fix

Replace the single-pass countOccurrences-only score with a composed scorer in searchContent:

fn scoreResult(r: SearchResult, query: []const u8, outline_lookup: ...) f32 {
    var score: f32 = @floatFromInt(countOccurrences(r.line_text, query));
    if (basenameStem(r.path) matches query) score += 10.0;
    if (hasSegment(r.path, "tests") or hasSegment(r.path, "test")) score *= 0.6;
    if (hasSegment(r.path, "examples")) score *= 0.6;
    if (hasSegment(r.path, "vendor") or hasSegment(r.path, "node_modules")) score *= 0.4;
    if (lineDefinesSymbolNamed(r.path, r.line_num, query)) score += 5.0;
    return score;
}

Constants are tuneable; the goal is for each signal alone to flip the order on these failing tests.

Related

Companion to #427 (Tier 1 file-length sort). Together they cover both candidate selection and post-rank ordering.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingpriority:p2Medium priority

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions