Skip to content

index: disk-loaded word index never removes stale postings — re-index appends forever, deletes leave ghosts #583

@justrach

Description

@justrach

Problem

WordIndex.readFromDisk and WordIndex.mmapFromDisk both set skip_file_words = true (src/index.zig:790, :819), and promoteIfBorrowed keeps it that way. With file_words empty, removeFile hits self.file_words.fetchRemove(path) orelse return (src/index.zig:108) and becomes a silent no-op for every disk-loaded index — exactly the mode the warm daemon and CLI fast paths run in.

Consequences, all on the post-load write path (indexFile calls removeFile first, src/index.zig:186):

  • Re-indexing a file appends new postings while every stale one survives: terms deleted from the file keep hitting it (at stale line numbers), and (doc, line) duplicates inflate BM25 term frequency.
  • Deleting a file leaves all of its postings live — ghost hits with a valid-looking path.
  • Unbounded postings growth: in a long-running watch/daemon session every file save grows the index; memory is never reclaimed (RSS). index: add BM25 ranking for content search #400 fixed the total_tokens/doc_lengths counters in this mode but not the postings themselves.
  • In pure zero-copy mmap mode removeFile is doubly a no-op: path_to_id is empty too, and unlike indexFile it never promotes the mmap to heap.

Failing Test

test_index.zig — fails on current release tip:

test "issue-582: disk-loaded word index — re-index and removeFile must drop stale postings" {
    const alloc = testing.allocator;
    var wi = WordIndex.init(alloc);
    defer wi.deinit();
    try wi.indexFile("src/a.zig", "pub fn alphaToken() void {}\n");
    try wi.indexFile("src/b.zig", "pub fn betaToken() void {}\n");

    var tmp = testing.tmpDir(.{});
    defer tmp.cleanup();
    var path_buf: [std.fs.max_path_bytes]u8 = undefined;
    const dir_path_len = try tmp.dir.realPathFile(io, ".", &path_buf);
    const dir_path = path_buf[0..dir_path_len];
    try wi.writeToDisk(io, dir_path, null);

    // Heap fast-load: re-indexing a file must drop its old postings.
    var loaded = WordIndex.readFromDisk(io, dir_path, alloc).?;
    defer loaded.deinit();
    try loaded.indexFile("src/a.zig", "pub fn gammaToken() void {}\n");
    const stale = try loaded.searchDeduped("alphaToken", alloc);
    defer alloc.free(stale);
    try testing.expectEqual(@as(usize, 0), stale.len);
    const fresh = try loaded.searchDeduped("gammaToken", alloc);
    defer alloc.free(fresh);
    try testing.expectEqual(@as(usize, 1), fresh.len);

    // Deleting a file must drop its postings outright.
    loaded.removeFile("src/b.zig");
    const ghost = try loaded.searchDeduped("betaToken", alloc);
    defer alloc.free(ghost);
    try testing.expectEqual(@as(usize, 0), ghost.len);

    // Zero-copy mmap load: removeFile is a write — it must promote, not no-op.
    var mloaded = WordIndex.mmapFromDisk(io, dir_path, alloc).?;
    defer mloaded.deinit();
    mloaded.removeFile("src/a.zig");
    const mghost = try mloaded.searchDeduped("alphaToken", alloc);
    defer alloc.free(mghost);
    try testing.expectEqual(@as(usize, 0), mghost.len);
}

Expected

After a disk fast-load, indexFile of changed content replaces a file's postings, and removeFile drops them — same observable behavior as a scratch-built index.

Fix

Give removeFile a slow path for the no-file_words case: when path_to_id knows the path, sweep index for the doc_id (prune empty buckets), fix doc_lengths/total_tokens, blank + free the id_to_path slot (the skip-mode owner). In mmap mode, promote first (a remove is a write), but only if the path is actually tracked. The sweep is O(index) but runs at most once per file edit after a load; the fast path is untouched.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingpriority:p2Medium priority

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions