Skip to content

Bloom-augmented trigram postings (nextMask + locMask) #17

@justrach

Description

@justrach

Summary

Augment the trigram index with two 8-bit bloom filter masks per (trigram, file) posting, inspired by GitHub's Project Blackbird design.

Current State

TrigramIndex in src/index.zig maps Trigram (u24) → StringHashMap(void) — just file presence, no position or next-character info.

Proposed Change

Replace StringHashMap(void) with StringHashMap(PostingMask) where:

const PostingMask = struct {
    next_mask: u8,  // bloom filter of chars following this trigram
    loc_mask: u8,   // bit mask of (position % 8) where trigram appears
};

At index time:

  • For each trigram at position i, set loc_mask |= 1 << (i % 8)
  • Hash the character at i+3 (if exists) into a bit position, set next_mask |= 1 << (hash(c) % 8)

At query time:

  • Extract trigrams from query as before
  • For consecutive trigram pairs, check nextMask of first contains the 4th char
  • Check locMask adjacency: (locMask[tri1] << 1) & locMask[tri2] != 0
  • Only brute-force scan files that pass both filters

Impact

Dramatically reduces false-positive candidate files, especially for common trigrams. Effectively gives quadgram-level selectivity with trigram-level storage.

Reference

Cursor blog: Fast regex search — "Trigram Queries with Probabilistic Masks" section

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions