Summary
Augment the trigram index with two 8-bit bloom filter masks per (trigram, file) posting, inspired by GitHub's Project Blackbird design.
Current State
TrigramIndex in src/index.zig maps Trigram (u24) → StringHashMap(void) — just file presence, no position or next-character info.
Proposed Change
Replace StringHashMap(void) with StringHashMap(PostingMask) where:
const PostingMask = struct {
next_mask: u8, // bloom filter of chars following this trigram
loc_mask: u8, // bit mask of (position % 8) where trigram appears
};
At index time:
- For each trigram at position
i, set loc_mask |= 1 << (i % 8)
- Hash the character at
i+3 (if exists) into a bit position, set next_mask |= 1 << (hash(c) % 8)
At query time:
- Extract trigrams from query as before
- For consecutive trigram pairs, check
nextMask of first contains the 4th char
- Check
locMask adjacency: (locMask[tri1] << 1) & locMask[tri2] != 0
- Only brute-force scan files that pass both filters
Impact
Dramatically reduces false-positive candidate files, especially for common trigrams. Effectively gives quadgram-level selectivity with trigram-level storage.
Reference
Cursor blog: Fast regex search — "Trigram Queries with Probabilistic Masks" section
Summary
Augment the trigram index with two 8-bit bloom filter masks per (trigram, file) posting, inspired by GitHub's Project Blackbird design.
Current State
TrigramIndexinsrc/index.zigmapsTrigram (u24) → StringHashMap(void)— just file presence, no position or next-character info.Proposed Change
Replace
StringHashMap(void)withStringHashMap(PostingMask)where:At index time:
i, setloc_mask |= 1 << (i % 8)i+3(if exists) into a bit position, setnext_mask |= 1 << (hash(c) % 8)At query time:
nextMaskof first contains the 4th charlocMaskadjacency:(locMask[tri1] << 1) & locMask[tri2] != 0Impact
Dramatically reduces false-positive candidate files, especially for common trigrams. Effectively gives quadgram-level selectivity with trigram-level storage.
Reference
Cursor blog: Fast regex search — "Trigram Queries with Probabilistic Masks" section