VACUUM: add deleted-document bitmap for efficient dead tuple tracking

## Summary

Currently, `tp_bulkdelete` iterates all indexed CTIDs and calls the VACUUM callback but **discards the return value**—dead entries are never removed from the index. This causes a correctness bug when Postgres reuses CTIDs after VACUUM: the index can return completely wrong rows for queries.

### The CTID reuse problem

1. Row A at CTID (42,5) is indexed with term "database"
2. Row A is deleted, plain VACUUM runs
3. `ambulkdelete` is supposed to remove dead index entries **before** VACUUM frees heap line pointers for reuse—but we don't
4. New row B gets CTID (42,5), contains "networking"
5. Index says (42,5) matches "database"—Postgres fetches row B, visibility check **passes** (row B is live), wrong row returned

For `ORDER BY ... LIMIT` queries (our most common pattern), Postgres trusts the index ordering and doesn't re-evaluate the `<@>` operator against the heap tuple. This is a false positive, not just a stale score.

### Immediate fix (separate PR)

Rebuild affected segments during `ambulkdelete` using the existing bulk-indexing codepath. Correct but expensive for large segments with few deletions.

### This ticket: deleted-document bitmap

A more efficient long-term solution:

1. During `ambulkdelete`, record dead doc IDs in a per-segment bitmap (Roaring or simple bitset)
2. Store the bitmap alongside the segment (e.g., dedicated pages after the segment data)
3. During scan/BMW iteration, check the bitmap to skip dead docs
4. During segment merge/compaction, use the bitmap to omit dead entries from the output segment—this is the natural cleanup point

### Benefits over segment rebuild

- O(dead_docs) work during VACUUM instead of O(all_docs_in_segment)
- No re-tokenization of live documents
- Fixes the LIMIT shortfall (over-fetch by bitmap cardinality to compensate for filtered docs)
- Improves query performance (skip dead docs during BMW iteration rather than scoring them)

### Statistics considerations (not blocking)

- `total_docs` and per-term `doc_freq` in segment dictionaries are not updated after deletes, which affects BM25 IDF calculation
- For small numbers of deletes this is negligible; mass deletion of documents with specific terms can skew ranking
- Fixing this properly requires either updating segment metadata during VACUUM or recomputing during merge

### References

- OPTIMIZATION_ROADMAP.md "DELETE Handling" section and Phase 4.1 (Roaring Bitmaps)
- Lucene uses a similar approach: per-segment live-docs bitset, merged during segment compaction
- Tantivy: `DeleteBitSet` per segment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VACUUM: add deleted-document bitmap for efficient dead tuple tracking #264

Summary

The CTID reuse problem

Immediate fix (separate PR)

This ticket: deleted-document bitmap

Benefits over segment rebuild

Statistics considerations (not blocking)

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

VACUUM: add deleted-document bitmap for efficient dead tuple tracking #264

Description

Summary

The CTID reuse problem

Immediate fix (separate PR)

This ticket: deleted-document bitmap

Benefits over segment rebuild

Statistics considerations (not blocking)

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions