-
Notifications
You must be signed in to change notification settings - Fork 555
perf: split L6 data into separate obsolete, live files #847
Description
In discussing compactions to remove obsolete keys from L6, Peter had an idea:
I wonder if we could arrange for any key written to a bottommost level that is pinned by a snapshot to be written to a separate sstable. That is, L6 would be separated into L6-live and L6-obsolete where L6-obsolete contains records that are pinned by a snapshot. With this setup, we simply have to wait until snapshots are released and then can perform a compaction which deletes the L6-obsolete sstable.
L6 files may contain obsolete records that must be preserved because they're pinned by an open snapshot. When the pinning snapshot eventually closes, we want to reclaim the disk space occupied by these obsolete records (#838), but L6 tables are large and expensive to compact. If the obsolete keys were segmented into a separate file, the obsolete file could cheaply be dropped by a simple manifest edit once the pinning snapshots were released. This would prevent unnecessary read and write IO.
The L6 obsolete files form an additional level beneath L6. Because records in the L6-obsolete level are known to be obsolete (vs ordinary levels where records might be obsolete), reads may skip any L6-obsolete files containing only sequence numbers strictly less than the iterator sequence number. Read amplification for reads at recent sequence numbers is unaffected, and range iterators at these recent sequence numbers avoid needing to skip over obsolete keys. Read amplification for reads at old sequence numbers is increased by 1.
We would need to experiment to get a sense of how much obsolete data is written to L6 sstables in practice to understand if such a large undertaking is worthwhile.
Jira issue: PEBBLE-215