-
Notifications
You must be signed in to change notification settings - Fork 4.1k
storage: avoid excessively wide range tombstones during Raft snapshot reception #44048
Description
kvBatchSnapshotStrategy blindly adds a range tombstone to the sstables it generates during Raft snapshot reception for each of the 3 key spans for a range. For most ranges, this is perfectly fine, but for the last range in the key space this ends up adding a range tombstone from [<start>,/Max]. This key range overlaps with any future ranges split off the end.
Why is this a problem? This wide range tombstone acts as a "block" in the RocksDB/Pebble LSM preventing ingestion into a level. We see this in TPCC imports. At startup, the range ending in /Max is upreplicated from n1 to two other nodes. Those other nodes ingest an sstable with a range tombstone ending in /Max. Subsequently, this range is split many times for import and when import tries to ingest sstables on these follower nodes, the ingestion hits L0 rather than L6. This in turn causes increased compaction pressure allowing more sstables to build up in L0. The evidence so far is a downward spiral results.
While kvBatchSnapshotStrategy appears to be the proximate culprit of such wide range tombstones, @nvanbenschoten speculates that Replica.clearSubsumedReplicaDiskData could have this same problem.
The suggestion is to introduce some additional checks to narrow the scope of the range tombstone. Specifically, Store.receiveSnapshot can create an iterator and SeekLT on the end key of the range. This last key in the range will be used to lower the upper bound of the range tombstone, rather than blindly using the upper bound of the range.