storage: optimize MVCCGarbageCollect for large number of undeleted version#51462
Conversation
Not all implementations of Iterator actually support reverse iteration. In particular, the rocksdb batch does not. This commit augments the interface by asking implementers to report whether they do indeed support reverse iteration. Release note: None
…hmark We don't use this in the real code, why use it in the benchmark? Release note: None
Prior to this change, MVCCGarbageCollect performed a linear scan of all versions of a key, not just the versions being garbage collected. Given the pagination of deleting versions above this call, the linear behavior can result in quadratic runtime of GC when the number of versions vastly exceeds the page size. The benchmark results demonstrate the change's effectiveness. It's worth noting that for a single key with a single version, the change has a negative performance impact. I suspect this is due to the allocation of a key in order to construct the iterator. In cases involving more keys, I theorize the positive change is due to the fact that now the iterator is never seeked backwards due to the sorting of the keys. It's worth noting that since 20.1, the GC queue has been sending keys in the GC request in reverse order. I anticipate that this sorting is likely a good thing in that case too. The stepping optimization seemed important in the microbenchmarks for cases where most of the data was garbage. Without it, the change had small negative impact on performance. ``` name old time/op new time/op delta MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1/numVersions=2/deleteVersions=1-24 4.69µs ± 0% 5.12µs ± 1% +9.12% (p=0.004 n=5+6) MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=1-24 342µs ± 1% 337µs ± 0% -1.35% (p=0.004 n=6+5) MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=16-24 348µs ± 2% 347µs ± 1% ~ (p=0.699 n=6+6) MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=32-24 359µs ± 4% 359µs ± 4% ~ (p=0.699 n=6+6) MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=512-24 572µs ± 2% 580µs ± 3% ~ (p=0.132 n=6+6) MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=1015-24 799µs ± 1% 811µs ± 3% ~ (p=0.126 n=5+6) MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=1023-24 793µs ± 0% 819µs ± 2% +3.19% (p=0.004 n=5+6) MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1024/numVersions=2/deleteVersions=1-24 3.38ms ± 2% 3.11ms ± 2% -8.07% (p=0.002 n=6+6) MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=1-24 433ms ± 2% 374ms ± 2% -13.70% (p=0.002 n=6+6) MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=16-24 443ms ± 1% 380ms ± 1% -14.16% (p=0.004 n=6+5) MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=32-24 449ms ± 0% 386ms ± 1% -14.22% (p=0.002 n=6+6) MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=512-24 727ms ± 2% 653ms ± 2% -10.17% (p=0.002 n=6+6) MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=1015-24 1.00s ± 1% 0.95s ± 3% -5.84% (p=0.004 n=5+6) MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=1023-24 1.02s ± 2% 0.98s ±10% ~ (p=0.180 n=6+6) MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1/numVersions=2/deleteVersions=1-24 2.09µs ± 3% 2.42µs ± 1% +15.56% (p=0.004 n=6+5) MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=1-24 140µs ± 1% 5µs ± 1% -96.78% (p=0.010 n=6+4) MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=16-24 142µs ± 1% 9µs ± 6% -93.59% (p=0.002 n=6+6) MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=32-24 144µs ± 0% 14µs ± 2% -90.50% (p=0.004 n=5+6) MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=512-24 216µs ± 1% 156µs ± 1% -27.80% (p=0.004 n=5+6) MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=1015-24 290µs ± 1% 303µs ± 1% +4.52% (p=0.002 n=6+6) MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=1023-24 294µs ± 1% 304µs ± 2% +3.24% (p=0.002 n=6+6) MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1024/numVersions=2/deleteVersions=1-24 1.93ms ± 2% 1.70ms ± 2% -11.80% (p=0.002 n=6+6) MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=1-24 390ms ± 1% 7ms ±25% -98.26% (p=0.008 n=5+5) MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=16-24 394ms ± 2% 20ms ±22% -95.00% (p=0.002 n=6+6) MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=32-24 391ms ± 1% 38ms ± 7% -90.38% (p=0.004 n=6+5) MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=512-24 782ms ± 1% 394ms ± 0% -49.59% (p=0.008 n=5+5) MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=1015-24 1.59s ± 4% 0.80s ± 4% -49.65% (p=0.002 n=6+6) MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=1023-24 1.58s ± 2% 0.80s ± 2% -49.23% (p=0.002 n=6+6) ``` Release note (performance improvement): Improved the efficiency of garbage collection when there are a large number of versions of a single key, commonly found when utilizing sequences.
2d36cfb to
8585416
Compare
itsbilal
left a comment
There was a problem hiding this comment.
really like the test in particular! And a nice optimization overall.
It might also be a good idea running any relevant roachtests with the COCKROACH_STORAGE_ENGINE=rocksdb env variable set in the environment calling the roachtest command, just to double-sanity-check the behaviour with rocksdb. But it looks good to me.
Reviewed 6 of 6 files at r1, 1 of 1 files at r2, 2 of 2 files at r3.
Reviewable status:complete! 1 of 0 LGTMs obtained
pkg/storage/mvcc.go, line 3270 at r3 (raw file):
// importantly, this optimization mitigated the overhead of the Seek // approach when almost all of the versions are garbage. const nextsBeforeSeekLT = 4
Some food for thought in the future: This iterate-a-couple-times-through-versions-then-seek idiom has started to show up a bit more regularly. pebbleMVCCScanner uses a dynamic value for this parameter that gets adjusted between [1, 10] depending on how many versions it encounters in practice. Maybe if we abstracted away some of this logic, we could 1) use it in more places, and 2) be able to take advantage of dynamically adjusting this variable in all those places too.
ajwerner
left a comment
There was a problem hiding this comment.
I ran YCSB with a short GC TTL. GC happened and there were no problems. Thanks again for the review!
bors r=itsbilal
Reviewable status:
complete! 1 of 0 LGTMs obtained
pkg/storage/mvcc.go, line 3270 at r3 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
Some food for thought in the future: This iterate-a-couple-times-through-versions-then-seek idiom has started to show up a bit more regularly.
pebbleMVCCScanneruses a dynamic value for this parameter that gets adjusted between[1, 10]depending on how many versions it encounters in practice. Maybe if we abstracted away some of this logic, we could 1) use it in more places, and 2) be able to take advantage of dynamically adjusting this variable in all those places too.
Interesting. It's not totally obvious to me how to extract this logic correctly.
Build succeeded |
This is a better-tested take-2 of #51184.
SupportsPrevmethod to theIteratorinterface.The win is now only pronounced on pebble, as might be expected.