-
Notifications
You must be signed in to change notification settings - Fork 4.1k
storage: batch Raft entry application across ready struct #37426
Description
Old prototype: #15648.
The Raft proposal pipeline currently applies Raft entries one-by-one to our storage engine in handleRaftReadyRaftMuLocked. We've seen time and time again that batching writes to RocksDB provides a sizable speedup, so this is a perfect candidate for batching. This will come with three concrete wins:
- batching the writes reduces the total amount of work incident on the system
- batching the writes reduces the number of cgo calls performed by Raft processing
- batching the writes should speed up a single iteration of
handleRaftReadyRaftMuLocked, allowing the Range to schedule and run the next iteration sooner
Put together, this should be a nice win to the amount of concurrency supported by the Raft proposal pipeline.
Implementation notes
To make this change we will need to transpose the order of operations in the committed entries loop and in processRaftCommand. Instead of the control flow looking like:
for each entry:
check if failed
if not:
apply to engine
ack client, release latches
it will look something like
for each entry:
check if failed
if not:
apply to batch
commit batch
for each entry:
ack client, release latches
Care will be needed around handling in-memory side effects of Raft entries correctly.
Predicted win
The improvement on single-range throughput as predicted by rafttoy is:
name old time/op new time/op delta
Raft/conc=256/bytes=256-16 15.6µs ± 5% 14.6µs ±15% -6.87% (p=0.046 n=16+20)
name old speed new speed delta
Raft/conc=256/bytes=256-16 16.4MB/s ± 5% 17.7MB/s ±17% +8.30% (p=0.043 n=16+20)
It's worth noting that rafttoy is using pebble instead of RocksDB. Writing to pebble doesn't incur a cgo call, which means that batching writes isn't quite as critical. This means that it's reasonable to expect the improvement to throughput in Cockroach will be even greater than this prediction.