storage: Use batches for direct RocksDB mutations#55708
storage: Use batches for direct RocksDB mutations#55708craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
petermattis
left a comment
There was a problem hiding this comment.
Presuming this fixes the bug, let's disable the DBImpl::{Put,Merge,Delete,SingleDelete,DeleteRange} code paths, or at least reverting the use of sync = true there.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @itsbilal, @jbowens, @petermattis, and @tbg)
pkg/storage/rocksdb.go, line 516 at r1 (raw file):
return err } return b.Commit(true)
Let's annotate these trues with Commit(true /* sync */).
935c1cd to
e0a0e7a
Compare
itsbilal
left a comment
There was a problem hiding this comment.
TFTR!
Removed the use of sync = true in those code paths. Can't remove them entirely as some tests depend on them
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @jbowens, @petermattis, and @tbg)
pkg/storage/rocksdb.go, line 516 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Let's annotate these
trueswithCommit(true /* sync */).
Done.
itsbilal
left a comment
There was a problem hiding this comment.
Re: bugfix, I've run engine/switch/nodes=3 ~50 times with no repro, and engine/switch/encrypted ~20 times. Given the latter was failing approx. 50% of the time before this change, I'm pretty confident this fixes it.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @jbowens, @petermattis, and @tbg)
petermattis
left a comment
There was a problem hiding this comment.
Were you able to reproduce the engine/switch/nodes=3 failure without this PR?
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @itsbilal, @jbowens, and @tbg)
pkg/storage/rocksdb.go, line 561 at r2 (raw file):
// It is safe to modify the contents of the arguments after ApplyBatchRepr // returns. func (r *RocksDB) ApplyBatchRepr(repr []byte, sync bool) error {
Can we get rid of the sync argument here? I don't think it is every used. Or perhaps we should just ignore this as we're remove RocksDB shortly anyways.
|
LGTM |
Currently, doing direct mutations on a RocksDB instance bypasses custom batching / syncing logic that we've built on top of it. This, or something internal to RocksDB, started leading to some bugs when all direct mutations started passing in WriteOptions.sync = true (see cockroachdb#55240 for when this change went in). In this change, direct mutations still commit the batch with sync=true to guarantee WAL syncing, but they go through the batch commit pipeline too, just like the vast majority of operations already do. Fixes cockroachdb#55362. Release note: None.
e0a0e7a to
8978797
Compare
|
Yes, I was able to repro it after ~50 runs. It's a lot less frequent than the |
|
TFTRs! bors r+ |
|
We'll want to backport this PR to 20.2, 20.1, and 19.2. |
|
Build succeeded: |
Currently, doing direct mutations on a RocksDB instance bypasses
custom batching / syncing logic that we've built on top of it.
This, or something internal to RocksDB, started leading to some bugs
when all direct mutations started passing in WriteOptions.sync = true
(see #55240 for when this change went in).
In this change, direct mutations still commit the batch with sync=true
to guarantee WAL syncing, but they go through the batch commit pipeline
too, just like the vast majority of operations already do.
Fixes #55362.
Release note: None.