storage/engine: add pebbleReadOnly#41859
Conversation
3803e62 to
af02589
Compare
petermattis
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @ajkr, @itsbilal, and @sumeerbhola)
pkg/storage/engine/batch_test.go, line 1015 at r2 (raw file):
// With Pebble, writes to the distinct batch are readable by the // distinct batch. This semantic difference is due to not buffering // writes in a builder.
I don't think this difference between Pebble and RocksDB distinct batches will matter, but we all should be aware of it.
pkg/storage/engine/batch_test.go, line 1203 at r2 (raw file):
switch engineImpl.name { case "pebble": // Reverse iteration in batches works on Pebble.
A less important difference between Pebble and RocksDB. We don't support reverse iteration on RocksDB batches due to implementation limitations with iteration on rocksdb::WriteBatchWithIndex. No such limitation with Pebble, but this isn't something CRDB uses.
4878a5d to
7d17a26
Compare
itsbilal
left a comment
There was a problem hiding this comment.
Looks good overall, just some questions / concerns.
Reviewed 2 of 5 files at r1, 2 of 4 files at r3, 2 of 2 files at r4.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @ajkr, @petermattis, and @sumeerbhola)
pkg/storage/engine/pebble_batch.go, line 114 at r4 (raw file):
} } else if !p.batch.Indexed() { r = p.db
Is this just to mimic RocksDB behaviour, or do we actually do reads on distinct batches and expect them to fall through to the underlying db?
pkg/storage/engine/pebble_batch.go, line 146 at r4 (raw file):
// Use the cached iterator. // // TODO(itsbilal): Investigate if it's equally or more efficient to just call
Are we confident enough in ditching this TODO? Both performance-wise and correctness wise?
For eg. I just uncovered a minor issue in this area: I noticed that we sometimes end up running into the "iterator already in use" case where we expect a batch to be able to spawn two iterators at once. As a particular example, resolveLocalIntents in batchEval spins up one non-prefix iterator on a batch (the associated Close is in a defer), then calls SetAbortSpan which requests a prefix iterator on the same batch. This is fine in the RocksDB batch case since we cache a prefix and a non-prefix iterator. We could either mimic that setup here and cache two iterators, or just do no caching if the performance impact of that is negligible or none.
pkg/storage/engine/pebble_iterator.go, line 519 at r4 (raw file):
p.reusable = false p.inuse = false p.Close()
Calling Close() here with reusable = false is pretty unsafe, since it will put the iterator back in the pebbleIterPool, while the batch it is in will end up in pebbleBatchPool, resulting in cases where two threads could concurrently be using the same pebbleIterator. I actually made the same mistake a couple weeks ago, which is probably a testament to how managing reusable iterators can be confusing and why it's a good idea to consider ditching them altogether.
For now, you can probably just copy the relevant parts of Close (i.e. *p = pebbleIterator{...}) here and that's good enough.
7d17a26 to
cff7cd5
Compare
petermattis
left a comment
There was a problem hiding this comment.
TFTR!
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @ajkr, @itsbilal, and @sumeerbhola)
pkg/storage/engine/pebble_batch.go, line 114 at r4 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
Is this just to mimic RocksDB behaviour, or do we actually do reads on distinct batches and expect them to fall through to the underlying db?
We definitely use this for iteration. The change here was made to mimic the RocksDB behavior for tests in batch_test.go. I don't think we actually use this in any production code.
pkg/storage/engine/pebble_batch.go, line 146 at r4 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
Are we confident enough in ditching this TODO? Both performance-wise and correctness wise?
For eg. I just uncovered a minor issue in this area: I noticed that we sometimes end up running into the "iterator already in use" case where we expect a batch to be able to spawn two iterators at once. As a particular example,
resolveLocalIntentsin batchEval spins up one non-prefix iterator on a batch (the associatedCloseis in a defer), then callsSetAbortSpanwhich requests a prefix iterator on the same batch. This is fine in the RocksDB batch case since we cache a prefix and a non-prefix iterator. We could either mimic that setup here and cache two iterators, or just do no caching if the performance impact of that is negligible or none.
Yeah, I'm confident of this performance-wise. Correctness is actually part of the motivation here. Creation of an iterator grabs an implicit snapshot. I'd prefer to mimic the behavior of RocksDB that we're getting by caching the iterator rather than having subtly different semantics.
Can you elaborate on the resolveLocalIntents problem? I haven't seen any test failures due to this. And the change here doesn't actually change the iterator already in use checking.
pkg/storage/engine/pebble_iterator.go, line 519 at r4 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
Calling
Close()here withreusable = falseis pretty unsafe, since it will put the iterator back in thepebbleIterPool, while the batch it is in will end up inpebbleBatchPool, resulting in cases where two threads could concurrently be using the samepebbleIterator. I actually made the same mistake a couple weeks ago, which is probably a testament to how managing reusable iterators can be confusing and why it's a good idea to consider ditching them altogether.For now, you can probably just copy the relevant parts of
Close(i.e.*p = pebbleIterator{...}) here and that's good enough.
Good catch. I've reorganized the code so that Close() now calls destroy() instead. PTAL.
itsbilal
left a comment
There was a problem hiding this comment.
Reviewed 2 of 2 files at r6.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @ajkr and @sumeerbhola)
pkg/storage/engine/pebble_batch.go, line 146 at r4 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Yeah, I'm confident of this performance-wise. Correctness is actually part of the motivation here. Creation of an iterator grabs an implicit snapshot. I'd prefer to mimic the behavior of RocksDB that we're getting by caching the iterator rather than having subtly different semantics.
Can you elaborate on the
resolveLocalIntentsproblem? I haven't seen any test failures due to this. And the change here doesn't actually change theiterator already in usechecking.
That makes sense, then - let's keep it this way and remove the TODO.
The issue I mentioned would fall under the umbrella of that TODO, but I could address it in a separate PR since it's unrelated to this change. It's not very well tested, but pops up every once in a while on a long running cluster
Just skim through resolveLocalIntents in batcheval/cmd_end_transaction.go, and you'll notice that there's an open batch iterator for the lifetime of that function. There's one case inside it where it could also call SetAbortSpan, which does an MVCCGetProto, which does an MVCCGet, which creates a short-lived prefix iterator on the batch.
In the RocksDB case that's fine since the two are cached independently and we don't end up reusing the same iterator, but in our current implementatino we only cache one iterator, and flip prefix to true as necessary. I guess if preserving semantics is our priority, we probably should just cache two iterators and call it a day.
pkg/storage/engine/pebble_iterator.go, line 519 at r4 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Good catch. I've reorganized the code so that
Close()now callsdestroy()instead. PTAL.
LGTM. There's a subtlety currently where calling destroy then Close could still leak a reusable iterator into the iterator pool, but looking at the places where we call destroy directly, I can't imagine any cases where it would happen. Plus destroy is unexported, so this seems safe.
petermattis
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @ajkr and @sumeerbhola)
pkg/storage/engine/pebble_batch.go, line 146 at r4 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
That makes sense, then - let's keep it this way and remove the TODO.
The issue I mentioned would fall under the umbrella of that TODO, but I could address it in a separate PR since it's unrelated to this change. It's not very well tested, but pops up every once in a while on a long running cluster
Just skim through
resolveLocalIntentsinbatcheval/cmd_end_transaction.go, and you'll notice that there's an open batch iterator for the lifetime of that function. There's one case inside it where it could also callSetAbortSpan, which does an MVCCGetProto, which does an MVCCGet, which creates a short-lived prefix iterator on the batch.In the RocksDB case that's fine since the two are cached independently and we don't end up reusing the same iterator, but in our current implementatino we only cache one iterator, and flip prefix to true as necessary. I guess if preserving semantics is our priority, we probably should just cache two iterators and call it a day.
Hmm. Interesting. It might be worth separating the cached prefix and normal iterators like we do for RocksDB. I'll leave this to a follow-on PR.
pkg/storage/engine/pebble_iterator.go, line 519 at r4 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
LGTM. There's a subtlety currently where calling
destroythenClosecould still leak a reusable iterator into the iterator pool, but looking at the places where we calldestroydirectly, I can't imagine any cases where it would happen. Plusdestroyis unexported, so this seems safe.
True. I've add a small mitigation which is to maintain the state of p.reusable. If Close() is subsequently called on a reusable iterator, we'll fall into the if p.reusable block.
Reimplement `Pebble.ReadOnly` with a specific `pebbleReadOnly`
implementation. This removes a lot of `if p.readOnly` checks from
`Pebble`, and also allows for `pebbleReadOnly.NewIterator` to use a
reusable iterator. This brings the behavior in line with
`rocksDBReadOnly`.
Added `pebbleIterator.{reusable,inuse,setOptions}`. Replaced
`pebbleBatchIterator` with a `pebbleIterator` marked as reusable.
Release note: None
Release note: None
cff7cd5 to
8b01e76
Compare
`pebbleBatch.NewIterator` was setting `pebbleBatch.iter.inuse = true`, and then calling `pebbleIterator.init` which was clearing that field. This was broken by cockroachdb#41859 which refactored how `pebbleBatch` iterator reuse works. Fixes cockroachdb#41899
sumeerbhola
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajkr, @itsbilal, and @petermattis)
pkg/storage/engine/pebble.go, line 578 at r8 (raw file):
} if opts.MinTimestampHint != (hlc.Timestamp{}) {
is hlc.Timestamp{} the zero timestamp? Is it possible for the above to be false but the MaxTimestampHint to be provided?
petermattis
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajkr, @itsbilal, and @sumeerbhola)
pkg/storage/engine/pebble.go, line 578 at r8 (raw file):
Previously, sumeerbhola wrote…
is hlc.Timestamp{} the zero timestamp? Is it possible for the above to be false but the MaxTimestampHint to be provided?
Yes, hlc.Timestamp{} is the zero timestamp. That's a Go-ism.
The logic here is copied from the RocksDB. I'm not sure why only MinTimestampHint is checked. That seems fragile, but we always set both timestamps whenever time-bound iteration is requested.
`pebbleBatch.NewIterator` was setting `pebbleBatch.iter.inuse = true`, and then calling `pebbleIterator.init` which was clearing that field. This was broken by cockroachdb#41859 which refactored how `pebbleBatch` iterator reuse works. Added separate cached prefix and normal iterators to `pebble{Batch,ReadOnly}`. Various bits of higher-level code expect to be able to have a prefix and normal iterator open at the same time. In particularly, this comes up during intent resolution. This also mimics our usage of RocksDB which seems desirable in the short term even though the semantics of having two cached iterators is slightly odd. Fixes cockroachdb#41899
Reimplement
Pebble.ReadOnlywith a specificpebbleReadOnlyimplementation. This removes a lot of
if p.readOnlychecks fromPebble, and also allows forpebbleReadOnly.NewIteratorto use areusable iterator. This brings the behavior in line with
rocksDBReadOnly.Added
pebbleIterator.{reusable,inuse,setOptions}. ReplacedpebbleBatchIteratorwith apebbleIteratormarked as reusable.Release note: None