kvserver, batcheval: pin Engine state during read-only command evaluation#76312
Conversation
|
Hey @nvanbenschoten, does this match your expectation of what we had discussed in the last few weeks? This may need a few tests but I wanted to confirm that the approach is close to what you had in mind. I confirmed that pinning the engine state eagerly (and doing nothing else) fails cc @arulajmani |
63fd520 to
0e68cf0
Compare
|
We'd discussed that we should measure the overhead of the pinning we do in this patch, with and without the additional overhead of calling Looks like the pinning alone has a pretty significant cost (I recall @sumeerbhola had been mildly concerned about something like this?). I'll try diffing the CPU profiles between |
sumeerbhola
left a comment
There was a problem hiding this comment.
Looks like the pinning alone has a pretty significant cost
Which benchmark is this?
The difference is that before we were creating a pebble.Iterator when creating the first pebbleIterator and then cloning it for future pebbleIterators. So 1 create + (n-1) clonings
While now we are dong 1 create + n clonings. Most of the time n=1. An optimization would be to store a bool in pebbleReadOnly and pebbleBatch about whether the {pebbleReadOnly,pebbleBatch}.iter has been used before and if not, don't bother cloning it.
There is also an existing bug where we may not close the iter:
// Setting iter to nil is sufficient since it will be closed by one of the
// subsequent destroy calls.
p.iter = nil
This is not true, if no one has created an MVCCIterator or EngineIterator. The bool will also help for fixing this.
Reviewed 4 of 14 files at r1, 4 of 6 files at r2.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @sumeerbhola)
pkg/kv/kvserver/replica_eval_context.go, line 55 at r2 (raw file):
// GetClosedTimestamp implements the EvalContext interface. func (ec *evalContextImpl) GetClosedTimestamp(_ context.Context) hlc.Timestamp {
this is implementing ImmutatbleEvalContext, yes?
Odd that it is named immutable, but the closedTS field is protected by a mutex. Who modifies it?
1fa3a1d to
099c920
Compare
aayushshah15
left a comment
There was a problem hiding this comment.
Which benchmark is this?
I should've mentioned it before, but it was just kv95 running for 5 mins.
./cockroach workload run kv --init --tolerate-errors --concurrency=192 --splits=1000 --ramp=1m --duration=5m --read-percent=95 --min-block-bytes=4096 --max-block-bytes=4096 {pgurl:1-5}
I added the bookkeeping you mentioned to avoid redundantly cloning the iterator an extra time. Here are the numbers with that change (I'm not sure if they're directly comparable to the numbers I posted above, since I don't remember what cluster size I ran the previous numbers on):
name \ ops/sec master.txt pin-with-closedts.txt pin-wo-closedts.txt
. 108k ± 2% 112k ± 1% 114k ± 0%
name \ p50 master.txt pin-with-closedts.txt pin-wo-closedts.txt
. 1.17 ± 6% 1.10 ± 0% 1.10 ± 0%
name \ p95 master.txt pin-with-closedts.txt pin-wo-closedts.txt
. 5.93 ± 2% 5.58 ± 4% 5.50 ± 0%
name \ p99 master.txt pin-with-closedts.txt pin-wo-closedts.txt
. 11.7 ± 3% 11.0 ± 0% 11.0 ± 0%
Surprisingly, it doesn't seem like the any of the locking inside Replica.GetCurrentClosedTimestamp() has much of an effect on the workload, cc @nvanbenschoten.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)
pkg/kv/kvserver/replica_eval_context.go, line 55 at r2 (raw file):
Previously, sumeerbhola wrote…
this is implementing
ImmutatbleEvalContext, yes?
Odd that it is named immutable, but theclosedTSfield is protected by a mutex. Who modifies it?
You're right, we didn't need the mutex here.
|
@nvanbenschoten mentioned that we should run the numbers for with/without the call into I ran them again for 10 iterations this time, and here they are. |
c0a1980 to
b5a3994
Compare
b5a3994 to
ac8da77
Compare
In cockroachdb#76312, we're making it such that a replica's closed timestamp is computed during the creation of the batch request's evaluation context. This is to ensure that a replica's closed timestamp is captured before we pin the state of its storage engine (since otherwise, the closed timestamp could be _newer_ than the state of the engine the command is evaluating over). See: cockroachdb#70974. However, computing the replica's current closed timestamp on every command evaluation can be expensive due to the mutual exclusion involved (it requires that we hold `Replica.mu` as well as a few locks inside the closed timestamp side transport). Since only the `QueryResolvedTimestampRequest` and `SubsumeRequest` need to read a replica's closed timestamp, this commit makes it such that we elide this work for every other request. Release note: None
ac8da77 to
5795da6
Compare
|
Thanks for gathering those extra numbers. Is the only difference between |
Yep.
Yes, I've added a commit to this PR that makes it such that most requests won't call into |
|
For even more completeness, do you have this direct comparison between master and the last commit of this PR? |
Here is |
nvb
left a comment
There was a problem hiding this comment.
Reviewed 2 of 14 files at r1, 1 of 6 files at r2, 11 of 15 files at r3, 4 of 4 files at r4, 4 of 4 files at r5, 7 of 7 files at r6, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @sumeerbhola)
pkg/kv/kvserver/replica_eval_context.go, line 39 at r4 (raw file):
func newEvalContextImpl(r *Replica) *evalContextImpl { return &evalContextImpl{
Should we pool these objects to avoid the new heap allocation? Their lifetime should be pretty well defined.
pkg/kv/kvserver/replica_eval_context.go, line 41 at r4 (raw file):
return &evalContextImpl{ Replica: r, closedTS: r.GetCurrentClosedTimestamp(context.TODO()),
Let's plumb a ctx to here.
pkg/kv/kvserver/batcheval/eval_context.go, line 142 at r4 (raw file):
} // ImmutableEvalContext is like EvalContext, but it encapsulates state that
The idea was to pass only this interface to read-only command evaluation functions, right? To do that, we may need to pull the EvalContext out of CommandArgs. Even if that is the plan, we can do it in a separate PR.
pkg/kv/kvserver/batcheval/eval_context.go, line 147 at r4 (raw file):
// GetClosedTimestamp returns the current closed timestamp on the range. // It is expected that a caller will have performed some action (either // calling RevokeLease or WatchForMerge) to freeze further progression of
This comment reminds me that GetClosedTimestamp is also used in during range subsumption and lease transfers. In those cases, we want the latest closed timestamp, not the closed timestamp at the time of the storage snapshot. The ordering is very important for those cases, as they need the closed timestamp to be computed after their call to WaitForMerge or RevokeLease. Do you have thoughts on what we should do here? Maybe expose two methods?
pkg/storage/pebble.go, line 1681 at r4 (raw file):
iter cloneableIter unused bool
nit: should this be called iterUnused? Is it clear what it's referring to? Same question below in pebbleBatch.
sumeerbhola
left a comment
There was a problem hiding this comment.
Reviewed 5 of 15 files at r3, 3 of 4 files at r4.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @nvanbenschoten, and @sumeerbhola)
pkg/storage/pebble.go, line 1681 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: should this be called
iterUnused? Is it clear what it's referring to? Same question below inpebbleBatch.
+1 to iterUnused
…luation This commit introduces a change to the way certain types of read-only requests are evaluated. Traditionally, read-only requests have held their latches throughout their execution. This commit allows certain qualifying reads to be able to release their latches earlier. At a high level, reads may attempt to resolve all conflicts upfront by performing a sort of "validation" phase before they perform their MVCC scan. This validation phase performs a scan of the lock table keyspace in order to find any conflicting intents that may need to be resolved before the actual evaluation of the request over the MVCC keyspace. If no conflicting intents are found, then (since cockroachdb#76312), the request is guaranteed to be fully isolated against all other concurrent requests and can be allowed to release its latches at this point. This allows the actual evaluation of the read (over the MVCC part of the keyspace) to proceed without latches being held, which is the main motivation of this work. This validation phase could be thought of as an extension to the validation that the concurrency manager already performs when requests are sequenced through it, by trying to detect any conflicting intents that have already been pulled into the in-memory lock table. Additionally, for certain types of requests that can drop their latches early, and do not need to access the `IntentHistory` for any of their parent txn's intents, this commit attempts to make their MVCC scan cheaper by eliminating the need for an `intentInterleavingIterator`. This is enabled by the observation that once the validation phase is complete, the only remaining intents in the read's declared span must be intents belonging to the reader's transaction. So if the reader doesn't need to read an intent that isn't the latest intent on a key, then it doesn't need access to the key's `IntentHistory` (which lives in the lock-table keyspace), and doesn't need to use an `intentInterleavingIterator`. Release note (performance improvement): TODO Resolves cockroachdb#66485
…luation This commit introduces a change to the way certain types of read-only requests are evaluated. Traditionally, read-only requests have held their latches throughout their execution. This commit allows certain qualifying reads to be able to release their latches earlier. At a high level, reads may attempt to resolve all conflicts upfront by performing a sort of "validation" phase before they perform their MVCC scan. This validation phase performs a scan of the lock table keyspace in order to find any conflicting intents that may need to be resolved before the actual evaluation of the request over the MVCC keyspace. If no conflicting intents are found, then (since cockroachdb#76312), the request is guaranteed to be fully isolated against all other concurrent requests and can be allowed to release its latches at this point. This allows the actual evaluation of the read (over the MVCC part of the keyspace) to proceed without latches being held, which is the main motivation of this work. This validation phase could be thought of as an extension to the validation that the concurrency manager already performs when requests are sequenced through it, by trying to detect any conflicting intents that have already been pulled into the in-memory lock table. Additionally, for certain types of requests that can drop their latches early, and do not need to access the `IntentHistory` for any of their parent txn's intents, this commit attempts to make their MVCC scan cheaper by eliminating the need for an `intentInterleavingIterator`. This is enabled by the observation that once the validation phase is complete, the only remaining intents in the read's declared span must be intents belonging to the reader's transaction. So if the reader doesn't need to read an intent that isn't the latest intent on a key, then it doesn't need access to the key's `IntentHistory` (which lives in the lock-table keyspace), and doesn't need to use an `intentInterleavingIterator`. Release note (performance improvement): certain types of reads will now have a far smaller contention footprint with conflicting concurrent writers Resolves cockroachdb#66485 Release justification: high benefit change to existing functionality, part of 22.2 roadmap
…luation This commit introduces a change to the way certain types of read-only requests are evaluated. Traditionally, read-only requests have held their latches throughout their execution. This commit allows certain qualifying reads to be able to release their latches earlier. At a high level, reads may attempt to resolve all conflicts upfront by performing a sort of "validation" phase before they perform their MVCC scan. This validation phase performs a scan of the lock table keyspace in order to find any conflicting intents that may need to be resolved before the actual evaluation of the request over the MVCC keyspace. If no conflicting intents are found, then (since cockroachdb#76312), the request is guaranteed to be fully isolated against all other concurrent requests and can be allowed to release its latches at this point. This allows the actual evaluation of the read (over the MVCC part of the keyspace) to proceed without latches being held, which is the main motivation of this work. This validation phase could be thought of as an extension to the validation that the concurrency manager already performs when requests are sequenced through it, by trying to detect any conflicting intents that have already been pulled into the in-memory lock table. Additionally, for certain types of requests that can drop their latches early, and do not need to access the `IntentHistory` for any of their parent txn's intents, this commit attempts to make their MVCC scan cheaper by eliminating the need for an `intentInterleavingIterator`. This is enabled by the observation that once the validation phase is complete, the only remaining intents in the read's declared span must be intents belonging to the reader's transaction. So if the reader doesn't need to read an intent that isn't the latest intent on a key, then it doesn't need access to the key's `IntentHistory` (which lives in the lock-table keyspace), and doesn't need to use an `intentInterleavingIterator`. Release note (performance improvement): certain types of reads will now have a far smaller contention footprint with conflicting concurrent writers Resolves cockroachdb#66485 Release justification: high benefit change to existing functionality, part of 22.2 roadmap
…luation This commit introduces a change to the way certain types of read-only requests are evaluated. Traditionally, read-only requests have held their latches throughout their execution. This commit allows certain qualifying reads to be able to release their latches earlier. At a high level, reads may attempt to resolve all conflicts upfront by performing a sort of "validation" phase before they perform their MVCC scan. This validation phase performs a scan of the lock table keyspace in order to find any conflicting intents that may need to be resolved before the actual evaluation of the request over the MVCC keyspace. If no conflicting intents are found, then (since cockroachdb#76312), the request is guaranteed to be fully isolated against all other concurrent requests and can be allowed to release its latches at this point. This allows the actual evaluation of the read (over the MVCC part of the keyspace) to proceed without latches being held, which is the main motivation of this work. This validation phase could be thought of as an extension to the validation that the concurrency manager already performs when requests are sequenced through it, by trying to detect any conflicting intents that have already been pulled into the in-memory lock table. Additionally, for certain types of requests that can drop their latches early, and do not need to access the `IntentHistory` for any of their parent txn's intents, this commit attempts to make their MVCC scan cheaper by eliminating the need for an `intentInterleavingIterator`. This is enabled by the observation that once the validation phase is complete, the only remaining intents in the read's declared span must be intents belonging to the reader's transaction. So if the reader doesn't need to read an intent that isn't the latest intent on a key, then it doesn't need access to the key's `IntentHistory` (which lives in the lock-table keyspace), and doesn't need to use an `intentInterleavingIterator`. Release note (performance improvement): certain types of reads will now have a far smaller contention footprint with conflicting concurrent writers Resolves cockroachdb#66485 Release justification: high benefit change to existing functionality, part of 22.2 roadmap
…luation This commit introduces a change to the way certain types of read-only requests are evaluated. Traditionally, read-only requests have held their latches throughout their execution. This commit allows certain qualifying reads to be able to release their latches earlier. At a high level, reads may attempt to resolve all conflicts upfront by performing a sort of "validation" phase before they perform their MVCC scan. This validation phase performs a scan of the lock table keyspace in order to find any conflicting intents that may need to be resolved before the actual evaluation of the request over the MVCC keyspace. If no conflicting intents are found, then (since cockroachdb#76312), the request is guaranteed to be fully isolated against all other concurrent requests and can be allowed to release its latches at this point. This allows the actual evaluation of the read (over the MVCC part of the keyspace) to proceed without latches being held, which is the main motivation of this work. This validation phase could be thought of as an extension to the validation that the concurrency manager already performs when requests are sequenced through it, by trying to detect any conflicting intents that have already been pulled into the in-memory lock table. Additionally, for certain types of requests that can drop their latches early, and do not need to access the `IntentHistory` for any of their parent txn's intents, this commit attempts to make their MVCC scan cheaper by eliminating the need for an `intentInterleavingIterator`. This is enabled by the observation that once the validation phase is complete, the only remaining intents in the read's declared span must be intents belonging to the reader's transaction. So if the reader doesn't need to read an intent that isn't the latest intent on a key, then it doesn't need access to the key's `IntentHistory` (which lives in the lock-table keyspace), and doesn't need to use an `intentInterleavingIterator`. Release note (performance improvement): certain types of reads will now have a far smaller contention footprint with conflicting concurrent writers Resolves cockroachdb#66485 Release justification: high benefit change to existing functionality, part of 22.2 roadmap
…luation This commit introduces a change to the way certain types of read-only requests are evaluated. Traditionally, read-only requests have held their latches throughout their execution. This commit allows certain qualifying reads to be able to release their latches earlier. At a high level, reads may attempt to resolve all conflicts upfront by performing a sort of "validation" phase before they perform their MVCC scan. This validation phase performs a scan of the lock table keyspace in order to find any conflicting intents that may need to be resolved before the actual evaluation of the request over the MVCC keyspace. If no conflicting intents are found, then (since cockroachdb#76312), the request is guaranteed to be fully isolated against all other concurrent requests and can be allowed to release its latches at this point. This allows the actual evaluation of the read (over the MVCC part of the keyspace) to proceed without latches being held, which is the main motivation of this work. This validation phase could be thought of as an extension to the validation that the concurrency manager already performs when requests are sequenced through it, by trying to detect any conflicting intents that have already been pulled into the in-memory lock table. Additionally, for certain types of requests that can drop their latches early, and do not need to access the `IntentHistory` for any of their parent txn's intents, this commit attempts to make their MVCC scan cheaper by eliminating the need for an `intentInterleavingIterator`. This is enabled by the observation that once the validation phase is complete, the only remaining intents in the read's declared span must be intents belonging to the reader's transaction. So if the reader doesn't need to read an intent that isn't the latest intent on a key, then it doesn't need access to the key's `IntentHistory` (which lives in the lock-table keyspace), and doesn't need to use an `intentInterleavingIterator`. Release note (performance improvement): certain types of reads will now have a far smaller contention footprint with conflicting concurrent writers Resolves cockroachdb#66485 Release justification: high benefit change to existing functionality, part of 22.2 roadmap
…luation This commit introduces a change to the way certain types of read-only requests are evaluated. Traditionally, read-only requests have held their latches throughout their execution. This commit allows certain qualifying reads to be able to release their latches earlier. At a high level, reads may attempt to resolve all conflicts upfront by performing a sort of "validation" phase before they perform their MVCC scan. This validation phase performs a scan of the lock table keyspace in order to find any conflicting intents that may need to be resolved before the actual evaluation of the request over the MVCC keyspace. If no conflicting intents are found, then (since cockroachdb#76312), the request is guaranteed to be fully isolated against all other concurrent requests and can be allowed to release its latches at this point. This allows the actual evaluation of the read (over the MVCC part of the keyspace) to proceed without latches being held, which is the main motivation of this work. This validation phase could be thought of as an extension to the validation that the concurrency manager already performs when requests are sequenced through it, by trying to detect any conflicting intents that have already been pulled into the in-memory lock table. Additionally, for certain types of requests that can drop their latches early, and do not need to access the `IntentHistory` for any of their parent txn's intents, this commit attempts to make their MVCC scan cheaper by eliminating the need for an `intentInterleavingIterator`. This is enabled by the observation that once the validation phase is complete, the only remaining intents in the read's declared span must be intents belonging to the reader's transaction. So if the reader doesn't need to read an intent that isn't the latest intent on a key, then it doesn't need access to the key's `IntentHistory` (which lives in the lock-table keyspace), and doesn't need to use an `intentInterleavingIterator`. Release note (performance improvement): certain types of reads will now have a far smaller contention footprint with conflicting concurrent writers Resolves cockroachdb#66485 Release justification: high benefit change to existing functionality, part of 22.2 roadmap
…luation This commit introduces a change to the way certain types of read-only requests are evaluated. Traditionally, read-only requests have held their latches throughout their execution. This commit allows certain qualifying reads to be able to release their latches earlier. At a high level, reads may attempt to resolve all conflicts upfront by performing a sort of "validation" phase before they perform their MVCC scan. This validation phase performs a scan of the lock table keyspace in order to find any conflicting intents that may need to be resolved before the actual evaluation of the request over the MVCC keyspace. If no conflicting intents are found, then (since cockroachdb#76312), the request is guaranteed to be fully isolated against all other concurrent requests and can be allowed to release its latches at this point. This allows the actual evaluation of the read (over the MVCC part of the keyspace) to proceed without latches being held, which is the main motivation of this work. This validation phase could be thought of as an extension to the validation that the concurrency manager already performs when requests are sequenced through it, by trying to detect any conflicting intents that have already been pulled into the in-memory lock table. Additionally, for certain types of requests that can drop their latches early, and do not need to access the `IntentHistory` for any of their parent txn's intents, this commit attempts to make their MVCC scan cheaper by eliminating the need for an `intentInterleavingIterator`. This is enabled by the observation that once the validation phase is complete, the only remaining intents in the read's declared span must be intents belonging to the reader's transaction. So if the reader doesn't need to read an intent that isn't the latest intent on a key, then it doesn't need access to the key's `IntentHistory` (which lives in the lock-table keyspace), and doesn't need to use an `intentInterleavingIterator`. Release note (performance improvement): certain types of reads will now have a far smaller contention footprint with conflicting concurrent writers Resolves cockroachdb#66485 Release justification: high benefit change to existing functionality, part of 22.2 roadmap
…luation This commit introduces a change to the way certain types of read-only requests are evaluated. Traditionally, read-only requests have held their latches throughout their execution. This commit allows certain qualifying reads to be able to release their latches earlier. At a high level, reads may attempt to resolve all conflicts upfront by performing a sort of "validation" phase before they perform their MVCC scan. This validation phase performs a scan of the lock table keyspace in order to find any conflicting intents that may need to be resolved before the actual evaluation of the request over the MVCC keyspace. If no conflicting intents are found, then (since cockroachdb#76312), the request is guaranteed to be fully isolated against all other concurrent requests and can be allowed to release its latches at this point. This allows the actual evaluation of the read (over the MVCC part of the keyspace) to proceed without latches being held, which is the main motivation of this work. This validation phase could be thought of as an extension to the validation that the concurrency manager already performs when requests are sequenced through it, by trying to detect any conflicting intents that have already been pulled into the in-memory lock table. Additionally, for certain types of requests that can drop their latches early, and do not need to access the `IntentHistory` for any of their parent txn's intents, this commit attempts to make their MVCC scan cheaper by eliminating the need for an `intentInterleavingIterator`. This is enabled by the observation that once the validation phase is complete, the only remaining intents in the read's declared span must be intents belonging to the reader's transaction. So if the reader doesn't need to read an intent that isn't the latest intent on a key, then it doesn't need access to the key's `IntentHistory` (which lives in the lock-table keyspace), and doesn't need to use an `intentInterleavingIterator`. Release note (performance improvement): certain types of reads will now have a far smaller contention footprint with conflicting concurrent writers Resolves cockroachdb#66485 Release justification: high benefit change to existing functionality, part of 22.2 roadmap
This commit makes it such that we eagerly pin the engine state of the
Readercreated during the evaluation of read-only requests.
Generally, reads will hold latches throughout the course of their evaluation
(particularly, while they do their
MVCCScan). Mainly, this commit paves theway for us to move to a world where we avoid holding latches during the
MVCCScan. Additionally, it also lets us make MVCC garbage collection latchless
as described in #55293.
There are a few notable changes in this patch:
Pinning the engine state eagerly runs into kv: QueryResolvedTimestamp iterator snapshot may be captured before closed timestamp #70974. To resolve this, the
closed timestamp of the
Replicais now captured at the time theEvalContextis created, and not during the command evaluation of
QueryResolvedTimestampRequest.EvalContextnow has aImmutableEvalContextembedded into it. TheImmutableEvalContextis supposed to encapsulate state that must not changeafter the
EvalContextis created. The closed timestamp of the replica is partof the
ImmutableEvalContext.Replicano longer fully implements theEvalContextinterface. Instead,it implements everything but
GetClosedTimestamp()(which is implemented byImmutableEvalContextinstead).Relates to #55293
Resolves #55461
Resolves #70974
Release note: None