kvserver: check GC threshold after acquiring a storage snapshot#78980
kvserver: check GC threshold after acquiring a storage snapshot#78980craig[bot] merged 2 commits intocockroachdb:masterfrom
Conversation
1bdebb2 to
60906cb
Compare
60906cb to
ff9b412
Compare
|
In the last discussion with @nvanbenschoten, he realized that the fact that this patch moves the lease check to happen after the pebble snapshot was captured is a problem. This is because the in-memory side effect of lease changes is applied in the "post-apply" stage. Consider the following scenario:
|
|
Next steps: This means that we need to split This also means that we will now be splitting the single |
bcad0d1 to
dabe257
Compare
|
I ran kv95 with the following repro steps: I compared |
dabe257 to
246eda9
Compare
|
@nvanbenschoten this is ready for a look. |
nvb
left a comment
There was a problem hiding this comment.
However, we spoke in person about using more iterations (maybe 25) to confirm that there is no statistically significant impact on kv95.
I'm also curious about the --min-block-bytes=4096 --max-block-bytes=4096. If you haven't run the tests again, consider using the default block size to avoid making these operations too "heavy" and drowning out the potential for lock contention.
Also, consider --gce-machine-type=n2-standard-32 --gce-min-cpu-platform='Intel Ice Lake' to run on a single NUMA node.
Reviewed 4 of 7 files at r1, 4 of 4 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @arulajmani)
|
I'd spun up a few runs right after our meeting (so they were running with the same repro steps as listed in #78980 (comment), but for 25 iterations). Here were the results for that: I'm spinning up re-runs of these with Ice Lake and without the block size constraints now. |
Here are these numbers: What do you think @nvanbenschoten? |
|
Am I reading this correctly that we're seeing a speedup with this change? If so, isn't that surprising? |
Previously, we would check if a given read could proceed with execution _before_ we acquired a snapshot of the storage engine state. In particular, we would check the replica's in-memory GC threshold before the state of the engine had been pinned. This meant that the following scenario was possible: 1. Request comes in, checks the replica's in-memory GC threshold and determines that it is safe to proceed. 2. A pair of GC requests bump the GC threshold and garbage-collect the expired data. 3. The read acquires a snapshot of the storage engine state. 4. The read continues with its execution and sees an incorrect result. This commit makes it such that we now check the GC threshold after we acquire a snapshot of the storage engine state. It does so by lifting the GC threshold check out of `checkExecutionCanProceed()` and splitting up the single critical section under `Replica.mu` into two. NB: Note that this commit only fixes our current model of issuing `GCRequest`s -- which is that we first issue a GCRequest that simply bumps the GC threshold and then issue another GCRequest that actually performs the garbage collection. If a single GCRequest were to do both of these things, we'd still have an issue with reads potentially seeing incorrect results since, currently, the in-memory GC threshold is bumped as a "post-apply" side effect that takes effect after the expired data has already been garbage collected. This will be handled in a future patch. Release note: none
…ldRacesWithRead` This commit unskips a subset of `TestGCThresholdRacesWithRead`, which is now possible because of cockroachdb#76312 and the first commit in this patch. See cockroachdb#76410 (comment) Relates to cockroachdb#55293. Release note: none
246eda9 to
efa68b9
Compare
I was wondering if it was because of the changes mentioned in https://cockroachlabs.slack.com/archives/C021WMUML04/p1655549375961439. I rebased and ran both the numbers again off of 5541cf8: |
|
Thanks for re-running. That's a small enough performance delta to be justified by the improvement here. We may also be able to offset it if we can remove |
|
Thanks for taking a look! bors r+ |
|
Build succeeded: |
84106: kv: avoid heap alloc of RangeAppliedState per Raft apply batch r=erikgrinaker a=nvanbenschoten
This commit hangs a `RangeAppliedState` struct off of `replicaAppBatch`
so that a Raft application batch can avoid a heap allocation during its
call to `addAppliedStateKeyToBatch`.
```
name old time/op new time/op delta
KV/Insert/Native/rows=1-10 46.3µs ± 7% 46.3µs ± 4% ~ (p=0.553 n=19+18)
KV/Insert/SQL/rows=1-10 141µs ±18% 133µs ± 7% ~ (p=0.075 n=19+18)
name old alloc/op new alloc/op delta
KV/Insert/Native/rows=1-10 15.6kB ± 2% 15.4kB ± 1% -1.46% (p=0.000 n=20+20)
KV/Insert/SQL/rows=1-10 43.0kB ± 0% 42.9kB ± 0% -0.43% (p=0.000 n=19+19)
name old allocs/op new allocs/op delta
KV/Insert/Native/rows=1-10 131 ± 0% 130 ± 0% -0.76% (p=0.000 n=19+20)
KV/Insert/SQL/rows=1-10 350 ± 0% 349 ± 0% -0.29% (p=0.000 n=16+19)
```
84108: kv/batcheval: replace command hash-map with array r=nvanbenschoten a=nvanbenschoten
The per-request hash-map lookup showed up in CPU profiles while running TPC-E:
```
----------------------------------------------------------+-------------
flat flat% sum% cum cum% calls calls% + context
----------------------------------------------------------+-------------
0.03s 60.00% | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).collectSpans github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_send.go:1093 (inline)
0.02s 40.00% | github.com/cockroachdb/cockroach/pkg/kv/kvserver.evaluateCommand github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_evaluate.go:483 (inline)
0 0% 0% 0.05s 0.029% | github.com/cockroachdb/cockroach/pkg/kv/kvserver/batcheval.LookupCommand github.com/cockroachdb/cockroach/pkg/kv/kvserver/batcheval/command.go:116
0.02s 40.00% | runtime.mapaccess2_fast64 GOROOT/src/runtime/map_fast64.go:57
0.02s 40.00% | runtime.mapaccess2_fast64 GOROOT/src/runtime/map_fast64.go:84
0.01s 20.00% | runtime.mapaccess2_fast64 GOROOT/src/runtime/map_fast64.go:69
----------------------------------------------------------+-------------
```
It's easy enough to get rid of by replacing the hash-map with an array, like we do in other parts of the code.
84110: kv: use atomic bool to avoid read lock in Replica.IsInitialized r=knz,erikgrinaker a=nvanbenschoten
According to a CPU profile acquired while running TPC-E, the read lock acquired in `Replica.IsInitialized` is the second most expensive read lock in terms of CPU cycles consumed. This doesn't mean that it's the most expensive in terms of blocking, but it still indicates that the lock is frequently acquired.
```
----------------------------------------------------------+-------------
0.29s 43.28% | github.com/cockroachdb/pebble/internal/cache.(*shard).Get github.com/cockroachdb/pebble/internal/cache/external/com_github_cockroachdb_pebble/internal/cache/clockpro.go:117 (inline)
0.05s 7.46% | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).IsInitialized github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_init.go:249 (inline)
...
0.67s 0.39% 24.00% 0.67s 0.39% | sync.(*RWMutex).RLock GOROOT/src/sync/rwmutex.go:61
----------------------------------------------------------+-------------
```
The two main callers of this method are `Store.Send` (on each request) and `Replica.handleRaftReadyRaftMuLocked` (on each raft ready iteration):
```
----------------------------------------------------------+-------------
0.04s 80.00% | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).Send github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_send.go:176
0.01s 20.00% | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReadyRaftMuLocked github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:829
0.04s 0.023% 0.023% 0.05s 0.029% | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).IsInitialized github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_init.go:251
0.01s 20.00% | sync.(*RWMutex).RUnlock GOROOT/src/sync/rwmutex.go:81
----------------------------------------------------------+-------------
```
This commit makes the method cheaper by replacing the lock with an atomic read. Once a replica is initialized, it stays initialized, so this state is well suited to be stored in an atomic value.
At a minimum, this offsets the new rlock acquired by #78980.
84186: sql/schemachanger: detect unsupported types for unique indexes r=fqazi a=fqazi
Fixes: #84185
Previously, the declarative schema changer when adding
a column with a unique index did not generate the
appropriate error when an unsupported type was encountered.
As a result, we would wait till execution and require a
rollback during execution phase. To address this, this patch
will report the unsupported type directly within the builder.
Release note: None
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: Faizan Qazi <faizan@cockroachlabs.com>
Previously, we would check if a given read could proceed with execution
before we acquired a snapshot of the storage engine state. In particular, we
would check the replica's in-memory GC threshold before the state of the engine
had been pinned.
This meant that the following scenario was possible:
that it is safe to proceed.
This commit makes it such that we now check the GC threshold after we acquire a
snapshot of the storage engine state.
NB: Note that this commit only fixes our current model of issuing
GCRequests-- which is that we first issue a GCRequest that simply bumps the GC threshold
and then issue another GCRequest that actually performs the garbage collection.
If a single GCRequest were to do both of these things, we'd still have an issue
with reads potentially seeing incorrect results since, currently, the in-memory
GC threshold is bumped as a "post-apply" side effect that takes effect after
the expired data has already been garbage collected. This will be handled in a
future patch.
Release note: none
Relates to #55293.