kvserver: check GC threshold after acquiring a storage snapshot by aayushshah15 · Pull Request #78980 · cockroachdb/cockroach

aayushshah15 · 2022-03-29T18:26:09Z

Previously, we would check if a given read could proceed with execution
before we acquired a snapshot of the storage engine state. In particular, we
would check the replica's in-memory GC threshold before the state of the engine
had been pinned.

This meant that the following scenario was possible:

Request comes in, checks the replica's in-memory GC threshold and determines
that it is safe to proceed.
A pair of GC requests bump the GC threshold and garbage-collect the expired data.
The read acquires a snapshot of the storage engine state.
The read continues with its execution and sees an incorrect result.

This commit makes it such that we now check the GC threshold after we acquire a
snapshot of the storage engine state.

NB: Note that this commit only fixes our current model of issuing GCRequests
-- which is that we first issue a GCRequest that simply bumps the GC threshold
and then issue another GCRequest that actually performs the garbage collection.
If a single GCRequest were to do both of these things, we'd still have an issue
with reads potentially seeing incorrect results since, currently, the in-memory
GC threshold is bumped as a "post-apply" side effect that takes effect after
the expired data has already been garbage collected. This will be handled in a
future patch.

Release note: none

Relates to #55293.

cockroach-teamcity · 2022-03-29T18:26:23Z

This change is

aayushshah15 · 2022-06-21T17:43:52Z

In the last discussion with @nvanbenschoten, he realized that the fact that this patch moves the lease check to happen after the pebble snapshot was captured is a problem.

This is because the in-memory side effect of lease changes is applied in the "post-apply" stage. Consider the following scenario:

A reader comes in, acquires a pebble iterator (which now happens before the lease check)
the lease gets transferred away to some other node, which accepts a few writes
the lease gets transferred back to the reader's node, which means that the lease check will pass for the reader
boom -- reader serves results off of a stale pebble snapshot

aayushshah15 · 2022-06-21T18:43:33Z

Next steps:

This means that we need to split checkExecutionCanProceed into two steps -- one that happens before the storage engine snapshot is captured, and another that happens after. The lease check needs to happen before the storage engine snapshot is captured, because of the hazard described above, while the GC threshold needs to be checked after the storage snapshot is captured, which was the original intent of this PR.

This also means that we will now be splitting the single Replica.mu acquisition (which currently happens just once in checkExecutionCanProceed()) into two. This warrants some validation to ensure that we're not regressing on performance.

aayushshah15 · 2022-06-22T04:24:36Z

I ran kv95 with the following repro steps:

rp destroy $CLUSTER
rp create $CLUSTER --nodes=6 --lifetime=3h --gce-machine-type=n1-standard-32
for i in {1..5}
do
        rp wipe $CLUSTER
        rp stage $CLUSTER cockroach
        rp start $CLUSTER:1-5
        mkdir -p ../benchlogs/kv95_${DATE}_$CLUSTER
        rp run $CLUSTER:6 -- './cockroach workload run kv --init --tolerate-errors --concurrency=192 --splits=1000 --ramp=1m --duration=5m --read-percent=95 --min-block-bytes=4096 --max-block-bytes=4096 {pgurl:1-5}' 2>&1 | tee ../benchlogs/kv95_${DATE}_$CLUSTER/${i}_kv.log
done

I compared master with the latest revision of this patch, and see the following:

benchstat ../benchlogs/master ../benchlogs/acquire_replica_mu_twice
name  old ops/sec  new ops/sec  delta
.       109k ± 4%    108k ± 1%   ~     (p=0.841 n=5+5)

name  old p50      new p50      delta
.       1.20 ± 0%    1.20 ± 0%   ~     (all equal)

name  old p95      new p95      delta
.       5.80 ± 0%    5.80 ± 0%   ~     (all equal)

name  old p99      new p99      delta
.       11.5 ± 0%    11.5 ± 0%   ~     (all equal)

aayushshah15 · 2022-06-22T17:01:36Z

@nvanbenschoten this is ready for a look.

nvb

However, we spoke in person about using more iterations (maybe 25) to confirm that there is no statistically significant impact on kv95.

I'm also curious about the --min-block-bytes=4096 --max-block-bytes=4096. If you haven't run the tests again, consider using the default block size to avoid making these operations too "heavy" and drowning out the potential for lock contention.

Also, consider --gce-machine-type=n2-standard-32 --gce-min-cpu-platform='Intel Ice Lake' to run on a single NUMA node.

Reviewed 4 of 7 files at r1, 4 of 4 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @arulajmani)

aayushshah15 · 2022-06-28T17:58:23Z

I'd spun up a few runs right after our meeting (so they were running with the same repro steps as listed in #78980 (comment), but for 25 iterations). Here were the results for that:

benchstat ../benchlogs/master ../benchlogs/split_mutex_acq
name  old ops/sec  new ops/sec  delta
.       110k ± 8%    113k ± 6%  +2.81%  (p=0.025 n=25+24)

name  old p50      new p50      delta
.       1.20 ± 0%    1.10 ± 0%  -8.33%  (p=0.000 n=19+24)

name  old p95      new p95      delta
.       5.54 ±10%    5.65 ± 6%    ~     (p=0.225 n=25+24)

name  old p99      new p99      delta
.       11.4 ± 8%    11.0 ± 5%  -3.55%  (p=0.002 n=25+23)

I'm spinning up re-runs of these with Ice Lake and without the block size constraints now.

aayushshah15 · 2022-06-28T21:09:18Z

I'm spinning up re-runs of these with Ice Lake and without the block size constraints now.

Here are these numbers:

benchstat ../benchlogs/master ../benchlogs/split_mutexes_acq
name  old ops/sec  new ops/sec  delta
.       228k ± 0%    232k ± 1%  +1.91%  (p=0.000 n=23+24)

name  old p50      new p50      delta
.       0.70 ± 0%    0.70 ± 0%    ~     (all equal)

name  old p95      new p95      delta
.       2.16 ± 3%    2.20 ± 0%  +1.85%  (p=0.001 n=25+25)

name  old p99      new p99      delta
.       5.50 ± 0%    5.50 ± 0%    ~     (all equal)

What do you think @nvanbenschoten?

nvb · 2022-06-28T21:19:19Z

Am I reading this correctly that we're seeing a speedup with this change? If so, isn't that surprising?

Previously, we would check if a given read could proceed with execution _before_ we acquired a snapshot of the storage engine state. In particular, we would check the replica's in-memory GC threshold before the state of the engine had been pinned. This meant that the following scenario was possible: 1. Request comes in, checks the replica's in-memory GC threshold and determines that it is safe to proceed. 2. A pair of GC requests bump the GC threshold and garbage-collect the expired data. 3. The read acquires a snapshot of the storage engine state. 4. The read continues with its execution and sees an incorrect result. This commit makes it such that we now check the GC threshold after we acquire a snapshot of the storage engine state. It does so by lifting the GC threshold check out of `checkExecutionCanProceed()` and splitting up the single critical section under `Replica.mu` into two. NB: Note that this commit only fixes our current model of issuing `GCRequest`s -- which is that we first issue a GCRequest that simply bumps the GC threshold and then issue another GCRequest that actually performs the garbage collection. If a single GCRequest were to do both of these things, we'd still have an issue with reads potentially seeing incorrect results since, currently, the in-memory GC threshold is bumped as a "post-apply" side effect that takes effect after the expired data has already been garbage collected. This will be handled in a future patch. Release note: none

…ldRacesWithRead` This commit unskips a subset of `TestGCThresholdRacesWithRead`, which is now possible because of cockroachdb#76312 and the first commit in this patch. See cockroachdb#76410 (comment) Relates to cockroachdb#55293. Release note: none

aayushshah15 · 2022-06-30T16:33:37Z

If so, isn't that surprising?

I was wondering if it was because of the changes mentioned in https://cockroachlabs.slack.com/archives/C021WMUML04/p1655549375961439. I rebased and ran both the numbers again off of 5541cf8:

# master -> 5541cf8cc701fee4f9c90de01b5bda826d8d3c24
# split_mutexes_acq -> 5541cf8cc701fee4f9c90de01b5bda826d8d3c24 + this PR

>benchstat ../benchlogs/master ../benchlogs/split_mutexes_acq
name  old ops/sec  new ops/sec  delta
.       228k ± 0%    227k ± 1%  -0.18%  (p=0.030 n=25+25)

name  old p50      new p50      delta
.       0.70 ± 0%    0.70 ± 0%    ~     (all equal)

name  old p95      new p95      delta
.       2.17 ± 3%    2.16 ± 3%    ~     (p=0.551 n=25+25)

name  old p99      new p99      delta
.       5.50 ± 0%    5.50 ± 0%    ~     (all equal)

nvb · 2022-06-30T17:45:23Z

Thanks for re-running. That's a small enough performance delta to be justified by the improvement here. We may also be able to offset it if we can remove readOnlyCmdMu.

aayushshah15 · 2022-06-30T17:48:33Z

Thanks for taking a look!

bors r+

craig · 2022-06-30T19:14:26Z

Build succeeded:

GitHub CI (Cockroach)

84106: kv: avoid heap alloc of RangeAppliedState per Raft apply batch r=erikgrinaker a=nvanbenschoten This commit hangs a `RangeAppliedState` struct off of `replicaAppBatch` so that a Raft application batch can avoid a heap allocation during its call to `addAppliedStateKeyToBatch`. ``` name old time/op new time/op delta KV/Insert/Native/rows=1-10 46.3µs ± 7% 46.3µs ± 4% ~ (p=0.553 n=19+18) KV/Insert/SQL/rows=1-10 141µs ±18% 133µs ± 7% ~ (p=0.075 n=19+18) name old alloc/op new alloc/op delta KV/Insert/Native/rows=1-10 15.6kB ± 2% 15.4kB ± 1% -1.46% (p=0.000 n=20+20) KV/Insert/SQL/rows=1-10 43.0kB ± 0% 42.9kB ± 0% -0.43% (p=0.000 n=19+19) name old allocs/op new allocs/op delta KV/Insert/Native/rows=1-10 131 ± 0% 130 ± 0% -0.76% (p=0.000 n=19+20) KV/Insert/SQL/rows=1-10 350 ± 0% 349 ± 0% -0.29% (p=0.000 n=16+19) ``` 84108: kv/batcheval: replace command hash-map with array r=nvanbenschoten a=nvanbenschoten The per-request hash-map lookup showed up in CPU profiles while running TPC-E: ``` ----------------------------------------------------------+------------- flat flat% sum% cum cum% calls calls% + context ----------------------------------------------------------+------------- 0.03s 60.00% | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).collectSpans github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_send.go:1093 (inline) 0.02s 40.00% | github.com/cockroachdb/cockroach/pkg/kv/kvserver.evaluateCommand github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_evaluate.go:483 (inline) 0 0% 0% 0.05s 0.029% | github.com/cockroachdb/cockroach/pkg/kv/kvserver/batcheval.LookupCommand github.com/cockroachdb/cockroach/pkg/kv/kvserver/batcheval/command.go:116 0.02s 40.00% | runtime.mapaccess2_fast64 GOROOT/src/runtime/map_fast64.go:57 0.02s 40.00% | runtime.mapaccess2_fast64 GOROOT/src/runtime/map_fast64.go:84 0.01s 20.00% | runtime.mapaccess2_fast64 GOROOT/src/runtime/map_fast64.go:69 ----------------------------------------------------------+------------- ``` It's easy enough to get rid of by replacing the hash-map with an array, like we do in other parts of the code. 84110: kv: use atomic bool to avoid read lock in Replica.IsInitialized r=knz,erikgrinaker a=nvanbenschoten According to a CPU profile acquired while running TPC-E, the read lock acquired in `Replica.IsInitialized` is the second most expensive read lock in terms of CPU cycles consumed. This doesn't mean that it's the most expensive in terms of blocking, but it still indicates that the lock is frequently acquired. ``` ----------------------------------------------------------+------------- 0.29s 43.28% | github.com/cockroachdb/pebble/internal/cache.(*shard).Get github.com/cockroachdb/pebble/internal/cache/external/com_github_cockroachdb_pebble/internal/cache/clockpro.go:117 (inline) 0.05s 7.46% | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).IsInitialized github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_init.go:249 (inline) ... 0.67s 0.39% 24.00% 0.67s 0.39% | sync.(*RWMutex).RLock GOROOT/src/sync/rwmutex.go:61 ----------------------------------------------------------+------------- ``` The two main callers of this method are `Store.Send` (on each request) and `Replica.handleRaftReadyRaftMuLocked` (on each raft ready iteration): ``` ----------------------------------------------------------+------------- 0.04s 80.00% | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).Send github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_send.go:176 0.01s 20.00% | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReadyRaftMuLocked github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:829 0.04s 0.023% 0.023% 0.05s 0.029% | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).IsInitialized github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_init.go:251 0.01s 20.00% | sync.(*RWMutex).RUnlock GOROOT/src/sync/rwmutex.go:81 ----------------------------------------------------------+------------- ``` This commit makes the method cheaper by replacing the lock with an atomic read. Once a replica is initialized, it stays initialized, so this state is well suited to be stored in an atomic value. At a minimum, this offsets the new rlock acquired by #78980. 84186: sql/schemachanger: detect unsupported types for unique indexes r=fqazi a=fqazi Fixes: #84185 Previously, the declarative schema changer when adding a column with a unique index did not generate the appropriate error when an unsupported type was encountered. As a result, we would wait till execution and require a rollback during execution phase. To address this, this patch will report the unsupported type directly within the builder. Release note: None Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: Faizan Qazi <faizan@cockroachlabs.com>

aayushshah15 requested review from a team as code owners March 29, 2022 18:26

aayushshah15 changed the title ~~20220328 mvccGCTest~~ kvserver: check GC threshold after acquiring a storage snapshot Mar 29, 2022

aayushshah15 requested review from arulajmani and nvb and removed request for a team March 29, 2022 18:27

aayushshah15 force-pushed the 20220328_mvccGCTest branch from 1bdebb2 to 60906cb Compare May 3, 2022 19:19

aayushshah15 force-pushed the 20220328_mvccGCTest branch from 60906cb to ff9b412 Compare June 8, 2022 18:19

aayushshah15 force-pushed the 20220328_mvccGCTest branch 3 times, most recently from bcad0d1 to dabe257 Compare June 22, 2022 04:18

aayushshah15 force-pushed the 20220328_mvccGCTest branch from dabe257 to 246eda9 Compare June 22, 2022 16:08

aayushshah15 mentioned this pull request Jun 22, 2022

kv: make MVCC garbage collection latch-less #55293

Closed

nvb approved these changes Jun 23, 2022

View reviewed changes

aayushshah15 added 2 commits June 28, 2022 17:25

aayushshah15 force-pushed the 20220328_mvccGCTest branch from 246eda9 to efa68b9 Compare June 28, 2022 21:25

craig bot merged commit b630c08 into cockroachdb:master Jun 30, 2022

erikgrinaker mentioned this pull request Jul 3, 2022

kvserver/gc: remove range tombstones during GC #83265

Merged

nvb mentioned this pull request Jul 8, 2022

kv: use atomic bool to avoid read lock in Replica.IsInitialized #84110

Merged

aayushshah15 mentioned this pull request Jul 21, 2022

kvserver: make MVCC GC less disruptive to foreground traffic #83213

Merged

Conversation

aayushshah15 commented Mar 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Mar 29, 2022

Uh oh!

aayushshah15 commented Jun 21, 2022

Uh oh!

aayushshah15 commented Jun 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aayushshah15 commented Jun 22, 2022

Uh oh!

aayushshah15 commented Jun 22, 2022

Uh oh!

nvb left a comment

Choose a reason for hiding this comment

Uh oh!

aayushshah15 commented Jun 28, 2022

Uh oh!

aayushshah15 commented Jun 28, 2022

Uh oh!

nvb commented Jun 28, 2022

Uh oh!

aayushshah15 commented Jun 30, 2022

Uh oh!

nvb commented Jun 30, 2022

Uh oh!

aayushshah15 commented Jun 30, 2022

Uh oh!

craig bot commented Jun 30, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

aayushshah15 commented Mar 29, 2022 •

edited

Loading

aayushshah15 commented Jun 21, 2022 •

edited

Loading