Skip to content

kvserver: check GC threshold after acquiring a storage snapshot#78980

Merged
craig[bot] merged 2 commits intocockroachdb:masterfrom
aayushshah15:20220328_mvccGCTest
Jun 30, 2022
Merged

kvserver: check GC threshold after acquiring a storage snapshot#78980
craig[bot] merged 2 commits intocockroachdb:masterfrom
aayushshah15:20220328_mvccGCTest

Conversation

@aayushshah15
Copy link
Copy Markdown
Contributor

@aayushshah15 aayushshah15 commented Mar 29, 2022

Previously, we would check if a given read could proceed with execution
before we acquired a snapshot of the storage engine state. In particular, we
would check the replica's in-memory GC threshold before the state of the engine
had been pinned.

This meant that the following scenario was possible:

  1. Request comes in, checks the replica's in-memory GC threshold and determines
    that it is safe to proceed.
  2. A pair of GC requests bump the GC threshold and garbage-collect the expired data.
  3. The read acquires a snapshot of the storage engine state.
  4. The read continues with its execution and sees an incorrect result.

This commit makes it such that we now check the GC threshold after we acquire a
snapshot of the storage engine state.

NB: Note that this commit only fixes our current model of issuing GCRequests
-- which is that we first issue a GCRequest that simply bumps the GC threshold
and then issue another GCRequest that actually performs the garbage collection.
If a single GCRequest were to do both of these things, we'd still have an issue
with reads potentially seeing incorrect results since, currently, the in-memory
GC threshold is bumped as a "post-apply" side effect that takes effect after
the expired data has already been garbage collected. This will be handled in a
future patch.

Release note: none

Relates to #55293.

@aayushshah15 aayushshah15 requested review from a team as code owners March 29, 2022 18:26
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@aayushshah15 aayushshah15 changed the title 20220328 mvccGCTest kvserver: check GC threshold after acquiring a storage snapshot Mar 29, 2022
@aayushshah15 aayushshah15 requested review from arulajmani and nvb and removed request for a team March 29, 2022 18:27
@aayushshah15 aayushshah15 force-pushed the 20220328_mvccGCTest branch from 1bdebb2 to 60906cb Compare May 3, 2022 19:19
@aayushshah15 aayushshah15 force-pushed the 20220328_mvccGCTest branch from 60906cb to ff9b412 Compare June 8, 2022 18:19
@aayushshah15
Copy link
Copy Markdown
Contributor Author

In the last discussion with @nvanbenschoten, he realized that the fact that this patch moves the lease check to happen after the pebble snapshot was captured is a problem.

This is because the in-memory side effect of lease changes is applied in the "post-apply" stage. Consider the following scenario:

  1. A reader comes in, acquires a pebble iterator (which now happens before the lease check)
  2. the lease gets transferred away to some other node, which accepts a few writes
  3. the lease gets transferred back to the reader's node, which means that the lease check will pass for the reader
  4. boom -- reader serves results off of a stale pebble snapshot

@aayushshah15
Copy link
Copy Markdown
Contributor Author

aayushshah15 commented Jun 21, 2022

Next steps:

This means that we need to split checkExecutionCanProceed into two steps -- one that happens before the storage engine snapshot is captured, and another that happens after. The lease check needs to happen before the storage engine snapshot is captured, because of the hazard described above, while the GC threshold needs to be checked after the storage snapshot is captured, which was the original intent of this PR.

This also means that we will now be splitting the single Replica.mu acquisition (which currently happens just once in checkExecutionCanProceed()) into two. This warrants some validation to ensure that we're not regressing on performance.

@aayushshah15 aayushshah15 force-pushed the 20220328_mvccGCTest branch 3 times, most recently from bcad0d1 to dabe257 Compare June 22, 2022 04:18
@aayushshah15
Copy link
Copy Markdown
Contributor Author

I ran kv95 with the following repro steps:

rp destroy $CLUSTER
rp create $CLUSTER --nodes=6 --lifetime=3h --gce-machine-type=n1-standard-32
for i in {1..5}
do
        rp wipe $CLUSTER
        rp stage $CLUSTER cockroach
        rp start $CLUSTER:1-5
        mkdir -p ../benchlogs/kv95_${DATE}_$CLUSTER
        rp run $CLUSTER:6 -- './cockroach workload run kv --init --tolerate-errors --concurrency=192 --splits=1000 --ramp=1m --duration=5m --read-percent=95 --min-block-bytes=4096 --max-block-bytes=4096 {pgurl:1-5}' 2>&1 | tee ../benchlogs/kv95_${DATE}_$CLUSTER/${i}_kv.log
done

I compared master with the latest revision of this patch, and see the following:

benchstat ../benchlogs/master ../benchlogs/acquire_replica_mu_twice
name  old ops/sec  new ops/sec  delta
.       109k ± 4%    108k ± 1%   ~     (p=0.841 n=5+5)

name  old p50      new p50      delta
.       1.20 ± 0%    1.20 ± 0%   ~     (all equal)

name  old p95      new p95      delta
.       5.80 ± 0%    5.80 ± 0%   ~     (all equal)

name  old p99      new p99      delta
.       11.5 ± 0%    11.5 ± 0%   ~     (all equal)

@aayushshah15 aayushshah15 force-pushed the 20220328_mvccGCTest branch from dabe257 to 246eda9 Compare June 22, 2022 16:08
@aayushshah15
Copy link
Copy Markdown
Contributor Author

@nvanbenschoten this is ready for a look.

Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

However, we spoke in person about using more iterations (maybe 25) to confirm that there is no statistically significant impact on kv95.

I'm also curious about the --min-block-bytes=4096 --max-block-bytes=4096. If you haven't run the tests again, consider using the default block size to avoid making these operations too "heavy" and drowning out the potential for lock contention.

Also, consider --gce-machine-type=n2-standard-32 --gce-min-cpu-platform='Intel Ice Lake' to run on a single NUMA node.

Reviewed 4 of 7 files at r1, 4 of 4 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @arulajmani)

@aayushshah15
Copy link
Copy Markdown
Contributor Author

I'd spun up a few runs right after our meeting (so they were running with the same repro steps as listed in #78980 (comment), but for 25 iterations). Here were the results for that:

benchstat ../benchlogs/master ../benchlogs/split_mutex_acq
name  old ops/sec  new ops/sec  delta
.       110k ± 8%    113k ± 6%  +2.81%  (p=0.025 n=25+24)

name  old p50      new p50      delta
.       1.20 ± 0%    1.10 ± 0%  -8.33%  (p=0.000 n=19+24)

name  old p95      new p95      delta
.       5.54 ±10%    5.65 ± 6%    ~     (p=0.225 n=25+24)

name  old p99      new p99      delta
.       11.4 ± 8%    11.0 ± 5%  -3.55%  (p=0.002 n=25+23)

I'm spinning up re-runs of these with Ice Lake and without the block size constraints now.

@aayushshah15
Copy link
Copy Markdown
Contributor Author

I'm spinning up re-runs of these with Ice Lake and without the block size constraints now.

Here are these numbers:

benchstat ../benchlogs/master ../benchlogs/split_mutexes_acq
name  old ops/sec  new ops/sec  delta
.       228k ± 0%    232k ± 1%  +1.91%  (p=0.000 n=23+24)

name  old p50      new p50      delta
.       0.70 ± 0%    0.70 ± 0%    ~     (all equal)

name  old p95      new p95      delta
.       2.16 ± 3%    2.20 ± 0%  +1.85%  (p=0.001 n=25+25)

name  old p99      new p99      delta
.       5.50 ± 0%    5.50 ± 0%    ~     (all equal)

What do you think @nvanbenschoten?

@nvb
Copy link
Copy Markdown
Contributor

nvb commented Jun 28, 2022

Am I reading this correctly that we're seeing a speedup with this change? If so, isn't that surprising?

Previously, we would check if a given read could proceed with execution
_before_ we acquired a snapshot of the storage engine state. In particular, we
would check the replica's in-memory GC threshold before the state of the engine
had been pinned.

This meant that the following scenario was possible:
1. Request comes in, checks the replica's in-memory GC threshold and determines
that it is safe to proceed.
2. A pair of GC requests bump the GC threshold and garbage-collect the expired data.
3. The read acquires a snapshot of the storage engine state.
4. The read continues with its execution and sees an incorrect result.

This commit makes it such that we now check the GC threshold after we acquire a
snapshot of the storage engine state. It does so by lifting the GC
threshold check out of `checkExecutionCanProceed()` and splitting up the
single critical section under `Replica.mu` into two.

NB: Note that this commit only fixes our current model of issuing `GCRequest`s
-- which is that we first issue a GCRequest that simply bumps the GC threshold
and then issue another GCRequest that actually performs the garbage collection.
If a single GCRequest were to do both of these things, we'd still have an issue
with reads potentially seeing incorrect results since, currently, the in-memory
GC threshold is bumped as a "post-apply" side effect that takes effect after
the expired data has already been garbage collected. This will be handled in a
future patch.

Release note: none
…ldRacesWithRead`

This commit unskips a subset of `TestGCThresholdRacesWithRead`, which is now
possible because of cockroachdb#76312 and the first commit in this patch. See
cockroachdb#76410 (comment)

Relates to cockroachdb#55293.

Release note: none
@aayushshah15 aayushshah15 force-pushed the 20220328_mvccGCTest branch from 246eda9 to efa68b9 Compare June 28, 2022 21:25
@aayushshah15
Copy link
Copy Markdown
Contributor Author

If so, isn't that surprising?

I was wondering if it was because of the changes mentioned in https://cockroachlabs.slack.com/archives/C021WMUML04/p1655549375961439. I rebased and ran both the numbers again off of 5541cf8:

# master -> 5541cf8cc701fee4f9c90de01b5bda826d8d3c24
# split_mutexes_acq -> 5541cf8cc701fee4f9c90de01b5bda826d8d3c24 + this PR

>benchstat ../benchlogs/master ../benchlogs/split_mutexes_acq
name  old ops/sec  new ops/sec  delta
.       228k ± 0%    227k ± 1%  -0.18%  (p=0.030 n=25+25)

name  old p50      new p50      delta
.       0.70 ± 0%    0.70 ± 0%    ~     (all equal)

name  old p95      new p95      delta
.       2.17 ± 3%    2.16 ± 3%    ~     (p=0.551 n=25+25)

name  old p99      new p99      delta
.       5.50 ± 0%    5.50 ± 0%    ~     (all equal)

@nvb
Copy link
Copy Markdown
Contributor

nvb commented Jun 30, 2022

Thanks for re-running. That's a small enough performance delta to be justified by the improvement here. We may also be able to offset it if we can remove readOnlyCmdMu.

@aayushshah15
Copy link
Copy Markdown
Contributor Author

Thanks for taking a look!

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jun 30, 2022

Build succeeded:

@craig craig bot merged commit b630c08 into cockroachdb:master Jun 30, 2022
craig bot pushed a commit that referenced this pull request Jul 12, 2022
84106: kv: avoid heap alloc of RangeAppliedState per Raft apply batch r=erikgrinaker a=nvanbenschoten

This commit hangs a `RangeAppliedState` struct off of `replicaAppBatch`
so that a Raft application batch can avoid a heap allocation during its
call to `addAppliedStateKeyToBatch`.

```
name                        old time/op    new time/op    delta
KV/Insert/Native/rows=1-10    46.3µs ± 7%    46.3µs ± 4%    ~     (p=0.553 n=19+18)
KV/Insert/SQL/rows=1-10        141µs ±18%     133µs ± 7%    ~     (p=0.075 n=19+18)

name                        old alloc/op   new alloc/op   delta
KV/Insert/Native/rows=1-10    15.6kB ± 2%    15.4kB ± 1%  -1.46%  (p=0.000 n=20+20)
KV/Insert/SQL/rows=1-10       43.0kB ± 0%    42.9kB ± 0%  -0.43%  (p=0.000 n=19+19)

name                        old allocs/op  new allocs/op  delta
KV/Insert/Native/rows=1-10       131 ± 0%       130 ± 0%  -0.76%  (p=0.000 n=19+20)
KV/Insert/SQL/rows=1-10          350 ± 0%       349 ± 0%  -0.29%  (p=0.000 n=16+19)
```

84108: kv/batcheval: replace command hash-map with array r=nvanbenschoten a=nvanbenschoten

The per-request hash-map lookup showed up in CPU profiles while running TPC-E:
```
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context
----------------------------------------------------------+-------------
                                                0.03s 60.00% |   github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).collectSpans github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_send.go:1093 (inline)
                                                0.02s 40.00% |   github.com/cockroachdb/cockroach/pkg/kv/kvserver.evaluateCommand github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_evaluate.go:483 (inline)
            0     0%     0%      0.05s 0.029%                | github.com/cockroachdb/cockroach/pkg/kv/kvserver/batcheval.LookupCommand github.com/cockroachdb/cockroach/pkg/kv/kvserver/batcheval/command.go:116
                                                0.02s 40.00% |   runtime.mapaccess2_fast64 GOROOT/src/runtime/map_fast64.go:57
                                                0.02s 40.00% |   runtime.mapaccess2_fast64 GOROOT/src/runtime/map_fast64.go:84
                                                0.01s 20.00% |   runtime.mapaccess2_fast64 GOROOT/src/runtime/map_fast64.go:69
----------------------------------------------------------+-------------
```

It's easy enough to get rid of by replacing the hash-map with an array, like we do in other parts of the code.

84110: kv: use atomic bool to avoid read lock in Replica.IsInitialized r=knz,erikgrinaker a=nvanbenschoten

According to a CPU profile acquired while running TPC-E, the read lock acquired in `Replica.IsInitialized` is the second most expensive read lock in terms of CPU cycles consumed. This doesn't mean that it's the most expensive in terms of blocking, but it still indicates that the lock is frequently acquired.

```
----------------------------------------------------------+-------------
                                             0.29s 43.28% |   github.com/cockroachdb/pebble/internal/cache.(*shard).Get github.com/cockroachdb/pebble/internal/cache/external/com_github_cockroachdb_pebble/internal/cache/clockpro.go:117 (inline)
                                             0.05s  7.46% |   github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).IsInitialized github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_init.go:249 (inline)
...
     0.67s  0.39% 24.00%      0.67s  0.39%                | sync.(*RWMutex).RLock GOROOT/src/sync/rwmutex.go:61
----------------------------------------------------------+-------------
```

The two main callers of this method are `Store.Send` (on each request) and `Replica.handleRaftReadyRaftMuLocked` (on each raft ready iteration):

```
----------------------------------------------------------+-------------
                                             0.04s 80.00% |   github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).Send github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_send.go:176
                                             0.01s 20.00% |   github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReadyRaftMuLocked github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:829
     0.04s 0.023% 0.023%      0.05s 0.029%                | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).IsInitialized github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_init.go:251
                                             0.01s 20.00% |   sync.(*RWMutex).RUnlock GOROOT/src/sync/rwmutex.go:81
----------------------------------------------------------+-------------
```

This commit makes the method cheaper by replacing the lock with an atomic read. Once a replica is initialized, it stays initialized, so this state is well suited to be stored in an atomic value.

At a minimum, this offsets the new rlock acquired by #78980.

84186: sql/schemachanger: detect unsupported types for unique indexes r=fqazi a=fqazi

Fixes: #84185

Previously, the declarative schema changer when adding
a column with a unique index did not generate the
appropriate error when an unsupported type was encountered.
As a result, we would wait till execution and require a
rollback during execution phase. To address this, this patch
will report the unsupported type directly within the builder.

Release note: None

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: Faizan Qazi <faizan@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants