kv: disallow GC requests that bump GC threshold and GC expired versions by nvb · Pull Request #76410 · cockroachdb/cockroach

nvb · 2022-02-11T00:36:49Z

Related to #55293.

This commit adds a safeguard to GC requests that prevents them from
bumping the GC threshold at the same time that they GC individual MVCC
versions. This was found to be unsafe in #55293 because performing both
of these actions at the same time could lead to a race where a read
request is allowed to evaluate without error while also failing to see
MVCC versions that are concurrently GCed.

This race is possible because foreground traffic consults the in-memory
version of the GC threshold (r.mu.state.GCThreshold), which is updated
after (in handleGCThresholdResult), not atomically with, the application
of the GC request's WriteBatch to the LSM (in ApplyToStateMachine). This
allows a read request to see the effect of a GC on MVCC state without seeing
its effect on the in-memory GC threshold.

The latches acquired by GC quests look like it will help with this race,
but in practice they do not for two reasons:

the latches do not protect timestamps below the GC request's batch
timestamp. This means that they only conflict with concurrent writes,
but not all concurrent reads.
the read could be served off a follower, which could be applying the
GC request's effect from the raft log. Latches held on the leaseholder
would have no impact on a follower read.

Thankfully, the GC queue has split these two steps for the past few
releases, at least since 87e85eb, so we do not have a bug today.

The commit also adds a test that reliably exercises the bug with a few
well-placed calls to time.Sleep. The test contains a variant where the
read is performed on the leaseholder and a variant where it is performed
on a follower. Both fail by default. If we switch the GC request to
acquire non-MVCC latches then the leaseholder variant passes, but the
follower read variant still fails.

cockroach-teamcity · 2022-02-11T00:36:56Z

This change is

aayushshah15

nice test, thanks for writing it.

I have a general question about the way mvcc GC works and I'm likely missing something fundamental. Why do we need to compute per-key gc timestamps inside processReplicatedKeyRange here?

cockroach/pkg/kv/kvserver/gc/gc.go

Lines 371 to 379 in 299abc8

    
           isNewest := s.curIsNewest() 
        
           if isGarbage(threshold, s.cur, s.next, isNewest) { 
        
           	keyBytes := int64(s.cur.Key.EncodedSize()) 
        
           	batchGCKeysBytes += keyBytes 
        
           	haveGarbageForThisKey = true 
        
           	gcTimestampForThisKey = s.cur.Key.Timestamp 
        
           	info.AffectedVersionsKeyBytes += keyBytes 
        
           	info.AffectedVersionsValBytes += int64(len(s.cur.Value)) 
        
           }

Looking at MVCCGarbageCollect, why can't we avoid all this and tell it to GC all keys that are lower than the new GC threshold?

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15 and @nvanbenschoten)

pkg/kv/kvserver/replica_test.go, line 8684 at r1 (raw file):

						Store: &StoreTestingKnobs{
							// Disable the GC queue so the test is the only one issuing GC
							// requests.

request*

nvb

TFTR!

bors r+

I have a general question about the way mvcc GC works and I'm likely missing something fundamental. Why do we need to compute per-key gc timestamps inside processReplicatedKeyRange here?

I don't think there's a fundamental reason why we do things this way vs. pushing the determination of all of the versions to GC through the GC request. The only real difference I can think of is that keeping the decision in the MVCC GC queue allows it to batch at a finer granularity and more accurately predict + limit the size of a single GC batch to KeyVersionChunkBytes.

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15)

pkg/kv/kvserver/replica_test.go, line 8684 at r1 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

request*

I think this should be "requests", no?

73586: rfc: optimize the draining process with connection_wait and relevant reporting systems r=ZhouXing19 a=ZhouXing19 This is a proposal for optimizing the draining process to be more legible for customers, and introduce a new step, connection_wait, to the draining process, which allows the server to early exit when all connections are closed. Release note: None 76410: kv: disallow GC requests that bump GC threshold and GC expired versions r=nvanbenschoten a=nvanbenschoten Related to #55293. This commit adds a safeguard to GC requests that prevents them from bumping the GC threshold at the same time that they GC individual MVCC versions. This was found to be unsafe in #55293 because performing both of these actions at the same time could lead to a race where a read request is allowed to evaluate without error while also failing to see MVCC versions that are concurrently GCed. This race is possible because foreground traffic consults the in-memory version of the GC threshold (`r.mu.state.GCThreshold`), which is updated after (in `handleGCThresholdResult`), not atomically with, the application of the GC request's WriteBatch to the LSM (in `ApplyToStateMachine`). This allows a read request to see the effect of a GC on MVCC state without seeing its effect on the in-memory GC threshold. The latches acquired by GC quests look like it will help with this race, but in practice they do not for two reasons: 1. the latches do not protect timestamps below the GC request's batch timestamp. This means that they only conflict with concurrent writes, but not all concurrent reads. 2. the read could be served off a follower, which could be applying the GC request's effect from the raft log. Latches held on the leaseholder would have no impact on a follower read. Thankfully, the GC queue has split these two steps for the past few releases, at least since 87e85eb, so we do not have a bug today. The commit also adds a test that reliably exercises the bug with a few well-placed calls to `time.Sleep`. The test contains a variant where the read is performed on the leaseholder and a variant where it is performed on a follower. Both fail by default. If we switch the GC request to acquire non-MVCC latches then the leaseholder variant passes, but the follower read variant still fails. 76417: ccl/sqlproxyccl: add connector component and support for session revival token r=JeffSwenson a=jaylim-crl Informs #76000. Previously, all the connection establishment logic is coupled with the handler function within proxy_handler.go. This makes connecting to a new SQL pod during connection migration difficult. This commit refactors all of those connection logic out of the proxy handler into a connector component, as described in the connection migration RFC. At the same time, we also add support for the session revival token within this connector component. Note that the overall behavior of the SQL proxy should be unchanged with this commit. Release note: None 76545: cmd/reduce: add -tlp option r=yuzefovich a=yuzefovich **cmd/reduce: remove stdin option and require -file argument** We tend to not use the option of passing input SQL via stdin, so this commit removes it. An additional argument in favor of doing that is that the follow-up commit will introduce another mode of behavior that requires `-file` argument to be specified, so it's just cleaner to always require it now. Release note: None **cmd/reduce: add -tlp option** This commit adds `-tlp` boolean flag that changes the behavior of `reduce`. It is required that `-file` is specified whenever the `-tlp` flag is used. The behavior is such that the last two queries (delimited by empty lines) in the file contain unpartitioned and partitioned queries that return different results although they are equivalent. If TLP check is requested, then we remove the last two queries from the input which we use then to construct a special TLP check query that results in an error if two removed queries return different results. We do not just include the TLP check query into the input string because the reducer would then reduce the check query itself, making the reduction meaningless. Release note: None 76598: server: use channel for DisableAutomaticVersionUpgrade r=RaduBerinde a=RaduBerinde DisableAutomaticVersionUpgrade is an atomic integer which is rechecked in a retry loop. This is not a very clean mechanism, and can lead to issues where you're unknowingly dealing with a copy of the knobs and setting the wrong atomic. The retry loop can also add unnecessary delays in tests. This commit changes DisableAutomaticVersionUpgrade from an atomic integer to a channel. If the channel is set, auto-upgrade waits until the channel is closed. Release note: None Co-authored-by: Jane Xing <zhouxing@uchicago.edu> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: Jay <jay@cockroachlabs.com> Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: Radu Berinde <radu@cockroachlabs.com>

craig · 2022-02-15T20:15:28Z

Build failed (retrying...):

GitHub CI (Cockroach)

nvb · 2022-02-15T20:36:04Z

bors r-

craig · 2022-02-15T20:36:06Z

Canceled.

nvb · 2022-02-15T22:16:21Z

Under stress:

------- Stdout: -------
=== RUN   TestGCThresholdRacesWithRead/followerRead=true/thresholdFirst=true
    replica_test.go:8797: read won race: header:<>
    replica_test.go:8799:
          Error Trace:  replica_test.go:8799
                              replica_test.go:8813
                              asm_amd64.s:1581
          Error:        Expected value not to be nil.
          Test:         TestGCThresholdRacesWithRead/followerRead=true/thresholdFirst=true
        --- FAIL: TestGCThresholdRacesWithRead/followerRead=true/thresholdFirst=true (1.25s)

I wonder if this found a real bug.

nvb · 2022-02-16T02:10:22Z

I'm actually able to hit this test failure under stress even in the /followerRead=false/thresholdFirst=true case. So all four variants are flaky.

I dug into what's going wrong and it appears that GC is broken even when we bump the GC threshold in a separate request from when we GC old MVCC versions. The reason for this is because of the incorrect latching that we noticed in #55293 combined with the lazy LSM snapshot we capture during request evaluation discussed in #55461. This can lead to races like the following:

sequenceDiagram
    participant Reader
    participant Replica
    participant GC Requests
    participant MVCC GC Queue
    Reader->>Replica: acquire latches
    Reader->>Replica: checkTSAboveGCThresholdRLocked
    Replica->>Reader: clear to proceed
    MVCC GC Queue->>GC Requests: bump GC threshold
    GC Requests->>Replica: acquire latches
    GC Requests->>Replica: bump GC threshold
    GC Requests->>MVCC GC Queue: success
    MVCC GC Queue->>GC Requests: GC expired MVCC version
    GC Requests->>Replica: acquire latches
    GC Requests->>Replica: GC expired MVCC version
    GC Requests->>MVCC GC Queue: success
    Reader->>Replica: lazily capture LSM snapshot
    Reader->>Reader: evaluate using snapshot, fail to observe version

The fact that the read request and the second GC request were able to both evaluate concurrently is due to the incorrect latching. However, as discussed above, correcting the latching is insufficient to fix this bug for follower reads. We can't rely on latching to solve this problem, because the GC request's latches are only acquired on the leaseholder. The real fix is to acquire the LSM snapshot eagerly (#55461), before checking the GC threshold, to ensure that the in-memory GC threshold we check is always equal to or greater than the corresponding persistent GC threshold present in the LSM snapshot that we operate over. So to fix this bug, we essentially need to address #55293, which promotes that performance issue into a correctness issue.

Note that this would fix the thresholdFirst=true variants, but it would still not fix the thresholdFirst=false variants. To fix those cases, we need to move the call to handleGCThresholdResult from after LSM application (a "post-apply" side effect) to before LSM application (a "pre-apply" side effect). We also need to make sure we get this right during Raft snapshot application. For now, it seems safer to continue with the 2-phase protocol.

@aayushshah15 let me know whether this all makes sense to you. If so, I think I'll just merge this PR with the entire test skipped for now, with a reference to #55293. We can then set one of the goals of #55293 to be fixing and unskipping the thresholdFirst=true variants of this test.

Related to cockroachdb#55293. This commit adds a safeguard to GC requests that prevents them from bumping the GC threshold at the same time that they GC individual MVCC versions. This was found to be unsafe in cockroachdb#55293 because performing both of these actions at the same time could lead to a race where a read request is allowed to evaluate without error while also failing to see MVCC versions that are concurrently GCed. This race is possible because foreground traffic consults the in-memory version of the GC threshold (`r.mu.state.GCThreshold`), which is updated after (in `handleGCThresholdResult`), not atomically with, the application of the GC request's WriteBatch to the LSM (in `ApplyToStateMachine`). This allows a read request to see the effect of a GC on MVCC state without seeing its effect on the in-memory GC threshold. The latches acquired by GC quests look like it will help with this race, but in practice they do not for two reasons: 1. the latches do not protect timestamps below the GC request's batch timestamp. This means that they only conflict with concurrent writes, but not all concurrent reads. 2. the read could be served off a follower, which could be applying the GC request's effect from the raft log. Latches held on the leaseholder would have no impact on a follower read. Thankfully, the GC queue has split these two steps for the past few releases, at least since 87e85eb, so we do not have a bug today. The commit also adds a test that reliably exercises the bug with a few well-placed calls to `time.Sleep`. The test contains a variant where the read is performed on the leaseholder and a variant where it is performed on a follower. Both fail by default. If we switch the GC request to acquire non-MVCC latches then the leaseholder variant passes, but the follower read variant still fails.

nvb · 2022-02-16T21:54:54Z

bors r+

craig · 2022-02-17T01:50:08Z

Build succeeded:

GitHub CI (Cockroach)

…ldRacesWithRead` This commit unskips a subset of `TestGCThresholdRacesWithRead`, which is now possible because of cockroachdb#76312 and the first commit in this patch. See cockroachdb#76410 (comment) Relates to cockroachdb#55293. Release note: none

nvb requested a review from aayushshah15 February 11, 2022 00:36

nvb requested a review from a team as a code owner February 11, 2022 00:36

aayushshah15 approved these changes Feb 11, 2022

View reviewed changes

nvb force-pushed the nvanbenschoten/gcThreshRace branch from 61c2fd6 to eee017b Compare February 15, 2022 18:40

nvb commented Feb 15, 2022

View reviewed changes

nvb force-pushed the nvanbenschoten/gcThreshRace branch from eee017b to c048446 Compare February 16, 2022 21:48

nvb mentioned this pull request Feb 16, 2022

kv: make MVCC garbage collection latch-less #55293

Closed

craig bot merged commit 8487632 into cockroachdb:master Feb 17, 2022

nvb deleted the nvanbenschoten/gcThreshRace branch March 12, 2022 23:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: disallow GC requests that bump GC threshold and GC expired versions#76410

kv: disallow GC requests that bump GC threshold and GC expired versions#76410
craig[bot] merged 1 commit intocockroachdb:masterfrom
nvb:nvanbenschoten/gcThreshRace

nvb commented Feb 11, 2022

Uh oh!

cockroach-teamcity commented Feb 11, 2022

Uh oh!

aayushshah15 left a comment •

edited

Loading

Uh oh!

nvb left a comment

Uh oh!

craig bot commented Feb 15, 2022

Uh oh!

nvb commented Feb 15, 2022

Uh oh!

craig bot commented Feb 15, 2022

Uh oh!

nvb commented Feb 15, 2022

Uh oh!

nvb commented Feb 16, 2022

Uh oh!

nvb commented Feb 16, 2022

Uh oh!

craig bot commented Feb 17, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	isNewest := s.curIsNewest()
	if isGarbage(threshold, s.cur, s.next, isNewest) {
	keyBytes := int64(s.cur.Key.EncodedSize())
	batchGCKeysBytes += keyBytes
	haveGarbageForThisKey = true
	gcTimestampForThisKey = s.cur.Key.Timestamp
	info.AffectedVersionsKeyBytes += keyBytes
	info.AffectedVersionsValBytes += int64(len(s.cur.Value))
	}

Conversation

nvb commented Feb 11, 2022

Uh oh!

cockroach-teamcity commented Feb 11, 2022

Uh oh!

aayushshah15 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nvb left a comment

Choose a reason for hiding this comment

Uh oh!

craig bot commented Feb 15, 2022

Uh oh!

nvb commented Feb 15, 2022

Uh oh!

craig bot commented Feb 15, 2022

Uh oh!

nvb commented Feb 15, 2022

Uh oh!

nvb commented Feb 16, 2022

Uh oh!

nvb commented Feb 16, 2022

Uh oh!

craig bot commented Feb 17, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

aayushshah15 left a comment •

edited

Loading