kv: disallow GC requests that bump GC threshold and GC expired versions#76410
Conversation
There was a problem hiding this comment.
nice test, thanks for writing it.
I have a general question about the way mvcc GC works and I'm likely missing something fundamental. Why do we need to compute per-key gc timestamps inside processReplicatedKeyRange here?
cockroach/pkg/kv/kvserver/gc/gc.go
Lines 371 to 379 in 299abc8
Looking at MVCCGarbageCollect, why can't we avoid all this and tell it to GC all keys that are lower than the new GC threshold?
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15 and @nvanbenschoten)
pkg/kv/kvserver/replica_test.go, line 8684 at r1 (raw file):
Store: &StoreTestingKnobs{ // Disable the GC queue so the test is the only one issuing GC // requests.
request*
61c2fd6 to
eee017b
Compare
nvb
left a comment
There was a problem hiding this comment.
TFTR!
bors r+
I have a general question about the way mvcc GC works and I'm likely missing something fundamental. Why do we need to compute per-key gc timestamps inside processReplicatedKeyRange here?
I don't think there's a fundamental reason why we do things this way vs. pushing the determination of all of the versions to GC through the GC request. The only real difference I can think of is that keeping the decision in the MVCC GC queue allows it to batch at a finer granularity and more accurately predict + limit the size of a single GC batch to KeyVersionChunkBytes.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15)
pkg/kv/kvserver/replica_test.go, line 8684 at r1 (raw file):
Previously, aayushshah15 (Aayush Shah) wrote…
request*
I think this should be "requests", no?
73586: rfc: optimize the draining process with connection_wait and relevant reporting systems r=ZhouXing19 a=ZhouXing19 This is a proposal for optimizing the draining process to be more legible for customers, and introduce a new step, connection_wait, to the draining process, which allows the server to early exit when all connections are closed. Release note: None 76410: kv: disallow GC requests that bump GC threshold and GC expired versions r=nvanbenschoten a=nvanbenschoten Related to #55293. This commit adds a safeguard to GC requests that prevents them from bumping the GC threshold at the same time that they GC individual MVCC versions. This was found to be unsafe in #55293 because performing both of these actions at the same time could lead to a race where a read request is allowed to evaluate without error while also failing to see MVCC versions that are concurrently GCed. This race is possible because foreground traffic consults the in-memory version of the GC threshold (`r.mu.state.GCThreshold`), which is updated after (in `handleGCThresholdResult`), not atomically with, the application of the GC request's WriteBatch to the LSM (in `ApplyToStateMachine`). This allows a read request to see the effect of a GC on MVCC state without seeing its effect on the in-memory GC threshold. The latches acquired by GC quests look like it will help with this race, but in practice they do not for two reasons: 1. the latches do not protect timestamps below the GC request's batch timestamp. This means that they only conflict with concurrent writes, but not all concurrent reads. 2. the read could be served off a follower, which could be applying the GC request's effect from the raft log. Latches held on the leaseholder would have no impact on a follower read. Thankfully, the GC queue has split these two steps for the past few releases, at least since 87e85eb, so we do not have a bug today. The commit also adds a test that reliably exercises the bug with a few well-placed calls to `time.Sleep`. The test contains a variant where the read is performed on the leaseholder and a variant where it is performed on a follower. Both fail by default. If we switch the GC request to acquire non-MVCC latches then the leaseholder variant passes, but the follower read variant still fails. 76417: ccl/sqlproxyccl: add connector component and support for session revival token r=JeffSwenson a=jaylim-crl Informs #76000. Previously, all the connection establishment logic is coupled with the handler function within proxy_handler.go. This makes connecting to a new SQL pod during connection migration difficult. This commit refactors all of those connection logic out of the proxy handler into a connector component, as described in the connection migration RFC. At the same time, we also add support for the session revival token within this connector component. Note that the overall behavior of the SQL proxy should be unchanged with this commit. Release note: None 76545: cmd/reduce: add -tlp option r=yuzefovich a=yuzefovich **cmd/reduce: remove stdin option and require -file argument** We tend to not use the option of passing input SQL via stdin, so this commit removes it. An additional argument in favor of doing that is that the follow-up commit will introduce another mode of behavior that requires `-file` argument to be specified, so it's just cleaner to always require it now. Release note: None **cmd/reduce: add -tlp option** This commit adds `-tlp` boolean flag that changes the behavior of `reduce`. It is required that `-file` is specified whenever the `-tlp` flag is used. The behavior is such that the last two queries (delimited by empty lines) in the file contain unpartitioned and partitioned queries that return different results although they are equivalent. If TLP check is requested, then we remove the last two queries from the input which we use then to construct a special TLP check query that results in an error if two removed queries return different results. We do not just include the TLP check query into the input string because the reducer would then reduce the check query itself, making the reduction meaningless. Release note: None 76598: server: use channel for DisableAutomaticVersionUpgrade r=RaduBerinde a=RaduBerinde DisableAutomaticVersionUpgrade is an atomic integer which is rechecked in a retry loop. This is not a very clean mechanism, and can lead to issues where you're unknowingly dealing with a copy of the knobs and setting the wrong atomic. The retry loop can also add unnecessary delays in tests. This commit changes DisableAutomaticVersionUpgrade from an atomic integer to a channel. If the channel is set, auto-upgrade waits until the channel is closed. Release note: None Co-authored-by: Jane Xing <zhouxing@uchicago.edu> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: Jay <jay@cockroachlabs.com> Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: Radu Berinde <radu@cockroachlabs.com>
|
Build failed (retrying...): |
|
bors r- |
|
Canceled. |
|
Under stress: I wonder if this found a real bug. |
|
I'm actually able to hit this test failure under stress even in the I dug into what's going wrong and it appears that GC is broken even when we bump the GC threshold in a separate request from when we GC old MVCC versions. The reason for this is because of the incorrect latching that we noticed in #55293 combined with the lazy LSM snapshot we capture during request evaluation discussed in #55461. This can lead to races like the following: sequenceDiagram
participant Reader
participant Replica
participant GC Requests
participant MVCC GC Queue
Reader->>Replica: acquire latches
Reader->>Replica: checkTSAboveGCThresholdRLocked
Replica->>Reader: clear to proceed
MVCC GC Queue->>GC Requests: bump GC threshold
GC Requests->>Replica: acquire latches
GC Requests->>Replica: bump GC threshold
GC Requests->>MVCC GC Queue: success
MVCC GC Queue->>GC Requests: GC expired MVCC version
GC Requests->>Replica: acquire latches
GC Requests->>Replica: GC expired MVCC version
GC Requests->>MVCC GC Queue: success
Reader->>Replica: lazily capture LSM snapshot
Reader->>Reader: evaluate using snapshot, fail to observe version
The fact that the read request and the second GC request were able to both evaluate concurrently is due to the incorrect latching. However, as discussed above, correcting the latching is insufficient to fix this bug for follower reads. We can't rely on latching to solve this problem, because the GC request's latches are only acquired on the leaseholder. The real fix is to acquire the LSM snapshot eagerly (#55461), before checking the GC threshold, to ensure that the in-memory GC threshold we check is always equal to or greater than the corresponding persistent GC threshold present in the LSM snapshot that we operate over. So to fix this bug, we essentially need to address #55293, which promotes that performance issue into a correctness issue. Note that this would fix the @aayushshah15 let me know whether this all makes sense to you. If so, I think I'll just merge this PR with the entire test skipped for now, with a reference to #55293. We can then set one of the goals of #55293 to be fixing and unskipping the |
Related to cockroachdb#55293. This commit adds a safeguard to GC requests that prevents them from bumping the GC threshold at the same time that they GC individual MVCC versions. This was found to be unsafe in cockroachdb#55293 because performing both of these actions at the same time could lead to a race where a read request is allowed to evaluate without error while also failing to see MVCC versions that are concurrently GCed. This race is possible because foreground traffic consults the in-memory version of the GC threshold (`r.mu.state.GCThreshold`), which is updated after (in `handleGCThresholdResult`), not atomically with, the application of the GC request's WriteBatch to the LSM (in `ApplyToStateMachine`). This allows a read request to see the effect of a GC on MVCC state without seeing its effect on the in-memory GC threshold. The latches acquired by GC quests look like it will help with this race, but in practice they do not for two reasons: 1. the latches do not protect timestamps below the GC request's batch timestamp. This means that they only conflict with concurrent writes, but not all concurrent reads. 2. the read could be served off a follower, which could be applying the GC request's effect from the raft log. Latches held on the leaseholder would have no impact on a follower read. Thankfully, the GC queue has split these two steps for the past few releases, at least since 87e85eb, so we do not have a bug today. The commit also adds a test that reliably exercises the bug with a few well-placed calls to `time.Sleep`. The test contains a variant where the read is performed on the leaseholder and a variant where it is performed on a follower. Both fail by default. If we switch the GC request to acquire non-MVCC latches then the leaseholder variant passes, but the follower read variant still fails.
eee017b to
c048446
Compare
|
bors r+ |
|
Build succeeded: |
…ldRacesWithRead` This commit unskips a subset of `TestGCThresholdRacesWithRead`, which is now possible because of cockroachdb#76312 and the first commit in this patch. See cockroachdb#76410 (comment) Relates to cockroachdb#55293. Release note: none
…ldRacesWithRead` This commit unskips a subset of `TestGCThresholdRacesWithRead`, which is now possible because of cockroachdb#76312 and the first commit in this patch. See cockroachdb#76410 (comment) Relates to cockroachdb#55293. Release note: none
…ldRacesWithRead` This commit unskips a subset of `TestGCThresholdRacesWithRead`, which is now possible because of cockroachdb#76312 and the first commit in this patch. See cockroachdb#76410 (comment) Relates to cockroachdb#55293. Release note: none
Related to #55293.
This commit adds a safeguard to GC requests that prevents them from
bumping the GC threshold at the same time that they GC individual MVCC
versions. This was found to be unsafe in #55293 because performing both
of these actions at the same time could lead to a race where a read
request is allowed to evaluate without error while also failing to see
MVCC versions that are concurrently GCed.
This race is possible because foreground traffic consults the in-memory
version of the GC threshold (
r.mu.state.GCThreshold), which is updatedafter (in
handleGCThresholdResult), not atomically with, the applicationof the GC request's WriteBatch to the LSM (in
ApplyToStateMachine). Thisallows a read request to see the effect of a GC on MVCC state without seeing
its effect on the in-memory GC threshold.
The latches acquired by GC quests look like it will help with this race,
but in practice they do not for two reasons:
timestamp. This means that they only conflict with concurrent writes,
but not all concurrent reads.
GC request's effect from the raft log. Latches held on the leaseholder
would have no impact on a follower read.
Thankfully, the GC queue has split these two steps for the past few
releases, at least since 87e85eb, so we do not have a bug today.
The commit also adds a test that reliably exercises the bug with a few
well-placed calls to
time.Sleep. The test contains a variant where theread is performed on the leaseholder and a variant where it is performed
on a follower. Both fail by default. If we switch the GC request to
acquire non-MVCC latches then the leaseholder variant passes, but the
follower read variant still fails.