Skip to content

kv: disallow GC requests that bump GC threshold and GC expired versions#76410

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
nvb:nvanbenschoten/gcThreshRace
Feb 17, 2022
Merged

kv: disallow GC requests that bump GC threshold and GC expired versions#76410
craig[bot] merged 1 commit intocockroachdb:masterfrom
nvb:nvanbenschoten/gcThreshRace

Conversation

@nvb
Copy link
Copy Markdown
Contributor

@nvb nvb commented Feb 11, 2022

Related to #55293.

This commit adds a safeguard to GC requests that prevents them from
bumping the GC threshold at the same time that they GC individual MVCC
versions. This was found to be unsafe in #55293 because performing both
of these actions at the same time could lead to a race where a read
request is allowed to evaluate without error while also failing to see
MVCC versions that are concurrently GCed.

This race is possible because foreground traffic consults the in-memory
version of the GC threshold (r.mu.state.GCThreshold), which is updated
after (in handleGCThresholdResult), not atomically with, the application
of the GC request's WriteBatch to the LSM (in ApplyToStateMachine). This
allows a read request to see the effect of a GC on MVCC state without seeing
its effect on the in-memory GC threshold.

The latches acquired by GC quests look like it will help with this race,
but in practice they do not for two reasons:

  1. the latches do not protect timestamps below the GC request's batch
    timestamp. This means that they only conflict with concurrent writes,
    but not all concurrent reads.
  2. the read could be served off a follower, which could be applying the
    GC request's effect from the raft log. Latches held on the leaseholder
    would have no impact on a follower read.

Thankfully, the GC queue has split these two steps for the past few
releases, at least since 87e85eb, so we do not have a bug today.

The commit also adds a test that reliably exercises the bug with a few
well-placed calls to time.Sleep. The test contains a variant where the
read is performed on the leaseholder and a variant where it is performed
on a follower. Both fail by default. If we switch the GC request to
acquire non-MVCC latches then the leaseholder variant passes, but the
follower read variant still fails.

@nvb nvb requested a review from aayushshah15 February 11, 2022 00:36
@nvb nvb requested a review from a team as a code owner February 11, 2022 00:36
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Contributor

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm_strong: nice test, thanks for writing it.

I have a general question about the way mvcc GC works and I'm likely missing something fundamental. Why do we need to compute per-key gc timestamps inside processReplicatedKeyRange here?

isNewest := s.curIsNewest()
if isGarbage(threshold, s.cur, s.next, isNewest) {
keyBytes := int64(s.cur.Key.EncodedSize())
batchGCKeysBytes += keyBytes
haveGarbageForThisKey = true
gcTimestampForThisKey = s.cur.Key.Timestamp
info.AffectedVersionsKeyBytes += keyBytes
info.AffectedVersionsValBytes += int64(len(s.cur.Value))
}

Looking at MVCCGarbageCollect, why can't we avoid all this and tell it to GC all keys that are lower than the new GC threshold?

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15 and @nvanbenschoten)


pkg/kv/kvserver/replica_test.go, line 8684 at r1 (raw file):

						Store: &StoreTestingKnobs{
							// Disable the GC queue so the test is the only one issuing GC
							// requests.

request*

@nvb nvb force-pushed the nvanbenschoten/gcThreshRace branch from 61c2fd6 to eee017b Compare February 15, 2022 18:40
Copy link
Copy Markdown
Contributor Author

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

bors r+

I have a general question about the way mvcc GC works and I'm likely missing something fundamental. Why do we need to compute per-key gc timestamps inside processReplicatedKeyRange here?

I don't think there's a fundamental reason why we do things this way vs. pushing the determination of all of the versions to GC through the GC request. The only real difference I can think of is that keeping the decision in the MVCC GC queue allows it to batch at a finer granularity and more accurately predict + limit the size of a single GC batch to KeyVersionChunkBytes.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15)


pkg/kv/kvserver/replica_test.go, line 8684 at r1 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

request*

I think this should be "requests", no?

craig bot pushed a commit that referenced this pull request Feb 15, 2022
73586: rfc: optimize the draining process with connection_wait and relevant reporting systems r=ZhouXing19 a=ZhouXing19

This is a proposal for optimizing the draining process to be more legible for
customers, and introduce a new step, connection_wait, to the draining process,
which allows the server to early exit when all connections are closed.

Release note: None

76410: kv: disallow GC requests that bump GC threshold and GC expired versions r=nvanbenschoten a=nvanbenschoten

Related to #55293.

This commit adds a safeguard to GC requests that prevents them from
bumping the GC threshold at the same time that they GC individual MVCC
versions. This was found to be unsafe in #55293 because performing both
of these actions at the same time could lead to a race where a read
request is allowed to evaluate without error while also failing to see
MVCC versions that are concurrently GCed.

This race is possible because foreground traffic consults the in-memory
version of the GC threshold (`r.mu.state.GCThreshold`), which is updated
after (in `handleGCThresholdResult`), not atomically with, the application
of the GC request's WriteBatch to the LSM (in `ApplyToStateMachine`). This
allows a read request to see the effect of a GC on MVCC state without seeing
its effect on the in-memory GC threshold.

The latches acquired by GC quests look like it will help with this race,
but in practice they do not for two reasons:
1. the latches do not protect timestamps below the GC request's batch
   timestamp. This means that they only conflict with concurrent writes,
   but not all concurrent reads.
2. the read could be served off a follower, which could be applying the
   GC request's effect from the raft log. Latches held on the leaseholder
   would have no impact on a follower read.

Thankfully, the GC queue has split these two steps for the past few
releases, at least since 87e85eb, so we do not have a bug today.

The commit also adds a test that reliably exercises the bug with a few
well-placed calls to `time.Sleep`. The test contains a variant where the
read is performed on the leaseholder and a variant where it is performed
on a follower. Both fail by default. If we switch the GC request to
acquire non-MVCC latches then the leaseholder variant passes, but the
follower read variant still fails.

76417: ccl/sqlproxyccl: add connector component and support for session revival token r=JeffSwenson a=jaylim-crl

Informs #76000.

Previously, all the connection establishment logic is coupled with the handler
function within proxy_handler.go. This makes connecting to a new SQL pod during
connection migration difficult. This commit refactors all of those connection
logic out of the proxy handler into a connector component, as described in the
connection migration RFC. At the same time, we also add support for the session
revival token within this connector component.

Note that the overall behavior of the SQL proxy should be unchanged with this
commit.

Release note: None

76545: cmd/reduce: add -tlp option r=yuzefovich a=yuzefovich

**cmd/reduce: remove stdin option and require -file argument**

We tend to not use the option of passing input SQL via stdin, so this
commit removes it. An additional argument in favor of doing that is that
the follow-up commit will introduce another mode of behavior that
requires `-file` argument to be specified, so it's just cleaner to
always require it now.

Release note: None

**cmd/reduce: add -tlp option**

This commit adds `-tlp` boolean flag that changes the behavior of
`reduce`. It is required that `-file` is specified whenever the `-tlp`
flag is used. The behavior is such that the last two queries (delimited
by empty lines) in the file contain unpartitioned and partitioned queries
that return different results although they are equivalent.

If TLP check is requested, then we remove the last two queries from the
input which we use then to construct a special TLP check query that
results in an error if two removed queries return different results.

We do not just include the TLP check query into the input string because
the reducer would then reduce the check query itself, making the
reduction meaningless.

Release note: None

76598: server: use channel for DisableAutomaticVersionUpgrade r=RaduBerinde a=RaduBerinde

DisableAutomaticVersionUpgrade is an atomic integer which is rechecked
in a retry loop. This is not a very clean mechanism, and can lead to
issues where you're unknowingly dealing with a copy of the knobs and
setting the wrong atomic. The retry loop can also add unnecessary
delays in tests.

This commit changes DisableAutomaticVersionUpgrade from an atomic
integer to a channel. If the channel is set, auto-upgrade waits until
the channel is closed.

Release note: None

Co-authored-by: Jane Xing <zhouxing@uchicago.edu>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: Jay <jay@cockroachlabs.com>
Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
Co-authored-by: Radu Berinde <radu@cockroachlabs.com>
@craig
Copy link
Copy Markdown
Contributor

craig bot commented Feb 15, 2022

Build failed (retrying...):

@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented Feb 15, 2022

bors r-

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Feb 15, 2022

Canceled.

@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented Feb 15, 2022

Under stress:

------- Stdout: -------
=== RUN   TestGCThresholdRacesWithRead/followerRead=true/thresholdFirst=true
    replica_test.go:8797: read won race: header:<>
    replica_test.go:8799:
          Error Trace:  replica_test.go:8799
                              replica_test.go:8813
                              asm_amd64.s:1581
          Error:        Expected value not to be nil.
          Test:         TestGCThresholdRacesWithRead/followerRead=true/thresholdFirst=true
        --- FAIL: TestGCThresholdRacesWithRead/followerRead=true/thresholdFirst=true (1.25s)

I wonder if this found a real bug.

@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented Feb 16, 2022

I'm actually able to hit this test failure under stress even in the /followerRead=false/thresholdFirst=true case. So all four variants are flaky.

I dug into what's going wrong and it appears that GC is broken even when we bump the GC threshold in a separate request from when we GC old MVCC versions. The reason for this is because of the incorrect latching that we noticed in #55293 combined with the lazy LSM snapshot we capture during request evaluation discussed in #55461. This can lead to races like the following:

sequenceDiagram
    participant Reader
    participant Replica
    participant GC Requests
    participant MVCC GC Queue
    Reader->>Replica: acquire latches
    Reader->>Replica: checkTSAboveGCThresholdRLocked
    Replica->>Reader: clear to proceed
    MVCC GC Queue->>GC Requests: bump GC threshold
    GC Requests->>Replica: acquire latches
    GC Requests->>Replica: bump GC threshold
    GC Requests->>MVCC GC Queue: success
    MVCC GC Queue->>GC Requests: GC expired MVCC version
    GC Requests->>Replica: acquire latches
    GC Requests->>Replica: GC expired MVCC version
    GC Requests->>MVCC GC Queue: success
    Reader->>Replica: lazily capture LSM snapshot
    Reader->>Reader: evaluate using snapshot, fail to observe version
Loading

The fact that the read request and the second GC request were able to both evaluate concurrently is due to the incorrect latching. However, as discussed above, correcting the latching is insufficient to fix this bug for follower reads. We can't rely on latching to solve this problem, because the GC request's latches are only acquired on the leaseholder. The real fix is to acquire the LSM snapshot eagerly (#55461), before checking the GC threshold, to ensure that the in-memory GC threshold we check is always equal to or greater than the corresponding persistent GC threshold present in the LSM snapshot that we operate over. So to fix this bug, we essentially need to address #55293, which promotes that performance issue into a correctness issue.

Note that this would fix the thresholdFirst=true variants, but it would still not fix the thresholdFirst=false variants. To fix those cases, we need to move the call to handleGCThresholdResult from after LSM application (a "post-apply" side effect) to before LSM application (a "pre-apply" side effect). We also need to make sure we get this right during Raft snapshot application. For now, it seems safer to continue with the 2-phase protocol.

@aayushshah15 let me know whether this all makes sense to you. If so, I think I'll just merge this PR with the entire test skipped for now, with a reference to #55293. We can then set one of the goals of #55293 to be fixing and unskipping the thresholdFirst=true variants of this test.

Related to cockroachdb#55293.

This commit adds a safeguard to GC requests that prevents them from
bumping the GC threshold at the same time that they GC individual MVCC
versions. This was found to be unsafe in cockroachdb#55293 because performing both
of these actions at the same time could lead to a race where a read
request is allowed to evaluate without error while also failing to see
MVCC versions that are concurrently GCed.

This race is possible because foreground traffic consults the in-memory
version of the GC threshold (`r.mu.state.GCThreshold`), which is updated
after (in `handleGCThresholdResult`), not atomically with, the application
of the GC request's WriteBatch to the LSM (in `ApplyToStateMachine`). This
allows a read request to see the effect of a GC on MVCC state without seeing
its effect on the in-memory GC threshold.

The latches acquired by GC quests look like it will help with this race,
but in practice they do not for two reasons:
1. the latches do not protect timestamps below the GC request's batch
   timestamp. This means that they only conflict with concurrent writes,
   but not all concurrent reads.
2. the read could be served off a follower, which could be applying the
   GC request's effect from the raft log. Latches held on the leaseholder
   would have no impact on a follower read.

Thankfully, the GC queue has split these two steps for the past few
releases, at least since 87e85eb, so we do not have a bug today.

The commit also adds a test that reliably exercises the bug with a few
well-placed calls to `time.Sleep`. The test contains a variant where the
read is performed on the leaseholder and a variant where it is performed
on a follower. Both fail by default. If we switch the GC request to
acquire non-MVCC latches then the leaseholder variant passes, but the
follower read variant still fails.
@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented Feb 16, 2022

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Feb 17, 2022

Build succeeded:

@craig craig bot merged commit 8487632 into cockroachdb:master Feb 17, 2022
@nvb nvb deleted the nvanbenschoten/gcThreshRace branch March 12, 2022 23:59
aayushshah15 added a commit to aayushshah15/cockroach that referenced this pull request Jun 22, 2022
…ldRacesWithRead`

This commit unskips a subset of `TestGCThresholdRacesWithRead`, which is now
possible because of cockroachdb#76312 and the first commit in this patch. See
cockroachdb#76410 (comment)

Relates to cockroachdb#55293.

Release note: none
aayushshah15 added a commit to aayushshah15/cockroach that referenced this pull request Jun 22, 2022
…ldRacesWithRead`

This commit unskips a subset of `TestGCThresholdRacesWithRead`, which is now
possible because of cockroachdb#76312 and the first commit in this patch. See
cockroachdb#76410 (comment)

Relates to cockroachdb#55293.

Release note: none
aayushshah15 added a commit to aayushshah15/cockroach that referenced this pull request Jun 28, 2022
…ldRacesWithRead`

This commit unskips a subset of `TestGCThresholdRacesWithRead`, which is now
possible because of cockroachdb#76312 and the first commit in this patch. See
cockroachdb#76410 (comment)

Relates to cockroachdb#55293.

Release note: none
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants