kvserver: fix race in durability callback queueing in raftLogTruncator#77245
Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom Mar 2, 2022
Merged
kvserver: fix race in durability callback queueing in raftLogTruncator#77245craig[bot] merged 1 commit intocockroachdb:masterfrom
craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
Member
erikgrinaker
approved these changes
Mar 2, 2022
Contributor
erikgrinaker
left a comment
There was a problem hiding this comment.
Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @sumeerbhola and @tbg)
pkg/kv/kvserver/raft_log_truncator.go, line 395 at r1 (raw file):
return } doneRunning := func() {
nit: this is only called in the RunAsyncTask error branch, let's inline it?
The existing code admitted the following interleaving between thread-1, running the async raft log truncation, and thread-2 which is running a new durabilityAdvancedCallback. thread-1: executes queued := t.mu.queuedDurabilityCB and sees queued is false thread-2: sees t.mu.runningTruncation is true and sets t.mu.queuedDurabilityCB = true thread-1: Sets t.mu.runningTruncation = false and returns Now the queued callback will never run. This can happen in tests that wait for truncation before doing the next truncation step, because they will stop waiting once the truncation is observed on a Replica, which happens before any of the steps listed above for thread-1. Fixes cockroachdb#77046 Release justification: Bug fix Release note: None
7d6e6c4 to
3ba6dcb
Compare
sumeerbhola
commented
Mar 2, 2022
Collaborator
Author
sumeerbhola
left a comment
There was a problem hiding this comment.
TFTR!
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @erikgrinaker)
pkg/kv/kvserver/raft_log_truncator.go, line 395 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
nit: this is only called in the
RunAsyncTaskerror branch, let's inline it?
Done
Collaborator
Author
|
bors r=erikgrinaker |
Contributor
|
Build succeeded: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The existing code admitted the following interleaving between
thread-1, running the async raft log truncation, and thread-2
which is running a new durabilityAdvancedCallback.
thread-1: executes queued := t.mu.queuedDurabilityCB and
sees queued is false
thread-2: sees t.mu.runningTruncation is true and sets
t.mu.queuedDurabilityCB = true
thread-1: Sets t.mu.runningTruncation = false and returns
Now the queued callback will never run. This can happen in tests
that wait for truncation before doing the next truncation step,
because they will stop waiting once the truncation is observed
on a Replica, which happens before any of the steps listed above
for thread-1.
Fixes #77046
Release justification: Bug fix
Release note: None