Skip to content

kvserver: fix race in durability callback queueing in raftLogTruncator#77245

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
sumeerbhola:truncator_race
Mar 2, 2022
Merged

kvserver: fix race in durability callback queueing in raftLogTruncator#77245
craig[bot] merged 1 commit intocockroachdb:masterfrom
sumeerbhola:truncator_race

Conversation

@sumeerbhola
Copy link
Copy Markdown
Collaborator

The existing code admitted the following interleaving between
thread-1, running the async raft log truncation, and thread-2
which is running a new durabilityAdvancedCallback.

thread-1: executes queued := t.mu.queuedDurabilityCB and
sees queued is false
thread-2: sees t.mu.runningTruncation is true and sets
t.mu.queuedDurabilityCB = true
thread-1: Sets t.mu.runningTruncation = false and returns

Now the queued callback will never run. This can happen in tests
that wait for truncation before doing the next truncation step,
because they will stop waiting once the truncation is observed
on a Replica, which happens before any of the steps listed above
for thread-1.

Fixes #77046

Release justification: Bug fix

Release note: None

@sumeerbhola sumeerbhola requested review from erikgrinaker and tbg March 1, 2022 21:20
@sumeerbhola sumeerbhola requested a review from a team as a code owner March 1, 2022 21:20
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @sumeerbhola and @tbg)


pkg/kv/kvserver/raft_log_truncator.go, line 395 at r1 (raw file):

		return
	}
	doneRunning := func() {

nit: this is only called in the RunAsyncTask error branch, let's inline it?

@tbg tbg removed their request for review March 2, 2022 13:15
The existing code admitted the following interleaving between
thread-1, running the async raft log truncation, and thread-2
which is running a new durabilityAdvancedCallback.

thread-1: executes queued := t.mu.queuedDurabilityCB and
 sees queued is false
thread-2: sees t.mu.runningTruncation is true and sets
 t.mu.queuedDurabilityCB = true
thread-1: Sets t.mu.runningTruncation = false and returns

Now the queued callback will never run. This can happen in tests
that wait for truncation before doing the next truncation step,
because they will stop waiting once the truncation is observed
on a Replica, which happens before any of the steps listed above
for thread-1.

Fixes cockroachdb#77046

Release justification: Bug fix

Release note: None
Copy link
Copy Markdown
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @erikgrinaker)


pkg/kv/kvserver/raft_log_truncator.go, line 395 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

nit: this is only called in the RunAsyncTask error branch, let's inline it?

Done

@sumeerbhola
Copy link
Copy Markdown
Collaborator Author

bors r=erikgrinaker

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Mar 2, 2022

Build succeeded:

@craig craig bot merged commit 1cc1725 into cockroachdb:master Mar 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

kvserver: TestRaftSSTableSideloadingTruncation/loosely-coupled=true seems flaky

3 participants