Skip to content

stop: break deadlock between Stopper.mu and Replica.mu#63867

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
irfansharif:210419.deadlock-store-close
Apr 20, 2021
Merged

stop: break deadlock between Stopper.mu and Replica.mu#63867
craig[bot] merged 1 commit intocockroachdb:masterfrom
irfansharif:210419.deadlock-store-close

Conversation

@irfansharif
Copy link
Copy Markdown
Contributor

Fixes #63761. As of #61279, it was possible for us to deadlock due to
inconsistent lock orderings between Stopper.mu and Replica.mu. We were
previously holding onto Stopper.mu while executing all closers,
including those that may acquire other locks. Because closers can be
defined anywhere (and may consequently grab any in-scope lock), we
should order Stopper.mu to come after all other locks in the system.

The closer added in #61279 iterated over all non-destroyed replicas,
locking Replica.mu to check for the replica's destroy status (in
Store.VisitReplicas). This deadlocked with the lease acquisition code
path that first grabs an exclusive lock over Replica.mu (see
InitOrJoinRequest), and uses the stopper to kick off an async task
acquiring the lease. The stopper internally locks Stopper.mu to check
whether or not it was already stopped.

Release note: None

@irfansharif irfansharif requested review from andreimatei and tbg April 19, 2021 19:14
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @tbg)


pkg/kv/kvserver/store.go, line 474 at r1 (raw file):

	// obeyed: baseQueue.mu < Replica.raftMu < Replica.readOnlyCmdMu < Store.mu
	// < Replica.mu < Replica.unreachablesMu < Store.coalescedMu
	// < Store.scheduler.mu < Stopper.mu.

I don't think including stopper.mu in this list is a good idea... That's a library lock not directly related to replica processing.


pkg/util/stop/stopper.go, line 477 at r1 (raw file):

	s.mu.Lock()
	closers := s.mu.closers
	defer s.mu.Unlock()

mmm I don't think you've fixed what you wanted to fix... We're still calling the closer under the lock. I think you don't want to defer the unlock any more.

In fact I think you don't even need to take this lock any more, based on the same argument below about how new closers can't be added any more (and the freezing was done by this thread ).


pkg/util/stop/stopper.go, line 479 at r1 (raw file):

	defer s.mu.Unlock()

	// We can't hold on to Stopper.mu while executing closers. Closers are

s/Stopper.mu/s.mu (or just "the lock").


pkg/util/stop/stopper.go, line 481 at r1 (raw file):

	// We can't hold on to Stopper.mu while executing closers. Closers are
	// allowed to be defined anywhere, and are allowed to grab any lock
	// available. As far as lock ordering is concerned, Stopper.mu comes after

nit: what does "any lock available" mean? I think this note is generally too verbose; just say "run the closers without holding the lock". Avoiding calling callbacks with locks held is a very common thing.

@irfansharif irfansharif force-pushed the 210419.deadlock-store-close branch from 48eedde to 0c2383f Compare April 19, 2021 19:45
Copy link
Copy Markdown
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @tbg)


pkg/kv/kvserver/store.go, line 474 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I don't think including stopper.mu in this list is a good idea... That's a library lock not directly related to replica processing.

Done.


pkg/util/stop/stopper.go, line 477 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

mmm I don't think you've fixed what you wanted to fix... We're still calling the closer under the lock. I think you don't want to defer the unlock any more.

In fact I think you don't even need to take this lock any more, based on the same argument below about how new closers can't be added any more (and the freezing was done by this thread ).

Done.


pkg/util/stop/stopper.go, line 479 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

s/Stopper.mu/s.mu (or just "the lock").

Done.

Copy link
Copy Markdown
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei and @tbg)

Fixes cockroachdb#63761. As of cockroachdb#61279, it was possible for us to deadlock due to
inconsistent lock orderings between Stopper.mu and Replica.mu. We were
previously holding onto Stopper.mu while executing all closers,
including those that may acquire other locks. Because closers can be
defined anywhere (and may consequently grab any in-scope lock), we
should order Stopper.mu to come after all other locks in the system.

The closer added in cockroachdb#61279 iterated over all non-destroyed replicas,
locking Replica.mu to check for the replica's destroy status (in
Store.VisitReplicas). This deadlocked with the lease acquisition code
path that first grabs an exclusive lock over Replica.mu (see
InitOrJoinRequest), and uses the stopper to kick off an async task
acquiring the lease. The stopper internally locks Stopper.mu to check
whether or not it was already stopped.

Release note: None
@irfansharif irfansharif force-pushed the 210419.deadlock-store-close branch from 0c2383f to 79922e4 Compare April 20, 2021 04:46
@irfansharif
Copy link
Copy Markdown
Contributor Author

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Apr 20, 2021

Build succeeded:

@craig craig bot merged commit d094f02 into cockroachdb:master Apr 20, 2021
@irfansharif irfansharif deleted the 210419.deadlock-store-close branch April 20, 2021 05:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

kvserver: deadlock stopping server because of stopper<->lease acquisition cycle

3 participants