stop: break deadlock between Stopper.mu and Replica.mu#63867
stop: break deadlock between Stopper.mu and Replica.mu#63867craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
andreimatei
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @tbg)
pkg/kv/kvserver/store.go, line 474 at r1 (raw file):
// obeyed: baseQueue.mu < Replica.raftMu < Replica.readOnlyCmdMu < Store.mu // < Replica.mu < Replica.unreachablesMu < Store.coalescedMu // < Store.scheduler.mu < Stopper.mu.
I don't think including stopper.mu in this list is a good idea... That's a library lock not directly related to replica processing.
pkg/util/stop/stopper.go, line 477 at r1 (raw file):
s.mu.Lock() closers := s.mu.closers defer s.mu.Unlock()
mmm I don't think you've fixed what you wanted to fix... We're still calling the closer under the lock. I think you don't want to defer the unlock any more.
In fact I think you don't even need to take this lock any more, based on the same argument below about how new closers can't be added any more (and the freezing was done by this thread ).
pkg/util/stop/stopper.go, line 479 at r1 (raw file):
defer s.mu.Unlock() // We can't hold on to Stopper.mu while executing closers. Closers are
s/Stopper.mu/s.mu (or just "the lock").
pkg/util/stop/stopper.go, line 481 at r1 (raw file):
// We can't hold on to Stopper.mu while executing closers. Closers are // allowed to be defined anywhere, and are allowed to grab any lock // available. As far as lock ordering is concerned, Stopper.mu comes after
nit: what does "any lock available" mean? I think this note is generally too verbose; just say "run the closers without holding the lock". Avoiding calling callbacks with locks held is a very common thing.
48eedde to
0c2383f
Compare
irfansharif
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @tbg)
pkg/kv/kvserver/store.go, line 474 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I don't think including
stopper.muin this list is a good idea... That's a library lock not directly related to replica processing.
Done.
pkg/util/stop/stopper.go, line 477 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
mmm I don't think you've fixed what you wanted to fix... We're still calling the closer under the lock. I think you don't want to
deferthe unlock any more.In fact I think you don't even need to take this lock any more, based on the same argument below about how new closers can't be added any more (and the freezing was done by this thread ).
Done.
pkg/util/stop/stopper.go, line 479 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
s/Stopper.mu/s.mu (or just "the lock").
Done.
andreimatei
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @andreimatei and @tbg)
Fixes cockroachdb#63761. As of cockroachdb#61279, it was possible for us to deadlock due to inconsistent lock orderings between Stopper.mu and Replica.mu. We were previously holding onto Stopper.mu while executing all closers, including those that may acquire other locks. Because closers can be defined anywhere (and may consequently grab any in-scope lock), we should order Stopper.mu to come after all other locks in the system. The closer added in cockroachdb#61279 iterated over all non-destroyed replicas, locking Replica.mu to check for the replica's destroy status (in Store.VisitReplicas). This deadlocked with the lease acquisition code path that first grabs an exclusive lock over Replica.mu (see InitOrJoinRequest), and uses the stopper to kick off an async task acquiring the lease. The stopper internally locks Stopper.mu to check whether or not it was already stopped. Release note: None
0c2383f to
79922e4
Compare
|
bors r+ |
|
Build succeeded: |
Fixes #63761. As of #61279, it was possible for us to deadlock due to
inconsistent lock orderings between Stopper.mu and Replica.mu. We were
previously holding onto Stopper.mu while executing all closers,
including those that may acquire other locks. Because closers can be
defined anywhere (and may consequently grab any in-scope lock), we
should order Stopper.mu to come after all other locks in the system.
The closer added in #61279 iterated over all non-destroyed replicas,
locking Replica.mu to check for the replica's destroy status (in
Store.VisitReplicas). This deadlocked with the lease acquisition code
path that first grabs an exclusive lock over Replica.mu (see
InitOrJoinRequest), and uses the stopper to kick off an async task
acquiring the lease. The stopper internally locks Stopper.mu to check
whether or not it was already stopped.
Release note: None