-
Notifications
You must be signed in to change notification settings - Fork 4.1k
kvserver: leaked replica mu #106568
Description
Describe the problem
In two instances, we saw a leaked replica mutex:
- https://github.com/cockroachlabs/support/issues/2387#issuecomment-1602346347
- roachtest: failover/chaos/read-only failed #106108
In both cases, the goroutine acquiring the mutex seems to have exited without releasing it. The second instance underwent an extensive attempt to reproduce the problem (~8k runs over a week) but the issue did not reoccur.
#106254 has a "deadlock detector" that also prints the stack trace of the mutex acquisition. However, it is likely too expensive to be always-on, especially for a hot mutex like Replica.mu.
Another angle could be #105366, i.e. proving through static analysis that all acquisitions are defer-unlocked (and thus in a deadlock scenario, the lock holder would still be around).
In my view we ought to only use a provably safe unlock pattern, so option 2 seems appealing, especially given the rarity of the bug.
Jira issue: CRDB-29618