mds/MDSDaemon: unlock mds_lock while shutting down Beacon and others#60326
mds/MDSDaemon: unlock mds_lock while shutting down Beacon and others#60326
mds_lock while shutting down Beacon and others#60326Conversation
60b532c to
4211071
Compare
4211071 to
c1df8f3
Compare
True. What a weird piece of code! (and |
I think just calling unlock is fine with an appropriate comment. We don't expect the MDS to continue normally so it's fine. I am wondering about the wisdom of letting other threads in the MDS (like MDCache::upkeeper) continue; not all threads are gated by the |
|
@batrick, it's been a month. Is this going to be merged? |
|
@batrick, it's been two months now. |
Sorry, I think I was waiting to see if you were going to adjust the code to:
|
c1df8f3 to
0695334
Compare
I don't think that makes sense because the beacon state is protected by
I added a comment to the |
|
@batrick, do you agree or do you still want me to move the unlock? |
|
Can one of the admins verify this patch? |
|
It's been nearly half a year since I submitted this PR, and it fixes a very annoying Ceph bug. I'm running out of time to get my Ceph improvements merged. |
|
Sorry again for the wait.
The concern I have is the race which would exist between unlocking the We already have a
Thanks. If you can please just move the unlock after the |
0695334 to
aed19e4
Compare
done |
|
jenkins test api |
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
|
@vshankar @mchangir @kotreshhr for your next batch, please. |
|
This PR is under test in https://tracker.ceph.com/issues/72244. |
This fixes a deadlock bug during MDS shutdown: - the "signal_handler" thread receives the shutdown signal and invokes MDSDaemon::suicide() while holding `mds_lock` - MDSDaemon::suicide() invokes Beacon::send_and_wait() while still holding `mds_lock` - meanwhile, all "ms_dispatch" threads get stuck waiting for `mds_lock`, for example in MDCache::upkeep_main() or MDSDaemon::ms_dispatch2() - Beacon::send_and_wait() waits for a `MSG_MDS_BEACON` packet to be dispatched (via `cvar` with a timeout) At this point, even if a `MSG_MDS_BEACON` packet is received by one of the worker threads, they will put it in the `DispatchQueue`, but no dispatcher thread will be able to handle it because they are all stuck. The cvar.wait_for() call in Beacon::send_and_wait() will therefore time out and the `MSG_MDS_BEACON` will never be processed. The proper solution is to unlock `mds_lock` to avoid the dispatchers from getting stuck. And in general, we should be holding a lock strictly only when it is needed and never do blocking calls while holding a lock. Fixes: https://tracker.ceph.com/issues/68760 Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
aed19e4 to
c0fedb5
Compare
|
putting this to retest |
|
jenkins make check arm64 |
|
jenkins test make check arm64 |
This fixes a deadlock bug during MDS shutdown:
the "signal_handler" thread receives the shutdown signal and invokes
MDSDaemon::suicide() while holding
mds_lockMDSDaemon::suicide() invokes Beacon::send_and_wait() while still
holding
mds_lockmeanwhile, all "ms_dispatch" threads get stuck waiting for
mds_lock, for example in MDCache::upkeep_main() orMDSDaemon::ms_dispatch2()
Beacon::send_and_wait() waits for a
MSG_MDS_BEACONpacket to bedispatched (via
cvarwith a timeout)At this point, even if a
MSG_MDS_BEACONpacket is received by one ofthe worker threads, they will put it in the
DispatchQueue, but nodispatcher thread will be able to handle it because they are all
stuck. The cvar.wait_for() clal in Beacon::send_and_wait() will
therefore time out and the
MSG_MDS_BEACONwill never be processed.The proper solution is to unlock
mds_lockto avoid the dispatchersfrom getting stuck. And in general, we should be holding a lock
strictly only when it is needed and never do blocking calls while
holding a lock.
Fixes: https://tracker.ceph.com/issues/68760
This used to be part of #60320
Checklist