Skip to content

squid: mds/MDSDaemon: unlock mds_lock while shutting down Beacon and others#64886

Merged
neesingh-rh merged 1 commit intoceph:squidfrom
vshankar:wip-72390-squid
Jan 19, 2026
Merged

squid: mds/MDSDaemon: unlock mds_lock while shutting down Beacon and others#64886
neesingh-rh merged 1 commit intoceph:squidfrom
vshankar:wip-72390-squid

Conversation

@vshankar
Copy link
Contributor

@vshankar vshankar commented Aug 7, 2025

backport tracker: https://tracker.ceph.com/issues/72390


backport of #60326
parent tracker: https://tracker.ceph.com/issues/68760

this backport was staged using ceph-backport.sh version 16.0.0.6848
find the latest version at https://github.com/ceph/ceph/blob/main/src/script/ceph-backport.sh

This fixes a deadlock bug during MDS shutdown:

- the "signal_handler" thread receives the shutdown signal and invokes
  MDSDaemon::suicide() while holding `mds_lock`

- MDSDaemon::suicide() invokes Beacon::send_and_wait() while still
  holding `mds_lock`

- meanwhile, all "ms_dispatch" threads get stuck waiting for
  `mds_lock`, for example in MDCache::upkeep_main() or
  MDSDaemon::ms_dispatch2()

- Beacon::send_and_wait() waits for a `MSG_MDS_BEACON` packet to be
  dispatched (via `cvar` with a timeout)

At this point, even if a `MSG_MDS_BEACON` packet is received by one of
the worker threads, they will put it in the `DispatchQueue`, but no
dispatcher thread will be able to handle it because they are all
stuck.  The cvar.wait_for() call in Beacon::send_and_wait() will
therefore time out and the `MSG_MDS_BEACON` will never be processed.

The proper solution is to unlock `mds_lock` to avoid the dispatchers
from getting stuck.  And in general, we should be holding a lock
strictly only when it is needed and never do blocking calls while
holding a lock.

Fixes: https://tracker.ceph.com/issues/68760
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
(cherry picked from commit c0fedb5)
@vshankar vshankar added this to the squid milestone Aug 7, 2025
@vshankar vshankar added the cephfs Ceph File System label Aug 7, 2025
@joscollin
Copy link
Member

jenkins retest this please

@joscollin
Copy link
Member

This PR is under test in https://tracker.ceph.com/issues/73330.

Copy link
Contributor

@neesingh-rh neesingh-rh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@neesingh-rh neesingh-rh merged commit b574f58 into ceph:squid Jan 19, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cephfs Ceph File System

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants