Bug #74953
closedtools/cephfs_mirror: Mirror daemon aborted due to lock ordering issue
0%
Description
The issue occurred while testing multithreading mirroring PR - https://github.com/ceph/ceph/pull/66572
But after analysing, found out that it's not related. The issue should exist in main branch as well.
Steps followed to reproduce:
1. 10 snapshot were synced
2. While the snapshot is being mirrored,
ceph --admin-daemon ./asok/client.mirror.20274.asok fs mirror peer status a@1 a7edb927-f127-4b45-88a1-c082198e93e7is done in loop in another window.
3. After 10 snapshots were synced, SIGSTOP is sent to mirror daemon
4. sleep 40 seconds
5. SIGCONT is sent to mirror daemon
6. The mirror daemon aborted
Here is the traceback.
(gdb) bt
#0 0x0000ffffb02ab960 in __pthread_kill_implementation () from /lib64/libc.so.6
#1 0x0000ffffb025ac00 [PAC] in raise () from /lib64/libc.so.6
#2 0x0000aaaabc453b6c [PAC] in reraise_fatal (signum=signum@entry=6) at /lsandbox/upstream/ceph/src/global/signal_handler.cc:93
#3 0x0000aaaabc455004 in handle_oneshot_fatal_signal (signum=6) at /lsandbox/upstream/ceph/src/global/signal_handler.cc:372
#4 <signal handler called>
#5 0x0000ffffb02ab960 in __pthread_kill_implementation () from /lib64/libc.so.6
#6 0x0000ffffb025ac00 [PAC] in raise () from /lib64/libc.so.6
#7 0x0000ffffb0245988 [PAC] in abort () from /lib64/libc.so.6
#8 0x0000ffffb1678964 [PAC] in ceph::__ceph_abort (file=file@entry=0xffffb1c4bdd0 "/lsandbox/upstream/ceph/src/common/lockdep.cc", line=line@entry=345,
func=func@entry=0xffffb1c4bef0 "int lockdep_will_lock(const char*, int, bool, bool)", msg="abort() called") at /lsandbox/upstream/ceph/src/common/assert.cc:241
#9 0x0000ffffb170cf44 in lockdep_will_lock (name=0xaaaacc4c0300 "cephfs::mirror::fs_mirror", id=<optimized out>, force_backtrace=<optimized out>, recursive=recursive@entry=false)
at /lsandbox/upstream/ceph/src/common/lockdep.cc:345
#10 0x0000ffffb1710ce4 in ceph::mutex_debug_detail::mutex_debugging_base::_will_lock (this=this@entry=0xaaaad123a470, recursive=recursive@entry=false)
at /lsandbox/upstream/ceph/src/common/mutex_debug.cc:50
#11 0x0000aaaabc3feef8 in ceph::mutex_debug_detail::mutex_debug_impl<false>::lock (this=this@entry=0xaaaad123a470, no_lockdep=no_lockdep@entry=false)
at /lsandbox/upstream/ceph/src/common/mutex_debug.h:186
#12 0x0000aaaabc413de0 in std::scoped_lock<ceph::mutex_debug_detail::mutex_debug_impl<false> >::scoped_lock (this=0xffffa93ab6c8, __m=...) at /usr/include/c++/14/mutex:787
#13 cephfs::mirror::FSMirror::set_blocklisted_ts (this=0xaaaad123a400) at /lsandbox/upstream/ceph/src/tools/cephfs_mirror/FSMirror.h:81
#14 0x0000aaaabc413e10 in cephfs::mirror::FSMirror::TimestampListener::set_blocklisted_ts (this=<optimized out>) at /lsandbox/upstream/ceph/src/tools/cephfs_mirror/FSMirror.h:136
#15 0x0000aaaabc41576c in cephfs::mirror::InstanceWatcher::handle_rewatch_complete (this=0xaaaad1f62900, r=<optimized out>) at /lsandbox/upstream/ceph/src/tools/cephfs_mirror/InstanceWatcher.cc:120
#16 0x0000aaaabc446140 in cephfs::mirror::Watcher::handle_rewatch_callback (this=0xaaaad1f62900, r=-108) at /lsandbox/upstream/ceph/src/tools/cephfs_mirror/Watcher.cc:242
#17 0x0000aaaabc446a24 in cephfs::mirror::C_CallbackAdapter<cephfs::mirror::Watcher, &cephfs::mirror::Watcher::handle_rewatch_callback>::finish (this=<optimized out>, r=<optimized out>)
at /lsandbox/upstream/ceph/src/tools/cephfs_mirror/aio_utils.h:29
#18 0x0000aaaabc406920 in Context::complete (this=0xaaaad136a630, r=<optimized out>) at /lsandbox/upstream/ceph/src/include/Context.h:103
#19 0x0000aaaabc40a8c8 in ContextWQ::process (this=0xaaaad11eb300, ctx=<optimized out>) at /lsandbox/upstream/ceph/src/common/WorkQueue.h:569
#20 0x0000aaaabc406b54 in ThreadPool::PointerWQ<Context>::_void_process (this=<optimized out>, item=<optimized out>, handle=...) at /lsandbox/upstream/ceph/src/common/WorkQueue.h:355
#21 0x0000ffffb166454c in ThreadPool::worker (this=0xaaaacc4cac00, wt=<optimized out>) at /lsandbox/upstream/ceph/src/common/WorkQueue.cc:126
#22 0x0000ffffb1666f64 in ThreadPool::WorkThread::entry (this=<optimized out>) at /lsandbox/upstream/ceph/src/common/WorkQueue.h:409
#23 0x0000ffffb1651bd0 in Thread::entry_wrapper (this=0xaaaad11925c0) at /lsandbox/upstream/ceph/src/common/Thread.cc:88
#24 0x0000ffffb1651bf0 in Thread::_entry_func (arg=<optimized out>) at /lsandbox/upstream/ceph/src/common/Thread.cc:75
#25 0x0000ffffb02a9b68 in start_thread () from /lib64/libc.so.6
#26 0x0000ffffb031350c [PAC] in thread_start () from /lib64/libc.so.6
(gdb)
Updated by Kotresh Hiremath Ravishankar about 1 month ago
Analysed the traceback with the help github copilot and found that there is indeed an issue with lock ordering followed.
In the traceback, the lock order is as below
InstanceWatcher::m_lock ----> FSMirror::m_lock
The InstanceWatcher::handle_rewatch_complete() acquires InstanceWatcher::m_lock (line 118)
and calls m_elistener.set_blocklisted_ts() which tries to acquire FSMirror::m_lock
The reverse order is as below
FSMirror::m_lock -----> InstanceWatcher::m_lock
The reverse order is during mirror asok peer status command. The FSMirror::mirror_status(Formatter *f) takes FSMirro::m_lock
and calls is_blocklisted which takes InstanceWatcher::m_lock
Updated by Kotresh Hiremath Ravishankar about 1 month ago
- Subject changed from tools/cephfs_mirror: Mirror daemon aborted due lock ordering issue to tools/cephfs_mirror: Mirror daemon aborted due to lock ordering issue
Updated by Kotresh Hiremath Ravishankar about 1 month ago
- Pull request ID set to 66572
Updated by Kotresh Hiremath Ravishankar about 1 month ago
- Status changed from New to Fix Under Review
Updated by Upkeep Bot 24 days ago
- Status changed from Fix Under Review to Resolved
- Merge Commit set to 81cb89b3931967c0e90a7cb90e213346036db1f5
- Fixed In set to v20.3.0-5627-g81cb89b393
- Upkeep Timestamp set to 2026-02-25T12:26:30+00:00