Bug #68853
opencephfs-mirror daemon is crashing when running in multi-threaded mode.
0%
Description
There could be different back traces but generally this looks like a mess with open file descriptors.
Observed when there are multiple directories to sync and cephfs_mirror_max_concurrent_directory_syncs > 1 (32 has been configured in our case).
The issue started to appear after upgrade from v17.2.7 to 18.2.4 so apparently this is a refression.
The most common backtrace/assertion looks like: Oct 31 10:37:36 ceph-mds03 cephfs-mirror[3155735]: /home/abuild/rpmbuild/BUILD/ceph-18.2.4/src/client/Inode.cc: In function 'bool Inode::put_open_ref(int)' thread 7efd387b5700 tim> Oct 31 10:37:36 ceph-mds03 cephfs-mirror[3155735]: /home/abuild/rpmbuild/BUILD/ceph-18.2.4/src/client/Inode.cc: 175: FAILED ceph_assert(ref > 0) Thread 76 "replayer-27" received signal SIGABRT, Aborted. [Switching to Thread 0x7efd387b5700 (LWP 3155835)] 0x00007efd5f319d2b in raise () from target:/lib64/libc.so.6 (gdb) bt #0 0x00007efd5f319d2b in raise () from target:/lib64/libc.so.6 #1 0x00007efd5f31b3e5 in abort () from target:/lib64/libc.so.6 #2 0x00007efd5f9defd1 in ceph::__ceph_assert_fail(char const*, char const*, int, char const*) () from target:/usr/lib64/ceph/libceph-common.so.2 #3 0x00007efd5f9df117 in ceph::__ceph_assert_fail(ceph::assert_data const&) () from target:/usr/lib64/ceph/libceph-common.so.2 #4 0x00007efd605cfc6e in Inode::put_open_ref(int) () from target:/usr/lib64/libcephfs.so.2 #5 0x00007efd605767ab in Client::_release_fh(Fh*) () from target:/usr/lib64/libcephfs.so.2 #6 0x00007efd60576b96 in Client::_close(int) () from target:/usr/lib64/libcephfs.so.2 #7 0x00007efd60577046 in Client::_closedir(dir_result_t*) () from target:/usr/lib64/libcephfs.so.2 #8 0x00007efd605774d2 in Client::closedir(dir_result_t*) () from target:/usr/lib64/libcephfs.so.2 #9 0x000056534c2f790d in cephfs::mirror::PeerReplayer::propagate_deleted_entries(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, cephfs::mirror::PeerReplayer::FHandles const&) () #10 0x000056534c2f8adc in cephfs::mirror::PeerReplayer::do_synchronize(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long> const&, boost::optional<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long> >) () #11 0x000056534c2fa361 in cephfs::mirror::PeerReplayer::synchronize(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long> const&, boost::optional<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long> >) () #12 0x000056534c2fb135 in cephfs::mirror::PeerReplayer::do_sync_snaps(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () #13 0x000056534c2fc79e in cephfs::mirror::PeerReplayer::sync_snaps(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unique_lock<std::mutex>&) () #14 0x000056534c2fd12d in cephfs::mirror::PeerReplayer::run(cephfs::mirror::PeerReplayer::SnapshotReplayerThread*) () #15 0x000056534c302a30 in cephfs::mirror::PeerReplayer::SnapshotReplayerThread::entry() () #16 0x00007efd5f7406ea in start_thread () from target:/lib64/libpthread.so.0 #17 0x00007efd5f3e758f in clone () from target:/lib64/libc.so.6
Updated by Jos Collin over 1 year ago
core dump file and logs uploaded via ceph-post-file, tag 416b4810-0275-4ac6-a9db-04171960bf0a
Updated by Venky Shankar over 1 year ago
- Category set to Correctness/Safety
- Status changed from New to Triaged
- Target version set to v20.0.0
- Source set to Community (dev)
- Backport set to quincy,reef,squid
Updated by Venky Shankar over 1 year ago
Igor Fedotov wrote:
There could be different back traces but generally this looks like a mess with open file descriptors.
Observed when there are multiple directories to sync and cephfs_mirror_max_concurrent_directory_syncs > 1 (32 has been configured in our case).The issue started to appear after upgrade from v17.2.7 to 18.2.4 so apparently this is a refression.
Possibly a bug in the client library related to modifying ref counts without locks.
Updated by Jos Collin over 1 year ago
- Status changed from Triaged to In Progress
Hi Igor,
I have checked your branch again. It lacks https://github.com/ceph/ceph/pull/58985 and you're applying a follow up fix (Bad file descriptor).
Could you please get all the necessary reef patches from https://github.com/pulls?q=is%3Aopen+is%3Apr+author%3Ajoscollin+archived%3Afalse+milestone%3Areef and check this issue again?
Updated by Igor Fedotov over 1 year ago
Jos Collin wrote in #note-6:
Hi Igor,
I have checked your branch again. It lacks https://github.com/ceph/ceph/pull/58985 and you're applying a follow up fix (Bad file descriptor).
Could you please get all the necessary reef patches from https://github.com/pulls?q=is%3Aopen+is%3Apr+author%3Ajoscollin+archived%3Afalse+milestone%3Areef and check this issue again?
Hi Jos,
before applying this pretty long changes list on a production cluster (which is a bit scary) I'd like to get your feedback on the following PR: https://github.com/ceph/ceph/pull/60667
IMO redundant ceph_close() call could be the culprit of the file descriptors' mess. Not yet tested, hopefully will be able to do that tomorrow.
Updated by Jos Collin over 1 year ago
- Assignee changed from Jos Collin to Igor Fedotov
- Pull request ID set to 60667
Updated by Jos Collin over 1 year ago
- Status changed from In Progress to Fix Under Review
Updated by Venky Shankar over 1 year ago
- Status changed from Fix Under Review to Pending Backport
Updated by Upkeep Bot over 1 year ago
- Copied to Backport #69244: reef: cephfs-mirror daemon is crashing when running in multi-threaded mode. added
Updated by Upkeep Bot over 1 year ago
- Copied to Backport #69245: squid: cephfs-mirror daemon is crashing when running in multi-threaded mode. added
Updated by Upkeep Bot over 1 year ago
- Copied to Backport #69246: quincy: cephfs-mirror daemon is crashing when running in multi-threaded mode. added
Updated by Upkeep Bot over 1 year ago
- Tags (freeform) set to backport_processed
Updated by Jos Collin over 1 year ago
@Venky Shankar Shall we drop quincy backport for this, as quincy tests are failing? No further quincy releases too.
Updated by Upkeep Bot 9 months ago
- Merge Commit set to 23d38a1a4cf61245e68d4349f404d4753ac71f46
- Fixed In set to v19.3.0-6431-g23d38a1a4cf
- Upkeep Timestamp set to 2025-07-08T18:34:16+00:00
Updated by Upkeep Bot 8 months ago
- Fixed In changed from v19.3.0-6431-g23d38a1a4cf to v19.3.0-6431-g23d38a1a4cf6
- Upkeep Timestamp changed from 2025-07-08T18:34:16+00:00 to 2025-07-14T15:44:21+00:00
Updated by Upkeep Bot 8 months ago
- Fixed In changed from v19.3.0-6431-g23d38a1a4cf6 to v19.3.0-6431-g23d38a1a4c
- Upkeep Timestamp changed from 2025-07-14T15:44:21+00:00 to 2025-07-14T21:09:06+00:00
Updated by Upkeep Bot 5 months ago
- Released In set to v20.2.0~1501
- Upkeep Timestamp changed from 2025-07-14T21:09:06+00:00 to 2025-11-01T01:01:36+00:00