Project

General

Profile

Actions

Bug #68853

open

cephfs-mirror daemon is crashing when running in multi-threaded mode.

Added by Igor Fedotov over 1 year ago. Updated 5 months ago.

Status:
Pending Backport
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Community (dev)
Backport:
quincy,reef,squid
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
cephfs-mirror
Labels (FS):
Pull request ID:
Tags (freeform):
backport_processed
Fixed In:
v19.3.0-6431-g23d38a1a4c
Released In:
v20.2.0~1501
Upkeep Timestamp:
2025-11-01T01:01:36+00:00

Description

There could be different back traces but generally this looks like a mess with open file descriptors.
Observed when there are multiple directories to sync and cephfs_mirror_max_concurrent_directory_syncs > 1 (32 has been configured in our case).

The issue started to appear after upgrade from v17.2.7 to 18.2.4 so apparently this is a refression.

The most common backtrace/assertion looks like:
Oct 31 10:37:36 ceph-mds03 cephfs-mirror[3155735]: /home/abuild/rpmbuild/BUILD/ceph-18.2.4/src/client/Inode.cc: In function 'bool Inode::put_open_ref(int)' thread 7efd387b5700 tim>
Oct 31 10:37:36 ceph-mds03 cephfs-mirror[3155735]: /home/abuild/rpmbuild/BUILD/ceph-18.2.4/src/client/Inode.cc: 175: FAILED ceph_assert(ref > 0)

Thread 76 "replayer-27" received signal SIGABRT, Aborted.
[Switching to Thread 0x7efd387b5700 (LWP 3155835)]
0x00007efd5f319d2b in raise () from target:/lib64/libc.so.6
(gdb) bt
#0  0x00007efd5f319d2b in raise () from target:/lib64/libc.so.6
#1  0x00007efd5f31b3e5 in abort () from target:/lib64/libc.so.6
#2  0x00007efd5f9defd1 in ceph::__ceph_assert_fail(char const*, char const*, int, char const*) () from target:/usr/lib64/ceph/libceph-common.so.2
#3  0x00007efd5f9df117 in ceph::__ceph_assert_fail(ceph::assert_data const&) () from target:/usr/lib64/ceph/libceph-common.so.2
#4  0x00007efd605cfc6e in Inode::put_open_ref(int) () from target:/usr/lib64/libcephfs.so.2
#5  0x00007efd605767ab in Client::_release_fh(Fh*) () from target:/usr/lib64/libcephfs.so.2
#6  0x00007efd60576b96 in Client::_close(int) () from target:/usr/lib64/libcephfs.so.2
#7  0x00007efd60577046 in Client::_closedir(dir_result_t*) () from target:/usr/lib64/libcephfs.so.2
#8  0x00007efd605774d2 in Client::closedir(dir_result_t*) () from target:/usr/lib64/libcephfs.so.2
#9  0x000056534c2f790d in cephfs::mirror::PeerReplayer::propagate_deleted_entries(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, cephfs::mirror::PeerReplayer::FHandles const&) ()
#10 0x000056534c2f8adc in cephfs::mirror::PeerReplayer::do_synchronize(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long> const&, boost::optional<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long> >) ()
#11 0x000056534c2fa361 in cephfs::mirror::PeerReplayer::synchronize(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long> const&, boost::optional<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long> >) ()
#12 0x000056534c2fb135 in cephfs::mirror::PeerReplayer::do_sync_snaps(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
#13 0x000056534c2fc79e in cephfs::mirror::PeerReplayer::sync_snaps(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unique_lock<std::mutex>&) ()
#14 0x000056534c2fd12d in cephfs::mirror::PeerReplayer::run(cephfs::mirror::PeerReplayer::SnapshotReplayerThread*) ()
#15 0x000056534c302a30 in cephfs::mirror::PeerReplayer::SnapshotReplayerThread::entry() ()
#16 0x00007efd5f7406ea in start_thread () from target:/lib64/libpthread.so.0
#17 0x00007efd5f3e758f in clone () from target:/lib64/libc.so.6

Related issues 3 (1 open2 closed)

Copied to CephFS - Backport #69244: reef: cephfs-mirror daemon is crashing when running in multi-threaded mode.In ProgressJos CollinActions
Copied to CephFS - Backport #69245: squid: cephfs-mirror daemon is crashing when running in multi-threaded mode.ResolvedIgor FedotovActions
Copied to CephFS - Backport #69246: quincy: cephfs-mirror daemon is crashing when running in multi-threaded mode.RejectedIgor FedotovActions
Actions #1

Updated by Jos Collin over 1 year ago

  • Assignee set to Jos Collin
Actions #2

Updated by Jos Collin over 1 year ago

core dump file and logs uploaded via ceph-post-file, tag 416b4810-0275-4ac6-a9db-04171960bf0a

Actions #3

Updated by Venky Shankar over 1 year ago

  • Description updated (diff)
Actions #4

Updated by Venky Shankar over 1 year ago

  • Category set to Correctness/Safety
  • Status changed from New to Triaged
  • Target version set to v20.0.0
  • Source set to Community (dev)
  • Backport set to quincy,reef,squid
Actions #5

Updated by Venky Shankar over 1 year ago

Igor Fedotov wrote:

There could be different back traces but generally this looks like a mess with open file descriptors.
Observed when there are multiple directories to sync and cephfs_mirror_max_concurrent_directory_syncs > 1 (32 has been configured in our case).

The issue started to appear after upgrade from v17.2.7 to 18.2.4 so apparently this is a refression.

Possibly a bug in the client library related to modifying ref counts without locks.

Actions #6

Updated by Jos Collin over 1 year ago

  • Status changed from Triaged to In Progress

Hi Igor,

I have checked your branch again. It lacks https://github.com/ceph/ceph/pull/58985 and you're applying a follow up fix (Bad file descriptor).
Could you please get all the necessary reef patches from https://github.com/pulls?q=is%3Aopen+is%3Apr+author%3Ajoscollin+archived%3Afalse+milestone%3Areef and check this issue again?

Actions #7

Updated by Igor Fedotov over 1 year ago

Jos Collin wrote in #note-6:

Hi Igor,

I have checked your branch again. It lacks https://github.com/ceph/ceph/pull/58985 and you're applying a follow up fix (Bad file descriptor).
Could you please get all the necessary reef patches from https://github.com/pulls?q=is%3Aopen+is%3Apr+author%3Ajoscollin+archived%3Afalse+milestone%3Areef and check this issue again?

Hi Jos,
before applying this pretty long changes list on a production cluster (which is a bit scary) I'd like to get your feedback on the following PR: https://github.com/ceph/ceph/pull/60667
IMO redundant ceph_close() call could be the culprit of the file descriptors' mess. Not yet tested, hopefully will be able to do that tomorrow.

Actions #8

Updated by Jos Collin over 1 year ago

  • Assignee changed from Jos Collin to Igor Fedotov
  • Pull request ID set to 60667
Actions #9

Updated by Jos Collin over 1 year ago

  • Status changed from In Progress to Fix Under Review
Actions #10

Updated by Venky Shankar over 1 year ago

  • Status changed from Fix Under Review to Pending Backport
Actions #11

Updated by Upkeep Bot over 1 year ago

  • Copied to Backport #69244: reef: cephfs-mirror daemon is crashing when running in multi-threaded mode. added
Actions #12

Updated by Upkeep Bot over 1 year ago

  • Copied to Backport #69245: squid: cephfs-mirror daemon is crashing when running in multi-threaded mode. added
Actions #13

Updated by Upkeep Bot over 1 year ago

  • Copied to Backport #69246: quincy: cephfs-mirror daemon is crashing when running in multi-threaded mode. added
Actions #14

Updated by Upkeep Bot over 1 year ago

  • Tags (freeform) set to backport_processed
Actions #15

Updated by Jos Collin over 1 year ago

@Venky Shankar Shall we drop quincy backport for this, as quincy tests are failing? No further quincy releases too.

Actions #16

Updated by Upkeep Bot 9 months ago

  • Merge Commit set to 23d38a1a4cf61245e68d4349f404d4753ac71f46
  • Fixed In set to v19.3.0-6431-g23d38a1a4cf
  • Upkeep Timestamp set to 2025-07-08T18:34:16+00:00
Actions #17

Updated by Upkeep Bot 8 months ago

  • Fixed In changed from v19.3.0-6431-g23d38a1a4cf to v19.3.0-6431-g23d38a1a4cf6
  • Upkeep Timestamp changed from 2025-07-08T18:34:16+00:00 to 2025-07-14T15:44:21+00:00
Actions #18

Updated by Upkeep Bot 8 months ago

  • Fixed In changed from v19.3.0-6431-g23d38a1a4cf6 to v19.3.0-6431-g23d38a1a4c
  • Upkeep Timestamp changed from 2025-07-14T15:44:21+00:00 to 2025-07-14T21:09:06+00:00
Actions #19

Updated by Upkeep Bot 5 months ago

  • Released In set to v20.2.0~1501
  • Upkeep Timestamp changed from 2025-07-14T21:09:06+00:00 to 2025-11-01T01:01:36+00:00
Actions

Also available in: Atom PDF