Bug #71110
opencephfs-mirror daemon crashes when cephfs_mirror_max_concurrent_directory_syncs > 1
0%
Description
Similar to bug 65115 but with different activity signature.
Since upgrade to 18.x our cephfs-mirror setup repeatedly crashes when using concurrency (cephfs_mirror_max_concurrent_directory_syncs > 1).
We did not have this issue with 17.2.7
Crash log from daemon below (last few lines)
-6> 2025-04-28T14:57:34.930+0000 7f864b3b8640 5 cephfs::mirror::PeerReplayer(a5a7ce57-e05d-4955-a0f2-cebb1b4eed25) propagate_deleted_entries: mode matches for entry=sha512
-5> 2025-04-28T14:57:34.930+0000 7f864abb7640 5 cephfs::mirror::PeerReplayer(a5a7ce57-e05d-4955-a0f2-cebb1b4eed25) propagate_deleted_entries: mode matches for entry=sha512
-4> 2025-04-28T14:57:34.932+0000 7f864abb7640 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/18.2.5/rpm/el9/BUILD/ceph-18.2.5/src/client/Inode.cc: In function 'bool Inode::
put_open_ref(int)' thread 7f864abb7640 time 2025-04-28T14:57:34.932811+0000
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/18.2.5/rpm/el9/BUILD/ceph-18.2.5/src/client/Inode.cc: 175: FAILED ceph_assert(ref > 0)
ceph version 18.2.5 (a5b0e13f9c96f3b45f596a95ad098f51ca0ccce1) reef (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x11e) [0x7f8670e25d2c]
2: /usr/lib64/ceph/libceph-common.so.2(+0x16beeb) [0x7f8670e25eeb]
3: /lib64/libcephfs.so.2(+0x111dee) [0x7f8671940dee]
4: /lib64/libcephfs.so.2(+0xd48b6) [0x7f86719038b6]
5: /lib64/libcephfs.so.2(+0xd975b) [0x7f867190875b]
6: /lib64/libcephfs.so.2(+0xce775) [0x7f86718fd775]
7: ceph_closedir()
8: (cephfs::mirror::PeerReplayer::propagate_deleted_entries(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, cephfs::mirror::PeerReplayer::FHandles con
st&)+0x106c) [0x55a75b94f0ac]
9: (cephfs::mirror::PeerReplayer::do_synchronize(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long> const&, boost::optional<std::pair<s
td::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long> >)+0xfb0) [0x55a75b950350]
10: (cephfs::mirror::PeerReplayer::synchronize(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long> const&, boost::optional<std::pair<std
::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long> >)+0x622) [0x55a75b951d72]
11: (cephfs::mirror::PeerReplayer::do_sync_snaps(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x51e) [0x55a75b9529de]
12: (cephfs::mirror::PeerReplayer::sync_snaps(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unique_lock<std::mutex>&)+0x63) [0x55a75b953f63]
13: (cephfs::mirror::PeerReplayer::run(cephfs::mirror::PeerReplayer::SnapshotReplayerThread*)+0x4e5) [0x55a75b954785]
14: /usr/bin/cephfs-mirror(+0x44284) [0x55a75b93c284]
15: /lib64/libc.so.6(+0x8a0ca) [0x7f86708f70ca]
16: clone()
-3> 2025-04-28T14:57:34.934+0000 7f864abb7640 -1 *** Caught signal (Aborted) **
in thread 7f864abb7640 thread_name:replayer-0
ceph version 18.2.5 (a5b0e13f9c96f3b45f596a95ad098f51ca0ccce1) reef (stable)
1: /lib64/libc.so.6(+0x3ebf0) [0x7f86708abbf0]
2: /lib64/libc.so.6(+0x8be0c) [0x7f86708f8e0c]
3: raise()
4: abort()
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x178) [0x7f8670e25d86]
6: /usr/lib64/ceph/libceph-common.so.2(+0x16beeb) [0x7f8670e25eeb]
7: /lib64/libcephfs.so.2(+0x111dee) [0x7f8671940dee]
8: /lib64/libcephfs.so.2(+0xd48b6) [0x7f86719038b6]
9: /lib64/libcephfs.so.2(+0xd975b) [0x7f867190875b]
10: /lib64/libcephfs.so.2(+0xce775) [0x7f86718fd775]
11: ceph_closedir()
12: (cephfs::mirror::PeerReplayer::propagate_deleted_entries(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, cephfs::mirror::PeerReplayer::FHandles co
nst&)+0x106c) [0x55a75b94f0ac]
13: (cephfs::mirror::PeerReplayer::do_synchronize(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long> const&, boost::optional<std::pair<
std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long> >)+0xfb0) [0x55a75b950350]
14: (cephfs::mirror::PeerReplayer::synchronize(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long> const&, boost::optional<std::pair<std
::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long> >)+0x622) [0x55a75b951d72]
15: (cephfs::mirror::PeerReplayer::do_sync_snaps(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x51e) [0x55a75b9529de]
16: (cephfs::mirror::PeerReplayer::sync_snaps(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unique_lock<std::mutex>&)+0x63) [0x55a75b953f63]
17: (cephfs::mirror::PeerReplayer::run(cephfs::mirror::PeerReplayer::SnapshotReplayerThread*)+0x4e5) [0x55a75b954785]
18: /usr/bin/cephfs-mirror(+0x44284) [0x55a75b93c284]
19: /lib64/libc.so.6(+0x8a0ca) [0x7f86708f70ca]
20: clone()
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
-2> 2025-04-28T14:57:34.944+0000 7f864cbbb640 5 cephfs::mirror::PeerReplayer(a5a7ce57-e05d-4955-a0f2-cebb1b4eed25) run: picked dir_root=/ec7p2/nethome/charlieb
-1> 2025-04-28T14:57:34.961+0000 7f86683f2640 10 monclient: tick
0> 2025-04-28T14:57:34.961+0000 7f86683f2640 10 monclient: _check_auth_tickets
Updated by Venky Shankar 11 months ago
- Category set to Correctness/Safety
- Status changed from New to Triaged
- Assignee set to Jos Collin
- Target version set to v21.0.0
- Source set to Community (user)
- Backport set to tentacle,squid,reef
Jos, please take this one.
Updated by Jos Collin 10 months ago
- Status changed from Triaged to Need More Info
hi @stuartc_gc,
cephfs_mirror_max_concurrent_directory_syncs > 1 doesn't make much sense, because we have default value '3' for cephfs_mirror_max_concurrent_directory_syncs:
https://github.com/ceph/ceph/blob/a5b0e13f9c96f3b45f596a95ad098f51ca0ccce1/src/common/options/cephfs-mirror.yaml.in#L5
Could you please let us know the exact value that you've used for cephfs_mirror_max_concurrent_directory_syncs ?
Updated by Stuart Cornell 10 months ago
cephfs_mirror_max_concurrent_directory_syncs > 1 doesn't make much sense, because we have default value '3' for cephfs_mirror_max_concurrent_directory_syncs:
https://github.com/ceph/ceph/blob/a5b0e13f9c96f3b45f596a95ad098f51ca0ccce1/src/common/options/cephfs-mirror.yaml.in#L5Could you please let us know the exact value that you've used for cephfs_mirror_max_concurrent_directory_syncs ?
What I mean is that if I use any value greater than 1 for this setting, then I get the crash that's described. Setting to a value of 1 allows it to run without crashing but of course with no concurrency.
A value of 2,3,4,8 have been tried. All cause crashes.
Updated by Jos Collin 10 months ago
- Status changed from Need More Info to Fix Under Review
- Backport changed from tentacle,squid,reef to reef
- Pull request ID set to 58985
This is happening only in reef at the moment. We already have a PR that fixes this issue in reef and it is in QA state.
Updated by Patrick Donnelly 9 months ago
- Backport changed from reef to tentacle,squid
Updated by Stuart Cornell 2 months ago
Jos Collin wrote in #note-4:
This is happening only in reef at the moment. We already have a PR that fixes this issue in reef and it is in QA state.
Is there any update on when this will be released please?
Updated by Venky Shankar 2 months ago
Stuart Cornell wrote in #note-6:
Jos Collin wrote in #note-4:
This is happening only in reef at the moment. We already have a PR that fixes this issue in reef and it is in QA state.
Is there any update on when this will be released please?
Hi Stuart, thanks for reaching out. Unfortunately, the upstream lab is under relocation and this has delayed point releases a bit.
Would you be able to upgrade to squid which already has the necessary fixes?
Updated by Stuart Cornell 2 months ago
Venky Shankar wrote in #note-7:
Stuart Cornell wrote in #note-6:
Jos Collin wrote in #note-4:
This is happening only in reef at the moment. We already have a PR that fixes this issue in reef and it is in QA state.
Is there any update on when this will be released please?
Hi Stuart, thanks for reaching out. Unfortunately, the upstream lab is under relocation and this has delayed point releases a bit.
Would you be able to upgrade to squid which already has the necessary fixes?
That is a definite possibility. We will do some testing to ensure compat with our clients.
Thankyou