Bug #70769
opencephfs: crash in Finisher
0%
Description
We see crashes in Finisher thread in some workloads in our Quincy cluster
The crashes arise from a potential null mdr object: Another example of a crash
ceph version 17.2.8-4 (f817ceb7f187defb1d021d6328fa833eb8e943b3) quincy (stable) 1: /lib64/libpthread.so.0(+0x12990) [0x7fd41822d990] 2: gsignal() 3: abort() 4: /lib64/libc.so.6(+0x21d39) [0x7fd416c47d39] 5: /lib64/libc.so.6(+0x46e86) [0x7fd416c6ce86] 6: /usr/bin/ceph-mds(+0x2bb455) [0x5570ecca9455] 7: /usr/bin/ceph-mds(+0x33fb37) [0x5570ecd2db37] 8: (MDSContext::complete(int)+0x5f) [0x5570ecf01aaf] 9: (Finisher::finisher_thread_entry()+0x18d) [0x7fd4192d847d] 10: /lib64/libpthread.so.0(+0x81ca) [0x7fd4182231ca] 11: clone() NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this
/builddir/build/BUILD/ceph-17.2.8/src/mds/Server.cc: 2595: FAILED ceph_assert(!mdr->is_batch_head()) ceph version 17.2.8-4 (f817ceb7f187defb1d021d6328fa833eb8e943b3) quincy (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x135) [0x7f8d68c1f635] 2: /usr/lib64/ceph/libceph-common.so.2(+0x26c7fb) [0x7f8d68c1f7fb] 3: (Server::dispatch_client_request(boost::intrusive_ptr<MDRequestImpl>&)+0x10a5) [0x560171418915] 4: (MDSContext::complete(int)+0x5f) [0x5601716b4aaf] 5: (void finish_contexts<std::vector<MDSContext*, std::allocator<MDSContext*> > >(ceph::common::CephContext*, std::vector<MDSContext*, std::allocator<MDSContext*> >&, int)+0x8d) [0x5601713554fd] 6: (Locker::eval(CInode*, int, bool)+0x3d6) [0x56017156ed66] 7: (Locker::handle_client_caps(boost::intrusive_ptr<MClientCaps const> const&)+0x2c8d) [0x56017157be4d] 8: (Locker::dispatch(boost::intrusive_ptr<Message const> const&)+0x1ec) [0x56017157d15c] 9: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x5db) [0x560171364ccb] 10: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0x5c) [0x56017136530c] 11: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x1bf) [0x56017134e1ef] 12: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0x478) [0x7f8d68ea13b8] 13: (DispatchQueue::entry()+0x50f) [0x7f8d68e9e7ff] 14: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f8d68f68ac1] 15: /lib64/libpthread.so.0(+0x81ca) [0x7f8d67c0a1ca] 16: clone()
We believe this arises due to https://github.com/ceph/ceph/pull/58843 which is a backport of https://github.com/ceph/ceph/pull/57553
which was a refactor of https://github.com/ceph/ceph/pull/56941
The original PR: ie https://github.com/ceph/ceph/pull/56941 would've not crashed as it would've returned on a null mdr, the current form
passes on the mdr to finisher which blows up at some point with a SEGV/SIGABRT depending on the build
For debug builds though, https://tracker.ceph.com/issues/70624 would prevent from this crash from appearing as the mutex issue indicates there is another issue hiding, but we'd crash at the mutex assert which precedes the finisher call in most cases.
Updated by Abhishek Lekshmanan 12 months ago
Updated by Venky Shankar 12 months ago
- Category set to Correctness/Safety
- Status changed from New to Fix Under Review
- Assignee set to Abhishek Lekshmanan
- Target version set to v20.0.0
- Source set to Community (dev)
- Backport set to reef,squid
- Pull request ID set to 62630
Updated by Venky Shankar 8 months ago
@Abhishek Lekshmanan - https://github.com/ceph/ceph/pull/62630#issuecomment-3097278063
I believe a possible reproducer was shared (somewhere). Could you or @Milind Changire update it here please?
Updated by Milind Changire 8 months ago
Venky Shankar wrote in #note-3:
@Abhishek Lekshmanan - https://github.com/ceph/ceph/pull/62630#issuecomment-3097278063
I believe a possible reproducer was shared (somewhere). Could you or @Milind Changire update it here please?
Abhishek said:
For https://github.com/ceph/ceph/pull/62630 / https://tracker.ceph.com/issues/70769 you asked for a reproducer.
We see these crashes occuring when the client eviction happens exactly when GetAttr is the Op in progress, for a reproducer what worked is having 2 mounts, one creating / deleting a large tree (for example untar + rm linux kernel) in a tight loop and another readonly mount having many
find -exec stat.
Watching a ceph mds daemon ops and evicting a client when getattr is going on should crash the MDS when a lone Batch getattr gets queued.
Milind said:
Do I have to manually evict the client using: ceph tell mds.0 client evict id=..
Abhishek said:
Yes at least for easy reproduction of the issue. In the prod case we saw both auto evict and manual evict triggering a crash
Updated by Venky Shankar 4 months ago
- Status changed from Fix Under Review to Pending Backport
- Backport changed from reef,squid to tentacle,squid
Updated by Upkeep Bot 4 months ago
- Merge Commit set to b5ebfaf3add6737ea3b3b0efe2ad15cceb78bb56
- Fixed In set to v20.3.0-4185-gb5ebfaf3ad
- Upkeep Timestamp set to 2025-11-17T09:14:38+00:00
Updated by Upkeep Bot 4 months ago
- Copied to Backport #73872: squid: cephfs: crash in Finisher added
Updated by Upkeep Bot 4 months ago
- Copied to Backport #73873: tentacle: cephfs: crash in Finisher added