Project

General

Profile

Actions

Bug #70769

open

cephfs: crash in Finisher

Added by Abhishek Lekshmanan 12 months ago. Updated 4 months ago.

Status:
Pending Backport
Priority:
Normal
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Community (dev)
Backport:
tentacle,squid
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Tags (freeform):
backport_processed
Fixed In:
v20.3.0-4185-gb5ebfaf3ad
Released In:
Upkeep Timestamp:
2025-11-17T09:14:38+00:00

Description

We see crashes in Finisher thread in some workloads in our Quincy cluster

The crashes arise from a potential null mdr object: Another example of a crash

 ceph version 17.2.8-4 (f817ceb7f187defb1d021d6328fa833eb8e943b3) quincy (stable)
 1: /lib64/libpthread.so.0(+0x12990) [0x7fd41822d990]
 2: gsignal()
 3: abort()
 4: /lib64/libc.so.6(+0x21d39) [0x7fd416c47d39]
 5: /lib64/libc.so.6(+0x46e86) [0x7fd416c6ce86]
 6: /usr/bin/ceph-mds(+0x2bb455) [0x5570ecca9455]
 7: /usr/bin/ceph-mds(+0x33fb37) [0x5570ecd2db37]
 8: (MDSContext::complete(int)+0x5f) [0x5570ecf01aaf]
 9: (Finisher::finisher_thread_entry()+0x18d) [0x7fd4192d847d]
 10: /lib64/libpthread.so.0(+0x81ca) [0x7fd4182231ca]
 11: clone()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this
/builddir/build/BUILD/ceph-17.2.8/src/mds/Server.cc: 2595: FAILED ceph_assert(!mdr->is_batch_head())

 ceph version 17.2.8-4 (f817ceb7f187defb1d021d6328fa833eb8e943b3) quincy (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x135) [0x7f8d68c1f635]
 2: /usr/lib64/ceph/libceph-common.so.2(+0x26c7fb) [0x7f8d68c1f7fb]
 3: (Server::dispatch_client_request(boost::intrusive_ptr<MDRequestImpl>&)+0x10a5) [0x560171418915]
 4: (MDSContext::complete(int)+0x5f) [0x5601716b4aaf]
 5: (void finish_contexts<std::vector<MDSContext*, std::allocator<MDSContext*> > >(ceph::common::CephContext*, std::vector<MDSContext*, std::allocator<MDSContext*> >&, int)+0x8d) [0x5601713554fd]
 6: (Locker::eval(CInode*, int, bool)+0x3d6) [0x56017156ed66]
 7: (Locker::handle_client_caps(boost::intrusive_ptr<MClientCaps const> const&)+0x2c8d) [0x56017157be4d]
 8: (Locker::dispatch(boost::intrusive_ptr<Message const> const&)+0x1ec) [0x56017157d15c]
 9: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x5db) [0x560171364ccb]
 10: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0x5c) [0x56017136530c]
 11: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x1bf) [0x56017134e1ef]
 12: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0x478) [0x7f8d68ea13b8]
 13: (DispatchQueue::entry()+0x50f) [0x7f8d68e9e7ff]
 14: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f8d68f68ac1]
 15: /lib64/libpthread.so.0(+0x81ca) [0x7f8d67c0a1ca]
 16: clone()

We believe this arises due to https://github.com/ceph/ceph/pull/58843 which is a backport of https://github.com/ceph/ceph/pull/57553
which was a refactor of https://github.com/ceph/ceph/pull/56941
The original PR: ie https://github.com/ceph/ceph/pull/56941 would've not crashed as it would've returned on a null mdr, the current form
passes on the mdr to finisher which blows up at some point with a SEGV/SIGABRT depending on the build

For debug builds though, https://tracker.ceph.com/issues/70624 would prevent from this crash from appearing as the mutex issue indicates there is another issue hiding, but we'd crash at the mutex assert which precedes the finisher call in most cases.


Related issues 2 (1 open1 closed)

Copied to CephFS - Backport #73872: squid: cephfs: crash in FinisherResolvedJos CollinActions
Copied to CephFS - Backport #73873: tentacle: cephfs: crash in FinisherQA TestingJos CollinActions
Actions #2

Updated by Venky Shankar 12 months ago

  • Category set to Correctness/Safety
  • Status changed from New to Fix Under Review
  • Assignee set to Abhishek Lekshmanan
  • Target version set to v20.0.0
  • Source set to Community (dev)
  • Backport set to reef,squid
  • Pull request ID set to 62630
Actions #3

Updated by Venky Shankar 8 months ago

@Abhishek Lekshmanan - https://github.com/ceph/ceph/pull/62630#issuecomment-3097278063

I believe a possible reproducer was shared (somewhere). Could you or @Milind Changire update it here please?

Actions #4

Updated by Milind Changire 8 months ago

Venky Shankar wrote in #note-3:

@Abhishek Lekshmanan - https://github.com/ceph/ceph/pull/62630#issuecomment-3097278063

I believe a possible reproducer was shared (somewhere). Could you or @Milind Changire update it here please?

Abhishek said:
For https://github.com/ceph/ceph/pull/62630 / https://tracker.ceph.com/issues/70769 you asked for a reproducer.

We see these crashes occuring when the client eviction happens exactly when GetAttr is the Op in progress, for a reproducer what worked is having 2 mounts, one creating / deleting a large tree (for example untar + rm linux kernel) in a tight loop and another readonly mount having many
find -exec stat.
Watching a ceph mds daemon ops and evicting a client when getattr is going on should crash the MDS when a lone Batch getattr gets queued.

Milind said:
Do I have to manually evict the client using: ceph tell mds.0 client evict id=..

Abhishek said:
Yes at least for easy reproduction of the issue. In the prod case we saw both auto evict and manual evict triggering a crash

Actions #5

Updated by Venky Shankar 4 months ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport changed from reef,squid to tentacle,squid
Actions #6

Updated by Upkeep Bot 4 months ago

  • Merge Commit set to b5ebfaf3add6737ea3b3b0efe2ad15cceb78bb56
  • Fixed In set to v20.3.0-4185-gb5ebfaf3ad
  • Upkeep Timestamp set to 2025-11-17T09:14:38+00:00
Actions #7

Updated by Upkeep Bot 4 months ago

Actions #8

Updated by Upkeep Bot 4 months ago

Actions #9

Updated by Upkeep Bot 4 months ago

  • Tags (freeform) set to backport_processed
Actions

Also available in: Atom PDF