Bug #70769: cephfs: crash in Finisher - CephFS - Ceph

Actions

Copy link

Bug #70769

open

cephfs: crash in Finisher

Added by Abhishek Lekshmanan 12 months ago. Updated 4 months ago.

Status:

Pending Backport

Priority:

Normal

Assignee:

Abhishek Lekshmanan

Category:

Correctness/Safety

Target version:

Ceph - v20.0.0

% Done:

Source:

Community (dev)

Backport:

tentacle,squid

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Labels (FS):

Pull request ID:

62630

Tags (freeform):

backport_processed

Merge Commit:

b5ebfaf3add6737ea3b3b0efe2ad15cceb78bb56

Fixed In:

v20.3.0-4185-gb5ebfaf3ad

Released In:

Upkeep Timestamp:

2025-11-17T09:14:38+00:00

Description

We see crashes in Finisher thread in some workloads in our Quincy cluster

The crashes arise from a potential null mdr object: Another example of a crash

 ceph version 17.2.8-4 (f817ceb7f187defb1d021d6328fa833eb8e943b3) quincy (stable)
 1: /lib64/libpthread.so.0(+0x12990) [0x7fd41822d990]
 2: gsignal()
 3: abort()
 4: /lib64/libc.so.6(+0x21d39) [0x7fd416c47d39]
 5: /lib64/libc.so.6(+0x46e86) [0x7fd416c6ce86]
 6: /usr/bin/ceph-mds(+0x2bb455) [0x5570ecca9455]
 7: /usr/bin/ceph-mds(+0x33fb37) [0x5570ecd2db37]
 8: (MDSContext::complete(int)+0x5f) [0x5570ecf01aaf]
 9: (Finisher::finisher_thread_entry()+0x18d) [0x7fd4192d847d]
 10: /lib64/libpthread.so.0(+0x81ca) [0x7fd4182231ca]
 11: clone()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this

/builddir/build/BUILD/ceph-17.2.8/src/mds/Server.cc: 2595: FAILED ceph_assert(!mdr->is_batch_head())

 ceph version 17.2.8-4 (f817ceb7f187defb1d021d6328fa833eb8e943b3) quincy (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x135) [0x7f8d68c1f635]
 2: /usr/lib64/ceph/libceph-common.so.2(+0x26c7fb) [0x7f8d68c1f7fb]
 3: (Server::dispatch_client_request(boost::intrusive_ptr<MDRequestImpl>&)+0x10a5) [0x560171418915]
 4: (MDSContext::complete(int)+0x5f) [0x5601716b4aaf]
 5: (void finish_contexts<std::vector<MDSContext*, std::allocator<MDSContext*> > >(ceph::common::CephContext*, std::vector<MDSContext*, std::allocator<MDSContext*> >&, int)+0x8d) [0x5601713554fd]
 6: (Locker::eval(CInode*, int, bool)+0x3d6) [0x56017156ed66]
 7: (Locker::handle_client_caps(boost::intrusive_ptr<MClientCaps const> const&)+0x2c8d) [0x56017157be4d]
 8: (Locker::dispatch(boost::intrusive_ptr<Message const> const&)+0x1ec) [0x56017157d15c]
 9: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x5db) [0x560171364ccb]
 10: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0x5c) [0x56017136530c]
 11: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x1bf) [0x56017134e1ef]
 12: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0x478) [0x7f8d68ea13b8]
 13: (DispatchQueue::entry()+0x50f) [0x7f8d68e9e7ff]
 14: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f8d68f68ac1]
 15: /lib64/libpthread.so.0(+0x81ca) [0x7f8d67c0a1ca]
 16: clone()

We believe this arises due to https://github.com/ceph/ceph/pull/58843 which is a backport of https://github.com/ceph/ceph/pull/57553
which was a refactor of https://github.com/ceph/ceph/pull/56941
The original PR: ie https://github.com/ceph/ceph/pull/56941 would've not crashed as it would've returned on a null mdr, the current form
passes on the mdr to finisher which blows up at some point with a SEGV/SIGABRT depending on the build

For debug builds though, https://tracker.ceph.com/issues/70624 would prevent from this crash from appearing as the mutex issue indicates there is another issue hiding, but we'd crash at the mutex assert which precedes the finisher call in most cases.

Related issues 2 (1 open — 1 closed)

Actions

Copy link

Updated by Abhishek Lekshmanan 12 months ago

main PR: https://github.com/ceph/ceph/pull/62630

Actions

Copy link

Updated by Venky Shankar 12 months ago

Category set to Correctness/Safety
Status changed from New to Fix Under Review
Assignee set to Abhishek Lekshmanan
Target version set to v20.0.0
Source set to Community (dev)
Backport set to reef,squid
Pull request ID set to 62630

Actions

Copy link

Updated by Venky Shankar 8 months ago

@Abhishek Lekshmanan - https://github.com/ceph/ceph/pull/62630#issuecomment-3097278063

I believe a possible reproducer was shared (somewhere). Could you or @Milind Changire update it here please?

Actions

Copy link

Updated by Milind Changire 8 months ago

Venky Shankar wrote in #note-3:

@Abhishek Lekshmanan - https://github.com/ceph/ceph/pull/62630#issuecomment-3097278063

I believe a possible reproducer was shared (somewhere). Could you or @Milind Changire update it here please?

Abhishek said:
For https://github.com/ceph/ceph/pull/62630 / https://tracker.ceph.com/issues/70769 you asked for a reproducer.

We see these crashes occuring when the client eviction happens exactly when GetAttr is the Op in progress, for a reproducer what worked is having 2 mounts, one creating / deleting a large tree (for example untar + rm linux kernel) in a tight loop and another readonly mount having many
find -exec stat.
Watching a ceph mds daemon ops and evicting a client when getattr is going on should crash the MDS when a lone Batch getattr gets queued.

Milind said:
Do I have to manually evict the client using: ceph tell mds.0 client evict id=..

Abhishek said:
Yes at least for easy reproduction of the issue. In the prod case we saw both auto evict and manual evict triggering a crash

Actions

Copy link

Updated by Venky Shankar 4 months ago

Status changed from Fix Under Review to Pending Backport
Backport changed from reef,squid to tentacle,squid

Actions

Copy link

Updated by Upkeep Bot 4 months ago

Merge Commit set to b5ebfaf3add6737ea3b3b0efe2ad15cceb78bb56
Fixed In set to v20.3.0-4185-gb5ebfaf3ad
Upkeep Timestamp set to 2025-11-17T09:14:38+00:00

Actions

Copy link

Updated by Upkeep Bot 4 months ago

Copied to Backport #73872: squid: cephfs: crash in Finisher added

Actions

Copy link

Updated by Upkeep Bot 4 months ago

Copied to Backport #73873: tentacle: cephfs: crash in Finisher added

Actions

Copy link

Updated by Upkeep Bot 4 months ago

Tags (freeform) set to backport_processed

Actions

Copy link

Also available in: Atom PDF

	Copied to CephFS - Backport #73872: squid: cephfs: crash in Finisher	Resolved	Jos Collin				Actions
	Copied to CephFS - Backport #73873: tentacle: cephfs: crash in Finisher	QA Testing	Jos Collin				Actions

Project

General

Profile

Ceph » CephFS

Tags

Custom queries

Bug #70769

cephfs: crash in Finisher

Updated by Abhishek Lekshmanan 12 months ago

Updated by Venky Shankar 12 months ago

Updated by Venky Shankar 8 months ago

Updated by Milind Changire 8 months ago

Updated by Venky Shankar 4 months ago

Updated by Upkeep Bot 4 months ago

Updated by Upkeep Bot 4 months ago

Updated by Upkeep Bot 4 months ago

Updated by Upkeep Bot 4 months ago