Bug #70624
closedqa: assertion failure on context completion of C_MDS_RetryRequest
0%
Description
2025-03-23T11:18:11.969 INFO:tasks.ceph.mds.b.smithi175.stderr:ceph-mds: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_ 64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/20.0.0-658-gafdfc507/rpm/el9/BUILD/ceph-20.0.0-658-gafdfc507/redhat-linux-build/boost/incl ude/boost/smart_ptr/intrusive_ptr.hpp:195: T& boost::intrusive_ptr<T>::operator*() const [with T = MDRequestImpl]: Assertion `px != 0' failed. 2025-03-23T11:18:11.970 INFO:tasks.ceph.mds.b.smithi175.stderr:*** Caught signal (Aborted) ** 2025-03-23T11:18:11.970 INFO:tasks.ceph.mds.b.smithi175.stderr: in thread 7fe0b17d6640 thread_name:mds-rank-fin 2025-03-23T11:18:11.970 INFO:tasks.ceph.mds.b.smithi175.stderr: ceph version 20.0.0-658-gafdfc507 (afdfc507cade689e910bdf1c4519de30e576304e) tentacle (dev) 2025-03-23T11:18:11.970 INFO:tasks.ceph.mds.b.smithi175.stderr: 1: /lib64/libc.so.6(+0x3ea60) [0x7fe0bd23ea60] 2025-03-23T11:18:11.970 INFO:tasks.ceph.mds.b.smithi175.stderr: 2: /lib64/libc.so.6(+0x8c0fc) [0x7fe0bd28c0fc] 2025-03-23T11:18:11.970 INFO:tasks.ceph.mds.b.smithi175.stderr: 3: raise() 2025-03-23T11:18:11.970 INFO:tasks.ceph.mds.b.smithi175.stderr: 4: abort() 2025-03-23T11:18:11.971 INFO:tasks.ceph.mds.b.smithi175.stderr: 5: /lib64/libc.so.6(+0x2875b) [0x7fe0bd22875b] 2025-03-23T11:18:11.971 INFO:tasks.ceph.mds.b.smithi175.stderr: 6: /lib64/libc.so.6(+0x376f6) [0x7fe0bd2376f6] 2025-03-23T11:18:11.971 INFO:tasks.ceph.mds.b.smithi175.stderr: 7: (C_MDS_RetryRequest::complete(int)+0x192) [0x5602d6af6470] 2025-03-23T11:18:11.971 INFO:tasks.ceph.mds.b.smithi175.stderr: 8: (Finisher::finisher_thread_entry()+0x4e0) [0x7fe0bdbd618a] 2025-03-23T11:18:11.971 INFO:tasks.ceph.mds.b.smithi175.stderr: 9: /usr/lib64/ceph/libceph-common.so.2(+0x1d6a7b) [0x7fe0bdbd6a7b] 2025-03-23T11:18:11.971 INFO:tasks.ceph.mds.b.smithi175.stderr: 10: (Thread::entry_wrapper()+0x33) [0x7fe0bdc2bbf5] 2025-03-23T11:18:11.971 INFO:tasks.ceph.mds.b.smithi175.stderr: 11: (Thread::_entry_func(void*)+0xd) [0x7fe0bdc2bc0b] 2025-03-23T11:18:11.971 INFO:tasks.ceph.mds.b.smithi175.stderr: 12: /lib64/libc.so.6(+0x8a3b2) [0x7fe0bd28a3b2] 2025-03-23T11:18:11.971 INFO:tasks.ceph.mds.b.smithi175.stderr: 13: /lib64/libc.so.6(+0x10f430) [0x7fe0bd30f430] 2025-03-23T11:18:11.971 INFO:tasks.ceph.mds.b.smithi175.stderr:2025-03-23T11:18:11.968+0000 7fe0b17d6640 -1 *** Caught signal (Aborted) ** 2025-03-23T11:18:11.971 INFO:tasks.ceph.mds.b.smithi175.stderr: in thread 7fe0b17d6640 thread_name:mds-rank-fin 2025-03-23T11:18:11.971 INFO:tasks.ceph.mds.b.smithi175.stderr: 2025-03-23T11:18:11.971 INFO:tasks.ceph.mds.b.smithi175.stderr: ceph version 20.0.0-658-gafdfc507 (afdfc507cade689e910bdf1c4519de30e576304e) tentacle (dev) 2025-03-23T11:18:11.972 INFO:tasks.ceph.mds.b.smithi175.stderr: 1: /lib64/libc.so.6(+0x3ea60) [0x7fe0bd23ea60] 2025-03-23T11:18:11.972 INFO:tasks.ceph.mds.b.smithi175.stderr: 2: /lib64/libc.so.6(+0x8c0fc) [0x7fe0bd28c0fc] 2025-03-23T11:18:11.972 INFO:tasks.ceph.mds.b.smithi175.stderr: 3: raise() 2025-03-23T11:18:11.972 INFO:tasks.ceph.mds.b.smithi175.stderr: 4: abort() 2025-03-23T11:18:11.972 INFO:tasks.ceph.mds.b.smithi175.stderr: 5: /lib64/libc.so.6(+0x2875b) [0x7fe0bd22875b] 2025-03-23T11:18:11.972 INFO:tasks.ceph.mds.b.smithi175.stderr: 6: /lib64/libc.so.6(+0x376f6) [0x7fe0bd2376f6] 2025-03-23T11:18:11.972 INFO:tasks.ceph.mds.b.smithi175.stderr: 7: (C_MDS_RetryRequest::complete(int)+0x192) [0x5602d6af6470] 2025-03-23T11:18:11.972 INFO:tasks.ceph.mds.b.smithi175.stderr: 8: (Finisher::finisher_thread_entry()+0x4e0) [0x7fe0bdbd618a] 2025-03-23T11:18:11.972 INFO:tasks.ceph.mds.b.smithi175.stderr: 9: /usr/lib64/ceph/libceph-common.so.2(+0x1d6a7b) [0x7fe0bdbd6a7b] 2025-03-23T11:18:11.972 INFO:tasks.ceph.mds.b.smithi175.stderr: 10: (Thread::entry_wrapper()+0x33) [0x7fe0bdc2bbf5] 2025-03-23T11:18:11.972 INFO:tasks.ceph.mds.b.smithi175.stderr: 11: (Thread::_entry_func(void*)+0xd) [0x7fe0bdc2bc0b] 2025-03-23T11:18:11.972 INFO:tasks.ceph.mds.b.smithi175.stderr: 12: /lib64/libc.so.6(+0x8a3b2) [0x7fe0bd28a3b2] 2025-03-23T11:18:11.972 INFO:tasks.ceph.mds.b.smithi175.stderr: 13: /lib64/libc.so.6(+0x10f430) [0x7fe0bd30f430] 2025-03-23T11:18:11.972 INFO:tasks.ceph.mds.b.smithi175.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 2025-03-23T11:18:11.972 INFO:tasks.ceph.mds.b.smithi175.stderr:
After investigating, it appears that the object being referenced by the mdr is lost by the time the C_MDS_RetryRequest is retried for completion.
Sometimes, this issue also manifests as an assertion failure of ceph_assert(ceph_mutex_is_locked_by_me(mds->mds_lock)); in MDSContext::complete().
Updated by Abhishek Lekshmanan 12 months ago
We see a similar crash with quincy
27: FAILED ceph_assert(((mds->mds_lock).is_locked_by_me()))
ceph version 17.2.8-4-gfdcded4f4fe (fdcded4f4fe652b62c46d2580aa45b936a5ab615) quincy (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x11c) [0x7ff4fb421919]
2: (ceph::register_assert_context(ceph::common::CephContext*)+0) [0x7ff4fb421b20]
3: (MDSContext::complete(int)+0xd1) [0x560ae1635f31]
4: (Finisher::finisher_thread_entry()+0x5e1) [0x7ff4fb3b0359]
5: (Finisher::FinisherThread::entry()+0xd) [0x560ae12c696b]
6: (Thread::entry_wrapper()+0x3f) [0x7ff4fb3f2849]
7: (Thread::_entry_func(void*)+0x9) [0x7ff4fb3f2861]
8: /lib64/libc.so.6(+0x897e2) [0x7ff4fa2897e2]
9: /lib64/libc.so.6(+0x10e800) [0x7ff4fa30e800]
-3> 2025-03-27T15:28:59.513+0100 7ff4f59f6640 10 monclient: handle_mon_command_ack 2 [{"prefix":"osd blocklist", "blocklistop":"add","addr":"188.185.19.81:0/2540936658"}]
-2> 2025-03-27T15:28:59.513+0100 7ff4f59f6640 10 monclient: _finish_command 2 = system:0 blocklisting 188.185.19.81:0/2540936658 until 2025-03-27T16:28:58.570465+0100 (3600 sec)
-1> 2025-03-27T15:28:59.513+0100 7ff4f61f7640 10 monclient: _send_mon_message to mon.c at v2:188.185.19.81:40165/0
0> 2025-03-27T15:28:59.515+0100 7ff4ef9ea640 -1 *** Caught signal (Aborted) **
in thread 7ff4ef9ea640 thread_name:MR_Finisher
ceph version 17.2.8-4-gfdcded4f4fe (fdcded4f4fe652b62c46d2580aa45b936a5ab615) quincy (stable)
1: /lib64/libc.so.6(+0x3e730) [0x7ff4fa23e730]
2: /lib64/libc.so.6(+0x8b52c) [0x7ff4fa28b52c]
3: raise()
4: abort()
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x263) [0x7ff4fb421a60]
6: (ceph::register_assert_context(ceph::common::CephContext*)+0) [0x7ff4fb421b20]
7: (MDSContext::complete(int)+0xd1) [0x560ae1635f31]
8: (Finisher::finisher_thread_entry()+0x5e1) [0x7ff4fb3b0359]
9: (Finisher::FinisherThread::entry()+0xd) [0x560ae12c696b]
10: (Thread::entry_wrapper()+0x3f) [0x7ff4fb3f2849]
11: (Thread::_entry_func(void*)+0x9) [0x7ff4fb3f2861]
12: /lib64/libc.so.6(+0x897e2) [0x7ff4fa2897e2]
13: /lib64/libc.so.6(+0x10e800) [0x7ff4fa30e800]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Updated by Abhishek Lekshmanan 12 months ago
Abhishek Lekshmanan wrote in #note-3:
We see a similar crash with quincy
[...]
created https://tracker.ceph.com/issues/70769
The mutex crash likely prevents us from seeing my reported issue with BatchGetAttr as the mutex assert indicates one more potential issue where the MDSContext expects the MDS lock which is not held by the caller! However the BatchGetAttr issue clearly causes nullptr MDRequest objects which also might explain other crashes... potential PR (only for the batch issue): https://github.com/ceph/ceph/pull/62630
Updated by Milind Changire 12 months ago ยท Edited
okay, so this issue is with the uninline data path during scrubbing
the uniniline code attempts a rdlock_path_pin_ref() on an inode while attempting to uninline it ... this is essentially bad code on my part
rdlock_path_pin_ref() attempts to do the locking and leads to a C_MDS_RetryRequest being attempted
But, when the code path reaches Server::dispatch_client_request() the mdr->client_request field is null
in short, since we know which inode we need to uninline, we should just use it
Updated by Milind Changire 12 months ago
- Related to Bug #69953: mds: segmentation faults in recent QA added
Updated by Venky Shankar 12 months ago
- Category set to Correctness/Safety
- Status changed from New to Fix Under Review
- Target version set to v20.0.0
- Source set to Development
- Backport set to reef,squid
Updated by Venky Shankar 12 months ago
- Component(FS) MDS added
- Labels (FS) crash, scrub added
Updated by Venky Shankar 10 months ago
- Status changed from Fix Under Review to Pending Backport
- Target version changed from v20.0.0 to v21.0.0
- Backport changed from reef,squid to reef,squid,tentacle
Updated by Upkeep Bot 10 months ago
- Copied to Backport #71395: tentacle: qa: assertion failure on context completion of C_MDS_RetryRequest added
Updated by Upkeep Bot 10 months ago
- Copied to Backport #71396: squid: qa: assertion failure on context completion of C_MDS_RetryRequest added
Updated by Upkeep Bot 10 months ago
- Copied to Backport #71397: reef: qa: assertion failure on context completion of C_MDS_RetryRequest added
Updated by Upkeep Bot 9 months ago
- Merge Commit set to b4c65dc7fea29c33b70dfe39b154aa68b3a6ec85
- Fixed In set to v20.3.0-442-gb4c65dc7fea
- Upkeep Timestamp set to 2025-07-08T18:06:50+00:00
Updated by Upkeep Bot 8 months ago
- Fixed In changed from v20.3.0-442-gb4c65dc7fea to v20.3.0-442-gb4c65dc7fea2
- Upkeep Timestamp changed from 2025-07-08T18:06:50+00:00 to 2025-07-14T15:21:17+00:00
Updated by Upkeep Bot 8 months ago
- Fixed In changed from v20.3.0-442-gb4c65dc7fea2 to v20.3.0-442-gb4c65dc7fe
- Upkeep Timestamp changed from 2025-07-14T15:21:17+00:00 to 2025-07-14T20:45:50+00:00
Updated by Venky Shankar about 2 months ago
- Assignee changed from Milind Changire to Mahesh Mohan
- Tags (freeform) changed from backport_processed to temp-assign
@Milind Changire is moving away from CephFS development. Assigning this to @Mahesh Mohan in the interim.
Updated by Upkeep Bot about 2 months ago
- Tags (freeform) changed from temp-assign to temp-assign backport_processed
Updated by Upkeep Bot about 1 month ago
- Status changed from Pending Backport to Resolved
- Upkeep Timestamp changed from 2025-07-14T20:45:50+00:00 to 2026-02-13T13:02:11+00:00