Project

General

Profile

Actions

Bug #70624

closed

qa: assertion failure on context completion of C_MDS_RetryRequest

Added by Milind Changire 12 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Development
Backport:
reef,squid,tentacle
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
crash, scrub
Pull request ID:
Tags (freeform):
temp-assign backport_processed
Fixed In:
v20.3.0-442-gb4c65dc7fe
Released In:
Upkeep Timestamp:
2026-02-13T13:02:11+00:00

Description

http://qa-proxy.ceph.com/teuthology/mchangir-2025-03-23_08:42:43-fs:functional-wip-mchangir-segfault-fixes-final-debug-distro-default-smithi/8204139/

2025-03-23T11:18:11.969 INFO:tasks.ceph.mds.b.smithi175.stderr:ceph-mds: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_
64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/20.0.0-658-gafdfc507/rpm/el9/BUILD/ceph-20.0.0-658-gafdfc507/redhat-linux-build/boost/incl
ude/boost/smart_ptr/intrusive_ptr.hpp:195: T& boost::intrusive_ptr<T>::operator*() const [with T = MDRequestImpl]: Assertion `px != 0' failed.
2025-03-23T11:18:11.970 INFO:tasks.ceph.mds.b.smithi175.stderr:*** Caught signal (Aborted) **
2025-03-23T11:18:11.970 INFO:tasks.ceph.mds.b.smithi175.stderr: in thread 7fe0b17d6640 thread_name:mds-rank-fin
2025-03-23T11:18:11.970 INFO:tasks.ceph.mds.b.smithi175.stderr: ceph version 20.0.0-658-gafdfc507 (afdfc507cade689e910bdf1c4519de30e576304e) tentacle (dev)
2025-03-23T11:18:11.970 INFO:tasks.ceph.mds.b.smithi175.stderr: 1: /lib64/libc.so.6(+0x3ea60) [0x7fe0bd23ea60]
2025-03-23T11:18:11.970 INFO:tasks.ceph.mds.b.smithi175.stderr: 2: /lib64/libc.so.6(+0x8c0fc) [0x7fe0bd28c0fc]
2025-03-23T11:18:11.970 INFO:tasks.ceph.mds.b.smithi175.stderr: 3: raise()
2025-03-23T11:18:11.970 INFO:tasks.ceph.mds.b.smithi175.stderr: 4: abort()
2025-03-23T11:18:11.971 INFO:tasks.ceph.mds.b.smithi175.stderr: 5: /lib64/libc.so.6(+0x2875b) [0x7fe0bd22875b]
2025-03-23T11:18:11.971 INFO:tasks.ceph.mds.b.smithi175.stderr: 6: /lib64/libc.so.6(+0x376f6) [0x7fe0bd2376f6]
2025-03-23T11:18:11.971 INFO:tasks.ceph.mds.b.smithi175.stderr: 7: (C_MDS_RetryRequest::complete(int)+0x192) [0x5602d6af6470]
2025-03-23T11:18:11.971 INFO:tasks.ceph.mds.b.smithi175.stderr: 8: (Finisher::finisher_thread_entry()+0x4e0) [0x7fe0bdbd618a]
2025-03-23T11:18:11.971 INFO:tasks.ceph.mds.b.smithi175.stderr: 9: /usr/lib64/ceph/libceph-common.so.2(+0x1d6a7b) [0x7fe0bdbd6a7b]
2025-03-23T11:18:11.971 INFO:tasks.ceph.mds.b.smithi175.stderr: 10: (Thread::entry_wrapper()+0x33) [0x7fe0bdc2bbf5]
2025-03-23T11:18:11.971 INFO:tasks.ceph.mds.b.smithi175.stderr: 11: (Thread::_entry_func(void*)+0xd) [0x7fe0bdc2bc0b]
2025-03-23T11:18:11.971 INFO:tasks.ceph.mds.b.smithi175.stderr: 12: /lib64/libc.so.6(+0x8a3b2) [0x7fe0bd28a3b2]
2025-03-23T11:18:11.971 INFO:tasks.ceph.mds.b.smithi175.stderr: 13: /lib64/libc.so.6(+0x10f430) [0x7fe0bd30f430]
2025-03-23T11:18:11.971 INFO:tasks.ceph.mds.b.smithi175.stderr:2025-03-23T11:18:11.968+0000 7fe0b17d6640 -1 *** Caught signal (Aborted) **
2025-03-23T11:18:11.971 INFO:tasks.ceph.mds.b.smithi175.stderr: in thread 7fe0b17d6640 thread_name:mds-rank-fin
2025-03-23T11:18:11.971 INFO:tasks.ceph.mds.b.smithi175.stderr:
2025-03-23T11:18:11.971 INFO:tasks.ceph.mds.b.smithi175.stderr: ceph version 20.0.0-658-gafdfc507 (afdfc507cade689e910bdf1c4519de30e576304e) tentacle (dev)
2025-03-23T11:18:11.972 INFO:tasks.ceph.mds.b.smithi175.stderr: 1: /lib64/libc.so.6(+0x3ea60) [0x7fe0bd23ea60]
2025-03-23T11:18:11.972 INFO:tasks.ceph.mds.b.smithi175.stderr: 2: /lib64/libc.so.6(+0x8c0fc) [0x7fe0bd28c0fc]
2025-03-23T11:18:11.972 INFO:tasks.ceph.mds.b.smithi175.stderr: 3: raise()
2025-03-23T11:18:11.972 INFO:tasks.ceph.mds.b.smithi175.stderr: 4: abort()
2025-03-23T11:18:11.972 INFO:tasks.ceph.mds.b.smithi175.stderr: 5: /lib64/libc.so.6(+0x2875b) [0x7fe0bd22875b]
2025-03-23T11:18:11.972 INFO:tasks.ceph.mds.b.smithi175.stderr: 6: /lib64/libc.so.6(+0x376f6) [0x7fe0bd2376f6]
2025-03-23T11:18:11.972 INFO:tasks.ceph.mds.b.smithi175.stderr: 7: (C_MDS_RetryRequest::complete(int)+0x192) [0x5602d6af6470]
2025-03-23T11:18:11.972 INFO:tasks.ceph.mds.b.smithi175.stderr: 8: (Finisher::finisher_thread_entry()+0x4e0) [0x7fe0bdbd618a]
2025-03-23T11:18:11.972 INFO:tasks.ceph.mds.b.smithi175.stderr: 9: /usr/lib64/ceph/libceph-common.so.2(+0x1d6a7b) [0x7fe0bdbd6a7b]
2025-03-23T11:18:11.972 INFO:tasks.ceph.mds.b.smithi175.stderr: 10: (Thread::entry_wrapper()+0x33) [0x7fe0bdc2bbf5]
2025-03-23T11:18:11.972 INFO:tasks.ceph.mds.b.smithi175.stderr: 11: (Thread::_entry_func(void*)+0xd) [0x7fe0bdc2bc0b]
2025-03-23T11:18:11.972 INFO:tasks.ceph.mds.b.smithi175.stderr: 12: /lib64/libc.so.6(+0x8a3b2) [0x7fe0bd28a3b2]
2025-03-23T11:18:11.972 INFO:tasks.ceph.mds.b.smithi175.stderr: 13: /lib64/libc.so.6(+0x10f430) [0x7fe0bd30f430]
2025-03-23T11:18:11.972 INFO:tasks.ceph.mds.b.smithi175.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2025-03-23T11:18:11.972 INFO:tasks.ceph.mds.b.smithi175.stderr:

After investigating, it appears that the object being referenced by the mdr is lost by the time the C_MDS_RetryRequest is retried for completion.

Sometimes, this issue also manifests as an assertion failure of ceph_assert(ceph_mutex_is_locked_by_me(mds->mds_lock)); in MDSContext::complete().


Related issues 4 (1 open3 closed)

Related to CephFS - Bug #69953: mds: segmentation faults in recent QAPending BackportMahesh Mohan

Actions
Copied to CephFS - Backport #71395: tentacle: qa: assertion failure on context completion of C_MDS_RetryRequestResolvedMahesh MohanActions
Copied to CephFS - Backport #71396: squid: qa: assertion failure on context completion of C_MDS_RetryRequestRejectedMilind ChangireActions
Copied to CephFS - Backport #71397: reef: qa: assertion failure on context completion of C_MDS_RetryRequestRejectedMilind ChangireActions
Actions #1

Updated by Milind Changire 12 months ago

  • Description updated (diff)
Actions #2

Updated by Milind Changire 12 months ago

  • Assignee set to Milind Changire
Actions #3

Updated by Abhishek Lekshmanan 12 months ago

We see a similar crash with quincy

27: FAILED ceph_assert(((mds->mds_lock).is_locked_by_me()))                                                                                                                   

 ceph version 17.2.8-4-gfdcded4f4fe (fdcded4f4fe652b62c46d2580aa45b936a5ab615) quincy (stable)                                                                                                                  
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x11c) [0x7ff4fb421919]                                                                                                               
 2: (ceph::register_assert_context(ceph::common::CephContext*)+0) [0x7ff4fb421b20]                                                                                                                              
 3: (MDSContext::complete(int)+0xd1) [0x560ae1635f31]                                                                                                                                                           
 4: (Finisher::finisher_thread_entry()+0x5e1) [0x7ff4fb3b0359]                                                                                                                                                  
 5: (Finisher::FinisherThread::entry()+0xd) [0x560ae12c696b]                                                                                                                                                    
 6: (Thread::entry_wrapper()+0x3f) [0x7ff4fb3f2849]                                                                                                                                                             
 7: (Thread::_entry_func(void*)+0x9) [0x7ff4fb3f2861]                                                                                                                                                           
 8: /lib64/libc.so.6(+0x897e2) [0x7ff4fa2897e2]                                                                                                                                                                 
 9: /lib64/libc.so.6(+0x10e800) [0x7ff4fa30e800]                                                                                                                                                                

    -3> 2025-03-27T15:28:59.513+0100 7ff4f59f6640 10 monclient: handle_mon_command_ack 2 [{"prefix":"osd blocklist", "blocklistop":"add","addr":"188.185.19.81:0/2540936658"}]                                  
    -2> 2025-03-27T15:28:59.513+0100 7ff4f59f6640 10 monclient: _finish_command 2 = system:0 blocklisting 188.185.19.81:0/2540936658 until 2025-03-27T16:28:58.570465+0100 (3600 sec)                           
    -1> 2025-03-27T15:28:59.513+0100 7ff4f61f7640 10 monclient: _send_mon_message to mon.c at v2:188.185.19.81:40165/0                                                                                          
     0> 2025-03-27T15:28:59.515+0100 7ff4ef9ea640 -1 *** Caught signal (Aborted) **                                                                                                                             
 in thread 7ff4ef9ea640 thread_name:MR_Finisher                                                                                                                                                                 

 ceph version 17.2.8-4-gfdcded4f4fe (fdcded4f4fe652b62c46d2580aa45b936a5ab615) quincy (stable)                                                                                                                  
 1: /lib64/libc.so.6(+0x3e730) [0x7ff4fa23e730]
 2: /lib64/libc.so.6(+0x8b52c) [0x7ff4fa28b52c]
 3: raise()
 4: abort()
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x263) [0x7ff4fb421a60]
 6: (ceph::register_assert_context(ceph::common::CephContext*)+0) [0x7ff4fb421b20]
 7: (MDSContext::complete(int)+0xd1) [0x560ae1635f31]
 8: (Finisher::finisher_thread_entry()+0x5e1) [0x7ff4fb3b0359]
 9: (Finisher::FinisherThread::entry()+0xd) [0x560ae12c696b]
 10: (Thread::entry_wrapper()+0x3f) [0x7ff4fb3f2849] 
 11: (Thread::_entry_func(void*)+0x9) [0x7ff4fb3f2861]
 12: /lib64/libc.so.6(+0x897e2) [0x7ff4fa2897e2]
 13: /lib64/libc.so.6(+0x10e800) [0x7ff4fa30e800]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Actions #4

Updated by Abhishek Lekshmanan 12 months ago

Abhishek Lekshmanan wrote in #note-3:

We see a similar crash with quincy

[...]

created https://tracker.ceph.com/issues/70769

The mutex crash likely prevents us from seeing my reported issue with BatchGetAttr as the mutex assert indicates one more potential issue where the MDSContext expects the MDS lock which is not held by the caller! However the BatchGetAttr issue clearly causes nullptr MDRequest objects which also might explain other crashes... potential PR (only for the batch issue): https://github.com/ceph/ceph/pull/62630

Actions #5

Updated by Milind Changire 12 months ago ยท Edited

okay, so this issue is with the uninline data path during scrubbing
the uniniline code attempts a rdlock_path_pin_ref() on an inode while attempting to uninline it ... this is essentially bad code on my part
rdlock_path_pin_ref() attempts to do the locking and leads to a C_MDS_RetryRequest being attempted
But, when the code path reaches Server::dispatch_client_request() the mdr->client_request field is null

in short, since we know which inode we need to uninline, we should just use it

Actions #6

Updated by Milind Changire 12 months ago

  • Pull request ID set to 62684
Actions #7

Updated by Milind Changire 12 months ago

  • Related to Bug #69953: mds: segmentation faults in recent QA added
Actions #8

Updated by Venky Shankar 12 months ago

  • Category set to Correctness/Safety
  • Status changed from New to Fix Under Review
  • Target version set to v20.0.0
  • Source set to Development
  • Backport set to reef,squid
Actions #9

Updated by Venky Shankar 12 months ago

  • Component(FS) MDS added
  • Labels (FS) crash, scrub added
Actions #10

Updated by Venky Shankar 10 months ago

  • Status changed from Fix Under Review to Pending Backport
  • Target version changed from v20.0.0 to v21.0.0
  • Backport changed from reef,squid to reef,squid,tentacle
Actions #11

Updated by Upkeep Bot 10 months ago

  • Copied to Backport #71395: tentacle: qa: assertion failure on context completion of C_MDS_RetryRequest added
Actions #12

Updated by Upkeep Bot 10 months ago

  • Copied to Backport #71396: squid: qa: assertion failure on context completion of C_MDS_RetryRequest added
Actions #13

Updated by Upkeep Bot 10 months ago

  • Copied to Backport #71397: reef: qa: assertion failure on context completion of C_MDS_RetryRequest added
Actions #14

Updated by Upkeep Bot 10 months ago

  • Tags (freeform) set to backport_processed
Actions #15

Updated by Upkeep Bot 9 months ago

  • Merge Commit set to b4c65dc7fea29c33b70dfe39b154aa68b3a6ec85
  • Fixed In set to v20.3.0-442-gb4c65dc7fea
  • Upkeep Timestamp set to 2025-07-08T18:06:50+00:00
Actions #16

Updated by Upkeep Bot 8 months ago

  • Fixed In changed from v20.3.0-442-gb4c65dc7fea to v20.3.0-442-gb4c65dc7fea2
  • Upkeep Timestamp changed from 2025-07-08T18:06:50+00:00 to 2025-07-14T15:21:17+00:00
Actions #17

Updated by Upkeep Bot 8 months ago

  • Fixed In changed from v20.3.0-442-gb4c65dc7fea2 to v20.3.0-442-gb4c65dc7fe
  • Upkeep Timestamp changed from 2025-07-14T15:21:17+00:00 to 2025-07-14T20:45:50+00:00
Actions #18

Updated by Venky Shankar about 2 months ago

  • Assignee changed from Milind Changire to Mahesh Mohan
  • Tags (freeform) changed from backport_processed to temp-assign

@Milind Changire is moving away from CephFS development. Assigning this to @Mahesh Mohan in the interim.

Actions #19

Updated by Upkeep Bot about 2 months ago

  • Tags (freeform) changed from temp-assign to temp-assign backport_processed
Actions #20

Updated by Upkeep Bot about 1 month ago

  • Status changed from Pending Backport to Resolved
  • Upkeep Timestamp changed from 2025-07-14T20:45:50+00:00 to 2026-02-13T13:02:11+00:00
Actions

Also available in: Atom PDF