Bug #67270
closedSegmentation fault in mon during scrub op
0%
Description
/a/yuriw-2024-07-24_13:26:20-rados-wip-yuri13-testing-2024-07-23-1049-squid-distro-default-smithi/7816741
2024-07-25T02:14:38.652 INFO:tasks.workunit.client.0.smithi077.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/cephtool/../../standalone/ceph-helpers.sh:1978: expect_failure: return 0 2024-07-25T02:14:38.652 INFO:tasks.workunit.client.0.smithi077.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/cephtool/test.sh:1189: test_mon_mon: ceph mon set election_strategy disallow 2024-07-25T02:14:39.016 INFO:tasks.ceph.mon.a.smithi077.stderr:*** Caught signal (Segmentation fault) ** 2024-07-25T02:14:39.016 INFO:tasks.ceph.mon.a.smithi077.stderr: in thread 7fd9b82e0640 thread_name:ms_dispatch 2024-07-25T02:14:39.018 INFO:tasks.ceph.mon.a.smithi077.stderr: ceph version 19.0.0-3644-g0bbb8f86 (0bbb8f86319e30091fba4403956eefdcfed3c0a4) squid (rc) 2024-07-25T02:14:39.018 INFO:tasks.ceph.mon.a.smithi077.stderr: 1: /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7fd9c067a520] 2024-07-25T02:14:39.018 INFO:tasks.ceph.mon.a.smithi077.stderr: 2: (Monitor::_scrub(ScrubResult*, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >*, int*)+0x10ac) [0x56151c2596ec] 2024-07-25T02:14:39.018 INFO:tasks.ceph.mon.a.smithi077.stderr: 3: (Monitor::scrub()+0x286) [0x56151c249676] 2024-07-25T02:14:39.018 INFO:tasks.ceph.mon.a.smithi077.stderr: 4: (Monitor::handle_scrub(boost::intrusive_ptr<MonOpRequest>)+0x567) [0x56151c24cef7] 2024-07-25T02:14:39.018 INFO:tasks.ceph.mon.a.smithi077.stderr: 5: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0xea4) [0x56151c251b14] 2024-07-25T02:14:39.018 INFO:tasks.ceph.mon.a.smithi077.stderr: 6: (Monitor::_ms_dispatch(Message*)+0x42f) [0x56151c25250f] 2024-07-25T02:14:39.018 INFO:tasks.ceph.mon.a.smithi077.stderr: 7: ceph-mon(+0x267b8e) [0x56151c207b8e] 2024-07-25T02:14:39.018 INFO:tasks.ceph.mon.a.smithi077.stderr: 8: (DispatchQueue::entry()+0x58a) [0x7fd9c11d39aa] 2024-07-25T02:14:39.019 INFO:tasks.ceph.mon.a.smithi077.stderr: 9: /usr/lib/ceph/libceph-common.so.2(+0x426ec1) [0x7fd9c126dec1] 2024-07-25T02:14:39.019 INFO:tasks.ceph.mon.a.smithi077.stderr: 10: /lib/x86_64-linux-gnu/libc.so.6(+0x94b43) [0x7fd9c06ccb43] 2024-07-25T02:14:39.019 INFO:tasks.ceph.mon.a.smithi077.stderr: 11: /lib/x86_64-linux-gnu/libc.so.6(+0x126a00) [0x7fd9c075ea00]
This job produced a coredump. I wasn't able to tell much from it, but this is the bt:
[Current thread is 1 (LWP 24412)]
(gdb) bt
#0 0x00007fd9c06cea7c in __pthread_mutex_lock_full (mutex=0x5f5c) at ./nptl/pthread_mutex_lock.c:497
#1 0x00007fd9c108f342 in __gnu_cxx::new_allocator<std::_Sp_counted_ptr_inplace<std::vector<entity_addrvec_t, std::allocator<entity_addrvec_t> >, std::allocator<std::vector<entity_addrvec_t, std::allocator<entity_addrvec_t> > >, (__gnu_cxx::_Lock_policy)2> >::deallocate (this=<optimized out>, __t=1, __p=<optimized out>) at /usr/include/c++/11/ext/new_allocator.h:132
#2 std::allocator<std::_Rb_tree_node<int> >::deallocate (__n=1, __p=<optimized out>, this=0x7fd9b82d8270) at /usr/include/c++/11/bits/allocator.h:199
#3 std::allocator_traits<std::allocator<std::_Rb_tree_node<int> > >::deallocate (__n=1, __p=<optimized out>, __a=...) at /usr/include/c++/11/bits/alloc_traits.h:496
#4 std::_Rb_tree<int, int, std::_Identity<int>, std::less<int>, std::allocator<int> >::_M_put_node (__p=<optimized out>, this=0x7fd9b82d8270) at /usr/include/c++/11/bits/stl_tree.h:565
#5 std::_Rb_tree<int, int, std::_Identity<int>, std::less<int>, std::allocator<int> >::_M_drop_node (__p=<optimized out>, this=0x7fd9b82d8270) at /usr/include/c++/11/bits/stl_tree.h:632
#6 std::_Rb_tree<int, int, std::_Identity<int>, std::less<int>, std::allocator<int> >::_M_erase (this=0x7fd9b82d8270, __x=<optimized out>) at /usr/include/c++/11/bits/stl_tree.h:1891
#7 0x00007fd9c12db319 in std::_Rb_tree<unsigned long, unsigned long, std::_Identity<unsigned long>, std::less<unsigned long>, std::allocator<unsigned long> >::~_Rb_tree (this=<optimized out>,
this=<optimized out>) at /usr/include/c++/11/bits/stl_tree.h:984
#8 std::set<unsigned long, std::less<unsigned long>, std::allocator<unsigned long> >::~set (this=<optimized out>, this=<optimized out>) at /usr/include/c++/11/bits/stl_set.h:281
#9 ceph::logging::Log::dump_recent (this=0x56151d3e2b40) at ./src/log/Log.cc:537
#10 0x0000000000000000 in ?? ()
Updated by Laura Flores over 1 year ago
Looks similar to this crash reported in telemetry:
http://telemetry.front.sepia.ceph.com:4000/d/jByk5HaMz/crash-spec-x-ray?var-sig_v2=b4840ac22ee48f2058747692077c96240007883d420668e71feb22149d8bd03b&orgId=1
Updated by MOHIT AGRAWAL over 1 year ago · Edited
- Assignee set to MOHIT AGRAWAL
As we can see in below stacktrace it is showing *start is invalid and it is a part of
scrub_state(scrub_state->last_key) and it seems scrub_state is reset via bootstrap
at the same time during handle_scrub operation.
(gdb) bt
#0 __pthread_kill_implementation (no_tid=0, signo=11, threadid=140573074654784) at ./nptl/pthread_kill.c:44
#1 __pthread_kill_internal (signo=11, threadid=140573074654784) at ./nptl/pthread_kill.c:78
#2 __GI___pthread_kill (threadid=140573074654784, signo=signo@entry=11) at ./nptl/pthread_kill.c:89
#3 0x00007fd9c067a476 in __GI_raise (sig=11) at ../sysdeps/posix/raise.c:26
#4 0x000056151c45765d in reraise_fatal (signum=11) at global/./src/global/signal_handler.cc:88
#5 handle_oneshot_fatal_signal (signum=11) at global/./src/global/signal_handler.cc:367
#6 <signal handler called>
#7 std::operator<< <char, std::char_traits<char>, std::allocator<char> > (__str=..., __os=..., __os=..., __str=...)
at /usr/include/c++/11/bits/basic_string.h:194
#8 std::operator<< <std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > (v={...}, out=...) at mon/./src/include/types.h:131
#9 Monitor::_scrub (this=0x56151e200000, r=0x56151e2f1078, start=0x8, num_keys=0x7fd9b82dc2ac) at mon/./src/mon/Monitor.cc:5672
#10 0x000056151c249676 in Monitor::scrub (this=0x56151e200000) at mon/./src/mon/Monitor.cc:5591
#11 0x000056151c24cef7 in Monitor::handle_scrub (this=0x56151e200000, op=...) at mon/./src/mon/Monitor.cc:5654
#12 0x000056151c251b14 in Monitor::dispatch_op (this=0x56151e200000, op=...) at mon/./src/mon/Monitor.cc:4716
#13 0x000056151c25250f in Monitor::_ms_dispatch (this=0x56151e200000, m=0x56151ec86b40) at mon/./src/mon/Monitor.cc:4533
#14 0x000056151c207b8e in Monitor::ms_dispatch (m=0x56151ec86b40, this=0x56151e200000) at mon/./src/mon/Monitor.h:952
#15 Dispatcher::ms_dispatch2 (this=0x56151e200000, m=...) at mon/./src/msg/Dispatcher.h:130
#16 0x00007fd9c11d39aa in DispatchQueue::entry() () from /usr/lib/ceph/libceph-common.so.2
#17 0x00007fd9c126dec1 in ?? () from /usr/lib/ceph/libceph-common.so.2
#18 0x00007fd9c06ccb43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#19 0x00007fd9c075ea00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
gdb) f 9
#9 Monitor::_scrub (this=0x56151e200000, r=0x56151e2f1078, start=0x8, num_keys=0x7fd9b82dc2ac) at mon/./src/mon/Monitor.cc:5672
5672 mon/./src/mon/Monitor.cc: No such file or directory.
(gdb) p start
$1 = (std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > *) 0x8
(gdb) f 10
#10 0x000056151c249676 in Monitor::scrub (this=0x56151e200000) at mon/./src/mon/Monitor.cc:5591
5591 mon/./src/mon/Monitor.cc: No such file or directory.
(gdb) p scrub_state
$2 = std::shared_ptr<Monitor::ScrubState> (empty) = {get() = 0x0}
If we do check the logs the logs are also validate the same thing.
Thread (7fd9b82e0640) is throwing a message wait_for_paxos_write flushing at line number(267498) 267498 -160> 2024-07-25T02:14:39.011+0000 7fd9b82e0640 10 mon.a@0(leader) e1 wait_for_paxos_write flushing pending write and before flushed it the other thread(7fd9b5adb640) call scrub_reset at the line (267584) Thread 2 267584 -87> 2024-07-25T02:14:39.011+0000 7fd9b5adb640 10 mon.a@0(probing) e2 scrub_reset
and the thread(7fd9b82e0640) is throwing message wait_for_paxos_write flushed finished.
267670 -1> 2024-07-25T02:14:39.011+0000 7fd9b82e0640 10 mon.a@0(probing) e2 wait_for_paxos_write flushed pending write
and then crashed while trying to access scrub_state->last_key.
I need to debug it more to find out RCA.
Updated by Laura Flores over 1 year ago · Edited
@MOHIT AGRAWAL can you share how you got a better backtrace than I did in gdb? I wonder if I was missing some debug info.
I install the ceph-mon debug package only.
Updated by Radoslaw Zarzynski over 1 year ago
scrub note: work in progress, Mohit is looking into possible fixes.
Updated by MOHIT AGRAWAL over 1 year ago
- Status changed from New to Fix Under Review
- Pull request ID set to 59050
Updated by Laura Flores 8 months ago
- Status changed from Fix Under Review to Pending Backport
Updated by Laura Flores 8 months ago
/a/yuriw-2025-06-30_19:22:49-rados-squid-release-distro-default-smithi/8359455
Updated by Upkeep Bot 8 months ago
- Copied to Backport #72095: squid: Segmentation fault in mon during scrub op added
Updated by Upkeep Bot 8 months ago
- Merge Commit set to 2c8bfcf2094ab2af352757ac4fc2813c27c26020
- Fixed In set to v19.3.0-7008-g2c8bfcf2094
- Upkeep Timestamp set to 2025-07-11T20:39:30+00:00
Updated by Upkeep Bot 8 months ago
- Fixed In changed from v19.3.0-7008-g2c8bfcf2094 to v19.3.0-7008-g2c8bfcf209
- Upkeep Timestamp changed from 2025-07-11T20:39:30+00:00 to 2025-07-14T20:44:21+00:00
Updated by Upkeep Bot 5 months ago
- Released In set to v20.2.0~1307
- Upkeep Timestamp changed from 2025-07-14T20:44:21+00:00 to 2025-11-01T00:59:18+00:00
Updated by MOHIT AGRAWAL 3 months ago
- Status changed from Pending Backport to Closed
The pull request is merged.