Project

General

Profile

Actions

Bug #67270

closed

Segmentation fault in mon during scrub op

Added by Laura Flores over 1 year ago. Updated 3 months ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Backport:
squid
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Tags (freeform):
backport_processed
Fixed In:
v19.3.0-7008-g2c8bfcf209
Released In:
v20.2.0~1307
Upkeep Timestamp:
2025-11-01T00:59:18+00:00

Description

/a/yuriw-2024-07-24_13:26:20-rados-wip-yuri13-testing-2024-07-23-1049-squid-distro-default-smithi/7816741

2024-07-25T02:14:38.652 INFO:tasks.workunit.client.0.smithi077.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/cephtool/../../standalone/ceph-helpers.sh:1978: expect_failure:  return 0
2024-07-25T02:14:38.652 INFO:tasks.workunit.client.0.smithi077.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/cephtool/test.sh:1189: test_mon_mon:  ceph mon set election_strategy disallow
2024-07-25T02:14:39.016 INFO:tasks.ceph.mon.a.smithi077.stderr:*** Caught signal (Segmentation fault) **
2024-07-25T02:14:39.016 INFO:tasks.ceph.mon.a.smithi077.stderr: in thread 7fd9b82e0640 thread_name:ms_dispatch
2024-07-25T02:14:39.018 INFO:tasks.ceph.mon.a.smithi077.stderr: ceph version 19.0.0-3644-g0bbb8f86 (0bbb8f86319e30091fba4403956eefdcfed3c0a4) squid (rc)
2024-07-25T02:14:39.018 INFO:tasks.ceph.mon.a.smithi077.stderr: 1: /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7fd9c067a520]
2024-07-25T02:14:39.018 INFO:tasks.ceph.mon.a.smithi077.stderr: 2: (Monitor::_scrub(ScrubResult*, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >*, int*)+0x10ac) [0x56151c2596ec]
2024-07-25T02:14:39.018 INFO:tasks.ceph.mon.a.smithi077.stderr: 3: (Monitor::scrub()+0x286) [0x56151c249676]
2024-07-25T02:14:39.018 INFO:tasks.ceph.mon.a.smithi077.stderr: 4: (Monitor::handle_scrub(boost::intrusive_ptr<MonOpRequest>)+0x567) [0x56151c24cef7]
2024-07-25T02:14:39.018 INFO:tasks.ceph.mon.a.smithi077.stderr: 5: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0xea4) [0x56151c251b14]
2024-07-25T02:14:39.018 INFO:tasks.ceph.mon.a.smithi077.stderr: 6: (Monitor::_ms_dispatch(Message*)+0x42f) [0x56151c25250f]
2024-07-25T02:14:39.018 INFO:tasks.ceph.mon.a.smithi077.stderr: 7: ceph-mon(+0x267b8e) [0x56151c207b8e]
2024-07-25T02:14:39.018 INFO:tasks.ceph.mon.a.smithi077.stderr: 8: (DispatchQueue::entry()+0x58a) [0x7fd9c11d39aa]
2024-07-25T02:14:39.019 INFO:tasks.ceph.mon.a.smithi077.stderr: 9: /usr/lib/ceph/libceph-common.so.2(+0x426ec1) [0x7fd9c126dec1]
2024-07-25T02:14:39.019 INFO:tasks.ceph.mon.a.smithi077.stderr: 10: /lib/x86_64-linux-gnu/libc.so.6(+0x94b43) [0x7fd9c06ccb43]
2024-07-25T02:14:39.019 INFO:tasks.ceph.mon.a.smithi077.stderr: 11: /lib/x86_64-linux-gnu/libc.so.6(+0x126a00) [0x7fd9c075ea00]

This job produced a coredump. I wasn't able to tell much from it, but this is the bt:

[Current thread is 1 (LWP 24412)]
(gdb) bt
#0  0x00007fd9c06cea7c in __pthread_mutex_lock_full (mutex=0x5f5c) at ./nptl/pthread_mutex_lock.c:497
#1  0x00007fd9c108f342 in __gnu_cxx::new_allocator<std::_Sp_counted_ptr_inplace<std::vector<entity_addrvec_t, std::allocator<entity_addrvec_t> >, std::allocator<std::vector<entity_addrvec_t, std::allocator<entity_addrvec_t> > >, (__gnu_cxx::_Lock_policy)2> >::deallocate (this=<optimized out>, __t=1, __p=<optimized out>) at /usr/include/c++/11/ext/new_allocator.h:132
#2  std::allocator<std::_Rb_tree_node<int> >::deallocate (__n=1, __p=<optimized out>, this=0x7fd9b82d8270) at /usr/include/c++/11/bits/allocator.h:199
#3  std::allocator_traits<std::allocator<std::_Rb_tree_node<int> > >::deallocate (__n=1, __p=<optimized out>, __a=...) at /usr/include/c++/11/bits/alloc_traits.h:496
#4  std::_Rb_tree<int, int, std::_Identity<int>, std::less<int>, std::allocator<int> >::_M_put_node (__p=<optimized out>, this=0x7fd9b82d8270) at /usr/include/c++/11/bits/stl_tree.h:565
#5  std::_Rb_tree<int, int, std::_Identity<int>, std::less<int>, std::allocator<int> >::_M_drop_node (__p=<optimized out>, this=0x7fd9b82d8270) at /usr/include/c++/11/bits/stl_tree.h:632
#6  std::_Rb_tree<int, int, std::_Identity<int>, std::less<int>, std::allocator<int> >::_M_erase (this=0x7fd9b82d8270, __x=<optimized out>) at /usr/include/c++/11/bits/stl_tree.h:1891
#7  0x00007fd9c12db319 in std::_Rb_tree<unsigned long, unsigned long, std::_Identity<unsigned long>, std::less<unsigned long>, std::allocator<unsigned long> >::~_Rb_tree (this=<optimized out>, 
    this=<optimized out>) at /usr/include/c++/11/bits/stl_tree.h:984
#8  std::set<unsigned long, std::less<unsigned long>, std::allocator<unsigned long> >::~set (this=<optimized out>, this=<optimized out>) at /usr/include/c++/11/bits/stl_set.h:281
#9  ceph::logging::Log::dump_recent (this=0x56151d3e2b40) at ./src/log/Log.cc:537
#10 0x0000000000000000 in ?? ()


Related issues 1 (1 open0 closed)

Copied to RADOS - Backport #72095: squid: Segmentation fault in mon during scrub opNewMOHIT AGRAWALActions
Actions #2

Updated by MOHIT AGRAWAL over 1 year ago · Edited

  • Assignee set to MOHIT AGRAWAL

As we can see in below stacktrace it is showing *start is invalid and it is a part of
scrub_state(scrub_state->last_key) and it seems scrub_state is reset via bootstrap
at the same time during handle_scrub operation.

(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=11, threadid=140573074654784) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=11, threadid=140573074654784) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140573074654784, signo=signo@entry=11) at ./nptl/pthread_kill.c:89
#3  0x00007fd9c067a476 in __GI_raise (sig=11) at ../sysdeps/posix/raise.c:26
#4  0x000056151c45765d in reraise_fatal (signum=11) at global/./src/global/signal_handler.cc:88
#5  handle_oneshot_fatal_signal (signum=11) at global/./src/global/signal_handler.cc:367
#6  <signal handler called>
#7  std::operator<< <char, std::char_traits<char>, std::allocator<char> > (__str=..., __os=..., __os=..., __str=...)
    at /usr/include/c++/11/bits/basic_string.h:194

#8  std::operator<< <std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > (v={...}, out=...) at mon/./src/include/types.h:131
#9  Monitor::_scrub (this=0x56151e200000, r=0x56151e2f1078, start=0x8, num_keys=0x7fd9b82dc2ac) at mon/./src/mon/Monitor.cc:5672
#10 0x000056151c249676 in Monitor::scrub (this=0x56151e200000) at mon/./src/mon/Monitor.cc:5591
#11 0x000056151c24cef7 in Monitor::handle_scrub (this=0x56151e200000, op=...) at mon/./src/mon/Monitor.cc:5654
#12 0x000056151c251b14 in Monitor::dispatch_op (this=0x56151e200000, op=...) at mon/./src/mon/Monitor.cc:4716
#13 0x000056151c25250f in Monitor::_ms_dispatch (this=0x56151e200000, m=0x56151ec86b40) at mon/./src/mon/Monitor.cc:4533
#14 0x000056151c207b8e in Monitor::ms_dispatch (m=0x56151ec86b40, this=0x56151e200000) at mon/./src/mon/Monitor.h:952
#15 Dispatcher::ms_dispatch2 (this=0x56151e200000, m=...) at mon/./src/msg/Dispatcher.h:130
#16 0x00007fd9c11d39aa in DispatchQueue::entry() () from /usr/lib/ceph/libceph-common.so.2
#17 0x00007fd9c126dec1 in ?? () from /usr/lib/ceph/libceph-common.so.2
#18 0x00007fd9c06ccb43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#19 0x00007fd9c075ea00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

gdb) f 9
#9  Monitor::_scrub (this=0x56151e200000, r=0x56151e2f1078, start=0x8, num_keys=0x7fd9b82dc2ac) at mon/./src/mon/Monitor.cc:5672
5672    mon/./src/mon/Monitor.cc: No such file or directory.
(gdb) p start
$1 = (std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > *) 0x8
(gdb) f 10
#10 0x000056151c249676 in Monitor::scrub (this=0x56151e200000) at mon/./src/mon/Monitor.cc:5591
5591    mon/./src/mon/Monitor.cc: No such file or directory.
(gdb) p scrub_state
$2 = std::shared_ptr<Monitor::ScrubState> (empty) = {get() = 0x0}

If we do check the logs the logs are also validate the same thing.

Thread (7fd9b82e0640) is throwing a message wait_for_paxos_write flushing at line number(267498)
267498   -160> 2024-07-25T02:14:39.011+0000 7fd9b82e0640 10 mon.a@0(leader) e1 wait_for_paxos_write flushing pending write

and before flushed it the other thread(7fd9b5adb640) call scrub_reset at the line (267584)
Thread 2
 267584    -87> 2024-07-25T02:14:39.011+0000 7fd9b5adb640 10 mon.a@0(probing) e2 scrub_reset

and the thread(7fd9b82e0640) is throwing message wait_for_paxos_write flushed finished.

  267670     -1> 2024-07-25T02:14:39.011+0000 7fd9b82e0640 10 mon.a@0(probing) e2 wait_for_paxos_write flushed pending write

and then crashed while trying to access scrub_state->last_key.
I need to debug it more to find out RCA.

Actions #3

Updated by Laura Flores over 1 year ago · Edited

@MOHIT AGRAWAL can you share how you got a better backtrace than I did in gdb? I wonder if I was missing some debug info.
I install the ceph-mon debug package only.

Actions #4

Updated by Radoslaw Zarzynski over 1 year ago

scrub note: work in progress, Mohit is looking into possible fixes.

Actions #5

Updated by MOHIT AGRAWAL over 1 year ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 59050
Actions #6

Updated by Radoslaw Zarzynski over 1 year ago

review-in-progress.

Actions #7

Updated by Laura Flores 8 months ago

  • Status changed from Fix Under Review to Pending Backport
Actions #8

Updated by Laura Flores 8 months ago

/a/yuriw-2025-06-30_19:22:49-rados-squid-release-distro-default-smithi/8359455

Actions #9

Updated by Upkeep Bot 8 months ago

  • Copied to Backport #72095: squid: Segmentation fault in mon during scrub op added
Actions #10

Updated by Upkeep Bot 8 months ago

  • Tags (freeform) set to backport_processed
Actions #11

Updated by Upkeep Bot 8 months ago

  • Merge Commit set to 2c8bfcf2094ab2af352757ac4fc2813c27c26020
  • Fixed In set to v19.3.0-7008-g2c8bfcf2094
  • Upkeep Timestamp set to 2025-07-11T20:39:30+00:00
Actions #12

Updated by Upkeep Bot 8 months ago

  • Fixed In changed from v19.3.0-7008-g2c8bfcf2094 to v19.3.0-7008-g2c8bfcf209
  • Upkeep Timestamp changed from 2025-07-11T20:39:30+00:00 to 2025-07-14T20:44:21+00:00
Actions #13

Updated by Upkeep Bot 5 months ago

  • Released In set to v20.2.0~1307
  • Upkeep Timestamp changed from 2025-07-14T20:44:21+00:00 to 2025-11-01T00:59:18+00:00
Actions #14

Updated by MOHIT AGRAWAL 3 months ago

  • Status changed from Pending Backport to Closed

The pull request is merged.

Actions

Also available in: Atom PDF