Bug #67270: Segmentation fault in mon during scrub op - RADOS - Ceph

Actions

Copy link

Bug #67270

closed

Segmentation fault in mon during scrub op

Added by Laura Flores over 1 year ago. Updated 3 months ago.

Status:

Closed

Priority:

Normal

Assignee:

MOHIT AGRAWAL

Category:

Target version:

% Done:

Source:

Backport:

squid

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

59050

Tags (freeform):

backport_processed

Merge Commit:

2c8bfcf2094ab2af352757ac4fc2813c27c26020

Fixed In:

v19.3.0-7008-g2c8bfcf209

Released In:

v20.2.0~1307

Upkeep Timestamp:

2025-11-01T00:59:18+00:00

Description

/a/yuriw-2024-07-24_13:26:20-rados-wip-yuri13-testing-2024-07-23-1049-squid-distro-default-smithi/7816741

2024-07-25T02:14:38.652 INFO:tasks.workunit.client.0.smithi077.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/cephtool/../../standalone/ceph-helpers.sh:1978: expect_failure:  return 0
2024-07-25T02:14:38.652 INFO:tasks.workunit.client.0.smithi077.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/cephtool/test.sh:1189: test_mon_mon:  ceph mon set election_strategy disallow
2024-07-25T02:14:39.016 INFO:tasks.ceph.mon.a.smithi077.stderr:*** Caught signal (Segmentation fault) **
2024-07-25T02:14:39.016 INFO:tasks.ceph.mon.a.smithi077.stderr: in thread 7fd9b82e0640 thread_name:ms_dispatch
2024-07-25T02:14:39.018 INFO:tasks.ceph.mon.a.smithi077.stderr: ceph version 19.0.0-3644-g0bbb8f86 (0bbb8f86319e30091fba4403956eefdcfed3c0a4) squid (rc)
2024-07-25T02:14:39.018 INFO:tasks.ceph.mon.a.smithi077.stderr: 1: /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7fd9c067a520]
2024-07-25T02:14:39.018 INFO:tasks.ceph.mon.a.smithi077.stderr: 2: (Monitor::_scrub(ScrubResult*, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >*, int*)+0x10ac) [0x56151c2596ec]
2024-07-25T02:14:39.018 INFO:tasks.ceph.mon.a.smithi077.stderr: 3: (Monitor::scrub()+0x286) [0x56151c249676]
2024-07-25T02:14:39.018 INFO:tasks.ceph.mon.a.smithi077.stderr: 4: (Monitor::handle_scrub(boost::intrusive_ptr<MonOpRequest>)+0x567) [0x56151c24cef7]
2024-07-25T02:14:39.018 INFO:tasks.ceph.mon.a.smithi077.stderr: 5: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0xea4) [0x56151c251b14]
2024-07-25T02:14:39.018 INFO:tasks.ceph.mon.a.smithi077.stderr: 6: (Monitor::_ms_dispatch(Message*)+0x42f) [0x56151c25250f]
2024-07-25T02:14:39.018 INFO:tasks.ceph.mon.a.smithi077.stderr: 7: ceph-mon(+0x267b8e) [0x56151c207b8e]
2024-07-25T02:14:39.018 INFO:tasks.ceph.mon.a.smithi077.stderr: 8: (DispatchQueue::entry()+0x58a) [0x7fd9c11d39aa]
2024-07-25T02:14:39.019 INFO:tasks.ceph.mon.a.smithi077.stderr: 9: /usr/lib/ceph/libceph-common.so.2(+0x426ec1) [0x7fd9c126dec1]
2024-07-25T02:14:39.019 INFO:tasks.ceph.mon.a.smithi077.stderr: 10: /lib/x86_64-linux-gnu/libc.so.6(+0x94b43) [0x7fd9c06ccb43]
2024-07-25T02:14:39.019 INFO:tasks.ceph.mon.a.smithi077.stderr: 11: /lib/x86_64-linux-gnu/libc.so.6(+0x126a00) [0x7fd9c075ea00]

This job produced a coredump. I wasn't able to tell much from it, but this is the bt:

[Current thread is 1 (LWP 24412)]
(gdb) bt
#0  0x00007fd9c06cea7c in __pthread_mutex_lock_full (mutex=0x5f5c) at ./nptl/pthread_mutex_lock.c:497
#1  0x00007fd9c108f342 in __gnu_cxx::new_allocator<std::_Sp_counted_ptr_inplace<std::vector<entity_addrvec_t, std::allocator<entity_addrvec_t> >, std::allocator<std::vector<entity_addrvec_t, std::allocator<entity_addrvec_t> > >, (__gnu_cxx::_Lock_policy)2> >::deallocate (this=<optimized out>, __t=1, __p=<optimized out>) at /usr/include/c++/11/ext/new_allocator.h:132
#2  std::allocator<std::_Rb_tree_node<int> >::deallocate (__n=1, __p=<optimized out>, this=0x7fd9b82d8270) at /usr/include/c++/11/bits/allocator.h:199
#3  std::allocator_traits<std::allocator<std::_Rb_tree_node<int> > >::deallocate (__n=1, __p=<optimized out>, __a=...) at /usr/include/c++/11/bits/alloc_traits.h:496
#4  std::_Rb_tree<int, int, std::_Identity<int>, std::less<int>, std::allocator<int> >::_M_put_node (__p=<optimized out>, this=0x7fd9b82d8270) at /usr/include/c++/11/bits/stl_tree.h:565
#5  std::_Rb_tree<int, int, std::_Identity<int>, std::less<int>, std::allocator<int> >::_M_drop_node (__p=<optimized out>, this=0x7fd9b82d8270) at /usr/include/c++/11/bits/stl_tree.h:632
#6  std::_Rb_tree<int, int, std::_Identity<int>, std::less<int>, std::allocator<int> >::_M_erase (this=0x7fd9b82d8270, __x=<optimized out>) at /usr/include/c++/11/bits/stl_tree.h:1891
#7  0x00007fd9c12db319 in std::_Rb_tree<unsigned long, unsigned long, std::_Identity<unsigned long>, std::less<unsigned long>, std::allocator<unsigned long> >::~_Rb_tree (this=<optimized out>, 
    this=<optimized out>) at /usr/include/c++/11/bits/stl_tree.h:984
#8  std::set<unsigned long, std::less<unsigned long>, std::allocator<unsigned long> >::~set (this=<optimized out>, this=<optimized out>) at /usr/include/c++/11/bits/stl_set.h:281
#9  ceph::logging::Log::dump_recent (this=0x56151d3e2b40) at ./src/log/Log.cc:537
#10 0x0000000000000000 in ?? ()

Related issues 1 (1 open — 0 closed)

Actions

Copy link

Updated by Laura Flores over 1 year ago

Looks similar to this crash reported in telemetry:
http://telemetry.front.sepia.ceph.com:4000/d/jByk5HaMz/crash-spec-x-ray?var-sig_v2=b4840ac22ee48f2058747692077c96240007883d420668e71feb22149d8bd03b&orgId=1

Actions

Copy link

Updated by MOHIT AGRAWAL over 1 year ago · Edited

Assignee set to MOHIT AGRAWAL

As we can see in below stacktrace it is showing *start is invalid and it is a part of
scrub_state(scrub_state->last_key) and it seems scrub_state is reset via bootstrap
at the same time during handle_scrub operation.

(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=11, threadid=140573074654784) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=11, threadid=140573074654784) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140573074654784, signo=signo@entry=11) at ./nptl/pthread_kill.c:89
#3  0x00007fd9c067a476 in __GI_raise (sig=11) at ../sysdeps/posix/raise.c:26
#4  0x000056151c45765d in reraise_fatal (signum=11) at global/./src/global/signal_handler.cc:88
#5  handle_oneshot_fatal_signal (signum=11) at global/./src/global/signal_handler.cc:367
#6  <signal handler called>
#7  std::operator<< <char, std::char_traits<char>, std::allocator<char> > (__str=..., __os=..., __os=..., __str=...)
    at /usr/include/c++/11/bits/basic_string.h:194

#8  std::operator<< <std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > (v={...}, out=...) at mon/./src/include/types.h:131
#9  Monitor::_scrub (this=0x56151e200000, r=0x56151e2f1078, start=0x8, num_keys=0x7fd9b82dc2ac) at mon/./src/mon/Monitor.cc:5672
#10 0x000056151c249676 in Monitor::scrub (this=0x56151e200000) at mon/./src/mon/Monitor.cc:5591
#11 0x000056151c24cef7 in Monitor::handle_scrub (this=0x56151e200000, op=...) at mon/./src/mon/Monitor.cc:5654
#12 0x000056151c251b14 in Monitor::dispatch_op (this=0x56151e200000, op=...) at mon/./src/mon/Monitor.cc:4716
#13 0x000056151c25250f in Monitor::_ms_dispatch (this=0x56151e200000, m=0x56151ec86b40) at mon/./src/mon/Monitor.cc:4533
#14 0x000056151c207b8e in Monitor::ms_dispatch (m=0x56151ec86b40, this=0x56151e200000) at mon/./src/mon/Monitor.h:952
#15 Dispatcher::ms_dispatch2 (this=0x56151e200000, m=...) at mon/./src/msg/Dispatcher.h:130
#16 0x00007fd9c11d39aa in DispatchQueue::entry() () from /usr/lib/ceph/libceph-common.so.2
#17 0x00007fd9c126dec1 in ?? () from /usr/lib/ceph/libceph-common.so.2
#18 0x00007fd9c06ccb43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#19 0x00007fd9c075ea00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

gdb) f 9
#9  Monitor::_scrub (this=0x56151e200000, r=0x56151e2f1078, start=0x8, num_keys=0x7fd9b82dc2ac) at mon/./src/mon/Monitor.cc:5672
5672    mon/./src/mon/Monitor.cc: No such file or directory.
(gdb) p start
$1 = (std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > *) 0x8
(gdb) f 10
#10 0x000056151c249676 in Monitor::scrub (this=0x56151e200000) at mon/./src/mon/Monitor.cc:5591
5591    mon/./src/mon/Monitor.cc: No such file or directory.
(gdb) p scrub_state
$2 = std::shared_ptr<Monitor::ScrubState> (empty) = {get() = 0x0}

If we do check the logs the logs are also validate the same thing.

Thread (7fd9b82e0640) is throwing a message wait_for_paxos_write flushing at line number(267498)
267498   -160> 2024-07-25T02:14:39.011+0000 7fd9b82e0640 10 mon.a@0(leader) e1 wait_for_paxos_write flushing pending write

and before flushed it the other thread(7fd9b5adb640) call scrub_reset at the line (267584)
Thread 2
 267584    -87> 2024-07-25T02:14:39.011+0000 7fd9b5adb640 10 mon.a@0(probing) e2 scrub_reset

and the thread(7fd9b82e0640) is throwing message wait_for_paxos_write flushed finished.

  267670     -1> 2024-07-25T02:14:39.011+0000 7fd9b82e0640 10 mon.a@0(probing) e2 wait_for_paxos_write flushed pending write

and then crashed while trying to access scrub_state->last_key.
I need to debug it more to find out RCA.

Actions

Copy link

Updated by Laura Flores over 1 year ago · Edited

@MOHIT AGRAWAL can you share how you got a better backtrace than I did in gdb? I wonder if I was missing some debug info.
I install the ceph-mon debug package only.

Actions

Copy link

Updated by Radoslaw Zarzynski over 1 year ago

scrub note: work in progress, Mohit is looking into possible fixes.

Actions

Copy link

Updated by MOHIT AGRAWAL over 1 year ago

Status changed from New to Fix Under Review
Pull request ID set to 59050

Actions

Copy link

Updated by Radoslaw Zarzynski over 1 year ago

review-in-progress.

Actions

Copy link

Updated by Laura Flores 8 months ago

Status changed from Fix Under Review to Pending Backport

Actions

Copy link

Updated by Laura Flores 8 months ago

/a/yuriw-2025-06-30_19:22:49-rados-squid-release-distro-default-smithi/8359455

Actions

Copy link

Updated by Upkeep Bot 8 months ago

Copied to Backport #72095: squid: Segmentation fault in mon during scrub op added

Actions

Copy link

#10

Updated by Upkeep Bot 8 months ago

Tags (freeform) set to backport_processed

Actions

Copy link

#11

Updated by Upkeep Bot 8 months ago

Merge Commit set to 2c8bfcf2094ab2af352757ac4fc2813c27c26020
Fixed In set to v19.3.0-7008-g2c8bfcf2094
Upkeep Timestamp set to 2025-07-11T20:39:30+00:00

Actions

Copy link

#12

Updated by Upkeep Bot 8 months ago

Fixed In changed from v19.3.0-7008-g2c8bfcf2094 to v19.3.0-7008-g2c8bfcf209
Upkeep Timestamp changed from 2025-07-11T20:39:30+00:00 to 2025-07-14T20:44:21+00:00

Actions

Copy link

#13

Updated by Upkeep Bot 5 months ago

Released In set to v20.2.0~1307
Upkeep Timestamp changed from 2025-07-14T20:44:21+00:00 to 2025-11-01T00:59:18+00:00

Actions

Copy link

#14

Updated by MOHIT AGRAWAL 3 months ago

Status changed from Pending Backport to Closed

The pull request is merged.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Tags

Custom queries

Bug #67270

Segmentation fault in mon during scrub op

Updated by Laura Flores over 1 year ago

Updated by MOHIT AGRAWAL over 1 year ago · Edited

Updated by Laura Flores over 1 year ago · Edited

Updated by Radoslaw Zarzynski over 1 year ago

Updated by MOHIT AGRAWAL over 1 year ago

Updated by Radoslaw Zarzynski over 1 year ago

Updated by Laura Flores 8 months ago

Updated by Laura Flores 8 months ago

Updated by Upkeep Bot 8 months ago

Updated by Upkeep Bot 8 months ago

Updated by Upkeep Bot 8 months ago

Updated by Upkeep Bot 8 months ago

Updated by Upkeep Bot 5 months ago

Updated by MOHIT AGRAWAL 3 months ago