Bug #46875: TestLibRBD.TestPendingAio: test_librbd.cc:4539: Failure or SIGSEGV - rbd - Ceph

Actions

Copy link

Bug #46875

open

TestLibRBD.TestPendingAio: test_librbd.cc:4539: Failure or SIGSEGV

Added by Sebastian Wagner over 5 years ago. Updated 25 days ago.

Status:

New

Priority:

Urgent

Assignee:

Target version:

% Done:

Source:

Q/A

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

36839

Tags (freeform):

Merge Commit:

Fixed In:

Released In:

Upkeep Timestamp:

Tags:

make-check-failure

Description

[ RUN      ] TestLibRBD.TestPendingAio
using new format!
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/librbd/test_librbd.cc:4539: Failure
Expected equality of these values:
  1
  rbd_aio_is_complete(comps[i])
    Which is: 0
[  FAILED  ] TestLibRBD.TestPendingAio (68 ms)

https://jenkins.ceph.com/job/ceph-pull-requests/57209/consoleFull#-361705261e840cee4-f4a4-4183-81dd-42855615f2c1

Files

consoleText.tgz (118 KB) consoleText.tgz

Sebastian Wagner, 11/04/2021 10:08 AM

Related issues 4 (1 open — 3 closed)

Actions

Copy link

Updated by Jason Dillaman over 5 years ago

Status changed from New to In Progress
Assignee set to Jason Dillaman

Actions

Copy link

Updated by Jason Dillaman over 5 years ago

Status changed from In Progress to Fix Under Review
Pull request ID set to 36839

Actions

Copy link

Updated by Mykola Golub over 5 years ago

Status changed from Fix Under Review to Resolved

Actions

Copy link

Updated by Josh Durgin over 5 years ago

Status changed from Resolved to Pending Backport
Backport set to octopus, nautilus

Looks like this could use a backport - seen in nautilus here:
https://jenkins.ceph.com/job/ceph-pull-requests/62883/
https://github.com/ceph/ceph/pull/37872

[ RUN      ] TestLibRBD.TestPendingAio
using new format!
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/librbd/test_librbd.cc:4572: Failure
Expected equality of these values:
  1
  rbd_aio_is_complete(comps[i])
    Which is: 0
[  FAILED  ] TestLibRBD.TestPendingAio (96 ms)

Actions

Copy link

Updated by Nathan Cutler over 5 years ago

Copied to Backport #48081: nautilus: make check: run-rbd-unit-tests: TestLibRBD.TestPendingAio: test_librbd.cc:4539: Failure added

Actions

Copy link

Updated by Nathan Cutler over 5 years ago

Copied to Backport #48082: octopus: make check: run-rbd-unit-tests: TestLibRBD.TestPendingAio: test_librbd.cc:4539: Failure added

Actions

Copy link

Updated by Jason Dillaman over 5 years ago

Status changed from Pending Backport to New
Backport deleted (~~octopus, nautilus~~)

Actions

Copy link

Updated by Nathan Cutler over 5 years ago

Thanks, Jason. The "make check" console log that Josh mentioned in Comment 4 (reproduced with a small correction below) has been preserved here: https://gist.github.com/smithfarm/d52b01f1214b723b46013cd9cd8ff497

Actions

Copy link

Updated by Nathan Cutler over 5 years ago

Josh Durgin wrote:

Looks like this could use a backport - seen in ~~nautilus~~ master here:
https://jenkins.ceph.com/job/ceph-pull-requests/62883/
https://github.com/ceph/ceph/pull/37872

[...]

Actions

Copy link

#10

Updated by Josh Durgin almost 5 years ago

Assignee deleted (~~Jason Dillaman~~)

Happened again here https://jenkins.ceph.com/job/ceph-pull-requests/73665/consoleFull :

[ RUN      ] TestLibRBD.TestPendingAio
using new format!
creating image with stripe unit: 65536, stripe count: 16
../src/test/librbd/test_librbd.cc:4707: Failure
Expected equality of these values:
  1
  rbd_aio_is_complete(comps[i])
    Which is: 0
[  FAILED  ] TestLibRBD.TestPendingAio (213 ms)

Actions

Copy link

#11

Updated by Deepika Upadhyay almost 5 years ago

Josh Durgin wrote:

Happened again here https://jenkins.ceph.com/job/ceph-pull-requests/73665/consoleFull :
[...]

saved log until someone can get back to it: https://gist.github.com/ideepika/843c38c088b4ca58de104005af9ba213

Actions

Copy link

#12

Updated by Sebastian Wagner over 4 years ago

File consoleText.tgz consoleText.tgz added

https://jenkins.ceph.com/job/ceph-pull-requests/85058 (attached the log)

Actions

Copy link

#13

Updated by Ilya Dryomov over 4 years ago

Related to Bug #49558: make check: run-rbd-unit-tests-61.sh (Failed) added

Actions

Copy link

#14

Updated by Ilya Dryomov over 2 years ago

This test occasionally crashes in Jenkins:

[ RUN      ] TestLibRBD.TestPendingAio
using new format!
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/run-rbd-unit-tests.sh: line 20: 705216 Segmentation fault      RBD_FEATURES=$i unittest_librbd

The following tests FAILED:
     33 - run-rbd-unit-tests-109.sh (Failed)

Actions

Copy link

#15

Updated by Ilya Dryomov over 2 years ago

Subject changed from make check: run-rbd-unit-tests: TestLibRBD.TestPendingAio: test_librbd.cc:4539: Failure to TestLibRBD.TestPendingAio: test_librbd.cc:4539: Failure or SIGSEGV

Actions

Copy link

#16

Updated by Casey Bodley almost 2 years ago

from https://jenkins.ceph.com/job/ceph-pull-requests/133893/consoleFull

[ RUN      ] TestLibRBD.TestPendingAio
using new format!
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/librbd/test_librbd.cc:8843: Failure
Expected equality of these values:
  1
  rbd_aio_is_complete(comps[i])
    Which is: 0
[  FAILED  ] TestLibRBD.TestPendingAio (127 ms)

Actions

Copy link

#17

Updated by Bill Scales almost 2 years ago

from https://jenkins.ceph.com/job/ceph-pull-requests-arm64/58231/ another instance of the segfault from run_rbd_unit_tests_109.sh

[ RUN ] TestLibRBD.TestPendingAio using new format! /home/jenkins-build/build/workspace/ceph-pull-requests-arm64/src/test/run-rbd-unit-tests.sh: line 20: 3214412 Segmentation fault (core dumped) RBD_FEATURES=$i unittest_librbd

Actions

Copy link

#18

Updated by Bill Scales almost 2 years ago · Edited

I've managed to recreate the core dump:

cd /ceph/build
source ../src/test/detect-build-env-vars.sh
export RBD_FEATURES=109
unittest_librbd --gtest_filter=TestLibRBD.TestPendingAio --gtest_repeat=1000

Didn't reproduce at first so I messed around with taskset to change number of cores for a bit of variability, eventually

taskset 15 unittest_librbd --gtest_filter=TestLibRBD.TestPendingAio --gtest_repeat=1000

reproduced the seg fault. Here's the stack trace:

#0  0x00000000026492be in librbd::exclusive_lock::ImageDispatch<librbd::ImageCtx>::handle_acquire_lock (this=0x7f4e24016250, r=-85) at /ceph/src/librbd/exclusive_lock/ImageDispatch.cc:287
#1  0x0000000002649b23 in librbd::util::detail::C_CallbackAdapter<librbd::exclusive_lock::ImageDispatch<librbd::ImageCtx>, &librbd::exclusive_lock::ImageDispatch<librbd::ImageCtx>::handle_acquire_lock>::finish (
    this=<optimized out>, r=<optimized out>) at /ceph/src/librbd/Utils.h:63
#2  0x00000000010ec09d in Context::complete (this=0x7f4e3400c480, r=<optimized out>) at /ceph/src/include/Context.h:99
#3  0x0000000001123df1 in librbd::asio::ContextWQ::queue(Context*, int)::{lambda()#1}::operator()() const (__closure=0x7f4ef0ff53f0) at /ceph/src/librbd/asio/ContextWQ.h:31
#4  boost::asio::detail::binder0<librbd::asio::ContextWQ::queue(Context*, int)::{lambda()#1}>::operator()() (this=0x7f4ef0ff53f0) at /ceph/build/boost/include/boost/asio/detail/bind_handler.hpp:60
#5  boost::asio::asio_handler_invoke<boost::asio::detail::binder0<librbd::asio::ContextWQ::queue(Context*, int)::{lambda()#1}> >(boost::asio::detail::binder0<librbd::asio::ContextWQ::queue(Context*, int)::{lambda()#1}>&, ...) (function=...) at /ceph/build/boost/include/boost/asio/handler_invoke_hook.hpp:88
#6  boost_asio_handler_invoke_helpers::invoke<boost::asio::detail::binder0<librbd::asio::ContextWQ::queue(Context*, int)::{lambda()#1}>, librbd::asio::ContextWQ::queue(Context*, int)::{lambda()#1}>(boost::asio::detail::binder0<librbd::asio::ContextWQ::queue(Context*, int)::{lambda()#1}>&, librbd::asio::ContextWQ::queue(Context*, int)::{lambda()#1}&) (context=..., function=...)
    at /ceph/build/boost/include/boost/asio/detail/handler_invoke_helpers.hpp:54
#7  boost::asio::detail::asio_handler_invoke<boost::asio::detail::binder0<librbd::asio::ContextWQ::queue(Context*, int)::{lambda()#1}>, librbd::asio::ContextWQ::queue(Context*, int)::{lambda()#1}>(boost::asio::detail::binder0<librbd::asio::ContextWQ::queue(Context*, int)::{lambda()#1}>&, boost::asio::detail::binder0<librbd::asio::ContextWQ::queue(Context*, int)::{lambda()#1}>*) (this_handler=0x7f4ef0ff53f0, function=...)
    at /ceph/build/boost/include/boost/asio/detail/bind_handler.hpp:111
#8  boost_asio_handler_invoke_helpers::invoke<boost::asio::detail::binder0<librbd::asio::ContextWQ::queue(Context*, int)::{lambda()#1}>, boost::asio::detail::binder0<librbd::asio::ContextWQ::queue(Context*, int)::{lambda()#1}> >(boost::asio::detail::binder0<librbd::asio::ContextWQ::queue(Context*, int)::{lambda()#1}>&, boost::asio::detail::binder0<librbd::asio::ContextWQ::queue(Context*, int)::{lambda()#1}>&) (
    context=..., function=...) at /ceph/build/boost/include/boost/asio/detail/handler_invoke_helpers.hpp:54
#9  boost::asio::detail::executor_op<boost::asio::detail::binder0<librbd::asio::ContextWQ::queue(Context*, int)::{lambda()#1}>, std::allocator<void>, boost::asio::detail::scheduler_operation>::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long) (owner=0x4113d30, base=<optimized out>) at /ceph/build/boost/include/boost/asio/detail/executor_op.hpp:71
#10 0x0000000001125a10 in boost::asio::detail::scheduler_operation::complete (bytes_transferred=0, ec=..., owner=<optimized out>, this=<optimized out>)
    at /ceph/build/boost/include/boost/asio/detail/scheduler_operation.hpp:40
#11 boost::asio::detail::strand_executor_service::run_ready_handlers (impl=std::shared_ptr<boost::asio::detail::strand_executor_service::strand_impl> (use count 2, weak count 0) = {...})
    at /ceph/build/boost/include/boost/asio/detail/impl/strand_executor_service.ipp:150
#12 boost::asio::detail::strand_executor_service::invoker<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> const, void>::operator() (this=0x7f4ef0ff5460)
    at /ceph/build/boost/include/boost/asio/detail/impl/strand_executor_service.hpp:136
#13 boost::asio::asio_handler_invoke<boost::asio::detail::strand_executor_service::invoker<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> const, void> > (function=...)
    at /ceph/build/boost/include/boost/asio/handler_invoke_hook.hpp:88
#14 boost_asio_handler_invoke_helpers::invoke<boost::asio::detail::strand_executor_service::invoker<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> const, void>, boost::asio::detail::strand_executor_service::invoker<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> const, void> > (context=..., function=...)
    at /ceph/build/boost/include/boost/asio/detail/handler_invoke_helpers.hpp:54
#15 boost::asio::detail::executor_op<boost::asio::detail::strand_executor_service::invoker<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> const, void>, boost::asio::detail::recycling_allocator<void, boost::asio::detail::thread_info_base::default_tag>, boost::asio::detail::scheduler_operation>::do_complete (owner=0x7f4ecc000dc0, base=<optimized out>)
    at /ceph/build/boost/include/boost/asio/detail/executor_op.hpp:71
#16 0x00000000016db528 in boost::asio::detail::scheduler_operation::complete (bytes_transferred=0, ec=..., owner=0x7f4ecc000dc0, this=0x41d0590)
    at /ceph/build/boost/include/boost/asio/detail/scheduler_operation.hpp:40
#17 boost::asio::detail::scheduler::do_run_one (this=this@entry=0x7f4ecc000dc0, lock=..., this_thread=..., ec=...) at /ceph/build/boost/include/boost/asio/detail/impl/scheduler.ipp:493
#18 0x00000000016dd589 in boost::asio::detail::scheduler::run (this=0x7f4ecc000dc0, ec=...) at /ceph/build/boost/include/boost/asio/detail/impl/scheduler.ipp:210
#19 0x000000000284bcab in boost::asio::io_context::run (this=<optimized out>) at /ceph/build/boost/include/boost/asio/impl/io_context.ipp:64
#20 ceph::async::io_context_pool::start(short)::{lambda()#1}::operator()() const (__closure=0x413b288) at /ceph/src/common/async/context_pool.h:69
#21 std::__invoke_impl<void, ceph::async::io_context_pool::start(short)::{lambda()#1}>(std::__invoke_other, ceph::async::io_context_pool::start(short)::{lambda()#1}&&) (__f=...)
    at /usr/include/c++/11/bits/invoke.h:61
#22 std::__invoke<ceph::async::io_context_pool::start(short)::{lambda()#1}>(ceph::async::io_context_pool::start(short)::{lambda()#1}&&) (__fn=...) at /usr/include/c++/11/bits/invoke.h:96
#23 std::invoke<ceph::async::io_context_pool::start(short)::{lambda()#1}>(ceph::async::io_context_pool::start(short)::{lambda()#1}&&) (__fn=...) at /usr/include/c++/11/functional:97
#24 _ZZ17make_named_threadIZN4ceph5async15io_context_pool5startEsEUlvE_JEESt6threadSt17basic_string_viewIcSt11char_traitsIcEEOT_DpOT0_ENKUlSA_SD_E_clIS3_JEEEDaSA_SD_ (fun=..., __closure=0x413b290)
    at /ceph/src/common/Thread.h:79
#25 _ZSt13__invoke_implIvZ17make_named_threadIZN4ceph5async15io_context_pool5startEsEUlvE_JEESt6threadSt17basic_string_viewIcSt11char_traitsIcEEOT_DpOT0_EUlSB_SE_E_JS4_EESA_St14__invoke_otherOT0_DpOT1_ (__f=...)
    at /usr/include/c++/11/bits/invoke.h:61
#26 _ZSt8__invokeIZ17make_named_threadIZN4ceph5async15io_context_pool5startEsEUlvE_JEESt6threadSt17basic_string_viewIcSt11char_traitsIcEEOT_DpOT0_EUlSB_SE_E_JS4_EENSt15__invoke_resultISA_JDpSC_EE4typeESB_SE_ (
    __fn=...) at /usr/include/c++/11/bits/invoke.h:96
#27 _ZNSt6thread8_InvokerISt5tupleIJZ17make_named_threadIZN4ceph5async15io_context_pool5startEsEUlvE_JEES_St17basic_string_viewIcSt11char_traitsIcEEOT_DpOT0_EUlSC_SF_E_S6_EEE9_M_invokeIJLm0ELm1EEEEvSt12_Index_tupleIJXspT_EEE (this=0x413b288) at /usr/include/c++/11/bits/std_thread.h:259
#28 _ZNSt6thread8_InvokerISt5tupleIJZ17make_named_threadIZN4ceph5async15io_context_pool5startEsEUlvE_JEES_St17basic_string_viewIcSt11char_traitsIcEEOT_DpOT0_EUlSC_SF_E_S6_EEEclEv (this=0x413b288)
    at /usr/include/c++/11/bits/std_thread.h:266
#29 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZ17make_named_threadIZN4ceph5async15io_context_pool5startEsEUlvE_JEES_St17basic_string_viewIcSt11char_traitsIcEEOT_DpOT0_EUlSD_SG_E_S7_EEEEE6_M_runEv (
    this=0x413b280) at /usr/include/c++/11/bits/std_thread.h:211
#30 0x00007f502c489ad4 in std::execute_native_thread_routine (__p=0x413b280) at ../../../../../libstdc++-v3/src/c++11/thread.cc:82
#31 0x00007f502c138c02 in start_thread (arg=<optimized out>) at pthread_create.c:443
#32 0x00007f502c1bcf34 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:100

Code that failed is: /ceph/src/librbd/exclusive_lock/ImageDispatch.cc line 287

template <typename I>
void ImageDispatch<I>::handle_acquire_lock(int r) {
  auto cct = m_image_ctx->cct; <<< This line

Looking at variables:

(gdb) print *this
$3 = {<librbd::io::ImageDispatchInterface> = {_vptr.ImageDispatchInterface = 0x7f4e24016}, m_image_ctx = 0x7eb600763a9d9ad8, m_lock = {<ceph::mutex_debug_detail::mutex_debugging_base> = {
      group = "\024b\342\300I\177\000\000ؚ\235:v\000\266~e_lock::ImageDispatch::m_lock (0x7f4e24016250)", id = 52, lockdep = true, backtrace = false, nlock = {<std::__atomic_base<int>> = {
          static _S_alignment = 4, _M_i = 0}, static is_always_lock_free = true}, locked_by = {_M_thread = 0}}, rwlock = {__data = {__readers = 1, __writers = 0, __wrphase_futex = 1, __writers_futex = 0, 
        __pad3 = 0, __pad4 = 0, __cur_writer = 0, __shared = 0, __rwelision = 0 '\000', __pad1 = "\000\000\000\000\000\000", __pad2 = 0, __flags = 0}, 
      __size = "\001\000\000\000\000\000\000\000\001", '\000' <repeats 46 times>, __align = 1}, track = true, nrlock = {<std::__atomic_base<unsigned int>> = {static _S_alignment = 4, _M_i = 0}, 
      static is_always_lock_free = true}}, m_require_lock_on_read = true, m_require_lock_on_write = true, m_on_dispatches = empty std::__cxx11::list}

Looks like m_image_ctx has been corrupted:

(gdb) x/8x *this->m_image_ctx
Cannot access memory at address 0x7eb600763a9d9ad8

Actions

Copy link

#19

Updated by Bill Scales over 1 year ago

Think I've found the bug. The signature of the segmentation fault is a use after free bug of the exclusive_lock ImageDispatch object:

1. The m_image_ctx pointer has been corrupted, this is near the start of the object
2. The front of the string "librbd::exclusive_lock::ImageDispatch::m_lock" has also been corrupted

That is corruption at the front of two different objects, which would be consistent with this memory being freed back to the heap.

The code path is exclusive lock delaying an I/O while it acquires the exclusive lock. The callback has returned -ERESTART (-85) which indicates the I/O is racing with exclusive lock shutdown and will be failed - and looking at ManagedLock<I>::acquire_lock this code path immediately schedules the async callback to call handle_acquire_lock.

What stops exclusive lock from being shutdown (and its destructor running) while the async callback is queued?

Looking at Dispatcher.h the shutdown process carefully takes m_lock and removes the dispatch from m_dispatches, then calls async_op_tracker->wait_for_ops to wait for in-flight I/Os to complete and only when this happens does it delete the dispatch.

However looking at Dispatcher.h:send we find:

1. Code takes m_lock and finds the next dispatch to call from m_dispatches
2. Calls async_op_tracker->start_op() to register the in flight I/O
3. Releases m_lock (I/O is now safely tracked by the async_op_tracker)
4. Calls send_dispatch
5. Calls async_op_tracker->finish_op()

So the async_op_tracker is only actually providing protection for a synchronous call to dispatch code. When a dispatch (like exclusive_lock) performs some async processing for an I/O (typically indicated by returning true to send_dispatch) the call to finish_op is allowing the shutdown to happen and the dispatch destructor to run before async processing completes.

Dispatcher isn't going to stop the exclusive_lock shutdown function and destructor being called. Looking at the shut_down function in exclusive_lock we see that it tries to clean up looking for I/Os on the m_on_dispatches, but this isn't sufficient to prevent handle_acquire_lock being called later on after the destructor has run. Blocking the thread in exclusive_lock ImageDispatch<I>::shut_down waiting for handle_acquire_lock to run would risk causing deadlock

I think the fix is to try and extend the tracking of I/Os by async_op_tracker so that I/O is still counted in the tracker while waiting for asynchronous call backs like this. There are likely to be other similar bugs in RBD where shutdown is racing with other asynchronous work.

Actions

Copy link

#20

Updated by Ilya Dryomov over 1 year ago

Hi Bill,

Your analysis seems spot on to me, although I'm not sure about extending that particular op tracker. This code is really hairy -- librbd shutdown has always been problematic, with a ton a work queue items and completion chains built one of top of the other.

The test's main thread is in rbd_close(), right? Could you please post its stack trace from the core for reference?

Actions

Copy link

#21

Updated by Bill Scales over 1 year ago

Here's the stack trace for the main thread - as expected its in rbd_close waiting for the close request to complete:

(gdb) bt
#0  __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7ffcf52ba630)
    at futex-internal.c:57
#1  __futex_abstimed_wait_common (futex_word=futex_word@entry=0x7ffcf52ba630, expected=expected@entry=0, clockid=clockid@entry=0, 
    abstime=abstime@entry=0x0, private=private@entry=0, cancel=cancel@entry=true) at futex-internal.c:87
#2  0x00007f502c1357ff in __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7ffcf52ba630, 
    expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at futex-internal.c:139
#3  0x00007f502c137fa0 in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7ffcf52ba5e0, cond=0x7ffcf52ba608)
    at pthread_cond_wait.c:504
#4  ___pthread_cond_wait (cond=0x7ffcf52ba608, mutex=0x7ffcf52ba5e0) at pthread_cond_wait.c:619
#5  0x00007f502d27c70d in ceph::condition_variable_debug::wait (this=this@entry=0x7ffcf52ba608, lock=...)
    at /ceph/src/common/condition_variable_debug.cc:28
#6  0x000000000255bde9 in ceph::condition_variable_debug::wait<C_SaferCond::wait()::{lambda()#1}>(std::unique_lock<ceph::mutex_debug_detail::mutex_debug_impl<false> >&, C_SaferCond::wait()::{lambda()#1}) (pred=..., lock=..., this=0x7ffcf52ba608)
    at /ceph/src/common/condition_variable_debug.h:34
#7  C_SaferCond::wait (this=0x7ffcf52ba5a0) at /ceph/src/common/Cond.h:101
#8  librbd::ImageState<librbd::ImageCtx>::close (this=0x4114990) at /ceph/src/librbd/ImageState.cc:502
#9  0x0000000002501434 in rbd_close (image=0x4140880) at /ceph/src/librbd/librbd.cc:4980
#10 0x000000000222ae6a in TestLibRBD_TestPendingAio_Test::TestBody (this=0x419bcf0) at /ceph/src/test/librbd/test_librbd.cc:9030
#11 0x00000000028c6035 in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void> (
    object=object@entry=0x419bcf0, method=<optimized out>, location=location@entry=0x2c5e1a3 "the test body")
    at /ceph/src/googletest/googletest/src/gtest.cc:2605
#12 0x00000000028cce39 in testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void> (object=object@entry=0x419bcf0, 
    method=&virtual testing::Test::TestBody(), location=location@entry=0x2c5e1a3 "the test body")
    at /ceph/src/googletest/googletest/src/gtest.cc:2641
#13 0x00000000028bf764 in testing::Test::Run (this=this@entry=0x419bcf0) at /ceph/src/googletest/googletest/src/gtest.cc:2680
#14 0x00000000028bf86a in testing::TestInfo::Run (this=0x3fbaab0) at /ceph/src/googletest/googletest/src/gtest.cc:2858
#15 0x00000000028bfe1a in testing::TestSuite::Run (this=0x3fa59b0) at /ceph/src/googletest/googletest/src/gtest.cc:3012
#16 0x00000000028c0228 in testing::internal::UnitTestImpl::RunAllTests (this=0x3f2b420)
    at /ceph/src/googletest/googletest/src/gtest.cc:5723
#17 0x00000000028c6665 in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (
    object=object@entry=0x3f2b420, method=<optimized out>, 
    location=location@entry=0x2c60958 "auxiliary test code (environments or event listeners)")
    at /ceph/src/googletest/googletest/src/gtest.cc:2605
#18 0x00000000028cd27f in testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (
    object=0x3f2b420, 
    method=(bool (testing::internal::UnitTestImpl::*)(testing::internal::UnitTestImpl * const)) 0x28bff80 <testing::internal::UnitTestImpl::RunAllTests()>, location=location@entry=0x2c60958 "auxiliary test code (environments or event listeners)")
    at /ceph/src/googletest/googletest/src/gtest.cc:2641
#19 0x00000000028bf904 in testing::UnitTest::Run (this=0x31e0ae0 <testing::UnitTest::GetInstance()::instance>)
    at /ceph/src/googletest/googletest/src/gtest.cc:5306
#20 0x00000000010e825e in RUN_ALL_TESTS () at /ceph/src/googletest/googletest/include/gtest/gtest.h:2486
#21 main (argc=<optimized out>, argv=<optimized out>) at /ceph/src/test/librbd/test_main.cc:72

Can also confirm that the close is running from image_ctx.state:

(gdb) print *this->m_image_ctx.state
$7 = {m_image_ctx = 0x4140880, m_state = librbd::ImageState<librbd::ImageCtx>::STATE_CLOSING, m_lock = {<ceph::mutex_debug_detail::mutex_debugging_base> = {group = "librbd::ImageState::m_lock (0x4114990)", id = 26, lockdep = true, backtrace = false, nlock = {<std::__atomic_base<int>> = {
          static _S_alignment = 4, _M_i = 0}, static is_always_lock_free = true}, locked_by = {_M_thread = 0}}, m = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 2, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, 
      __size = '\000' <repeats 16 times>, "\002", '\000' <repeats 22 times>, __align = 0}}, m_actions_contexts = std::__cxx11::list = {[0] = {first = {action_type = librbd::ImageState<librbd::ImageCtx>::ACTION_TYPE_CLOSE, refresh_seq = 0, snap_id = 18446744073709551614, on_ready = 0x0}, 
      second = std::__cxx11::list = {[0] = 0x7ffcf52ba5a0}}}, m_last_refresh = 0, m_refresh_seq = 0, m_update_watchers = 0x4114be0, m_quiesce_watchers = 0x4114e90, m_open_flags = 0}

and with a bit of a fight with gdb we can find the ImageDispatcher which shows that the Exclusive Lock ImageDispatch has been removed from m_dispatches (we know from RBD_FEATURES=109 that Exclusive Lock should be in the map, as well as the fact the segmentation fault happened in exclusive lock code):

(gdb) print *(librbd::io::ImageDispatcher<librbd::ImageCtx> *)0x7f4c80013e90
$18 = {<librbd::io::Dispatcher<librbd::ImageCtx, librbd::io::ImageDispatcherInterface>> = {<librbd::io::ImageDispatcherInterface> = {<librbd::io::DispatcherInterface<librbd::io::ImageDispatchInterface>> = {
        _vptr.DispatcherInterface = 0x31bafc0 <vtable for librbd::io::ImageDispatcher<librbd::ImageCtx>+16>}, <No data fields>}, m_image_ctx = 0x4140880, m_lock = {<ceph::mutex_debug_detail::mutex_debugging_base> = {group = "librbd::io::Dispatcher::lock (0x7f4c80013e90)", id = 30, lockdep = true,
        backtrace = false, nlock = {<std::__atomic_base<int>> = {static _S_alignment = 4, _M_i = 0}, static is_always_lock_free = true}, locked_by = {_M_thread = 0}}, rwlock = {__data = {__readers = 0, __writers = 0, __wrphase_futex = 0, __writers_futex = 0, __pad3 = 0, __pad4 = 0, __cur_writer = 0,
          __shared = 0, __rwelision = 0 '\000', __pad1 = "\000\000\000\000\000\000", __pad2 = 0, __flags = 0}, __size = '\000' <repeats 55 times>, __align = 0}, track = true, nrlock = {<std::__atomic_base<unsigned int>> = {static _S_alignment = 4, _M_i = 0}, static is_always_lock_free = true}},
    m_dispatches = std::map with 5 elements = {[librbd::io::IMAGE_DISPATCH_LAYER_QUEUE] = {dispatch = 0x4116d00, async_op_tracker = 0x40b79f0}, [librbd::io::IMAGE_DISPATCH_LAYER_QOS] = {dispatch = 0x41153b0, async_op_tracker = 0x40be4d0}, [librbd::io::IMAGE_DISPATCH_LAYER_REFRESH] = {
        dispatch = 0x4116d40, async_op_tracker = 0x41402f0}, [librbd::io::IMAGE_DISPATCH_LAYER_WRITE_BLOCK] = {dispatch = 0x41182d0, async_op_tracker = 0x40be550}, [librbd::io::IMAGE_DISPATCH_LAYER_CORE] = {dispatch = 0x4116ca0, async_op_tracker = 0x4205a40}}},
  m_next_tid = {<std::__atomic_base<unsigned long>> = {static _S_alignment = 8, _M_i = 516}, static is_always_lock_free = true}, m_qos_image_dispatch = 0x41153b0, m_write_block_dispatch = 0x41182d0}

The other dispatches are still in the list, so that must mean Exclusive Lock has called Dispatcher::shut_down_dispatch to delete itself during the close processing.

Actions

Copy link

#22

Updated by Casey Bodley over 1 year ago

from https://jenkins.ceph.com/job/ceph-pull-requests/144113/

The following tests FAILED:
     33 - run-rbd-unit-tests-109.sh (Failed)
     34 - run-rbd-unit-tests-127.sh (Failed)

[ RUN      ] TestLibRBD.TestPendingAio
using new format!
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/run-rbd-unit-tests.sh: line 20: 3898836 Segmentation fault      (core dumped) RBD_FEATURES=$i unittest_librbd

[ RUN      ] TestLibRBD.TestPendingAio
using new format!
creating image with stripe unit: 65536, stripe count: 16
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/librbd/test_librbd.cc:9203: Failure
Expected equality of these values:
  1
  rbd_aio_is_complete(comps[i])
    Which is: 0
[  FAILED  ] TestLibRBD.TestPendingAio (304 ms)

Actions

Copy link

#23

Updated by Casey Bodley over 1 year ago

https://jenkins.ceph.com/job/ceph-pull-requests/144496/

The following tests FAILED:
     31 - run-rbd-unit-tests-1.sh (Failed)

[ RUN      ] TestLibRBD.TestPendingAio
using new format!
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/librbd/test_librbd.cc:9203: Failure
Expected equality of these values:
  1
  rbd_aio_is_complete(comps[i])
    Which is: 0
[  FAILED  ] TestLibRBD.TestPendingAio (247 ms)

Actions

Copy link

#24

Updated by Casey Bodley over 1 year ago

from https://jenkins.ceph.com/job/ceph-pull-requests/147785/

The following tests FAILED:
     33 - run-rbd-unit-tests-109.sh (Failed)

[ RUN      ] TestLibRBD.TestPendingAio
using new format!
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/librbd/test_librbd.cc:9203: Failure
Expected equality of these values:
  1
  rbd_aio_is_complete(comps[i])
    Which is: 0
[  FAILED  ] TestLibRBD.TestPendingAio (515 ms)

Actions

Copy link

#25

Updated by Casey Bodley over 1 year ago

https://jenkins.ceph.com/job/ceph-pull-requests/148481/

The following tests FAILED:
     31 - run-rbd-unit-tests-1.sh (Failed)

[ RUN      ] TestLibRBD.TestPendingAio
using new format!
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/librbd/test_librbd.cc:9203: Failure
Expected equality of these values:
  1
  rbd_aio_is_complete(comps[i])
    Which is: 0
[  FAILED  ] TestLibRBD.TestPendingAio (542 ms)

Actions

Copy link

#26

Updated by Casey Bodley 12 months ago

Priority changed from Normal to Urgent

https://jenkins.ceph.com/job/ceph-pull-requests/154533/

[ RUN      ] TestLibRBD.TestPendingAio
using new format!
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/librbd/test_librbd.cc:9203: Failure
Expected equality of these values:
  1
  rbd_aio_is_complete(comps[i])
    Which is: 0
[  FAILED  ] TestLibRBD.TestPendingAio (130 ms)

please fix or consider disabling. raising prio for visibility

Actions

Copy link

#27

Updated by Casey Bodley 12 months ago

https://jenkins.ceph.com/job/ceph-pull-requests/155151/

[ RUN      ] TestLibRBD.TestPendingAio
using new format!
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/librbd/test_librbd.cc:9203: Failure
Expected equality of these values:
  1
  rbd_aio_is_complete(comps[i])
    Which is: 0

[  FAILED  ] TestLibRBD.TestPendingAio (191 ms)

Actions

Copy link

#28

Updated by Casey Bodley 10 months ago

https://jenkins.ceph.com/job/ceph-pull-requests/159202/

The following tests FAILED:
     31 - run-rbd-unit-tests-1.sh (Failed)

[ RUN      ] TestLibRBD.TestPendingAio
using new format!
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/librbd/test_librbd.cc:9203: Failure
Expected equality of these values:
  1
  rbd_aio_is_complete(comps[i])
    Which is: 0

[  FAILED  ] TestLibRBD.TestPendingAio (95 ms)

Actions

Copy link

#29

Updated by Casey Bodley 9 months ago

https://jenkins.ceph.com/job/ceph-pull-requests-arm64/75695/

The following tests FAILED:
     32 - run-rbd-unit-tests-61.sh (Failed)

[ RUN      ] TestLibRBD.TestPendingAio
using new format!
/home/jenkins-build/build/workspace/ceph-pull-requests-arm64/src/test/librbd/test_librbd.cc:9203: Failure
Expected equality of these values:
  1
  rbd_aio_is_complete(comps[i])
    Which is: 0

[  FAILED  ] TestLibRBD.TestPendingAio (304 ms)

Actions

Copy link

#30

Updated by Casey Bodley 6 months ago

https://jenkins.ceph.com/job/ceph-pull-requests/167525/

The following tests FAILED:
     31 - run-rbd-unit-tests-1.sh (Failed)

[ RUN      ] TestLibRBD.TestPendingAio
using new format!
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/librbd/test_librbd.cc:9012: Failure
Expected equality of these values:
  1
  rbd_aio_is_complete(comps[i])
    Which is: 0

[  FAILED  ] TestLibRBD.TestPendingAio (247 ms)

Actions

Copy link

#31

Updated by Casey Bodley 4 months ago

https://jenkins.ceph.com/job/ceph-pull-requests-arm64/83245/

The following tests FAILED:
     31 - run-rbd-unit-tests-1.sh (Failed)

[ RUN      ] TestLibRBD.TestPendingAio
using new format!
/ceph/src/test/librbd/test_librbd.cc:9013: Failure
Expected equality of these values:
  1
  rbd_aio_is_complete(comps[i])
    Which is: 0

[  FAILED  ] TestLibRBD.TestPendingAio (182 ms)

Actions

Copy link

#32

Updated by Ilya Dryomov 25 days ago

Tags set to make-check-failure

Actions

Copy link

#33

Updated by Ilya Dryomov 25 days ago

Related to Bug #75163: disable known flaky tests in run-rbd-unit-tests added

Actions

Copy link

#34

Updated by Ilya Dryomov 25 days ago

Disabled in "make check" context via https://github.com/ceph/ceph/pull/67525.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » rbd

Tags

Custom queries

Bug #46875

TestLibRBD.TestPendingAio: test_librbd.cc:4539: Failure or SIGSEGV

Updated by Jason Dillaman over 5 years ago

Updated by Jason Dillaman over 5 years ago

Updated by Mykola Golub over 5 years ago

Updated by Josh Durgin over 5 years ago

Updated by Nathan Cutler over 5 years ago

Updated by Nathan Cutler over 5 years ago

Updated by Jason Dillaman over 5 years ago

Updated by Nathan Cutler over 5 years ago

Updated by Nathan Cutler over 5 years ago

Updated by Josh Durgin almost 5 years ago

Updated by Deepika Upadhyay almost 5 years ago

Updated by Sebastian Wagner over 4 years ago

Updated by Ilya Dryomov over 4 years ago

Updated by Ilya Dryomov over 2 years ago

Updated by Ilya Dryomov over 2 years ago

Updated by Casey Bodley almost 2 years ago

Updated by Bill Scales almost 2 years ago

Updated by Bill Scales almost 2 years ago · Edited

Updated by Bill Scales over 1 year ago

Updated by Ilya Dryomov over 1 year ago

Updated by Bill Scales over 1 year ago

Updated by Casey Bodley over 1 year ago

Updated by Casey Bodley over 1 year ago

Updated by Casey Bodley over 1 year ago

Updated by Casey Bodley over 1 year ago

Updated by Casey Bodley 12 months ago

Updated by Casey Bodley 12 months ago

Updated by Casey Bodley 10 months ago

Updated by Casey Bodley 9 months ago

Updated by Casey Bodley 6 months ago

Updated by Casey Bodley 4 months ago

Updated by Ilya Dryomov 25 days ago

Updated by Ilya Dryomov 25 days ago

Updated by Ilya Dryomov 25 days ago