Bug #46875
openTestLibRBD.TestPendingAio: test_librbd.cc:4539: Failure or SIGSEGV
0%
Description
[ RUN ] TestLibRBD.TestPendingAio
using new format!
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/librbd/test_librbd.cc:4539: Failure
Expected equality of these values:
1
rbd_aio_is_complete(comps[i])
Which is: 0
[ FAILED ] TestLibRBD.TestPendingAio (68 ms)
Files
Updated by Jason Dillaman over 5 years ago
- Status changed from New to In Progress
- Assignee set to Jason Dillaman
Updated by Jason Dillaman over 5 years ago
- Status changed from In Progress to Fix Under Review
- Pull request ID set to 36839
Updated by Mykola Golub over 5 years ago
- Status changed from Fix Under Review to Resolved
Updated by Josh Durgin over 5 years ago
- Status changed from Resolved to Pending Backport
- Backport set to octopus, nautilus
Looks like this could use a backport - seen in nautilus here:
https://jenkins.ceph.com/job/ceph-pull-requests/62883/
https://github.com/ceph/ceph/pull/37872
[ RUN ] TestLibRBD.TestPendingAio
using new format!
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/librbd/test_librbd.cc:4572: Failure
Expected equality of these values:
1
rbd_aio_is_complete(comps[i])
Which is: 0
[ FAILED ] TestLibRBD.TestPendingAio (96 ms)
Updated by Nathan Cutler over 5 years ago
- Copied to Backport #48081: nautilus: make check: run-rbd-unit-tests: TestLibRBD.TestPendingAio: test_librbd.cc:4539: Failure added
Updated by Nathan Cutler over 5 years ago
- Copied to Backport #48082: octopus: make check: run-rbd-unit-tests: TestLibRBD.TestPendingAio: test_librbd.cc:4539: Failure added
Updated by Jason Dillaman over 5 years ago
- Status changed from Pending Backport to New
- Backport deleted (
octopus, nautilus)
Updated by Nathan Cutler over 5 years ago
Thanks, Jason. The "make check" console log that Josh mentioned in Comment 4 (reproduced with a small correction below) has been preserved here: https://gist.github.com/smithfarm/d52b01f1214b723b46013cd9cd8ff497
Updated by Nathan Cutler over 5 years ago
Josh Durgin wrote:
Looks like this could use a backport - seen in
nautilusmaster here:
https://jenkins.ceph.com/job/ceph-pull-requests/62883/
https://github.com/ceph/ceph/pull/37872[...]
Updated by Josh Durgin almost 5 years ago
- Assignee deleted (
Jason Dillaman)
Happened again here https://jenkins.ceph.com/job/ceph-pull-requests/73665/consoleFull :
[ RUN ] TestLibRBD.TestPendingAio
using new format!
creating image with stripe unit: 65536, stripe count: 16
../src/test/librbd/test_librbd.cc:4707: Failure
Expected equality of these values:
1
rbd_aio_is_complete(comps[i])
Which is: 0
[ FAILED ] TestLibRBD.TestPendingAio (213 ms)
Updated by Deepika Upadhyay almost 5 years ago
Josh Durgin wrote:
Happened again here https://jenkins.ceph.com/job/ceph-pull-requests/73665/consoleFull :
[...]
saved log until someone can get back to it: https://gist.github.com/ideepika/843c38c088b4ca58de104005af9ba213
Updated by Sebastian Wagner over 4 years ago
- File consoleText.tgz consoleText.tgz added
https://jenkins.ceph.com/job/ceph-pull-requests/85058 (attached the log)
Updated by Ilya Dryomov over 4 years ago
- Related to Bug #49558: make check: run-rbd-unit-tests-61.sh (Failed) added
Updated by Ilya Dryomov over 2 years ago
This test occasionally crashes in Jenkins:
[ RUN ] TestLibRBD.TestPendingAio
using new format!
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/run-rbd-unit-tests.sh: line 20: 705216 Segmentation fault RBD_FEATURES=$i unittest_librbd
The following tests FAILED:
33 - run-rbd-unit-tests-109.sh (Failed)
Updated by Ilya Dryomov over 2 years ago
- Subject changed from make check: run-rbd-unit-tests: TestLibRBD.TestPendingAio: test_librbd.cc:4539: Failure to TestLibRBD.TestPendingAio: test_librbd.cc:4539: Failure or SIGSEGV
Updated by Casey Bodley almost 2 years ago
from https://jenkins.ceph.com/job/ceph-pull-requests/133893/consoleFull
[ RUN ] TestLibRBD.TestPendingAio
using new format!
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/librbd/test_librbd.cc:8843: Failure
Expected equality of these values:
1
rbd_aio_is_complete(comps[i])
Which is: 0
[ FAILED ] TestLibRBD.TestPendingAio (127 ms)
Updated by Bill Scales almost 2 years ago
from https://jenkins.ceph.com/job/ceph-pull-requests-arm64/58231/ another instance of the segfault from run_rbd_unit_tests_109.sh
[ RUN ] TestLibRBD.TestPendingAio
using new format!
/home/jenkins-build/build/workspace/ceph-pull-requests-arm64/src/test/run-rbd-unit-tests.sh: line 20: 3214412 Segmentation fault (core dumped) RBD_FEATURES=$i unittest_librbd
Updated by Bill Scales almost 2 years ago · Edited
I've managed to recreate the core dump:
cd /ceph/build source ../src/test/detect-build-env-vars.sh export RBD_FEATURES=109 unittest_librbd --gtest_filter=TestLibRBD.TestPendingAio --gtest_repeat=1000
Didn't reproduce at first so I messed around with taskset to change number of cores for a bit of variability, eventually
taskset 15 unittest_librbd --gtest_filter=TestLibRBD.TestPendingAio --gtest_repeat=1000
reproduced the seg fault. Here's the stack trace:
#0 0x00000000026492be in librbd::exclusive_lock::ImageDispatch<librbd::ImageCtx>::handle_acquire_lock (this=0x7f4e24016250, r=-85) at /ceph/src/librbd/exclusive_lock/ImageDispatch.cc:287
#1 0x0000000002649b23 in librbd::util::detail::C_CallbackAdapter<librbd::exclusive_lock::ImageDispatch<librbd::ImageCtx>, &librbd::exclusive_lock::ImageDispatch<librbd::ImageCtx>::handle_acquire_lock>::finish (
this=<optimized out>, r=<optimized out>) at /ceph/src/librbd/Utils.h:63
#2 0x00000000010ec09d in Context::complete (this=0x7f4e3400c480, r=<optimized out>) at /ceph/src/include/Context.h:99
#3 0x0000000001123df1 in librbd::asio::ContextWQ::queue(Context*, int)::{lambda()#1}::operator()() const (__closure=0x7f4ef0ff53f0) at /ceph/src/librbd/asio/ContextWQ.h:31
#4 boost::asio::detail::binder0<librbd::asio::ContextWQ::queue(Context*, int)::{lambda()#1}>::operator()() (this=0x7f4ef0ff53f0) at /ceph/build/boost/include/boost/asio/detail/bind_handler.hpp:60
#5 boost::asio::asio_handler_invoke<boost::asio::detail::binder0<librbd::asio::ContextWQ::queue(Context*, int)::{lambda()#1}> >(boost::asio::detail::binder0<librbd::asio::ContextWQ::queue(Context*, int)::{lambda()#1}>&, ...) (function=...) at /ceph/build/boost/include/boost/asio/handler_invoke_hook.hpp:88
#6 boost_asio_handler_invoke_helpers::invoke<boost::asio::detail::binder0<librbd::asio::ContextWQ::queue(Context*, int)::{lambda()#1}>, librbd::asio::ContextWQ::queue(Context*, int)::{lambda()#1}>(boost::asio::detail::binder0<librbd::asio::ContextWQ::queue(Context*, int)::{lambda()#1}>&, librbd::asio::ContextWQ::queue(Context*, int)::{lambda()#1}&) (context=..., function=...)
at /ceph/build/boost/include/boost/asio/detail/handler_invoke_helpers.hpp:54
#7 boost::asio::detail::asio_handler_invoke<boost::asio::detail::binder0<librbd::asio::ContextWQ::queue(Context*, int)::{lambda()#1}>, librbd::asio::ContextWQ::queue(Context*, int)::{lambda()#1}>(boost::asio::detail::binder0<librbd::asio::ContextWQ::queue(Context*, int)::{lambda()#1}>&, boost::asio::detail::binder0<librbd::asio::ContextWQ::queue(Context*, int)::{lambda()#1}>*) (this_handler=0x7f4ef0ff53f0, function=...)
at /ceph/build/boost/include/boost/asio/detail/bind_handler.hpp:111
#8 boost_asio_handler_invoke_helpers::invoke<boost::asio::detail::binder0<librbd::asio::ContextWQ::queue(Context*, int)::{lambda()#1}>, boost::asio::detail::binder0<librbd::asio::ContextWQ::queue(Context*, int)::{lambda()#1}> >(boost::asio::detail::binder0<librbd::asio::ContextWQ::queue(Context*, int)::{lambda()#1}>&, boost::asio::detail::binder0<librbd::asio::ContextWQ::queue(Context*, int)::{lambda()#1}>&) (
context=..., function=...) at /ceph/build/boost/include/boost/asio/detail/handler_invoke_helpers.hpp:54
#9 boost::asio::detail::executor_op<boost::asio::detail::binder0<librbd::asio::ContextWQ::queue(Context*, int)::{lambda()#1}>, std::allocator<void>, boost::asio::detail::scheduler_operation>::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long) (owner=0x4113d30, base=<optimized out>) at /ceph/build/boost/include/boost/asio/detail/executor_op.hpp:71
#10 0x0000000001125a10 in boost::asio::detail::scheduler_operation::complete (bytes_transferred=0, ec=..., owner=<optimized out>, this=<optimized out>)
at /ceph/build/boost/include/boost/asio/detail/scheduler_operation.hpp:40
#11 boost::asio::detail::strand_executor_service::run_ready_handlers (impl=std::shared_ptr<boost::asio::detail::strand_executor_service::strand_impl> (use count 2, weak count 0) = {...})
at /ceph/build/boost/include/boost/asio/detail/impl/strand_executor_service.ipp:150
#12 boost::asio::detail::strand_executor_service::invoker<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> const, void>::operator() (this=0x7f4ef0ff5460)
at /ceph/build/boost/include/boost/asio/detail/impl/strand_executor_service.hpp:136
#13 boost::asio::asio_handler_invoke<boost::asio::detail::strand_executor_service::invoker<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> const, void> > (function=...)
at /ceph/build/boost/include/boost/asio/handler_invoke_hook.hpp:88
#14 boost_asio_handler_invoke_helpers::invoke<boost::asio::detail::strand_executor_service::invoker<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> const, void>, boost::asio::detail::strand_executor_service::invoker<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> const, void> > (context=..., function=...)
at /ceph/build/boost/include/boost/asio/detail/handler_invoke_helpers.hpp:54
#15 boost::asio::detail::executor_op<boost::asio::detail::strand_executor_service::invoker<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> const, void>, boost::asio::detail::recycling_allocator<void, boost::asio::detail::thread_info_base::default_tag>, boost::asio::detail::scheduler_operation>::do_complete (owner=0x7f4ecc000dc0, base=<optimized out>)
at /ceph/build/boost/include/boost/asio/detail/executor_op.hpp:71
#16 0x00000000016db528 in boost::asio::detail::scheduler_operation::complete (bytes_transferred=0, ec=..., owner=0x7f4ecc000dc0, this=0x41d0590)
at /ceph/build/boost/include/boost/asio/detail/scheduler_operation.hpp:40
#17 boost::asio::detail::scheduler::do_run_one (this=this@entry=0x7f4ecc000dc0, lock=..., this_thread=..., ec=...) at /ceph/build/boost/include/boost/asio/detail/impl/scheduler.ipp:493
#18 0x00000000016dd589 in boost::asio::detail::scheduler::run (this=0x7f4ecc000dc0, ec=...) at /ceph/build/boost/include/boost/asio/detail/impl/scheduler.ipp:210
#19 0x000000000284bcab in boost::asio::io_context::run (this=<optimized out>) at /ceph/build/boost/include/boost/asio/impl/io_context.ipp:64
#20 ceph::async::io_context_pool::start(short)::{lambda()#1}::operator()() const (__closure=0x413b288) at /ceph/src/common/async/context_pool.h:69
#21 std::__invoke_impl<void, ceph::async::io_context_pool::start(short)::{lambda()#1}>(std::__invoke_other, ceph::async::io_context_pool::start(short)::{lambda()#1}&&) (__f=...)
at /usr/include/c++/11/bits/invoke.h:61
#22 std::__invoke<ceph::async::io_context_pool::start(short)::{lambda()#1}>(ceph::async::io_context_pool::start(short)::{lambda()#1}&&) (__fn=...) at /usr/include/c++/11/bits/invoke.h:96
#23 std::invoke<ceph::async::io_context_pool::start(short)::{lambda()#1}>(ceph::async::io_context_pool::start(short)::{lambda()#1}&&) (__fn=...) at /usr/include/c++/11/functional:97
#24 _ZZ17make_named_threadIZN4ceph5async15io_context_pool5startEsEUlvE_JEESt6threadSt17basic_string_viewIcSt11char_traitsIcEEOT_DpOT0_ENKUlSA_SD_E_clIS3_JEEEDaSA_SD_ (fun=..., __closure=0x413b290)
at /ceph/src/common/Thread.h:79
#25 _ZSt13__invoke_implIvZ17make_named_threadIZN4ceph5async15io_context_pool5startEsEUlvE_JEESt6threadSt17basic_string_viewIcSt11char_traitsIcEEOT_DpOT0_EUlSB_SE_E_JS4_EESA_St14__invoke_otherOT0_DpOT1_ (__f=...)
at /usr/include/c++/11/bits/invoke.h:61
#26 _ZSt8__invokeIZ17make_named_threadIZN4ceph5async15io_context_pool5startEsEUlvE_JEESt6threadSt17basic_string_viewIcSt11char_traitsIcEEOT_DpOT0_EUlSB_SE_E_JS4_EENSt15__invoke_resultISA_JDpSC_EE4typeESB_SE_ (
__fn=...) at /usr/include/c++/11/bits/invoke.h:96
#27 _ZNSt6thread8_InvokerISt5tupleIJZ17make_named_threadIZN4ceph5async15io_context_pool5startEsEUlvE_JEES_St17basic_string_viewIcSt11char_traitsIcEEOT_DpOT0_EUlSC_SF_E_S6_EEE9_M_invokeIJLm0ELm1EEEEvSt12_Index_tupleIJXspT_EEE (this=0x413b288) at /usr/include/c++/11/bits/std_thread.h:259
#28 _ZNSt6thread8_InvokerISt5tupleIJZ17make_named_threadIZN4ceph5async15io_context_pool5startEsEUlvE_JEES_St17basic_string_viewIcSt11char_traitsIcEEOT_DpOT0_EUlSC_SF_E_S6_EEEclEv (this=0x413b288)
at /usr/include/c++/11/bits/std_thread.h:266
#29 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZ17make_named_threadIZN4ceph5async15io_context_pool5startEsEUlvE_JEES_St17basic_string_viewIcSt11char_traitsIcEEOT_DpOT0_EUlSD_SG_E_S7_EEEEE6_M_runEv (
this=0x413b280) at /usr/include/c++/11/bits/std_thread.h:211
#30 0x00007f502c489ad4 in std::execute_native_thread_routine (__p=0x413b280) at ../../../../../libstdc++-v3/src/c++11/thread.cc:82
#31 0x00007f502c138c02 in start_thread (arg=<optimized out>) at pthread_create.c:443
#32 0x00007f502c1bcf34 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:100
Code that failed is: /ceph/src/librbd/exclusive_lock/ImageDispatch.cc line 287
template <typename I>
void ImageDispatch<I>::handle_acquire_lock(int r) {
auto cct = m_image_ctx->cct; <<< This line
Looking at variables:
(gdb) print *this
$3 = {<librbd::io::ImageDispatchInterface> = {_vptr.ImageDispatchInterface = 0x7f4e24016}, m_image_ctx = 0x7eb600763a9d9ad8, m_lock = {<ceph::mutex_debug_detail::mutex_debugging_base> = {
group = "\024b\342\300I\177\000\000ؚ\235:v\000\266~e_lock::ImageDispatch::m_lock (0x7f4e24016250)", id = 52, lockdep = true, backtrace = false, nlock = {<std::__atomic_base<int>> = {
static _S_alignment = 4, _M_i = 0}, static is_always_lock_free = true}, locked_by = {_M_thread = 0}}, rwlock = {__data = {__readers = 1, __writers = 0, __wrphase_futex = 1, __writers_futex = 0,
__pad3 = 0, __pad4 = 0, __cur_writer = 0, __shared = 0, __rwelision = 0 '\000', __pad1 = "\000\000\000\000\000\000", __pad2 = 0, __flags = 0},
__size = "\001\000\000\000\000\000\000\000\001", '\000' <repeats 46 times>, __align = 1}, track = true, nrlock = {<std::__atomic_base<unsigned int>> = {static _S_alignment = 4, _M_i = 0},
static is_always_lock_free = true}}, m_require_lock_on_read = true, m_require_lock_on_write = true, m_on_dispatches = empty std::__cxx11::list}
Looks like m_image_ctx has been corrupted:
(gdb) x/8x *this->m_image_ctx Cannot access memory at address 0x7eb600763a9d9ad8
Updated by Bill Scales over 1 year ago
Think I've found the bug. The signature of the segmentation fault is a use after free bug of the exclusive_lock ImageDispatch object:
1. The m_image_ctx pointer has been corrupted, this is near the start of the object
2. The front of the string "librbd::exclusive_lock::ImageDispatch::m_lock" has also been corrupted
That is corruption at the front of two different objects, which would be consistent with this memory being freed back to the heap.
The code path is exclusive lock delaying an I/O while it acquires the exclusive lock. The callback has returned -ERESTART (-85) which indicates the I/O is racing with exclusive lock shutdown and will be failed - and looking at ManagedLock<I>::acquire_lock this code path immediately schedules the async callback to call handle_acquire_lock.
What stops exclusive lock from being shutdown (and its destructor running) while the async callback is queued?
Looking at Dispatcher.h the shutdown process carefully takes m_lock and removes the dispatch from m_dispatches, then calls async_op_tracker->wait_for_ops to wait for in-flight I/Os to complete and only when this happens does it delete the dispatch.
However looking at Dispatcher.h:send we find:
1. Code takes m_lock and finds the next dispatch to call from m_dispatches
2. Calls async_op_tracker->start_op() to register the in flight I/O
3. Releases m_lock (I/O is now safely tracked by the async_op_tracker)
4. Calls send_dispatch
5. Calls async_op_tracker->finish_op()
So the async_op_tracker is only actually providing protection for a synchronous call to dispatch code. When a dispatch (like exclusive_lock) performs some async processing for an I/O (typically indicated by returning true to send_dispatch) the call to finish_op is allowing the shutdown to happen and the dispatch destructor to run before async processing completes.
Dispatcher isn't going to stop the exclusive_lock shutdown function and destructor being called. Looking at the shut_down function in exclusive_lock we see that it tries to clean up looking for I/Os on the m_on_dispatches, but this isn't sufficient to prevent handle_acquire_lock being called later on after the destructor has run. Blocking the thread in exclusive_lock ImageDispatch<I>::shut_down waiting for handle_acquire_lock to run would risk causing deadlock
I think the fix is to try and extend the tracking of I/Os by async_op_tracker so that I/O is still counted in the tracker while waiting for asynchronous call backs like this. There are likely to be other similar bugs in RBD where shutdown is racing with other asynchronous work.
Updated by Ilya Dryomov over 1 year ago
Hi Bill,
Your analysis seems spot on to me, although I'm not sure about extending that particular op tracker. This code is really hairy -- librbd shutdown has always been problematic, with a ton a work queue items and completion chains built one of top of the other.
The test's main thread is in rbd_close(), right? Could you please post its stack trace from the core for reference?
Updated by Bill Scales over 1 year ago
Here's the stack trace for the main thread - as expected its in rbd_close waiting for the close request to complete:
(gdb) bt
#0 __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7ffcf52ba630)
at futex-internal.c:57
#1 __futex_abstimed_wait_common (futex_word=futex_word@entry=0x7ffcf52ba630, expected=expected@entry=0, clockid=clockid@entry=0,
abstime=abstime@entry=0x0, private=private@entry=0, cancel=cancel@entry=true) at futex-internal.c:87
#2 0x00007f502c1357ff in __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7ffcf52ba630,
expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at futex-internal.c:139
#3 0x00007f502c137fa0 in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7ffcf52ba5e0, cond=0x7ffcf52ba608)
at pthread_cond_wait.c:504
#4 ___pthread_cond_wait (cond=0x7ffcf52ba608, mutex=0x7ffcf52ba5e0) at pthread_cond_wait.c:619
#5 0x00007f502d27c70d in ceph::condition_variable_debug::wait (this=this@entry=0x7ffcf52ba608, lock=...)
at /ceph/src/common/condition_variable_debug.cc:28
#6 0x000000000255bde9 in ceph::condition_variable_debug::wait<C_SaferCond::wait()::{lambda()#1}>(std::unique_lock<ceph::mutex_debug_detail::mutex_debug_impl<false> >&, C_SaferCond::wait()::{lambda()#1}) (pred=..., lock=..., this=0x7ffcf52ba608)
at /ceph/src/common/condition_variable_debug.h:34
#7 C_SaferCond::wait (this=0x7ffcf52ba5a0) at /ceph/src/common/Cond.h:101
#8 librbd::ImageState<librbd::ImageCtx>::close (this=0x4114990) at /ceph/src/librbd/ImageState.cc:502
#9 0x0000000002501434 in rbd_close (image=0x4140880) at /ceph/src/librbd/librbd.cc:4980
#10 0x000000000222ae6a in TestLibRBD_TestPendingAio_Test::TestBody (this=0x419bcf0) at /ceph/src/test/librbd/test_librbd.cc:9030
#11 0x00000000028c6035 in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void> (
object=object@entry=0x419bcf0, method=<optimized out>, location=location@entry=0x2c5e1a3 "the test body")
at /ceph/src/googletest/googletest/src/gtest.cc:2605
#12 0x00000000028cce39 in testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void> (object=object@entry=0x419bcf0,
method=&virtual testing::Test::TestBody(), location=location@entry=0x2c5e1a3 "the test body")
at /ceph/src/googletest/googletest/src/gtest.cc:2641
#13 0x00000000028bf764 in testing::Test::Run (this=this@entry=0x419bcf0) at /ceph/src/googletest/googletest/src/gtest.cc:2680
#14 0x00000000028bf86a in testing::TestInfo::Run (this=0x3fbaab0) at /ceph/src/googletest/googletest/src/gtest.cc:2858
#15 0x00000000028bfe1a in testing::TestSuite::Run (this=0x3fa59b0) at /ceph/src/googletest/googletest/src/gtest.cc:3012
#16 0x00000000028c0228 in testing::internal::UnitTestImpl::RunAllTests (this=0x3f2b420)
at /ceph/src/googletest/googletest/src/gtest.cc:5723
#17 0x00000000028c6665 in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (
object=object@entry=0x3f2b420, method=<optimized out>,
location=location@entry=0x2c60958 "auxiliary test code (environments or event listeners)")
at /ceph/src/googletest/googletest/src/gtest.cc:2605
#18 0x00000000028cd27f in testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (
object=0x3f2b420,
method=(bool (testing::internal::UnitTestImpl::*)(testing::internal::UnitTestImpl * const)) 0x28bff80 <testing::internal::UnitTestImpl::RunAllTests()>, location=location@entry=0x2c60958 "auxiliary test code (environments or event listeners)")
at /ceph/src/googletest/googletest/src/gtest.cc:2641
#19 0x00000000028bf904 in testing::UnitTest::Run (this=0x31e0ae0 <testing::UnitTest::GetInstance()::instance>)
at /ceph/src/googletest/googletest/src/gtest.cc:5306
#20 0x00000000010e825e in RUN_ALL_TESTS () at /ceph/src/googletest/googletest/include/gtest/gtest.h:2486
#21 main (argc=<optimized out>, argv=<optimized out>) at /ceph/src/test/librbd/test_main.cc:72
Can also confirm that the close is running from image_ctx.state:
(gdb) print *this->m_image_ctx.state
$7 = {m_image_ctx = 0x4140880, m_state = librbd::ImageState<librbd::ImageCtx>::STATE_CLOSING, m_lock = {<ceph::mutex_debug_detail::mutex_debugging_base> = {group = "librbd::ImageState::m_lock (0x4114990)", id = 26, lockdep = true, backtrace = false, nlock = {<std::__atomic_base<int>> = {
static _S_alignment = 4, _M_i = 0}, static is_always_lock_free = true}, locked_by = {_M_thread = 0}}, m = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 2, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}},
__size = '\000' <repeats 16 times>, "\002", '\000' <repeats 22 times>, __align = 0}}, m_actions_contexts = std::__cxx11::list = {[0] = {first = {action_type = librbd::ImageState<librbd::ImageCtx>::ACTION_TYPE_CLOSE, refresh_seq = 0, snap_id = 18446744073709551614, on_ready = 0x0},
second = std::__cxx11::list = {[0] = 0x7ffcf52ba5a0}}}, m_last_refresh = 0, m_refresh_seq = 0, m_update_watchers = 0x4114be0, m_quiesce_watchers = 0x4114e90, m_open_flags = 0}
and with a bit of a fight with gdb we can find the ImageDispatcher which shows that the Exclusive Lock ImageDispatch has been removed from m_dispatches (we know from RBD_FEATURES=109 that Exclusive Lock should be in the map, as well as the fact the segmentation fault happened in exclusive lock code):
(gdb) print *(librbd::io::ImageDispatcher<librbd::ImageCtx> *)0x7f4c80013e90
$18 = {<librbd::io::Dispatcher<librbd::ImageCtx, librbd::io::ImageDispatcherInterface>> = {<librbd::io::ImageDispatcherInterface> = {<librbd::io::DispatcherInterface<librbd::io::ImageDispatchInterface>> = {
_vptr.DispatcherInterface = 0x31bafc0 <vtable for librbd::io::ImageDispatcher<librbd::ImageCtx>+16>}, <No data fields>}, m_image_ctx = 0x4140880, m_lock = {<ceph::mutex_debug_detail::mutex_debugging_base> = {group = "librbd::io::Dispatcher::lock (0x7f4c80013e90)", id = 30, lockdep = true,
backtrace = false, nlock = {<std::__atomic_base<int>> = {static _S_alignment = 4, _M_i = 0}, static is_always_lock_free = true}, locked_by = {_M_thread = 0}}, rwlock = {__data = {__readers = 0, __writers = 0, __wrphase_futex = 0, __writers_futex = 0, __pad3 = 0, __pad4 = 0, __cur_writer = 0,
__shared = 0, __rwelision = 0 '\000', __pad1 = "\000\000\000\000\000\000", __pad2 = 0, __flags = 0}, __size = '\000' <repeats 55 times>, __align = 0}, track = true, nrlock = {<std::__atomic_base<unsigned int>> = {static _S_alignment = 4, _M_i = 0}, static is_always_lock_free = true}},
m_dispatches = std::map with 5 elements = {[librbd::io::IMAGE_DISPATCH_LAYER_QUEUE] = {dispatch = 0x4116d00, async_op_tracker = 0x40b79f0}, [librbd::io::IMAGE_DISPATCH_LAYER_QOS] = {dispatch = 0x41153b0, async_op_tracker = 0x40be4d0}, [librbd::io::IMAGE_DISPATCH_LAYER_REFRESH] = {
dispatch = 0x4116d40, async_op_tracker = 0x41402f0}, [librbd::io::IMAGE_DISPATCH_LAYER_WRITE_BLOCK] = {dispatch = 0x41182d0, async_op_tracker = 0x40be550}, [librbd::io::IMAGE_DISPATCH_LAYER_CORE] = {dispatch = 0x4116ca0, async_op_tracker = 0x4205a40}}},
m_next_tid = {<std::__atomic_base<unsigned long>> = {static _S_alignment = 8, _M_i = 516}, static is_always_lock_free = true}, m_qos_image_dispatch = 0x41153b0, m_write_block_dispatch = 0x41182d0}
The other dispatches are still in the list, so that must mean Exclusive Lock has called Dispatcher::shut_down_dispatch to delete itself during the close processing.
Updated by Casey Bodley over 1 year ago
from https://jenkins.ceph.com/job/ceph-pull-requests/144113/
The following tests FAILED:
33 - run-rbd-unit-tests-109.sh (Failed)
34 - run-rbd-unit-tests-127.sh (Failed)
[ RUN ] TestLibRBD.TestPendingAio using new format! /home/jenkins-build/build/workspace/ceph-pull-requests/src/test/run-rbd-unit-tests.sh: line 20: 3898836 Segmentation fault (core dumped) RBD_FEATURES=$i unittest_librbd
[ RUN ] TestLibRBD.TestPendingAio
using new format!
creating image with stripe unit: 65536, stripe count: 16
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/librbd/test_librbd.cc:9203: Failure
Expected equality of these values:
1
rbd_aio_is_complete(comps[i])
Which is: 0
[ FAILED ] TestLibRBD.TestPendingAio (304 ms)
Updated by Casey Bodley over 1 year ago
https://jenkins.ceph.com/job/ceph-pull-requests/144496/
The following tests FAILED:
31 - run-rbd-unit-tests-1.sh (Failed)
[ RUN ] TestLibRBD.TestPendingAio
using new format!
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/librbd/test_librbd.cc:9203: Failure
Expected equality of these values:
1
rbd_aio_is_complete(comps[i])
Which is: 0
[ FAILED ] TestLibRBD.TestPendingAio (247 ms)
Updated by Casey Bodley over 1 year ago
from https://jenkins.ceph.com/job/ceph-pull-requests/147785/
The following tests FAILED:
33 - run-rbd-unit-tests-109.sh (Failed)
[ RUN ] TestLibRBD.TestPendingAio
using new format!
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/librbd/test_librbd.cc:9203: Failure
Expected equality of these values:
1
rbd_aio_is_complete(comps[i])
Which is: 0
[ FAILED ] TestLibRBD.TestPendingAio (515 ms)
Updated by Casey Bodley over 1 year ago
https://jenkins.ceph.com/job/ceph-pull-requests/148481/
The following tests FAILED:
31 - run-rbd-unit-tests-1.sh (Failed)
[ RUN ] TestLibRBD.TestPendingAio
using new format!
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/librbd/test_librbd.cc:9203: Failure
Expected equality of these values:
1
rbd_aio_is_complete(comps[i])
Which is: 0
[ FAILED ] TestLibRBD.TestPendingAio (542 ms)
Updated by Casey Bodley 12 months ago
- Priority changed from Normal to Urgent
https://jenkins.ceph.com/job/ceph-pull-requests/154533/
[ RUN ] TestLibRBD.TestPendingAio
using new format!
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/librbd/test_librbd.cc:9203: Failure
Expected equality of these values:
1
rbd_aio_is_complete(comps[i])
Which is: 0
[ FAILED ] TestLibRBD.TestPendingAio (130 ms)
please fix or consider disabling. raising prio for visibility
Updated by Casey Bodley 12 months ago
https://jenkins.ceph.com/job/ceph-pull-requests/155151/
[ RUN ] TestLibRBD.TestPendingAio
using new format!
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/librbd/test_librbd.cc:9203: Failure
Expected equality of these values:
1
rbd_aio_is_complete(comps[i])
Which is: 0
[ FAILED ] TestLibRBD.TestPendingAio (191 ms)
Updated by Casey Bodley 10 months ago
https://jenkins.ceph.com/job/ceph-pull-requests/159202/
The following tests FAILED:
31 - run-rbd-unit-tests-1.sh (Failed)
[ RUN ] TestLibRBD.TestPendingAio
using new format!
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/librbd/test_librbd.cc:9203: Failure
Expected equality of these values:
1
rbd_aio_is_complete(comps[i])
Which is: 0
[ FAILED ] TestLibRBD.TestPendingAio (95 ms)
Updated by Casey Bodley 9 months ago
https://jenkins.ceph.com/job/ceph-pull-requests-arm64/75695/
The following tests FAILED:
32 - run-rbd-unit-tests-61.sh (Failed)
[ RUN ] TestLibRBD.TestPendingAio
using new format!
/home/jenkins-build/build/workspace/ceph-pull-requests-arm64/src/test/librbd/test_librbd.cc:9203: Failure
Expected equality of these values:
1
rbd_aio_is_complete(comps[i])
Which is: 0
[ FAILED ] TestLibRBD.TestPendingAio (304 ms)
Updated by Casey Bodley 6 months ago
https://jenkins.ceph.com/job/ceph-pull-requests/167525/
The following tests FAILED:
31 - run-rbd-unit-tests-1.sh (Failed)
[ RUN ] TestLibRBD.TestPendingAio
using new format!
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/librbd/test_librbd.cc:9012: Failure
Expected equality of these values:
1
rbd_aio_is_complete(comps[i])
Which is: 0
[ FAILED ] TestLibRBD.TestPendingAio (247 ms)
Updated by Casey Bodley 4 months ago
https://jenkins.ceph.com/job/ceph-pull-requests-arm64/83245/
The following tests FAILED:
31 - run-rbd-unit-tests-1.sh (Failed)
[ RUN ] TestLibRBD.TestPendingAio
using new format!
/ceph/src/test/librbd/test_librbd.cc:9013: Failure
Expected equality of these values:
1
rbd_aio_is_complete(comps[i])
Which is: 0
[ FAILED ] TestLibRBD.TestPendingAio (182 ms)
Updated by Ilya Dryomov 25 days ago
- Related to Bug #75163: disable known flaky tests in run-rbd-unit-tests added
Updated by Ilya Dryomov 25 days ago
Disabled in "make check" context via https://github.com/ceph/ceph/pull/67525.