Project

General

Profile

Actions

Bug #42884

closed

OSDMapTest.CleanPGUpmaps failure

Added by Jeff Layton over 6 years ago. Updated 3 months ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Backport:
quincy,reef,squid
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Tags (freeform):
backport_processed
Fixed In:
v19.3.0-3759-g582e882c43
Released In:
v20.2.0~2384
Upkeep Timestamp:
2025-11-01T00:58:51+00:00

Description

During a make check build for a PR, I got this crash during OSD testing:

https://jenkins.ceph.com/job/ceph-pull-requests/38964/console

[       OK ] OSDMapTest.PrimaryAffinity (882 ms)
[ RUN      ] OSDMapTest.get_osd_crush_node_flags
[       OK ] OSDMapTest.get_osd_crush_node_flags (1 ms)
[ RUN      ] OSDMapTest.parse_osd_id_list
Expected option value to be integer, got 'foo'invalid osd id 'foo'expected numerical value, got: -12invalid osd id '-12'[       OK ] OSDMapTest.parse_osd_id_list (0 ms)
[ RUN      ] OSDMapTest.CleanPGUpmaps
pure virtual method called
terminate called without an active exception
*** Caught signal (Aborted) **
 in thread 7fc6950aa700 thread_name:clean_upmap_tp
 ceph version Development (no_version) octopus (dev)
 1: /home/jenkins-build/build/workspace/ceph-pull-requests/build/bin/unittest_osdmap() [0x62fa00]
 2: (()+0x11390) [0x7fc6a70ff390]
 3: (gsignal()+0x38) [0x7fc69bbb2428]
 4: (abort()+0x16a) [0x7fc69bbb402a]
 5: (()+0x998ae) [0x7fc69c1f88ae]
 6: (()+0xa54b6) [0x7fc69c2044b6]
 7: (()+0xa5521) [0x7fc69c204521]
 8: (()+0xa626f) [0x7fc69c20526f]
 9: (ThreadPool::WorkQueue<ParallelPGMapper::Item>::_void_dequeue()+0x23) [0x5def41]
 10: (ThreadPool::worker(ThreadPool::WorkThread*)+0x517) [0x7fc69de88e2b]
 11: (ThreadPool::WorkThread::entry()+0x32) [0x7fc69de8cff0]
 12: (Thread::entry_wrapper()+0x78) [0x7fc69de68bac]
 13: (Thread::_entry_func(void*)+0x18) [0x7fc69de68b2a]
 14: (()+0x76ba) [0x7fc6a70f56ba]
 15: (clone()+0x6d) [0x7fc69bc8441d]
2019-11-19T13:31:17.884+0000 7fc6950aa700 -1 *** Caught signal (Aborted) **
 in thread 7fc6950aa700 thread_name:clean_upmap_tp

Build is based on commit 05d685dd37b34f2a0, with some cephfs patches on top (nothing that should affect OSD tests).


Related issues 3 (2 open1 closed)

Copied to RADOS - Backport #67235: reef: OSDMapTest.CleanPGUpmaps failureIn ProgressMOHIT AGRAWALActions
Copied to RADOS - Backport #67236: quincy: OSDMapTest.CleanPGUpmaps failureResolvedMOHIT AGRAWALActions
Copied to RADOS - Backport #67237: squid: OSDMapTest.CleanPGUpmaps failureIn ProgressMOHIT AGRAWALActions
Actions #1

Updated by Jeff Layton over 6 years ago

  • Project changed from Ceph to RADOS
  • Subject changed from crash workqueue code to OSDMapTest.CleanPGUpmaps failure
Actions #2

Updated by Kefu Chai over 6 years ago

not reproducible locally. i am testing 9b61479da4f89014b6d1857287102bbc9db13e6e

Actions #6

Updated by Rongqi Sun over 1 year ago

Hi @MOHIT AGRAWAL , would you mind having a loot at this too?

Ref:
https://jenkins.ceph.com/job/ceph-pull-requests-arm64/58812/consoleFull#-1172629797e840cee4-f4a4-4183-81dd-42855615f2c1

[ RUN      ] OSDMapTest.CleanPGUpmaps
pure virtual method called
terminate called without an active exception
*** Caught signal (Aborted) **
 in thread ffff9ace5d60 thread_name:clean_upmap_tp
 ceph version Development (no_version) squid (dev)
 1: /home/jenkins-build/build/workspace/ceph-pull-requests-arm64/build/bin/unittest_osdmap(+0x2e5c88) [0xaaaac36c5c88]
 2: __kernel_rt_sigreturn()
 3: /lib/aarch64-linux-gnu/libc.so.6(+0x7f200) [0xffffa184f200]
 4: raise()
 5: abort()
 6: (__gnu_cxx::__verbose_terminate_handler()+0x124) [0xffffa1afb364]
 7: /lib/aarch64-linux-gnu/libstdc++.so.6(+0xa8a0c) [0xffffa1af8a0c]
 8: /lib/aarch64-linux-gnu/libstdc++.so.6(+0xa8a70) [0xffffa1af8a70]
 9: __cxa_deleted_virtual()
 10: (ThreadPool::WorkQueue<ParallelPGMapper::Item>::_void_dequeue()+0x20) [0xaaaac36136dc]
 11: (ThreadPool::worker(ThreadPool::WorkThread*)+0x5f8) [0xffffa3a63bfc]
 12: (ThreadPool::WorkThread::entry()+0x24) [0xffffa3a693a8]
 13: (Thread::entry_wrapper()+0xa0) [0xffffa3a3a31c]
 14: (Thread::_entry_func(void*)+0x18) [0xffffa3a3a268]
 15: /lib/aarch64-linux-gnu/libc.so.6(+0x7d5c8) [0xffffa184d5c8]
 16: /lib/aarch64-linux-gnu/libc.so.6(+0xe5edc) [0xffffa18b5edc]
2024-06-28T17:03:41.116-0400 ffff9ace5d60 -1 *** Caught signal (Aborted) **
 in thread ffff9ace5d60 thread_name:clean_upmap_tp

 ceph version Development (no_version) squid (dev)
 1: /home/jenkins-build/build/workspace/ceph-pull-requests-arm64/build/bin/unittest_osdmap(+0x2e5c88) [0xaaaac36c5c88]
 2: __kernel_rt_sigreturn()
 3: /lib/aarch64-linux-gnu/libc.so.6(+0x7f200) [0xffffa184f200]
 4: raise()
 5: abort()
 6: (__gnu_cxx::__verbose_terminate_handler()+0x124) [0xffffa1afb364]
 7: /lib/aarch64-linux-gnu/libstdc++.so.6(+0xa8a0c) [0xffffa1af8a0c]
 8: /lib/aarch64-linux-gnu/libstdc++.so.6(+0xa8a70) [0xffffa1af8a70]
 9: __cxa_deleted_virtual()
 10: (ThreadPool::WorkQueue<ParallelPGMapper::Item>::_void_dequeue()+0x20) [0xaaaac36136dc]
 11: (ThreadPool::worker(ThreadPool::WorkThread*)+0x5f8) [0xffffa3a63bfc]
 12: (ThreadPool::WorkThread::entry()+0x24) [0xffffa3a693a8]
 13: (Thread::entry_wrapper()+0xa0) [0xffffa3a3a31c]
 14: (Thread::_entry_func(void*)+0x18) [0xffffa3a3a268]
 15: /lib/aarch64-linux-gnu/libc.so.6(+0x7d5c8) [0xffffa184d5c8]
 16: /lib/aarch64-linux-gnu/libc.so.6(+0xe5edc) [0xffffa18b5edc]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
   -26> 2024-06-28T17:03:38.712-0400 ffffa4ae5020  5 asok(0xaaaadee11660) register_command assert hook 0xaaaaded90070
   -25> 2024-06-28T17:03:38.712-0400 ffffa4ae5020  5 asok(0xaaaadee11660) register_command abort hook 0xaaaaded90070
   -24> 2024-06-28T17:03:38.712-0400 ffffa4ae5020  5 asok(0xaaaadee11660) register_command leak_some_memory hook 0xaaaaded90070
   -23> 2024-06-28T17:03:38.712-0400 ffffa4ae5020  5 asok(0xaaaadee11660) register_command perfcounters_dump hook 0xaaaaded90070
   -22> 2024-06-28T17:03:38.712-0400 ffffa4ae5020  5 asok(0xaaaadee11660) register_command 1 hook 0xaaaaded90070
   -21> 2024-06-28T17:03:38.712-0400 ffffa4ae5020  5 asok(0xaaaadee11660) register_command perf dump hook 0xaaaaded90070
   -20> 2024-06-28T17:03:38.712-0400 ffffa4ae5020  5 asok(0xaaaadee11660) register_command perfcounters_schema hook 0xaaaaded90070
   -19> 2024-06-28T17:03:38.712-0400 ffffa4ae5020  5 asok(0xaaaadee11660) register_command perf histogram dump hook 0xaaaaded90070
   -18> 2024-06-28T17:03:38.712-0400 ffffa4ae5020  5 asok(0xaaaadee11660) register_command 2 hook 0xaaaaded90070
   -17> 2024-06-28T17:03:38.712-0400 ffffa4ae5020  5 asok(0xaaaadee11660) register_command perf schema hook 0xaaaaded90070
   -16> 2024-06-28T17:03:38.712-0400 ffffa4ae5020  5 asok(0xaaaadee11660) register_command counter dump hook 0xaaaaded90070
   -15> 2024-06-28T17:03:38.712-0400 ffffa4ae5020  5 asok(0xaaaadee11660) register_command counter schema hook 0xaaaaded90070
   -14> 2024-06-28T17:03:38.712-0400 ffffa4ae5020  5 asok(0xaaaadee11660) register_command perf histogram schema hook 0xaaaaded90070
   -13> 2024-06-28T17:03:38.712-0400 ffffa4ae5020  5 asok(0xaaaadee11660) register_command perf reset hook 0xaaaaded90070
   -12> 2024-06-28T17:03:38.712-0400 ffffa4ae5020  5 asok(0xaaaadee11660) register_command config show hook 0xaaaaded90070
   -11> 2024-06-28T17:03:38.712-0400 ffffa4ae5020  5 asok(0xaaaadee11660) register_command config help hook 0xaaaaded90070
   -10> 2024-06-28T17:03:38.712-0400 ffffa4ae5020  5 asok(0xaaaadee11660) register_command config set hook 0xaaaaded90070
    -9> 2024-06-28T17:03:38.712-0400 ffffa4ae5020  5 asok(0xaaaadee11660) register_command config unset hook 0xaaaaded90070
    -8> 2024-06-28T17:03:38.712-0400 ffffa4ae5020  5 asok(0xaaaadee11660) register_command config get hook 0xaaaaded90070
    -7> 2024-06-28T17:03:38.712-0400 ffffa4ae5020  5 asok(0xaaaadee11660) register_command config diff hook 0xaaaaded90070
    -6> 2024-06-28T17:03:38.712-0400 ffffa4ae5020  5 asok(0xaaaadee11660) register_command config diff get hook 0xaaaaded90070
    -5> 2024-06-28T17:03:38.712-0400 ffffa4ae5020  5 asok(0xaaaadee11660) register_command injectargs hook 0xaaaaded90070
    -4> 2024-06-28T17:03:38.712-0400 ffffa4ae5020  5 asok(0xaaaadee11660) register_command log flush hook 0xaaaaded90070
    -3> 2024-06-28T17:03:38.712-0400 ffffa4ae5020  5 asok(0xaaaadee11660) register_command log dump hook 0xaaaaded90070
    -2> 2024-06-28T17:03:38.712-0400 ffffa4ae5020  5 asok(0xaaaadee11660) register_command log reopen hook 0xaaaaded90070
    -1> 2024-06-28T17:03:38.712-0400 ffffa4ae5020  5 asok(0xaaaadee11660) register_command dump_mempools hook 0xaaaadeec0a88
     0> 2024-06-28T17:03:41.116-0400 ffff9ace5d60 -1 *** Caught signal (Aborted) **
 in thread ffff9ace5d60 thread_name:clean_upmap_tp

 ceph version Development (no_version) squid (dev)
 1: /home/jenkins-build/build/workspace/ceph-pull-requests-arm64/build/bin/unittest_osdmap(+0x2e5c88) [0xaaaac36c5c88]
 2: __kernel_rt_sigreturn()
 3: /lib/aarch64-linux-gnu/libc.so.6(+0x7f200) [0xffffa184f200]
 4: raise()
 5: abort()
 6: (__gnu_cxx::__verbose_terminate_handler()+0x124) [0xffffa1afb364]
 7: /lib/aarch64-linux-gnu/libstdc++.so.6(+0xa8a0c) [0xffffa1af8a0c]
 8: /lib/aarch64-linux-gnu/libstdc++.so.6(+0xa8a70) [0xffffa1af8a70]
 9: __cxa_deleted_virtual()
 10: (ThreadPool::WorkQueue<ParallelPGMapper::Item>::_void_dequeue()+0x20) [0xaaaac36136dc]
 11: (ThreadPool::worker(ThreadPool::WorkThread*)+0x5f8) [0xffffa3a63bfc]
 12: (ThreadPool::WorkThread::entry()+0x24) [0xffffa3a693a8]
 13: (Thread::entry_wrapper()+0xa0) [0xffffa3a3a31c]
 14: (Thread::_entry_func(void*)+0x18) [0xffffa3a3a268]
 15: /lib/aarch64-linux-gnu/libc.so.6(+0x7d5c8) [0xffffa184d5c8]
 16: /lib/aarch64-linux-gnu/libc.so.6(+0xe5edc) [0xffffa18b5edc]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Actions #7

Updated by Laura Flores over 1 year ago

  • Assignee set to MOHIT AGRAWAL

@MOHIT AGRAWAL this looks similar to the other issue you were working on; maybe you have an idea.

Actions #8

Updated by MOHIT AGRAWAL over 1 year ago

The test case is getting crash during calling of clean_pg_upmap, the workflow of the function is like below

1) Create a threadpool object
2) Start a thread pool
3) Create a ParallelPGMapper object(mapper) and pass threadpool object reference
4) Create a CleanUmapJob object
5) Call mapper.queue function and insert the job
6) call job.wait
7) stop threadpool

If we do check the process core it is showing like below and it is not showing any object is corrupted.

    Thread 2 (Thread 0x7f7848256200 (LWP 24355)):
#0  0x00007f7847eab470 in __lll_lock_wait () from /lib64/libc.so.6
#1  0x00007f7847eb1e61 in pthread_mutex_lock@@GLIBC_2.2.5 () from /lib64/libc.so.6
#2  0x00007f7849201913 in ceph::mutex_debug_detail::mutex_debug_impl<false>::lock_impl (this=this@entry=0x7ffdbce66660) at /nvme0/ceph/src/common/mutex_debug.h:122
#3  0x00007f7849201bf1 in ceph::mutex_debug_detail::mutex_debug_impl<false>::lock (this=0x7ffdbce66660, no_lockdep=no_lockdep@entry=false) at /nvme0/ceph/src/common/mutex_debug.h:188
#4  0x00007f78495e171c in ThreadPool::WorkQueue<ParallelPGMapper::Item>::queue (this=this@entry=0x7ffdbce66458, item=item@entry=0x565057f74dd0) at /nvme0/ceph/src/common/WorkQueue.h:251
#5  0x00007f78495e0dd8 in ParallelPGMapper::queue (this=this@entry=0x7ffdbce66400, job=job@entry=0x7ffdbce664a0, pgs_per_item=pgs_per_item@entry=256, input_pgs=std::vector of length 3, capacity 3 = {...}) at /nvme0/ceph/src/osd/OSDMapMapping.cc:189
#6  0x0000565057913f8f in OSDMapTest::clean_pg_upmaps (this=this@entry=0x565057f66740, cct=0x565057e1b080, om=..., pending_inc=...) at /nvme0/ceph/src/test/osd/TestOSDMap.cc:319
#7  0x00005650578e7453 in OSDMapTest_BUG_51842_Test::TestBody (this=0x565057f66740) at /nvme0/ceph/src/test/osd/TestOSDMap.cc:2295
#8  0x0000565057931cad in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void> (object=object@entry=0x565057f66740, method=<optimized out>, location=location@entry=0x565057957748 "the test body") at /nvme0/ceph/src/googletest/googletest/src/gtest.cc:2605
#9  0x000056505793a080 in testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void> (object=object@entry=0x565057f66740, method=&virtual testing::Test::TestBody(), location=location@entry=0x565057957748 "the test body") at /nvme0/ceph/src/googletest/googletest/src/gtest.cc:2641
--Type <RET> for more, q to quit, c to continue without paging--
#10 0x000056505792af9a in testing::Test::Run (this=this@entry=0x565057f66740) at /nvme0/ceph/src/googletest/googletest/src/gtest.cc:2680
#11 0x000056505792b0a0 in testing::TestInfo::Run (this=0x565057f5cc80) at /nvme0/ceph/src/googletest/googletest/src/gtest.cc:2858
#12 0x000056505792b154 in testing::TestSuite::Run (this=0x565057dcf040) at /nvme0/ceph/src/googletest/googletest/src/gtest.cc:3012
#13 0x000056505792c80b in testing::internal::UnitTestImpl::RunAllTests (this=0x565057e142a0) at /nvme0/ceph/src/googletest/googletest/src/gtest.cc:5723
#14 0x0000565057931f59 in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (object=object@entry=0x565057e142a0, method=<optimized out>, location=location@entry=0x565057958898 "auxiliary test code (environments or event listeners)") at /nvme0/ceph/src/googletest/googletest/src/gtest.cc:2605
#15 0x000056505793a5dd in testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (object=0x565057e142a0, method=(bool (testing::internal::UnitTestImpl::*)(testing::internal::UnitTestImpl * const)) 0x56505792c4a0 <testing::internal::UnitTestImpl::RunAllTests()>, location=location@entry=0x565057958898 "auxiliary test code (environments or event listeners)") at /nvme0/ceph/src/googletest/googletest/src/gtest.cc:2641
#16 0x000056505792b26f in testing::UnitTest::Run (this=0x565057999c00 <testing::UnitTest::GetInstance()::instance>) at /nvme0/ceph/src/googletest/googletest/src/gtest.cc:5306
#17 0x00005650578ea2b8 in RUN_ALL_TESTS () at /nvme0/ceph/src/googletest/googletest/include/gtest/gtest.h:2486
#18 0x00005650578c3d84 in main (argc=<optimized out>, argv=0x7ffdbce68f68) at /nvme0/ceph/src/test/osd/TestOSDMap.cc:32

Thread 1 (Thread 0x7f783eb676c0 (LWP 24477)):
#0  0x00007f7847eb0884 in _pthread_kill_implementation () from /lib64/libc.so.6
#1  0x00007f7847e5fafe in raise () from /lib64/libc.so.6
#2  0x00005650579486ee in reraise_fatal (signum=signum@entry=6) at /nvme0/ceph/src/global/signal_handler.cc:88
#3  0x0000565057949a3c in handle_oneshot_fatal_signal (signum=6) at /nvme0/ceph/src/global/signal_handler.cc:367
#4  <signal handler called>
#5  0x00007f7847eb0884 in __pthread_kill_implementation () from /lib64/libc.so.6
#6  0x00007f7847e5fafe in raise () from /lib64/libc.so.6
#7  0x00007f7847e4887f in abort () from /lib64/libc.so.6
#8  0x00007f78480a4d39 in __gnu_cxx::
_verbose_terminate_handler() [clone .cold] () from /lib64/libstdc++.so.6
#9  0x00007f78480b4f6c in _cxxabiv1::_terminate(void (*)()) () from /lib64/libstdc++.so.6
#10 0x00007f78480b4fd7 in std::terminate() () from /lib64/libstdc++.so.6
#11 0x00007f78480b5d15 in __cxa_pure_virtual () from /lib64/libstdc++.so.6
#12 0x00005650578e952e in ThreadPool::WorkQueue<ParallelPGMapper::Item>::_void_dequeue (this=<optimized out>) at /nvme0/ceph/src/common/WorkQueue.h:226
#13 0x00007f7849278c81 in ThreadPool::worker (this=0x7ffdbce665f0, wt=<optimized out>) at /nvme0/ceph/src/common/WorkQueue.cc:111
#14 0x00007f784927b835 in ThreadPool::WorkThread::entry (this=<optimized out>) at /nvme0/ceph/src/common/WorkQueue.h:401
#15 0x00007f7849267e51 in Thread::entry_wrapper (this=0x565057f74d50) at /nvme0/ceph/src/common/Thread.cc:87
#16 0x00007f7849267e69 in Thread::_entry_func (arg=<optimized out>) at /nvme0/ceph/src/common/Thread.cc:74
#17 0x00007f7847eae947 in start_thread () from /lib64/libc.so.6
#18 0x00007f7847f34860 in clone3 () from /lib64/libc.so.6

gdb) f 13
#13 0x00007f7849278c81 in ThreadPool::worker (this=0x7ffdbce665f0, wt=<optimized out>) at /nvme0/ceph/src/common/WorkQueue.cc:111
111 void item = wq->void_dequeue();
(gdb) p wq
$1 = (ThreadPool::WorkQueue
*) 0x7ffdbce66458
(gdb) p *wq
$2 = {_vptr.WorkQueue_ = 0x565057993b10 <vtable for ParallelPGMapper::WQ+16>, name = "ParallelPGMapper::WQ",
  timeout_interval = std::atomic<std::chrono::duration<unsigned long, std::ratio<1, 1000000000> >> = { std::chrono::duration = { 60000000000ns } },
  suicide_interval = std::atomic<std::chrono::duration<unsigned long, std::ratio<1, 1000000000> >> = { std::chrono::duration = { 0ns } }}
(gdb) p ((ParallelPGMapper
)0x7ffdbce66400)
$8 = (ParallelPGMapper ) 0x7ffdbce66400
(gdb) p ((ParallelPGMapper
)0x7ffdbce66400)->wq
$9 = {<ThreadPool::WorkQueue<ParallelPGMapper::Item>> = {<ThreadPool::WorkQueue_> = {_vptr.WorkQueue_ = 0x565057993b10 <vtable for ParallelPGMapper::WQ+16>,
      name = "ParallelPGMapper::WQ",
      timeout_interval = std::atomic<std::chrono::duration<unsigned long, std::ratio<1, 1000000000> >> = { std::chrono::duration = { 60000000000ns } },
      suicide_interval = std::atomic<std::chrono::duration<unsigned long, std::ratio<1, 1000000000> >> = { std::chrono::duration = { 0ns } }},
    pool = 0x7ffdbce665f0}, m = 0x7ffdbce66400}
(gdb) p &((ParallelPGMapper*)0x7ffdbce66400)->wq
$10 = (ParallelPGMapper::WQ ) 0x7ffdbce66458
p *((ParallelPGMapper
)0x7ffdbce66400)->wq->pool
$13 = {<ceph::md_config_obs_impl<ceph::common::ConfigProxy>> = {_vptr.md_config_obs_impl = 0x7f7849a99928 <vtable for ThreadPool+16>}, cct = 0x565057e1b080,
  name = "BUG_40104::clean_upmap_tp", thread_name = "clean_upmap_tp", lockname = "BUG_40104::clean_upmap_tp::lock",
  lock = {<ceph::mutex_debug_detail::mutex_debugging_base> = {group = "BUG_40104::clean_upmap_tp::lock", id = -1, lockdep = true, backtrace = false,
      nlock = std::atomic<int> = { 1 }, locked_by = {_M_thread = 140154424948416}}, m = {
_data = {__lock = 2, _count = 0, __owner = 24477, __nusers = 6,
        __kind = 2, __spins = 0, __elision = 0, __list = {
_prev = 0x0, _next = 0x0}},
      __size = "\002\000\000\000\000\000\000\000\235
\000\000\006\000\000\000\002", '\000' <repeats 22 times>, _align = 2}}, _cond = {cond = {_data = {
        _wseq = {_value64 = 10, _value32 = {_low = 10, _high = 0}}, __g1_start = {_value64 = 0, _value32 = {_low = 0, _high = 0}}, __g_refs = {10,
          0}, __g_size = {0, 0}, __g1_orig_size = 0, __wrefs = 40, __g_signals = {0, 0}},
      __size = "\n", '\000' <repeats 15 times>, "\n", '\000' <repeats 19 times>, "(\000\000\000\000\000\000\000\000\000\000", __align = 10},
    waiter_mutex = 0x7ffdbce66660}, _stop = false, _pause = 0, _draining = 0, _wait_cond = {cond = {
_data = {__wseq = {__value64 = 0, _value32 = {
            __low = 0, __high = 0}}, __g1_start = {
_value64 = 0, _value32 = {_low = 0, _high = 0}}, __g_refs = {0, 0}, __g_size = {0, 0},
        __g1_orig_size = 0, __wrefs = 0, __g_signals = {0, 0}}, __size = '\000' <repeats 47 times>, __align = 0}, waiter_mutex = 0x0}, _num_threads = 8,
  _thread_num_option = "", _conf_keys = 0x565058024cc0, work_queues = std::vector of length 1, capacity 1 = {0x7ffdbce66458}, next_work_queue = 1,
  _threads = std::set with 8 elements = {[0] = 0x565057f663d0, [1] = 0x565057f74d50, [2] = 0x565057f83500, [3] = 0x565057f83550, [4] = 0x565057f835a0,
    [5] = 0x565057f835f0, [6] = 0x565057f83670, [7] = 0x5650580252b0}, _old_threads = empty std::
_cxx11::list, processing = 0}

There are two threads (thread 1 and 2) and thread 2 is waiting on mapper.queue function because lock is already held by thread1(worker queue) and thread 1 is getting crashed while calling dequeue function.

Actually here the main issue is we are starting threadpool before creating a mapper object. Basically as we do start a threadpool eventually it call ThreadPool::worker function that function first access a workqueue object and try to dequeue the job.Here we are inserting a workqueue object during ParallelPGMapper object construction at the initialization(after calling pool->add_work_queue by WorkerQueue constructor). We can;t the say the object is fully constructed but at the same time if a worker thread fetch the object(wq) and try to call a function (void_deque_access()) it will crash because the object is not constructed completely and it is trying to access a virtual function and it throws an error "_cxa_pure_virtual". There are multiple ways to avoid it but i think the best option is we can construct a ParallelPGMapper object before starting a thread pool so that while worker thread try to fetch an object(wq) the object is fully constructed.I have tested a patch after calling a test case 100 times in a loop, I am not getting any crashes. Without a patch the test case is crashing almost 15 times.

Actions #9

Updated by MOHIT AGRAWAL over 1 year ago

  • Pull request ID set to 58406
Actions #10

Updated by MOHIT AGRAWAL over 1 year ago

  • Status changed from New to Fix Under Review
Actions #11

Updated by Radoslaw Zarzynski over 1 year ago

scrub note: approved.

Actions #12

Updated by Radoslaw Zarzynski over 1 year ago

scrub note: awaits QA.

Actions #13

Updated by Laura Flores over 1 year ago

QA testing ongoing here: https://tracker.ceph.com/issues/66955

Actions #15

Updated by Radoslaw Zarzynski over 1 year ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to quincy,reef,squid
Actions #16

Updated by Upkeep Bot over 1 year ago

Actions #17

Updated by Upkeep Bot over 1 year ago

  • Copied to Backport #67236: quincy: OSDMapTest.CleanPGUpmaps failure added
Actions #18

Updated by Upkeep Bot over 1 year ago

Actions #19

Updated by Upkeep Bot over 1 year ago

  • Tags (freeform) set to backport_processed
Actions #20

Updated by Upkeep Bot 9 months ago

  • Merge Commit set to 582e882c439e9f7acadd4caf4996089cefd12e07
  • Fixed In set to v19.3.0-3759-g582e882c439
  • Upkeep Timestamp set to 2025-07-09T18:56:48+00:00
Actions #21

Updated by Upkeep Bot 8 months ago

  • Fixed In changed from v19.3.0-3759-g582e882c439 to v19.3.0-3759-g582e882c43
  • Upkeep Timestamp changed from 2025-07-09T18:56:48+00:00 to 2025-07-14T18:12:48+00:00
Actions #22

Updated by Upkeep Bot 5 months ago

  • Released In set to v20.2.0~2384
  • Upkeep Timestamp changed from 2025-07-14T18:12:48+00:00 to 2025-11-01T00:58:51+00:00
Actions #23

Updated by MOHIT AGRAWAL 3 months ago

  • Status changed from Pending Backport to Closed

The pull request is merged.

Actions

Also available in: Atom PDF