wip by tchaikov · Pull Request #1 · rzarzynski/ceph

tchaikov · 2019-08-16T08:11:37Z

Signed-off-by: Kefu Chai kchai@redhat.com

References tracker ticket
Updates documentation if necessary
Includes tests for new functionality or reproducer for bug

Signed-off-by: Kefu Chai <kchai@redhat.com>

rzarzynski

Thanks for the commit, @tchaikov! I will alter the commit title manually and rebase.

otherwise ODR is violated: ==449025==ERROR: AddressSanitizer: odr-violation (0x000000f03700): [1] size=8 'g_ceph_context' ../src/global/global_context.cc:24:14 [2] size=8 'g_ceph_context' ../src/global/global_context.cc:24:14 These globals were registered at these points: [1]: #0 0x4779bd in __asan_register_globals (/var/ssd/ceph/clang-build/bin/ceph-conf+0x4779bd) #1 0x56e9cb in asan.module_ctor (/var/ssd/ceph/clang-build/bin/ceph-conf+0x56e9cb) [2]: #0 0x4779bd in __asan_register_globals (/var/ssd/ceph/clang-build/bin/ceph-conf+0x4779bd) #1 0x7fe5fed12aeb in asan.module_ctor (/var/ssd/ceph/clang-build/lib/libceph-common.so.2+0x2f34aeb) ==449025==HINT: if you don't care about these errors you may set ASAN_OPTIONS=detect_odr_violation=0 Signed-off-by: Kefu Chai <kchai@redhat.com>

Accordingly to cppreference.com [1]: "If multiple threads of execution access the same std::shared_ptr object without synchronization and any of those accesses uses a non-const member function of shared_ptr then a data race will occur (...)" [1]: https://en.cppreference.com/w/cpp/memory/shared_ptr/atomic One of the coredumps showed the `shared_ptr`-typed `OSD::osdmap` with healthy looking content but damaged control block: ``` [Current thread is 1 (Thread 0x7f7dcaf73700 (LWP 205295))] (gdb) bt #0 0x0000559cb81c3ea0 in ?? () #1 0x0000559c97675b27 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x559cba0ec900) at /usr/include/c++/8/bits/shared_ptr_base.h:148 #2 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x559cba0ec900) at /usr/include/c++/8/bits/shared_ptr_base.h:148 ceph#3 0x0000559c975ef8aa in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=<optimized out>, __in_chrg=<optimized out>) at /usr/include/c++/8/bits/shared_ptr_base.h:1167 ceph#4 std::__shared_ptr<OSDMap const, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=<optimized out>, __in_chrg=<optimized out>) at /usr/include/c++/8/bits/shared_ptr_base.h:1167 ceph#5 std::shared_ptr<OSDMap const>::~shared_ptr (this=<optimized out>, __in_chrg=<optimized out>) at /usr/include/c++/8/bits/shared_ptr.h:103 ceph#6 OSD::create_context (this=<optimized out>) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/OSD.cc:9053 ceph#7 0x0000559c97655571 in OSD::dequeue_peering_evt (this=0x559ca22ac000, sdata=0x559ca2ef2900, pg=0x559cb4aa3400, evt=std::shared_ptr<PGPeeringEvent> (use count 2, weak count 0) = {...}, handle=...) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/OSD.cc:9665 ceph#8 0x0000559c97886db6 in ceph::osd::scheduler::PGPeeringItem::run (this=<optimized out>, osd=<optimized out>, sdata=<optimized out>, pg=..., handle=...) at /usr/include/c++/8/ext/atomicity.h:96 ceph#9 0x0000559c9764862f in ceph::osd::scheduler::OpSchedulerItem::run (handle=..., pg=..., sdata=<optimized out>, osd=<optimized out>, this=0x7f7dcaf703f0) at /usr/include/c++/8/bits/unique_ptr.h:342 ceph#10 OSD::ShardedOpWQ::_process (this=<optimized out>, thread_index=<optimized out>, hb=<optimized out>) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/OSD.cc:10677 ceph#11 0x0000559c97c76094 in ShardedThreadPool::shardedthreadpool_worker (this=0x559ca22aca28, thread_index=14) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/common/WorkQueue.cc:311 ceph#12 0x0000559c97c78cf4 in ShardedThreadPool::WorkThreadSharded::entry (this=<optimized out>) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/common/WorkQueue.h:706 ceph#13 0x00007f7df17852de in start_thread () from /lib64/libpthread.so.0 ceph#14 0x00007f7df052f133 in __libc_ifunc_impl_list () from /lib64/libc.so.6 ceph#15 0x0000000000000000 in ?? () (gdb) frame 7 ceph#7 0x0000559c97655571 in OSD::dequeue_peering_evt (this=0x559ca22ac000, sdata=0x559ca2ef2900, pg=0x559cb4aa3400, evt=std::shared_ptr<PGPeeringEvent> (use count 2, weak count 0) = {...}, handle=...) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/OSD.cc:9665 9665 in /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/OSD.cc (gdb) print osdmap $24 = std::shared_ptr<const OSDMap> (expired, weak count 0) = {get() = 0x559cba028000} (gdb) print *osdmap # pretty sane OSDMap (gdb) print sizeof(osdmap) $26 = 16 (gdb) x/2a &osdmap 0x559ca22acef0: 0x559cba028000 0x559cba0ec900 (gdb) frame 2 #2 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x559cba0ec900) at /usr/include/c++/8/bits/shared_ptr_base.h:148 148 /usr/include/c++/8/bits/shared_ptr_base.h: No such file or directory. (gdb) disassemble Dump of assembler code for function std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release(): ... 0x0000559c97675b1e <+62>: mov (%rdi),%rax 0x0000559c97675b21 <+65>: mov %rdi,%rbx 0x0000559c97675b24 <+68>: callq *0x10(%rax) => 0x0000559c97675b27 <+71>: test %rbp,%rbp ... End of assembler dump. (gdb) info registers rdi rbx rax rdi 0x559cba0ec900 94131624790272 rbx 0x559cba0ec900 94131624790272 rax 0x559cba0ec8a0 94131624790176 (gdb) x/a 0x559cba0ec8a0 + 0x10 0x559cba0ec8b0: 0x559cb81c3ea0 (gdb) bt #0 0x0000559cb81c3ea0 in ?? () ... (gdb) p $_siginfo._sifields._sigfault.si_addr $27 = (void *) 0x559cb81c3ea0 ``` Helgrind seems to agree: ``` ==00:00:02:54.519 510301== Possible data race during write of size 8 at 0xF123930 by thread ceph#90 ==00:00:02:54.519 510301== Locks held: 2, at addresses 0xF122A58 0xF1239A8 ==00:00:02:54.519 510301== at 0x7218DD: operator= (shared_ptr_base.h:1078) ==00:00:02:54.519 510301== by 0x7218DD: operator= (shared_ptr.h:103) ==00:00:02:54.519 510301== by 0x7218DD: OSD::_committed_osd_maps(unsigned int, unsigned int, MOSDMap*) (OSD.cc:8116) ==00:00:02:54.519 510301== by 0x7752CA: C_OnMapCommit::finish(int) (OSD.cc:7678) ==00:00:02:54.519 510301== by 0x72A06C: Context::complete(int) (Context.h:77) ==00:00:02:54.519 510301== by 0xD07F14: Finisher::finisher_thread_entry() (Finisher.cc:66) ==00:00:02:54.519 510301== by 0xA7E1203: mythread_wrapper (hg_intercepts.c:389) ==00:00:02:54.519 510301== by 0xC6182DD: start_thread (in /usr/lib64/libpthread-2.28.so) ==00:00:02:54.519 510301== by 0xD8B34B2: clone (in /usr/lib64/libc-2.28.so) ==00:00:02:54.519 510301== ==00:00:02:54.519 510301== This conflicts with a previous read of size 8 by thread ceph#117 ==00:00:02:54.519 510301== Locks held: 1, at address 0x2123E9A0 ==00:00:02:54.519 510301== at 0x6B5842: __shared_ptr (shared_ptr_base.h:1165) ==00:00:02:54.519 510301== by 0x6B5842: shared_ptr (shared_ptr.h:129) ==00:00:02:54.519 510301== by 0x6B5842: get_osdmap (OSD.h:1700) ==00:00:02:54.519 510301== by 0x6B5842: OSD::create_context() (OSD.cc:9053) ==00:00:02:54.519 510301== by 0x71B570: OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&) (OSD.cc:9665) ==00:00:02:54.519 510301== by 0x71B997: OSD::dequeue_delete(OSDShard*, PG*, unsigned int, ThreadPool::TPHandle&) (OSD.cc:9701) ==00:00:02:54.519 510301== by 0x70E62E: run (OpSchedulerItem.h:148) ==00:00:02:54.519 510301== by 0x70E62E: OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*) (OSD.cc:10677) ==00:00:02:54.519 510301== by 0xD3C093: ShardedThreadPool::shardedthreadpool_worker(unsigned int) (WorkQueue.cc:311) ==00:00:02:54.519 510301== by 0xD3ECF3: ShardedThreadPool::WorkThreadSharded::entry() (WorkQueue.h:706) ==00:00:02:54.519 510301== by 0xA7E1203: mythread_wrapper (hg_intercepts.c:389) ==00:00:02:54.519 510301== by 0xC6182DD: start_thread (in /usr/lib64/libpthread-2.28.so) ==00:00:02:54.519 510301== Address 0xf123930 is 3,824 bytes inside a block of size 10,296 alloc'd ==00:00:02:54.519 510301== at 0xA7DC0C3: operator new[](unsigned long) (vg_replace_malloc.c:433) ==00:00:02:54.519 510301== by 0x66F766: main (ceph_osd.cc:688) ==00:00:02:54.519 510301== Block was alloc'd by thread #1 ``` Actually there is plenty of similar issues reported like: ``` ==00:00:05:04.903 510301== Possible data race during read of size 8 at 0x1E3E0588 by thread ceph#119 ==00:00:05:04.903 510301== Locks held: 1, at address 0x1EAD41D0 ==00:00:05:04.903 510301== at 0x753165: clear (hashtable.h:2051) ==00:00:05:04.903 510301== by 0x753165: std::_Hashtable<entity_addr_t, std::pair<entity_addr_t const, utime_t>, mempool::pool_allocator<(mempool::pool_index_t)15, std::pair<entity_addr_t const, utime_t> >, std::__detail::_Select1st, std::equal_to<entity_addr_t>, std::hash<entity_addr_t>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__deta il::_Hashtable_traits<true, false, true> >::~_Hashtable() (hashtable.h:1369) ==00:00:05:04.903 510301== by 0x75331C: ~unordered_map (unordered_map.h:102) ==00:00:05:04.903 510301== by 0x75331C: OSDMap::~OSDMap() (OSDMap.h:350) ==00:00:05:04.903 510301== by 0x753606: operator() (shared_cache.hpp:100) ==00:00:05:04.903 510301== by 0x753606: std::_Sp_counted_deleter<OSDMap const*, SharedLRU<unsigned int, OSDMap const>::Cleanup, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() (shared_ptr _base.h:471) ==00:00:05:04.903 510301== by 0x73BB26: _M_release (shared_ptr_base.h:155) ==00:00:05:04.903 510301== by 0x73BB26: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() (shared_ptr_base.h:148) ==00:00:05:04.903 510301== by 0x6B58A9: ~__shared_count (shared_ptr_base.h:728) ==00:00:05:04.903 510301== by 0x6B58A9: ~__shared_ptr (shared_ptr_base.h:1167) ==00:00:05:04.903 510301== by 0x6B58A9: ~shared_ptr (shared_ptr.h:103) ==00:00:05:04.903 510301== by 0x6B58A9: OSD::create_context() (OSD.cc:9053) ==00:00:05:04.903 510301== by 0x71B570: OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&) (OSD.cc:9665) ==00:00:05:04.903 510301== by 0x71B997: OSD::dequeue_delete(OSDShard*, PG*, unsigned int, ThreadPool::TPHandle&) (OSD.cc:9701) ==00:00:05:04.903 510301== by 0x70E62E: run (OpSchedulerItem.h:148) ==00:00:05:04.903 510301== by 0x70E62E: OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*) (OSD.cc:10677) ==00:00:05:04.903 510301== by 0xD3C093: ShardedThreadPool::shardedthreadpool_worker(unsigned int) (WorkQueue.cc:311) ==00:00:05:04.903 510301== by 0xD3ECF3: ShardedThreadPool::WorkThreadSharded::entry() (WorkQueue.h:706) ==00:00:05:04.903 510301== by 0xA7E1203: mythread_wrapper (hg_intercepts.c:389) ==00:00:05:04.903 510301== by 0xC6182DD: start_thread (in /usr/lib64/libpthread-2.28.so) ==00:00:05:04.903 510301== by 0xD8B34B2: clone (in /usr/lib64/libc-2.28.so) ==00:00:05:04.903 510301== ==00:00:05:04.903 510301== This conflicts with a previous write of size 8 by thread ceph#90 ==00:00:05:04.903 510301== Locks held: 2, at addresses 0xF122A58 0xF1239A8 ==00:00:05:04.903 510301== at 0x7531E1: clear (hashtable.h:2054) ==00:00:05:04.903 510301== by 0x7531E1: std::_Hashtable<entity_addr_t, std::pair<entity_addr_t const, utime_t>, mempool::pool_allocator<(mempool::pool_index_t)15, std::pair<entity_addr_t const, utime_t> >, std::__detail::_Select1st, std::equal_to<entity_addr_t>, std::hash<entity_addr_t>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::~_Hashtable() (hashtable.h:1369) ==00:00:05:04.903 510301== by 0x75331C: ~unordered_map (unordered_map.h:102) ==00:00:05:04.903 510301== by 0x75331C: OSDMap::~OSDMap() (OSDMap.h:350) ==00:00:05:04.903 510301== by 0x753606: operator() (shared_cache.hpp:100) ==00:00:05:04.903 510301== by 0x753606: std::_Sp_counted_deleter<OSDMap const*, SharedLRU<unsigned int, OSDMap const>::Cleanup, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() (shared_ptr_base.h:471) ==00:00:05:04.903 510301== by 0x73BB26: _M_release (shared_ptr_base.h:155) ==00:00:05:04.903 510301== by 0x73BB26: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() (shared_ptr_base.h:148) ==00:00:05:04.903 510301== by 0x72191E: operator= (shared_ptr_base.h:747) ==00:00:05:04.903 510301== by 0x72191E: operator= (shared_ptr_base.h:1078) ==00:00:05:04.903 510301== by 0x72191E: operator= (shared_ptr.h:103) ==00:00:05:04.903 510301== by 0x72191E: OSD::_committed_osd_maps(unsigned int, unsigned int, MOSDMap*) (OSD.cc:8116) ==00:00:05:04.903 510301== by 0x7752CA: C_OnMapCommit::finish(int) (OSD.cc:7678) ==00:00:05:04.903 510301== by 0x72A06C: Context::complete(int) (Context.h:77) ==00:00:05:04.903 510301== by 0xD07F14: Finisher::finisher_thread_entry() (Finisher.cc:66) ==00:00:05:04.903 510301== Address 0x1e3e0588 is 872 bytes inside a block of size 1,208 alloc'd ==00:00:05:04.903 510301== at 0xA7DC0C3: operator new[](unsigned long) (vg_replace_malloc.c:433) ==00:00:05:04.903 510301== by 0x6C7C0C: OSDService::try_get_map(unsigned int) (OSD.cc:1606) ==00:00:05:04.903 510301== by 0x7213BD: get_map (OSD.h:699) ==00:00:05:04.903 510301== by 0x7213BD: get_map (OSD.h:1732) ==00:00:05:04.903 510301== by 0x7213BD: OSD::_committed_osd_maps(unsigned int, unsigned int, MOSDMap*) (OSD.cc:8076) ==00:00:05:04.903 510301== by 0x7752CA: C_OnMapCommit::finish(int) (OSD.cc:7678) ==00:00:05:04.903 510301== by 0x72A06C: Context::complete(int) (Context.h:77) ==00:00:05:04.903 510301== by 0xD07F14: Finisher::finisher_thread_entry() (Finisher.cc:66) ==00:00:05:04.903 510301== by 0xA7E1203: mythread_wrapper (hg_intercepts.c:389) ==00:00:05:04.903 510301== by 0xC6182DD: start_thread (in /usr/lib64/libpthread-2.28.so) ==00:00:05:04.903 510301== by 0xD8B34B2: clone (in /usr/lib64/libc-2.28.so) ``` Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

Helgrind complains about: ``` ==00:00:04:55.495 510301== Possible data race during read of size 8 at 0x1EF1B0B0 by thread ceph#96 ==00:00:04:55.495 510301== Locks held: 3, at addresses 0xEE8D650 0xEE8D748 0x1FFEFFEE08 ==00:00:04:55.495 510301== at 0x107F5C5: ~unique_ptr (unique_ptr.h:273) ==00:00:04:55.495 510301== by 0x107F5C5: ~AuthConnectionMeta (Auth.h:163) ==00:00:04:55.495 510301== by 0x107F5C5: std::_Sp_counted_ptr<AuthConnectionMeta*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() (shared_ptr_base.h:377) ==00:00:04:55.495 510301== by 0x10C19EE: _M_release (shared_ptr_base.h:155) ==00:00:04:55.495 510301== by 0x10C19EE: _M_release (shared_ptr_base.h:148) ==00:00:04:55.495 510301== by 0x10C19EE: ~__shared_count (shared_ptr_base.h:728) ==00:00:04:55.495 510301== by 0x10C19EE: ~__shared_ptr (shared_ptr_base.h:1167) ==00:00:04:55.495 510301== by 0x10C19EE: ~shared_ptr (shared_ptr.h:103) ==00:00:04:55.495 510301== by 0x10C19EE: Protocol::~Protocol() (Protocol.cc:14) ==00:00:04:55.495 510301== by 0x1081A5C: ProtocolV2::~ProtocolV2() (ProtocolV2.cc:100) ==00:00:04:55.495 510301== by 0x105AA48: operator() (unique_ptr.h:81) ==00:00:04:55.495 510301== by 0x105AA48: ~unique_ptr (unique_ptr.h:274) ==00:00:04:55.495 510301== by 0x105AA48: AsyncConnection::~AsyncConnection() (AsyncConnection.cc:149) ==00:00:04:55.495 510301== by 0x105ABAC: AsyncConnection::~AsyncConnection() (AsyncConnection.cc:154) ==00:00:04:55.495 510301== by 0xD2BAA0: RefCountedObject::put() const (RefCountedObj.cc:27) ==00:00:04:55.495 510301== by 0xEB1E58: intrusive_ptr_release (RefCountedObj.h:193) ==00:00:04:55.495 510301== by 0xEB1E58: ~intrusive_ptr (intrusive_ptr.hpp:98) ==00:00:04:55.495 510301== by 0xEB1E58: ~pair (stl_pair.h:208) ==00:00:04:55.495 510301== by 0xEB1E58: destroy<std::pair<const entity_addrvec_t, boost::intrusive_ptr<AsyncConnection> > > (new_allocator.h:140) ==00:00:04:55.495 510301== by 0xEB1E58: destroy<std::pair<const entity_addrvec_t, boost::intrusive_ptr<AsyncConnection> > > (alloc_traits.h:487) ==00:00:04:55.495 510301== by 0xEB1E58: _M_deallocate_node (hashtable_policy.h:2100) ==00:00:04:55.495 510301== by 0xEB1E58: _M_erase (hashtable.h:1909) ==00:00:04:55.495 510301== by 0xEB1E58: std::_Hashtable<entity_addrvec_t, std::pair<entity_addrvec_t const, boost::intrusive_ptr<AsyncConnection> >, std::allocator<std::pair<entity_addrvec_t const, boost ::intrusive_ptr<AsyncConnection> > >, std::__detail::_Select1st, std::equal_to<entity_addrvec_t>, std::hash<entity_addrvec_t>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__ detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::erase(std::__detail::_Node_const_iterator<std::pair<entity_addrvec_t const, boost::intrusive_ptr<AsyncConnection> >, fals e, true>) (hashtable.h:1884) ==00:00:04:55.495 510301== by 0xEB24EA: erase (hashtable.h:762) ==00:00:04:55.495 510301== by 0xEB24EA: erase (unordered_map.h:798) ==00:00:04:55.495 510301== by 0xEB24EA: AsyncMessenger::_lookup_conn(entity_addrvec_t const&) (AsyncMessenger.h:312) ==00:00:04:55.495 510301== by 0xEAA08A: AsyncMessenger::connect_to(int, entity_addrvec_t const&, bool, bool) (AsyncMessenger.cc:701) ==00:00:04:55.495 510301== by 0xEF7477: connect_to_mon (Messenger.h:530) ==00:00:04:55.495 510301== by 0xEF7477: MonClient::_add_conn(unsigned int, unsigned long) (MonClient.cc:716) ==00:00:04:55.495 510301== by 0xEF7C30: MonClient::_add_conns(unsigned long) (MonClient.cc:779) ==00:00:04:55.495 510301== by 0xEFDED6: MonClient::_reopen_session(int) (MonClient.cc:691) ==00:00:04:55.495 510301== by 0xF025F2: MonClient::tick() (MonClient.cc:925) ==00:00:04:55.495 510301== by 0x72A06C: Context::complete(int) (Context.h:77) ==00:00:04:55.495 510301== by 0xD35276: SafeTimer::timer_thread() (Timer.cc:96) ==00:00:04:55.495 510301== by 0xD36850: SafeTimerThread::entry() (Timer.cc:30) ==00:00:04:55.495 510301== by 0xA7E1203: mythread_wrapper (hg_intercepts.c:389) ==00:00:04:55.495 510301== by 0xC6182DD: start_thread (in /usr/lib64/libpthread-2.28.so) ==00:00:04:55.495 510301== by 0xD8B34B2: clone (in /usr/lib64/libc-2.28.so) ==00:00:04:55.495 510301== Address 0x1ef1b0b0 is 128 bytes inside a block of size 136 alloc'd ==00:00:04:55.495 510301== at 0xA7DC0C3: operator new[](unsigned long) (vg_replace_malloc.c:433) ==00:00:04:55.495 510301== by 0x1082094: ProtocolV2::reset_recv_state()::{lambda()#2}::operator()() const [clone .isra.916] (ProtocolV2.cc:240) ==00:00:04:55.495 510301== by 0x108244B: EventCenter::C_submit_event<ProtocolV2::reset_recv_state()::{lambda()#2}>::do_request(unsigned long) (Event.h:228) ==00:00:04:55.495 510301== by 0xEB7465: EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*) (Event.cc:433) ==00:00:04:55.495 510301== by 0xEBC1BB: operator() (Stack.cc:53) ==00:00:04:55.495 510301== by 0xEBC1BB: std::_Function_handler<void (), NetworkStack::add_thread(unsigned int)::{lambda()#1}>::_M_invoke(std::_Any_data const&) (std_function.h:297) ==00:00:04:55.495 510301== by 0xCF4AA32: ??? (in /usr/lib64/libstdc++.so.6.0.25) ==00:00:04:55.495 510301== by 0xA7E1203: mythread_wrapper (hg_intercepts.c:389) ==00:00:04:55.495 510301== by 0xC6182DD: start_thread (in /usr/lib64/libpthread-2.28.so) ==00:00:04:55.495 510301== by 0xD8B34B2: clone (in /usr/lib64/libc-2.28.so) ==00:00:04:55.495 510301== Block was alloc'd by thread ceph#3 ``` Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

Accordingly to cppreference.com [1]: "If multiple threads of execution access the same std::shared_ptr object without synchronization and any of those accesses uses a non-const member function of shared_ptr then a data race will occur (...)" [1]: https://en.cppreference.com/w/cpp/memory/shared_ptr/atomic One of the coredumps showed the `shared_ptr`-typed `OSD::osdmap` with healthy looking content but damaged control block: ``` [Current thread is 1 (Thread 0x7f7dcaf73700 (LWP 205295))] (gdb) bt #0 0x0000559cb81c3ea0 in ?? () #1 0x0000559c97675b27 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x559cba0ec900) at /usr/include/c++/8/bits/shared_ptr_base.h:148 #2 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x559cba0ec900) at /usr/include/c++/8/bits/shared_ptr_base.h:148 ceph#3 0x0000559c975ef8aa in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=<optimized out>, __in_chrg=<optimized out>) at /usr/include/c++/8/bits/shared_ptr_base.h:1167 ceph#4 std::__shared_ptr<OSDMap const, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=<optimized out>, __in_chrg=<optimized out>) at /usr/include/c++/8/bits/shared_ptr_base.h:1167 ceph#5 std::shared_ptr<OSDMap const>::~shared_ptr (this=<optimized out>, __in_chrg=<optimized out>) at /usr/include/c++/8/bits/shared_ptr.h:103 ceph#6 OSD::create_context (this=<optimized out>) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/OSD.cc:9053 ceph#7 0x0000559c97655571 in OSD::dequeue_peering_evt (this=0x559ca22ac000, sdata=0x559ca2ef2900, pg=0x559cb4aa3400, evt=std::shared_ptr<PGPeeringEvent> (use count 2, weak count 0) = {...}, handle=...) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/OSD.cc:9665 ceph#8 0x0000559c97886db6 in ceph::osd::scheduler::PGPeeringItem::run (this=<optimized out>, osd=<optimized out>, sdata=<optimized out>, pg=..., handle=...) at /usr/include/c++/8/ext/atomicity.h:96 ceph#9 0x0000559c9764862f in ceph::osd::scheduler::OpSchedulerItem::run (handle=..., pg=..., sdata=<optimized out>, osd=<optimized out>, this=0x7f7dcaf703f0) at /usr/include/c++/8/bits/unique_ptr.h:342 ceph#10 OSD::ShardedOpWQ::_process (this=<optimized out>, thread_index=<optimized out>, hb=<optimized out>) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/OSD.cc:10677 ceph#11 0x0000559c97c76094 in ShardedThreadPool::shardedthreadpool_worker (this=0x559ca22aca28, thread_index=14) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/common/WorkQueue.cc:311 ceph#12 0x0000559c97c78cf4 in ShardedThreadPool::WorkThreadSharded::entry (this=<optimized out>) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/common/WorkQueue.h:706 ceph#13 0x00007f7df17852de in start_thread () from /lib64/libpthread.so.0 ceph#14 0x00007f7df052f133 in __libc_ifunc_impl_list () from /lib64/libc.so.6 ceph#15 0x0000000000000000 in ?? () (gdb) frame 7 ceph#7 0x0000559c97655571 in OSD::dequeue_peering_evt (this=0x559ca22ac000, sdata=0x559ca2ef2900, pg=0x559cb4aa3400, evt=std::shared_ptr<PGPeeringEvent> (use count 2, weak count 0) = {...}, handle=...) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/OSD.cc:9665 9665 in /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/OSD.cc (gdb) print osdmap $24 = std::shared_ptr<const OSDMap> (expired, weak count 0) = {get() = 0x559cba028000} (gdb) print *osdmap # pretty sane OSDMap (gdb) print sizeof(osdmap) $26 = 16 (gdb) x/2a &osdmap 0x559ca22acef0: 0x559cba028000 0x559cba0ec900 (gdb) frame 2 #2 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x559cba0ec900) at /usr/include/c++/8/bits/shared_ptr_base.h:148 148 /usr/include/c++/8/bits/shared_ptr_base.h: No such file or directory. (gdb) disassemble Dump of assembler code for function std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release(): ... 0x0000559c97675b1e <+62>: mov (%rdi),%rax 0x0000559c97675b21 <+65>: mov %rdi,%rbx 0x0000559c97675b24 <+68>: callq *0x10(%rax) => 0x0000559c97675b27 <+71>: test %rbp,%rbp ... End of assembler dump. (gdb) info registers rdi rbx rax rdi 0x559cba0ec900 94131624790272 rbx 0x559cba0ec900 94131624790272 rax 0x559cba0ec8a0 94131624790176 (gdb) x/a 0x559cba0ec8a0 + 0x10 0x559cba0ec8b0: 0x559cb81c3ea0 (gdb) bt #0 0x0000559cb81c3ea0 in ?? () ... (gdb) p $_siginfo._sifields._sigfault.si_addr $27 = (void *) 0x559cb81c3ea0 ``` Helgrind seems to agree: ``` ==00:00:02:54.519 510301== Possible data race during write of size 8 at 0xF123930 by thread ceph#90 ==00:00:02:54.519 510301== Locks held: 2, at addresses 0xF122A58 0xF1239A8 ==00:00:02:54.519 510301== at 0x7218DD: operator= (shared_ptr_base.h:1078) ==00:00:02:54.519 510301== by 0x7218DD: operator= (shared_ptr.h:103) ==00:00:02:54.519 510301== by 0x7218DD: OSD::_committed_osd_maps(unsigned int, unsigned int, MOSDMap*) (OSD.cc:8116) ==00:00:02:54.519 510301== by 0x7752CA: C_OnMapCommit::finish(int) (OSD.cc:7678) ==00:00:02:54.519 510301== by 0x72A06C: Context::complete(int) (Context.h:77) ==00:00:02:54.519 510301== by 0xD07F14: Finisher::finisher_thread_entry() (Finisher.cc:66) ==00:00:02:54.519 510301== by 0xA7E1203: mythread_wrapper (hg_intercepts.c:389) ==00:00:02:54.519 510301== by 0xC6182DD: start_thread (in /usr/lib64/libpthread-2.28.so) ==00:00:02:54.519 510301== by 0xD8B34B2: clone (in /usr/lib64/libc-2.28.so) ==00:00:02:54.519 510301== ==00:00:02:54.519 510301== This conflicts with a previous read of size 8 by thread ceph#117 ==00:00:02:54.519 510301== Locks held: 1, at address 0x2123E9A0 ==00:00:02:54.519 510301== at 0x6B5842: __shared_ptr (shared_ptr_base.h:1165) ==00:00:02:54.519 510301== by 0x6B5842: shared_ptr (shared_ptr.h:129) ==00:00:02:54.519 510301== by 0x6B5842: get_osdmap (OSD.h:1700) ==00:00:02:54.519 510301== by 0x6B5842: OSD::create_context() (OSD.cc:9053) ==00:00:02:54.519 510301== by 0x71B570: OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&) (OSD.cc:9665) ==00:00:02:54.519 510301== by 0x71B997: OSD::dequeue_delete(OSDShard*, PG*, unsigned int, ThreadPool::TPHandle&) (OSD.cc:9701) ==00:00:02:54.519 510301== by 0x70E62E: run (OpSchedulerItem.h:148) ==00:00:02:54.519 510301== by 0x70E62E: OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*) (OSD.cc:10677) ==00:00:02:54.519 510301== by 0xD3C093: ShardedThreadPool::shardedthreadpool_worker(unsigned int) (WorkQueue.cc:311) ==00:00:02:54.519 510301== by 0xD3ECF3: ShardedThreadPool::WorkThreadSharded::entry() (WorkQueue.h:706) ==00:00:02:54.519 510301== by 0xA7E1203: mythread_wrapper (hg_intercepts.c:389) ==00:00:02:54.519 510301== by 0xC6182DD: start_thread (in /usr/lib64/libpthread-2.28.so) ==00:00:02:54.519 510301== Address 0xf123930 is 3,824 bytes inside a block of size 10,296 alloc'd ==00:00:02:54.519 510301== at 0xA7DC0C3: operator new[](unsigned long) (vg_replace_malloc.c:433) ==00:00:02:54.519 510301== by 0x66F766: main (ceph_osd.cc:688) ==00:00:02:54.519 510301== Block was alloc'd by thread #1 ``` Actually there is plenty of similar issues reported like: ``` ==00:00:05:04.903 510301== Possible data race during read of size 8 at 0x1E3E0588 by thread ceph#119 ==00:00:05:04.903 510301== Locks held: 1, at address 0x1EAD41D0 ==00:00:05:04.903 510301== at 0x753165: clear (hashtable.h:2051) ==00:00:05:04.903 510301== by 0x753165: std::_Hashtable<entity_addr_t, std::pair<entity_addr_t const, utime_t>, mempool::pool_allocator<(mempool::pool_index_t)15, std::pair<entity_addr_t const, utime_t> >, std::__detail::_Select1st, std::equal_to<entity_addr_t>, std::hash<entity_addr_t>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__deta il::_Hashtable_traits<true, false, true> >::~_Hashtable() (hashtable.h:1369) ==00:00:05:04.903 510301== by 0x75331C: ~unordered_map (unordered_map.h:102) ==00:00:05:04.903 510301== by 0x75331C: OSDMap::~OSDMap() (OSDMap.h:350) ==00:00:05:04.903 510301== by 0x753606: operator() (shared_cache.hpp:100) ==00:00:05:04.903 510301== by 0x753606: std::_Sp_counted_deleter<OSDMap const*, SharedLRU<unsigned int, OSDMap const>::Cleanup, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() (shared_ptr _base.h:471) ==00:00:05:04.903 510301== by 0x73BB26: _M_release (shared_ptr_base.h:155) ==00:00:05:04.903 510301== by 0x73BB26: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() (shared_ptr_base.h:148) ==00:00:05:04.903 510301== by 0x6B58A9: ~__shared_count (shared_ptr_base.h:728) ==00:00:05:04.903 510301== by 0x6B58A9: ~__shared_ptr (shared_ptr_base.h:1167) ==00:00:05:04.903 510301== by 0x6B58A9: ~shared_ptr (shared_ptr.h:103) ==00:00:05:04.903 510301== by 0x6B58A9: OSD::create_context() (OSD.cc:9053) ==00:00:05:04.903 510301== by 0x71B570: OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&) (OSD.cc:9665) ==00:00:05:04.903 510301== by 0x71B997: OSD::dequeue_delete(OSDShard*, PG*, unsigned int, ThreadPool::TPHandle&) (OSD.cc:9701) ==00:00:05:04.903 510301== by 0x70E62E: run (OpSchedulerItem.h:148) ==00:00:05:04.903 510301== by 0x70E62E: OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*) (OSD.cc:10677) ==00:00:05:04.903 510301== by 0xD3C093: ShardedThreadPool::shardedthreadpool_worker(unsigned int) (WorkQueue.cc:311) ==00:00:05:04.903 510301== by 0xD3ECF3: ShardedThreadPool::WorkThreadSharded::entry() (WorkQueue.h:706) ==00:00:05:04.903 510301== by 0xA7E1203: mythread_wrapper (hg_intercepts.c:389) ==00:00:05:04.903 510301== by 0xC6182DD: start_thread (in /usr/lib64/libpthread-2.28.so) ==00:00:05:04.903 510301== by 0xD8B34B2: clone (in /usr/lib64/libc-2.28.so) ==00:00:05:04.903 510301== ==00:00:05:04.903 510301== This conflicts with a previous write of size 8 by thread ceph#90 ==00:00:05:04.903 510301== Locks held: 2, at addresses 0xF122A58 0xF1239A8 ==00:00:05:04.903 510301== at 0x7531E1: clear (hashtable.h:2054) ==00:00:05:04.903 510301== by 0x7531E1: std::_Hashtable<entity_addr_t, std::pair<entity_addr_t const, utime_t>, mempool::pool_allocator<(mempool::pool_index_t)15, std::pair<entity_addr_t const, utime_t> >, std::__detail::_Select1st, std::equal_to<entity_addr_t>, std::hash<entity_addr_t>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::~_Hashtable() (hashtable.h:1369) ==00:00:05:04.903 510301== by 0x75331C: ~unordered_map (unordered_map.h:102) ==00:00:05:04.903 510301== by 0x75331C: OSDMap::~OSDMap() (OSDMap.h:350) ==00:00:05:04.903 510301== by 0x753606: operator() (shared_cache.hpp:100) ==00:00:05:04.903 510301== by 0x753606: std::_Sp_counted_deleter<OSDMap const*, SharedLRU<unsigned int, OSDMap const>::Cleanup, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() (shared_ptr_base.h:471) ==00:00:05:04.903 510301== by 0x73BB26: _M_release (shared_ptr_base.h:155) ==00:00:05:04.903 510301== by 0x73BB26: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() (shared_ptr_base.h:148) ==00:00:05:04.903 510301== by 0x72191E: operator= (shared_ptr_base.h:747) ==00:00:05:04.903 510301== by 0x72191E: operator= (shared_ptr_base.h:1078) ==00:00:05:04.903 510301== by 0x72191E: operator= (shared_ptr.h:103) ==00:00:05:04.903 510301== by 0x72191E: OSD::_committed_osd_maps(unsigned int, unsigned int, MOSDMap*) (OSD.cc:8116) ==00:00:05:04.903 510301== by 0x7752CA: C_OnMapCommit::finish(int) (OSD.cc:7678) ==00:00:05:04.903 510301== by 0x72A06C: Context::complete(int) (Context.h:77) ==00:00:05:04.903 510301== by 0xD07F14: Finisher::finisher_thread_entry() (Finisher.cc:66) ==00:00:05:04.903 510301== Address 0x1e3e0588 is 872 bytes inside a block of size 1,208 alloc'd ==00:00:05:04.903 510301== at 0xA7DC0C3: operator new[](unsigned long) (vg_replace_malloc.c:433) ==00:00:05:04.903 510301== by 0x6C7C0C: OSDService::try_get_map(unsigned int) (OSD.cc:1606) ==00:00:05:04.903 510301== by 0x7213BD: get_map (OSD.h:699) ==00:00:05:04.903 510301== by 0x7213BD: get_map (OSD.h:1732) ==00:00:05:04.903 510301== by 0x7213BD: OSD::_committed_osd_maps(unsigned int, unsigned int, MOSDMap*) (OSD.cc:8076) ==00:00:05:04.903 510301== by 0x7752CA: C_OnMapCommit::finish(int) (OSD.cc:7678) ==00:00:05:04.903 510301== by 0x72A06C: Context::complete(int) (Context.h:77) ==00:00:05:04.903 510301== by 0xD07F14: Finisher::finisher_thread_entry() (Finisher.cc:66) ==00:00:05:04.903 510301== by 0xA7E1203: mythread_wrapper (hg_intercepts.c:389) ==00:00:05:04.903 510301== by 0xC6182DD: start_thread (in /usr/lib64/libpthread-2.28.so) ==00:00:05:04.903 510301== by 0xD8B34B2: clone (in /usr/lib64/libc-2.28.so) ``` Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

* no need to discard_result(). as `output_stream::close()` returns an empty future<> already * free the connected socket after the background task finishes, because: we should not free the connected socket before the promise referencing it is fulfilled. otherwise we have error messages from ASan, like ==287182==ERROR: AddressSanitizer: heap-use-after-free on address 0x611000019aa0 at pc 0x55e2ae2de882 bp 0x7fff7e2bf080 sp 0x7fff7e2bf078 READ of size 8 at 0x611000019aa0 thread T0 #0 0x55e2ae2de881 in seastar::reactor_backend_aio::await_events(int, __sigset_t const*) ../src/seastar/src/core/reactor_backend.cc:396 #1 0x55e2ae2dfb59 in seastar::reactor_backend_aio::reap_kernel_completions() ../src/seastar/src/core/reactor_backend.cc:428 #2 0x55e2adbea397 in seastar::reactor::reap_kernel_completions_pollfn::poll() (/var/ssd/ceph/build/bin/crimson-osd+0x155e9397) ceph#3 0x55e2adaec6d0 in seastar::reactor::poll_once() ../src/seastar/src/core/reactor.cc:2789 ceph#4 0x55e2adae7cf7 in operator() ../src/seastar/src/core/reactor.cc:2687 ceph#5 0x55e2adb7c595 in __invoke_impl<bool, seastar::reactor::run()::<lambda()>&> /usr/include/c++/10/bits/invoke.h:60 ceph#6 0x55e2adb699b0 in __invoke_r<bool, seastar::reactor::run()::<lambda()>&> /usr/include/c++/10/bits/invoke.h:113 ceph#7 0x55e2adb50222 in _M_invoke /usr/include/c++/10/bits/std_function.h:291 ceph#8 0x55e2adc2ba00 in std::function<bool ()>::operator()() const /usr/include/c++/10/bits/std_function.h:622 ceph#9 0x55e2adaea491 in seastar::reactor::run() ../src/seastar/src/core/reactor.cc:2713 ceph#10 0x55e2ad98f1c7 in seastar::app_template::run_deprecated(int, char**, std::function<void ()>&&) ../src/seastar/src/core/app-template.cc:199 ceph#11 0x55e2a9e57538 in main ../src/crimson/osd/main.cc:148 ceph#12 0x7fae7f20de0a in __libc_start_main ../csu/libc-start.c:308 ceph#13 0x55e2a9d431e9 in _start (/var/ssd/ceph/build/bin/crimson-osd+0x117421e9) 0x611000019aa0 is located 96 bytes inside of 240-byte region [0x611000019a40,0x611000019b30) freed by thread T0 here: #0 0x7fae80a4e487 in operator delete(void*, unsigned long) (/usr/lib/x86_64-linux-gnu/libasan.so.6+0xac487) #1 0x55e2ae302a0a in seastar::aio_pollable_fd_state::~aio_pollable_fd_state() ../src/seastar/src/core/reactor_backend.cc:458 #2 0x55e2ae2e1059 in seastar::reactor_backend_aio::forget(seastar::pollable_fd_state&) ../src/seastar/src/core/reactor_backend.cc:524 ceph#3 0x55e2adab9b9a in seastar::pollable_fd_state::forget() ../src/seastar/src/core/reactor.cc:1396 ceph#4 0x55e2adab9d05 in seastar::intrusive_ptr_release(seastar::pollable_fd_state*) ../src/seastar/src/core/reactor.cc:1401 ceph#5 0x55e2ace1b72b in boost::intrusive_ptr<seastar::pollable_fd_state>::~intrusive_ptr() /opt/ceph/include/boost/smart_ptr/intrusive_ptr.hpp:98 ceph#6 0x55e2ace115a5 in seastar::pollable_fd::~pollable_fd() ../src/seastar/include/seastar/core/internal/pollable_fd.hh:109 ceph#7 0x55e2ae0ed35c in seastar::net::posix_server_socket_impl::~posix_server_socket_impl() ../src/seastar/include/seastar/net/posix-stack.hh:161 ceph#8 0x55e2ae0ed3cf in seastar::net::posix_server_socket_impl::~posix_server_socket_impl() ../src/seastar/include/seastar/net/posix-stack.hh:161 ceph#9 0x55e2ae0ed943 in std::default_delete<seastar::net::api_v2::server_socket_impl>::operator()(seastar::net::api_v2::server_socket_impl*) const /usr/include/c++/10/bits/unique_ptr.h:81 ceph#10 0x55e2ae0db357 in std::unique_ptr<seastar::net::api_v2::server_socket_impl, std::default_delete<seastar::net::api_v2::server_socket_impl> >::~unique_ptr() /usr/include/c++/10/bits/unique_ptr.h:357 ceph#11 0x55e2ae1438b7 in seastar::api_v2::server_socket::~server_socket() ../src/seastar/src/net/stack.cc:195 ceph#12 0x55e2aa1c7656 in std::_Optional_payload_base<seastar::api_v2::server_socket>::_M_destroy() /usr/include/c++/10/optional:260 ceph#13 0x55e2aa16c84b in std::_Optional_payload_base<seastar::api_v2::server_socket>::_M_reset() /usr/include/c++/10/optional:280 ceph#14 0x55e2ac24b2b7 in std::_Optional_base_impl<seastar::api_v2::server_socket, std::_Optional_base<seastar::api_v2::server_socket, false, false> >::_M_reset() /usr/include/c++/10/optional:432 ceph#15 0x55e2ac23f37b in std::optional<seastar::api_v2::server_socket>::reset() /usr/include/c++/10/optional:975 ceph#16 0x55e2ac21a2e7 in crimson::admin::AdminSocket::stop() ../src/crimson/admin/admin_socket.cc:265 ceph#17 0x55e2aa099825 in operator() ../src/crimson/osd/osd.cc:450 ceph#18 0x55e2aa0d4e3e in apply ../src/seastar/include/seastar/core/apply.hh:36 Signed-off-by: Kefu Chai <kchai@redhat.com>

This fixes the the following selinux error when using ceph-iscsi's rbd-target-api daemon (rbd-target-gw has the same issue). They are a result of the a python library, rtslib, which the daemons use. Additional Information: Source Context system_u:system_r:ceph_t:s0 Target Context system_u:object_r:configfs_t:s0 Target Objects /sys/kernel/config/target/iscsi/iqn.2003-01.com.re dhat:ceph-iscsi/tpgt_1/attrib/authentication [ file ] Source rbd-target-api Source Path /usr/libexec/platform-python3.6 Port <Unknown> Host ans8 Source RPM Packages platform-python-3.6.8-15.1.el8.x86_64 Target RPM Packages Policy RPM selinux-policy-3.14.3-20.el8.noarch Selinux Enabled True Policy Type targeted Enforcing Mode Enforcing Host Name ans8 Platform Linux ans8 4.18.0-147.el8.x86_64 #1 SMP Thu Sep 26 15:52:44 UTC 2019 x86_64 x86_64 Alert Count 1 First Seen 2020-01-08 18:39:47 EST Last Seen 2020-01-08 18:39:47 EST Local ID 6f8c3415-7a50-4dc8-b3d2-2621e1d00ca3 Raw Audit Messages type=AVC msg=audit(1578526787.577:68): avc: denied { ioctl } for pid=995 comm="rbd-target-api" path="/sys/kernel/config/target/iscsi/iqn.2003-01.com.redhat:ceph-iscsi/tpgt_1/attrib/authentication" dev="configfs" ino=25703 ioctlcmd=0x5401 scontext=system_u:system_r:ceph_t:s0 tcontext=system_u:object_r:configfs_t:s0 tclass=file permissive=1 type=SYSCALL msg=audit(1578526787.577:68): arch=x86_64 syscall=ioctl success=no exit=ENOTTY a0=34 a1=5401 a2=7ffd4f8f1f60 a3=3052cd2d95839b96 items=0 ppid=1 pid=995 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm=rbd-target-api exe=/usr/libexec/platform-python3.6 subj=system_u:system_r:ceph_t:s0 key=(null) Hash: rbd-target-api,ceph_t,configfs_t,file,ioctl Signed-off-by: Mike Christie <mchristi@redhat.com>

This fixes the selinux errors like this for /etc/target ----------------------------------- Additional Information: Source Context system_u:system_r:ceph_t:s0 Target Context system_u:object_r:targetd_etc_rw_t:s0 Target Objects target [ dir ] Source rbd-target-api Source Path rbd-target-api Port <Unknown> Host ans8 Source RPM Packages Target RPM Packages Policy RPM selinux-policy-3.14.3-20.el8.noarch Selinux Enabled True Policy Type targeted Enforcing Mode Enforcing Host Name ans8 Platform Linux ans8 4.18.0-147.el8.x86_64 #1 SMP Thu Sep 26 15:52:44 UTC 2019 x86_64 x86_64 Alert Count 1 First Seen 2020-01-08 18:39:48 EST Last Seen 2020-01-08 18:39:48 EST Local ID 9a13ee18-eaf2-4f2a-872f-2809ee4928f6 Raw Audit Messages type=AVC msg=audit(1578526788.148:69): avc: denied { search } for pid=995 comm="rbd-target-api" name="target" dev="sda1" ino=52198 scontext=system_u:system_r:ceph_t:s0 tcontext=system_u:object_r:targetd_etc_rw_t:s0 tclass=dir permissive=1 Hash: rbd-target-api,ceph_t,targetd_etc_rw_t,dir,search which are a result of the rtslib library the ceph-iscsi daemons use accessing /etc/target to read/write a file which stores meta data the target uses. Signed-off-by: Mike Christie <mchristi@redhat.com>

Changes addressing comments in PR - commit to be squashed prior to merge Signed-off-by: Paul Cuzner <pcuzner@redhat.com>

Otherwise, if we assert, we'll hang here: Thread 1 (Thread 0x7f74eba79580 (LWP 1688617)): #0 0x00007f74eb2aa529 in futex_wait (private=<optimized out>, expected=132, futex_word=0x7ffd642b4b54) at ../sysdeps/unix/sysv/linux/futex-internal.h:61 #1 futex_wait_simple (private=<optimized out>, expected=132, futex_word=0x7ffd642b4b54) at ../sysdeps/nptl/futex-internal.h:135 #2 __pthread_cond_destroy (cond=0x7ffd642b4b30) at pthread_cond_destroy.c:54 ceph#3 0x0000563ff2e5a891 in LibRadosService_StatusFormat_Test::TestBody (this=<optimized out>) at /usr/include/c++/7/bits/unique_ptr.h:78 ceph#4 0x0000563ff2e9dc3a in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void> (location=0x563ff2ea72e4 "the test body", method=<optimized out>, object=0x563ff422a6d0) at ./src/googletest/googletest/src/gtest.cc:2605 ceph#5 testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void> (object=object@entry=0x563ff422a6d0, method=<optimized out>, location=location@entry=0x563ff2ea72e4 "the test body") at ./src/googletest/googletest/src/gtest.cc:2641 ceph#6 0x0000563ff2e908c3 in testing::Test::Run (this=0x563ff422a6d0) at ./src/googletest/googletest/src/gtest.cc:2680 ceph#7 0x0000563ff2e90a25 in testing::TestInfo::Run (this=0x563ff41a3b70) at ./src/googletest/googletest/src/gtest.cc:2858 ceph#8 0x0000563ff2e90ec1 in testing::TestSuite::Run (this=0x563ff41b6230) at ./src/googletest/googletest/src/gtest.cc:3012 ceph#9 0x0000563ff2e92bdc in testing::internal::UnitTestImpl::RunAllTests (this=<optimized out>) at ./src/googletest/googletest/src/gtest.cc:5723 ceph#10 0x0000563ff2e9e14a in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (location=0x563ff2ea8728 "auxiliary test code (environments or event listeners)", method=<optimized out>, object=0x563ff41a2d10) at ./src/googletest/googletest/src/gtest.cc:2605 ceph#11 testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (object=0x563ff41a2d10, method=<optimized out>, location=location@entry=0x563ff2ea8728 "auxiliary test code (environments or event listeners)") at ./src/googletest/googletest/src/gtest.cc:2641 ceph#12 0x0000563ff2e90ae8 in testing::UnitTest::Run (this=0x563ff30c0660 <testing::UnitTest::GetInstance()::instance>) at ./src/googletest/googletest/src/gtest.cc:5306 Signed-off-by: Sage Weil <sage@newdream.net>

An attempt to `Connection::do_auth()` may finish in one of three states: _success_, _failure_ and _cancelation_. Unfortunately, its callers were missing the third treating cancelation like a failure. This was the root cause of the following failure at Sepia: ``` rzarzynski@teuthology:/home/teuthworker/archive/rzarzynski-2021-05-06_22:08:43-rados-master-distro-basic-smithi/6102605$ less ./remote/smithi204/log/ceph-osd.3.log.gz ... WARN 2021-05-06 22:35:40,464 [shard 0] osd - ms_handle_reset ... INFO 2021-05-06 22:35:40,465 [shard 0] monc - do_auth_single: connection closed INFO 2021-05-06 22:35:40,465 [shard 0] ms - [osd.3(client) v2:172.21.15.204:6808/31418@57568 >> mon.? v2:172.21.15.204:3300/0] execute_connecting(): protocol aborted at CLOSING -- std::system_error (error crimson::net:6, protocol aborted) ... ERROR 2021-05-06 22:35:40,465 [shard 0] osd - mon.osd.3 dispatch() ms_handle_reset caught exception: std::system_error (error crimson::net:3, negotiation failure) ceph-osd: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-3909-g81233a18/rpm/el8/BUILD/ceph-17.0.0-3909-g81233a18/src/crimson/common/gated.h:36: crimson::common::Gated::dispatch(const char*, T&, Func&&) [with Func = crimson::mon::Client::ms_handle_reset(crimson::net::ConnectionRef, bool)::<lambda()>&; T = crimson::mon::Client]::<lambda(std::__exception_ptr::exception_ptr)>: Assertion `*eptr.__cxa_exception_type() == typeid(seastar::gate_closed_exception)' failed. Aborting on shard 0. Backtrace: 0# 0x00005618C973932F in ceph-osd 1# FatalSignal::signaled(int, siginfo_t const*) in ceph-osd 2# FatalSignal::install_oneshot_signal_handler<6>()::{lambda(int, siginfo_t*, void*)#1}::_FUN(int, siginfo_t*, void*) in ceph-osd 3# 0x00007F7BB592EB20 in /lib64/libpthread.so.0 4# gsignal in /lib64/libc.so.6 5# abort in /lib64/libc.so.6 6# 0x00007F7BB3F29B09 in /lib64/libc.so.6 7# 0x00007F7BB3F37DE6 in /lib64/libc.so.6 8# 0x00005618C9FF295C in ceph-osd 9# 0x00005618C3907313 in ceph-osd 10# 0x00005618CCA2F84F in ceph-osd 11# 0x00005618CCA34D90 in ceph-osd 12# 0x00005618CCBEC9BB in ceph-osd 13# 0x00005618CC744E9A in ceph-osd 14# main in ceph-osd 15# __libc_start_main in /lib64/libc.so.6 16# _start in ceph-osd daemon-helper: command crashed with signal 6 ``` Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

An attempt to `Connection::do_auth()` may finish in one of three states: _success_, _failure_ and _cancellation_. Unfortunately, its callers were missing the third treating cancellation like a failure. This was the root cause of the following failure at Sepia: ``` rzarzynski@teuthology:/home/teuthworker/archive/rzarzynski-2021-05-06_22:08:43-rados-master-distro-basic-smithi/6102605$ less ./remote/smithi204/log/ceph-osd.3.log.gz ... WARN 2021-05-06 22:35:40,464 [shard 0] osd - ms_handle_reset ... INFO 2021-05-06 22:35:40,465 [shard 0] monc - do_auth_single: connection closed INFO 2021-05-06 22:35:40,465 [shard 0] ms - [osd.3(client) v2:172.21.15.204:6808/31418@57568 >> mon.? v2:172.21.15.204:3300/0] execute_connecting(): protocol aborted at CLOSING -- std::system_error (error crimson::net:6, protocol aborted) ... ERROR 2021-05-06 22:35:40,465 [shard 0] osd - mon.osd.3 dispatch() ms_handle_reset caught exception: std::system_error (error crimson::net:3, negotiation failure) ceph-osd: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-3909-g81233a18/rpm/el8/BUILD/ceph-17.0.0-3909-g81233a18/src/crimson/common/gated.h:36: crimson::common::Gated::dispatch(const char*, T&, Func&&) [with Func = crimson::mon::Client::ms_handle_reset(crimson::net::ConnectionRef, bool)::<lambda()>&; T = crimson::mon::Client]::<lambda(std::__exception_ptr::exception_ptr)>: Assertion `*eptr.__cxa_exception_type() == typeid(seastar::gate_closed_exception)' failed. Aborting on shard 0. Backtrace: 0# 0x00005618C973932F in ceph-osd 1# FatalSignal::signaled(int, siginfo_t const*) in ceph-osd 2# FatalSignal::install_oneshot_signal_handler<6>()::{lambda(int, siginfo_t*, void*)#1}::_FUN(int, siginfo_t*, void*) in ceph-osd 3# 0x00007F7BB592EB20 in /lib64/libpthread.so.0 4# gsignal in /lib64/libc.so.6 5# abort in /lib64/libc.so.6 6# 0x00007F7BB3F29B09 in /lib64/libc.so.6 7# 0x00007F7BB3F37DE6 in /lib64/libc.so.6 8# 0x00005618C9FF295C in ceph-osd 9# 0x00005618C3907313 in ceph-osd 10# 0x00005618CCA2F84F in ceph-osd 11# 0x00005618CCA34D90 in ceph-osd 12# 0x00005618CCBEC9BB in ceph-osd 13# 0x00005618CC744E9A in ceph-osd 14# main in ceph-osd 15# __libc_start_main in /lib64/libc.so.6 16# _start in ceph-osd daemon-helper: command crashed with signal 6 ``` Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

…dling. In `crimson/osd/main.cc` we instruct Seastar to handle `SIGHUP`. ``` // just ignore SIGHUP, we don't reread settings seastar::engine().handle_signal(SIGHUP, [] {}) ``` This happens using the Seastar's signal handling infrastructure which is incompliant with the alien world. ``` void reactor::signals::handle_signal(int signo, noncopyable_function<void ()>&& handler) { // ... struct sigaction sa; sa.sa_sigaction = [](int sig, siginfo_t *info, void *p) { engine()._backend->signal_received(sig, info, p); }; // ... } ``` ``` extern __thread reactor* local_engine; extern __thread size_t task_quota; inline reactor& engine() { return *local_engine; } ``` The low-level signal handler above assumes `local_engine._backend` is not null which stays true only for threads from the S*'s world. Unfortunately, as we don't block the `SIGHUP` for alien threads, kernel is perfectly authorized to pick up one them to run the handler leading to weirdly-looking segfaults like this one: ``` INFO 2021-04-23 07:06:57,807 [shard 0] bluestore - stat DEBUG 2021-04-23 07:06:58,753 [shard 0] ms - [osd.1(client) v2:172.21.15.100:6802/30478@51064 >> mgr.4105 v2:172.21.15.109:6800/29891] --> ceph#7 === pg_stats(0 pgs seq 55834574872 v 0) v2 (87) ... INFO 2021-04-23 07:06:58,813 [shard 0] bluestore - stat DEBUG 2021-04-23 07:06:59,753 [shard 0] osd - AdminSocket::handle_client: incoming asok string: {"prefix": "get_command_descriptions"} INFO 2021-04-23 07:06:59,753 [shard 0] osd - asok response length: 2947 INFO 2021-04-23 07:06:59,817 [shard 0] bluestore - stat DEBUG 2021-04-23 07:06:59,865 [shard 0] osd - AdminSocket::handle_client: incoming asok string: {"prefix": "get_command_descriptions"} INFO 2021-04-23 07:06:59,866 [shard 0] osd - asok response length: 2947 DEBUG 2021-04-23 07:07:00,020 [shard 0] osd - AdminSocket::handle_client: incoming asok string: {"prefix": "get_command_descriptions"} INFO 2021-04-23 07:07:00,020 [shard 0] osd - asok response length: 2947 INFO 2021-04-23 07:07:00,820 [shard 0] bluestore - stat ... Backtrace: 0# 0x00005600CD0D6AAF in ceph-osd 1# FatalSignal::signaled(int) in ceph-osd 2# FatalSignal::install_oneshot_signal_handler<11>()::{lambda(int, siginfo_t*, void*)#1}::_FUN(int, siginfo_t*, void*) in ceph-osd 3# 0x00007F5877C7EB20 in /lib64/libpthread.so.0 4# 0x00005600CD830B81 in ceph-osd 5# 0x00007F5877C7EB20 in /lib64/libpthread.so.0 6# pthread_cond_timedwait in /lib64/libpthread.so.0 7# crimson::os::ThreadPool::loop(std::chrono::duration<long, std::ratio<1l, 1000l> >, unsigned long) in ceph-osd 8# 0x00007F5877999BA3 in /lib64/libstdc++.so.6 9# 0x00007F5877C7414A in /lib64/libpthread.so.0 10# clone in /lib64/libc.so.6 daemon-helper: command crashed with signal 11 ``` Ultimately, it turned out the thread came out from a syscall (`futex`) and started crunching the `SIGHUP` handler's code in which a nullptr dereference happened. This patch blocks `SIGHUP` for all threads spawned by `AlienStore`. Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

An attempt to `Connection::do_auth()` may finish in one of three states: _success_, _failure_ and _cancellation_. Unfortunately, its callers were missing the third treating cancellation like a failure. This was the root cause of the following failure at Sepia: ``` rzarzynski@teuthology:/home/teuthworker/archive/rzarzynski-2021-05-06_22:08:43-rados-master-distro-basic-smithi/6102605$ less ./remote/smithi204/log/ceph-osd.3.log.gz ... WARN 2021-05-06 22:35:40,464 [shard 0] osd - ms_handle_reset ... INFO 2021-05-06 22:35:40,465 [shard 0] monc - do_auth_single: connection closed INFO 2021-05-06 22:35:40,465 [shard 0] ms - [osd.3(client) v2:172.21.15.204:6808/31418@57568 >> mon.? v2:172.21.15.204:3300/0] execute_connecting(): protocol aborted at CLOSING -- std::system_error (error crimson::net:6, protocol aborted) ... ERROR 2021-05-06 22:35:40,465 [shard 0] osd - mon.osd.3 dispatch() ms_handle_reset caught exception: std::system_error (error crimson::net:3, negotiation failure) ceph-osd: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-3909-g81233a18/rpm/el8/BUILD/ceph-17.0.0-3909-g81233a18/src/crimson/common/gated.h:36: crimson::common::Gated::dispatch(const char*, T&, Func&&) [with Func = crimson::mon::Client::ms_handle_reset(crimson::net::ConnectionRef, bool)::<lambda()>&; T = crimson::mon::Client]::<lambda(std::__exception_ptr::exception_ptr)>: Assertion `*eptr.__cxa_exception_type() == typeid(seastar::gate_closed_exception)' failed. Aborting on shard 0. Backtrace: 0# 0x00005618C973932F in ceph-osd 1# FatalSignal::signaled(int, siginfo_t const*) in ceph-osd 2# FatalSignal::install_oneshot_signal_handler<6>()::{lambda(int, siginfo_t*, void*)#1}::_FUN(int, siginfo_t*, void*) in ceph-osd 3# 0x00007F7BB592EB20 in /lib64/libpthread.so.0 4# gsignal in /lib64/libc.so.6 5# abort in /lib64/libc.so.6 6# 0x00007F7BB3F29B09 in /lib64/libc.so.6 7# 0x00007F7BB3F37DE6 in /lib64/libc.so.6 8# 0x00005618C9FF295C in ceph-osd 9# 0x00005618C3907313 in ceph-osd 10# 0x00005618CCA2F84F in ceph-osd 11# 0x00005618CCA34D90 in ceph-osd 12# 0x00005618CCBEC9BB in ceph-osd 13# 0x00005618CC744E9A in ceph-osd 14# main in ceph-osd 15# __libc_start_main in /lib64/libc.so.6 16# _start in ceph-osd daemon-helper: command crashed with signal 6 ``` Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

The `send_message()` method is a high-level facility for communicating with a monitor. If there is an active conn available, it sends the message immediatelly; otherwise the message is queued. This method assumes the queue is already drained if the connection is available. `active_con` is managed by `reopen_session()` where it's first cleared and then reset after finding new alive mon. This is followed by draining the `pending_messages` queue which happens in `on_session_opened()` after the `MAuth` exchange is finished. Unfortunately, the path from the `active_con` clearing to draining the queue is long and divided into multiple continuations which results in lack of atomicity. When e.g. `run_command()` interleaves the stages, following crash happens: ``` INFO 2021-05-07 08:13:43,914 [shard 0] monc - do_auth_single: mon v2:172.21.15.82:6805/34166 => v2:172.21.15.82:3300/0 returns auth_reply(proto 2 0 (0) Success) v1: 0 ceph-osd: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-3910-g1b18e076/rpm/el8/BUILD/ceph-17.0.0-3910-g1b18e076/src/crimson/mon/MonClient.cc:1034: seastar::future<> crimson::mon::Client::send_message(MessageRef): Assertion `pending_messages.empty()' failed. Aborting on shard 0. Backtrace: 0# 0x000055CDE6DB532F in ceph-osd 1# FatalSignal::signaled(int, siginfo_t const*) in ceph-osd 2# FatalSignal::install_oneshot_signal_handler<6>()::{lambda(int, siginfo_t*, void*)#1}::_FUN(int, siginfo_t*, void*) in ceph-osd 3# 0x00007FC1BF20BB20 in /lib64/libpthread.so.0 4# gsignal in /lib64/libc.so.6 5# abort in /lib64/libc.so.6 6# 0x00007FC1BD806B09 in /lib64/libc.so.6 7# 0x00007FC1BD814DE6 in /lib64/libc.so.6 8# crimson::mon::Client::send_message(boost::intrusive_ptr<Message>) in ceph-osd 9# crimson::mon::Client::renew_subs() in ceph-osd 10# 0x000055CDE764FB0B in ceph-osd 11# 0x000055CDE10457F0 in ceph-osd 12# 0x000055CDEA0AB88F in ceph-osd 13# 0x000055CDEA0B0DD0 in ceph-osd 14# 0x000055CDEA2689FB in ceph-osd 15# 0x000055CDE9DC0EDA in ceph-osd 16# main in ceph-osd 17# __libc_start_main in /lib64/libc.so.6 18# _start in ceph-osd ``` The problem caused following failure at Sepia: http://pulpito.front.sepia.ceph.com/rzarzynski-2021-05-07_07:41:02-rados-master-distro-basic-smithi/6104549 Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

The `send_message()` method is a high-level facility for communicating with a monitor. If there is an active conn available, it sends the message immediately; otherwise the message is queued. This method assumes the queue is already drained if the connection is available. `active_con` is managed by `reopen_session()` where it's first cleared and then reset after finding new alive mon. This is followed by draining the `pending_messages` queue which happens in `on_session_opened()` after the `MAuth` exchange is finished. Unfortunately, the path from the `active_con` clearing to draining the queue is long and divided into multiple continuations which results in lack of atomicity. When e.g. `run_command()` interleaves the stages, following crash happens: ``` INFO 2021-05-07 08:13:43,914 [shard 0] monc - do_auth_single: mon v2:172.21.15.82:6805/34166 => v2:172.21.15.82:3300/0 returns auth_reply(proto 2 0 (0) Success) v1: 0 ceph-osd: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-3910-g1b18e076/rpm/el8/BUILD/ceph-17.0.0-3910-g1b18e076/src/crimson/mon/MonClient.cc:1034: seastar::future<> crimson::mon::Client::send_message(MessageRef): Assertion `pending_messages.empty()' failed. Aborting on shard 0. Backtrace: 0# 0x000055CDE6DB532F in ceph-osd 1# FatalSignal::signaled(int, siginfo_t const*) in ceph-osd 2# FatalSignal::install_oneshot_signal_handler<6>()::{lambda(int, siginfo_t*, void*)#1}::_FUN(int, siginfo_t*, void*) in ceph-osd 3# 0x00007FC1BF20BB20 in /lib64/libpthread.so.0 4# gsignal in /lib64/libc.so.6 5# abort in /lib64/libc.so.6 6# 0x00007FC1BD806B09 in /lib64/libc.so.6 7# 0x00007FC1BD814DE6 in /lib64/libc.so.6 8# crimson::mon::Client::send_message(boost::intrusive_ptr<Message>) in ceph-osd 9# crimson::mon::Client::renew_subs() in ceph-osd 10# 0x000055CDE764FB0B in ceph-osd 11# 0x000055CDE10457F0 in ceph-osd 12# 0x000055CDEA0AB88F in ceph-osd 13# 0x000055CDEA0B0DD0 in ceph-osd 14# 0x000055CDEA2689FB in ceph-osd 15# 0x000055CDE9DC0EDA in ceph-osd 16# main in ceph-osd 17# __libc_start_main in /lib64/libc.so.6 18# _start in ceph-osd ``` The problem caused following failure at Sepia: http://pulpito.front.sepia.ceph.com/rzarzynski-2021-05-07_07:41:02-rados-master-distro-basic-smithi/6104549 Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

`OpSequencer` assumes that ID of a previous client request is always lower than ID of current one. This is reflected by the assertion in `OpSequencer::start_op()`. It trigerred the following failure [1] in Teuthology: ``` DEBUG 2021-05-07 08:01:41,227 [shard 0] osd - client_request(id=1, detail=osd_op(client.4171.0:1 2.2 2.7c339972 (undecoded) ondisk+retry+read+known_if_redirected e29) v8) same_interval_since: 31 ceph-osd: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-3910-g1b18e076/rpm/el8/BUILD/ceph- 17.0.0-3910-g1b18e076/src/crimson/osd/osd_operation_sequencer.h:38: seastar::futurize_t<Result> crimson::osd::OpSequencer::start_op(HandleT&, uint64_t, uint64_t, FuncT&&) [with HandleT = crimson::PipelineHa ndle; FuncT = crimson::interruptible::interruptor<InterruptCond>::wrap_function(Func&&) [with Func = crimson::osd::ClientRequest::start()::<lambda()> mutable::<lambda(Ref<crimson::osd::PG>)> mutable::<lambd a()> mutable::<lambda()>; InterruptCond = crimson::osd::IOInterruptCondition]::<lambda()>; Result = crimson::interruptible::interruptible_future_detail<crimson::osd::IOInterruptCondition, seastar::future<> >; seastar::futurize_t<Result> = crimson::interruptible::interruptible_future_detail<crimson::osd::IOInterruptCondition, seastar::future<> >; uint64_t = long unsigned int]: Assertion `prev_op < this_op' fai led. Aborting on shard 0. Backtrace: Segmentation fault. Backtrace: 0# 0x00005592B028932F in ceph-osd 1# FatalSignal::signaled(int, siginfo_t const*) in ceph-osd 2# FatalSignal::install_oneshot_signal_handler<6>()::{lambda(int, siginfo_t*, void*)#1}::_FUN(int, siginfo_t*, void*) in ceph-osd 3# 0x00007F57B72E7B20 in /lib64/libpthread.so.0 4# gsignal in /lib64/libc.so.6 5# abort in /lib64/libc.so.6 6# 0x00007F57B58E2B09 in /lib64/libc.so.6 7# 0x00007F57B58F0DE6 in /lib64/libc.so.6 8# 0x00005592ABB8484D in ceph-osd 9# 0x00005592ABB8ACB3 in ceph-osd 10# seastar::continuation<seastar::internal::promise_base_with_type<seastar::bool_class<seastar::stop_iteration_tag> >, seastar::noncopyable_function<seastar::future<seastar::bool_class<seastar::stop_iteration_tag> > (boost::intrusive_ptr<crimson::osd::PG>&&)>, seastar::future<boost::intrusive_ptr<crimson::osd::PG> >::then_impl_nrvo<seastar::noncopyable_function<seastar::future<seastar::bool_class<seastar::stop_iteration_tag> > (boost::intrusive_ptr<crimson::osd::PG>&&)>, seastar::future<seastar::bool_class<seastar::stop_iteration_tag> > >(seastar::noncopyable_function<seastar::future<seastar::bool_class<seastar::stop_iteration_tag> > (boost::intrusive_ptr<crimson::osd::PG>&&)>&&)::{lambda(seastar::internal::promise_base_with_type<seastar::bool_class<seastar::stop_iteration_tag> >&&, seastar::noncopyable_function<seastar::future<seastar::bool_class<seastar::stop_iteration_tag> > (boost::intrusive_ptr<crimson::osd::PG>&&)>&, seastar::future_state<boost::intrusive_ptr<crimson::osd::PG> >&&)#1}, boost::intrusive_ptr<crimson::osd::PG> >::run_and_dispose() in ceph-osd 11# 0x00005592B357F88F in ceph-osd 12# 0x00005592B3584DD0 in ceph-osd ``` [1]: http://pulpito.front.sepia.ceph.com/rzarzynski-2021-05-07_07:41:02-rados-master-distro-basic-smithi/6104530 Crash analysis resulted in two observations: 1. during the request execution the acting set got changed, the request was interrupted and a try to re-execute it emerged; 2. the interrupted request was the very first client request the OSD has ever seen. Code analysis showed a problem in how `ClientRequest` establishes `prev_op_id`: although supposed to be performed only once for a request, it can get executed twice but only for the very first request `OpSequencer` saw. ```cpp void ClientRequest::may_set_prev_op() { // set prev_op_id if it's not set yet if (__builtin_expect(prev_op_id == 0, true)) { prev_op_id = sequencer.get_last_issued(); } } ``` Unfortunately, `0` isn't a distincted value that cannot be returned by `get_last_issued()`: ```cpp class OpSequencer { // ... uint64_t get_last_issued() const { return last_issued; } // ... // the id of last op which is issued uint64_t last_issued = 0; ``` As a result, `OpSequencer` returnted on the second call a new value (actually `this_op`) violating the assertion. The commit fixes the problem by switching from a designated value to `std::optional`. Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

`OpSequencer` assumes that ID of a previous client request is always lower than ID of current one. This is reflected by the assertion in `OpSequencer::start_op()`. It triggered the following failure [1] in Teuthology: ``` DEBUG 2021-05-07 08:01:41,227 [shard 0] osd - client_request(id=1, detail=osd_op(client.4171.0:1 2.2 2.7c339972 (undecoded) ondisk+retry+read+known_if_redirected e29) v8) same_interval_since: 31 ceph-osd: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-3910-g1b18e076/rpm/el8/BUILD/ceph- 17.0.0-3910-g1b18e076/src/crimson/osd/osd_operation_sequencer.h:38: seastar::futurize_t<Result> crimson::osd::OpSequencer::start_op(HandleT&, uint64_t, uint64_t, FuncT&&) [with HandleT = crimson::PipelineHa ndle; FuncT = crimson::interruptible::interruptor<InterruptCond>::wrap_function(Func&&) [with Func = crimson::osd::ClientRequest::start()::<lambda()> mutable::<lambda(Ref<crimson::osd::PG>)> mutable::<lambd a()> mutable::<lambda()>; InterruptCond = crimson::osd::IOInterruptCondition]::<lambda()>; Result = crimson::interruptible::interruptible_future_detail<crimson::osd::IOInterruptCondition, seastar::future<> >; seastar::futurize_t<Result> = crimson::interruptible::interruptible_future_detail<crimson::osd::IOInterruptCondition, seastar::future<> >; uint64_t = long unsigned int]: Assertion `prev_op < this_op' fai led. Aborting on shard 0. Backtrace: Segmentation fault. Backtrace: 0# 0x00005592B028932F in ceph-osd 1# FatalSignal::signaled(int, siginfo_t const*) in ceph-osd 2# FatalSignal::install_oneshot_signal_handler<6>()::{lambda(int, siginfo_t*, void*)#1}::_FUN(int, siginfo_t*, void*) in ceph-osd 3# 0x00007F57B72E7B20 in /lib64/libpthread.so.0 4# gsignal in /lib64/libc.so.6 5# abort in /lib64/libc.so.6 6# 0x00007F57B58E2B09 in /lib64/libc.so.6 7# 0x00007F57B58F0DE6 in /lib64/libc.so.6 8# 0x00005592ABB8484D in ceph-osd 9# 0x00005592ABB8ACB3 in ceph-osd 10# seastar::continuation<seastar::internal::promise_base_with_type<seastar::bool_class<seastar::stop_iteration_tag> >, seastar::noncopyable_function<seastar::future<seastar::bool_class<seastar::stop_iteration_tag> > (boost::intrusive_ptr<crimson::osd::PG>&&)>, seastar::future<boost::intrusive_ptr<crimson::osd::PG> >::then_impl_nrvo<seastar::noncopyable_function<seastar::future<seastar::bool_class<seastar::stop_iteration_tag> > (boost::intrusive_ptr<crimson::osd::PG>&&)>, seastar::future<seastar::bool_class<seastar::stop_iteration_tag> > >(seastar::noncopyable_function<seastar::future<seastar::bool_class<seastar::stop_iteration_tag> > (boost::intrusive_ptr<crimson::osd::PG>&&)>&&)::{lambda(seastar::internal::promise_base_with_type<seastar::bool_class<seastar::stop_iteration_tag> >&&, seastar::noncopyable_function<seastar::future<seastar::bool_class<seastar::stop_iteration_tag> > (boost::intrusive_ptr<crimson::osd::PG>&&)>&, seastar::future_state<boost::intrusive_ptr<crimson::osd::PG> >&&)#1}, boost::intrusive_ptr<crimson::osd::PG> >::run_and_dispose() in ceph-osd 11# 0x00005592B357F88F in ceph-osd 12# 0x00005592B3584DD0 in ceph-osd ``` [1]: http://pulpito.front.sepia.ceph.com/rzarzynski-2021-05-07_07:41:02-rados-master-distro-basic-smithi/6104530 Crash analysis resulted in two observations: 1. during the request execution the acting set got changed, the request was interrupted and a try to re-execute it emerged; 2. the interrupted request was the very first client request the OSD has ever seen. Code analysis showed a problem in how `ClientRequest` establishes `prev_op_id`: although supposed to be performed only once for a request, it can get executed twice but only for the very first request `OpSequencer` saw. ```cpp void ClientRequest::may_set_prev_op() { // set prev_op_id if it's not set yet if (__builtin_expect(prev_op_id == 0, true)) { prev_op_id = sequencer.get_last_issued(); } } ``` Unfortunately, `0` isn't a distincted value that cannot be returned by `get_last_issued()`: ```cpp class OpSequencer { // ... uint64_t get_last_issued() const { return last_issued; } // ... // the id of last op which is issued uint64_t last_issued = 0; ``` As a result, `OpSequencer` returned on the second call a new value (actually `this_op`) violating the assertion. The commit fixes the problem by switching from a designated value to `std::optional`. Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

f7181ab has optimized the client parallelism. To achieve that `PG::do_osd_ops()` were converted to return basically future pair of futures. Unfortunately, the life- time management of `OpsExecuter` was kept intact. In the result, the object was valid only till fullfying the outer future while, due to the `rollbacker` instances, it should be available till `all_completed` becomes available. This issue can explain the following problem has been observed in a Teuthology job [1]. ``` DEBUG 2021-05-20 08:03:22,617 [shard 0] osd - do_op_call: method returned ret=-17, outdata.length()=0 while num_read=1, num_write=0 DEBUG 2021-05-20 08:03:22,617 [shard 0] osd - rollback_obc_if_modified: object 19:e17d4708:test-rados-api-smithi095-38404-2::foo:head got erro r generic:17, need_rollback=false ================================================================= ==33626==ERROR: AddressSanitizer: heap-use-after-free on address 0x60d0000b9320 at pc 0x560f486b8222 bp 0x7fffc467a1e0 sp 0x7fffc467a1d0 READ of size 4 at 0x60d0000b9320 thread T0 #0 0x560f486b8221 (/usr/bin/ceph-osd+0x2c610221) #1 0x560f4880c6b1 in seastar::continuation<seastar::internal::promise_base_with_type<boost::intrusive_ptr<MOSDOpReply> >, seastar::noncopy able_function<crimson::interruptible::interruptible_future_detail<crimson::osd::IOInterruptCondition, crimson::errorator<crimson::unthrowable_ wrapper<std::error_code const&, crimson::ec<(std::errc)11> > >::_future<crimson::errorated_future_marker<boost::intrusive_ptr<MOSDOpReply> > > > ()>, seastar::future<void>::then_impl_nrvo<seastar::noncopyable_function<crimson::interruptible::interruptible_future_detail<crimson::osd:: IOInterruptCondition, crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<(std::errc)11> > >::_future<crimson: :errorated_future_marker<boost::intrusive_ptr<MOSDOpReply> > > > ()>, crimson::interruptible::interruptible_future_detail<crimson::osd::IOInte rruptCondition, crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<(std::errc)11> > >::_future<crimson::error ated_future_marker<boost::intrusive_ptr<MOSDOpReply> > > > >(seastar::noncopyable_function<crimson::interruptible::interruptible_future_detail <crimson::osd::IOInterruptCondition, crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<(std::errc)11> > >::_ future<crimson::errorated_future_marker<boost::intrusive_ptr<MOSDOpReply> > > > ()>&&)::{lambda(seastar::internal::promise_base_with_type<boos t::intrusive_ptr<MOSDOpReply> >&&, seastar::noncopyable_function<crimson::interruptible::interruptible_future_detail<crimson::osd::IOInterruptCondition, crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<(std::errc)11> > >::_future<crimson::errorated_future_marker<boost::intrusive_ptr<MOSDOpReply> > > > ()>&, seastar::future_state<seastar::internal::monostate>&&)#1}, void>::run_and_dispose() (/usr/bin/ceph-osd+0x2c7646b1) #2 0x560f5352c3ae (/usr/bin/ceph-osd+0x374843ae) ceph#3 0x560f535318ef (/usr/bin/ceph-osd+0x374898ef) ceph#4 0x560f536e395a (/usr/bin/ceph-osd+0x3763b95a) ceph#5 0x560f532413d9 (/usr/bin/ceph-osd+0x371993d9) ceph#6 0x560f476af95a in main (/usr/bin/ceph-osd+0x2b60795a) ceph#7 0x7f7aa0af97b2 in __libc_start_main (/lib64/libc.so.6+0x237b2) ceph#8 0x560f477d2e8d in _start (/usr/bin/ceph-osd+0x2b72ae8d) ``` [1]: http://pulpito.front.sepia.ceph.com/rzarzynski-2021-05-20_07:28:16-rados-master-distro-basic-smithi/6124735/ The commit deals with the problem by repacking the outer future. An alternative could be in switching from `std::unique_ptr` to `seastar::shared_ptr` for managing `OpsExecuter`. Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

…ore. ``` INFO 2021-05-26 20:16:32,872 [shard 0] ms - [osd.0(client) v2:172.21.15.119:6803/31733 >> unknown.? -@55220] ProtocolV2::start_accept(): targ et_addr=172.21.15.119:55220/0 DEBUG 2021-05-26 20:16:32,872 [shard 0] ms - [osd.0(client) v2:172.21.15.119:6803/31733 >> unknown.? -@55220] TRIGGER ACCEPTING, was NONE DEBUG 2021-05-26 20:16:32,873 [shard 0] ms - [osd.0(client) v2:172.21.15.119:6803/31733 >> unknown.? -@55220] SEND(26) banner: len_payload=16, supported=1, required=0, banner="ceph v2 " DEBUG 2021-05-26 20:16:32,873 [shard 0] ms - [osd.0(client) v2:172.21.15.119:6803/31733 >> unknown.? -@55220] RECV(10) banner: "ceph v2 " DEBUG 2021-05-26 20:16:32,873 [shard 0] ms - [osd.0(client) v2:172.21.15.119:6803/31733 >> unknown.? -@55220] GOT banner: payload_len=16 DEBUG 2021-05-26 20:16:32,873 [shard 0] ms - [osd.0(client) v2:172.21.15.119:6803/31733 >> unknown.? -@55220] RECV(16) banner features: supported=1 required=0 DEBUG 2021-05-26 20:16:32,873 [shard 0] ms - [osd.0(client) v2:172.21.15.119:6803/31733 >> unknown.? -@55220] WRITE HelloFrame: my_type=osd, peer_addr=172.21.15.119:55220/0 DEBUG 2021-05-26 20:16:32,873 [shard 0] ms - [osd.0(client) v2:172.21.15.119:6803/31733 >> unknown.? -@55220] GOT HelloFrame: my_type=client peer_addr=v2:172.21.15.119:6803/31733 INFO 2021-05-26 20:16:32,873 [shard 0] ms - [osd.0(client) v2:172.21.15.119:6803/31733 >> client.? -@55220] UPDATE: peer_type=client, policy(lossy=true server=true standby=false resetcheck=false) DEBUG 2021-05-26 20:16:32,873 [shard 0] ms - [osd.0(client) v2:172.21.15.119:6803/31733 >> client.? -@55220] GOT AuthRequestFrame: method=2, preferred_modes={1, 2}, payload_len=174 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-4622-gaa1dc559/rpm/el8/BUILD/ceph-17.0.0-4622-gaa1dc559/src/crimson/mon/MonClient.cc:399:10: runtime error: member access within null pointer of type 'struct Connection' Segmentation fault on shard 0. Backtrace: 0# 0x000055E84CF44C1F in ceph-osd 1# FatalSignal::signaled(int, siginfo_t const*) in ceph-osd 2# FatalSignal::install_oneshot_signal_handler<11>()::{lambda(int, siginfo_t*, void*)#1}::_FUN(int, siginfo_t*, void*) in ceph-osd 3# 0x00007F2BC88C0B20 in /lib64/libpthread.so.0 4# crimson::mon::Connection::get_conn() in ceph-osd 5# crimson::mon::Client::handle_auth_request(seastar::shared_ptr<crimson::net::Connection>, seastar::lw_shared_ptr<AuthConnectionMeta>, bool, unsigned int, ceph::buffer::v15_2_0::list const&, ceph::buffer::v15_2_0::list*) in ceph-osd 6# crimson::net::ProtocolV2::_handle_auth_request(ceph::buffer::v15_2_0::list&, bool) in ceph-osd 7# 0x000055E84DF67669 in ceph-osd 8# 0x000055E84DF68775 in ceph-osd 9# 0x000055E846F47F60 in ceph-osd 10# 0x000055E85296770F in ceph-osd 11# 0x000055E85296CC50 in ceph-osd 12# 0x000055E852B1ECBB in ceph-osd 13# 0x000055E85267C73A in ceph-osd 14# main in ceph-osd 15# __libc_start_main in /lib64/libc.so.6 16# _start in ceph-osd Fault at location: 0x98 ``` When the `handle_auth_request()` happens, there is no guarantee `active_con` is being available. This is reflected in the classical implementation: ```cpp int MonClient::handle_auth_request( Connection *con, // ... ceph::buffer::list *reply) { // ... bool isvalid = ah->verify_authorizer( cct, *rotating_secrets, payload, auth_meta->get_connection_secret_length(), reply, &con->peer_name, &con->peer_global_id, &con->peer_caps_info, &auth_meta->session_key, &auth_meta->connection_secret, ac); ``` The patch transplate the same logic to crimson. Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

Following crash occured at Sepia [1]: ``` INFO 2021-05-26 20:16:32,872 [shard 0] ms - [osd.0(client) v2:172.21.15.119:6803/31733 >> unknown.? -@55220] ProtocolV2::start_accept(): targ et_addr=172.21.15.119:55220/0 DEBUG 2021-05-26 20:16:32,872 [shard 0] ms - [osd.0(client) v2:172.21.15.119:6803/31733 >> unknown.? -@55220] TRIGGER ACCEPTING, was NONE DEBUG 2021-05-26 20:16:32,873 [shard 0] ms - [osd.0(client) v2:172.21.15.119:6803/31733 >> unknown.? -@55220] SEND(26) banner: len_payload=16, supported=1, required=0, banner="ceph v2 " DEBUG 2021-05-26 20:16:32,873 [shard 0] ms - [osd.0(client) v2:172.21.15.119:6803/31733 >> unknown.? -@55220] RECV(10) banner: "ceph v2 " DEBUG 2021-05-26 20:16:32,873 [shard 0] ms - [osd.0(client) v2:172.21.15.119:6803/31733 >> unknown.? -@55220] GOT banner: payload_len=16 DEBUG 2021-05-26 20:16:32,873 [shard 0] ms - [osd.0(client) v2:172.21.15.119:6803/31733 >> unknown.? -@55220] RECV(16) banner features: supported=1 required=0 DEBUG 2021-05-26 20:16:32,873 [shard 0] ms - [osd.0(client) v2:172.21.15.119:6803/31733 >> unknown.? -@55220] WRITE HelloFrame: my_type=osd, peer_addr=172.21.15.119:55220/0 DEBUG 2021-05-26 20:16:32,873 [shard 0] ms - [osd.0(client) v2:172.21.15.119:6803/31733 >> unknown.? -@55220] GOT HelloFrame: my_type=client peer_addr=v2:172.21.15.119:6803/31733 INFO 2021-05-26 20:16:32,873 [shard 0] ms - [osd.0(client) v2:172.21.15.119:6803/31733 >> client.? -@55220] UPDATE: peer_type=client, policy(lossy=true server=true standby=false resetcheck=false) DEBUG 2021-05-26 20:16:32,873 [shard 0] ms - [osd.0(client) v2:172.21.15.119:6803/31733 >> client.? -@55220] GOT AuthRequestFrame: method=2, preferred_modes={1, 2}, payload_len=174 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-4622-gaa1dc559/rpm/el8/BUILD/ceph-17.0.0-4622-gaa1dc559/src/crimson/mon/MonClient.cc:399:10: runtime error: member access within null pointer of type 'struct Connection' Segmentation fault on shard 0. Backtrace: 0# 0x000055E84CF44C1F in ceph-osd 1# FatalSignal::signaled(int, siginfo_t const*) in ceph-osd 2# FatalSignal::install_oneshot_signal_handler<11>()::{lambda(int, siginfo_t*, void*)#1}::_FUN(int, siginfo_t*, void*) in ceph-osd 3# 0x00007F2BC88C0B20 in /lib64/libpthread.so.0 4# crimson::mon::Connection::get_conn() in ceph-osd 5# crimson::mon::Client::handle_auth_request(seastar::shared_ptr<crimson::net::Connection>, seastar::lw_shared_ptr<AuthConnectionMeta>, bool, unsigned int, ceph::buffer::v15_2_0::list const&, ceph::buffer::v15_2_0::list*) in ceph-osd 6# crimson::net::ProtocolV2::_handle_auth_request(ceph::buffer::v15_2_0::list&, bool) in ceph-osd 7# 0x000055E84DF67669 in ceph-osd 8# 0x000055E84DF68775 in ceph-osd 9# 0x000055E846F47F60 in ceph-osd 10# 0x000055E85296770F in ceph-osd 11# 0x000055E85296CC50 in ceph-osd 12# 0x000055E852B1ECBB in ceph-osd 13# 0x000055E85267C73A in ceph-osd 14# main in ceph-osd 15# __libc_start_main in /lib64/libc.so.6 16# _start in ceph-osd Fault at location: 0x98 ``` [1]: http://pulpito.front.sepia.ceph.com/rzarzynski-2021-05-26_12:20:26-rados-master-distro-basic-smithi/6136907 When the `handle_auth_request()` happens, there is no guarantee `active_con` is being available. This is reflected in the classical implementation: ```cpp int MonClient::handle_auth_request( Connection *con, // ... ceph::buffer::list *reply) { // ... bool isvalid = ah->verify_authorizer( cct, *rotating_secrets, payload, auth_meta->get_connection_secret_length(), reply, &con->peer_name, &con->peer_global_id, &con->peer_caps_info, &auth_meta->session_key, &auth_meta->connection_secret, ac); ``` The patch transplate the same logic to crimson. Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

The `FuturizedStore` interface imposes the `get_attr()` takes the `name` parameter as `std::string_view`, and thus burdens implementations with extending the life- time of the data the instance refers to. Unfortunately, `AlienStore` is unaware that prolonging the life of a `std::string_view` instance doesn't prolong the data memory it points to. This problem has manifested in the following use-after-free detected at Sepia: ``` rzarzynski@teuthology:/home/teuthworker/archive/rzarzynski-2021-05-26_12:20:26-rados-master-distro-basic-smithi/6136929$ less ./remote/smithi194/log/ceph-osd.7.log.gz ... DEBUG 2021-05-26 20:24:54,077 [shard 0] osd - do_osd_ops_execute: object 14:55e1a5b4:test-rados-api-smithi067-38889-2::foo:head - handling op call DEBUG 2021-05-26 20:24:54,077 [shard 0] osd - handling op call on object 14:55e1a5b4:test-rados-api-smithi067-38889-2::foo:head DEBUG 2021-05-26 20:24:54,078 [shard 0] osd - calling method lock.lock, num_read=0, num_write=0 DEBUG 2021-05-26 20:24:54,078 [shard 0] osd - handling op getxattr on object 14:55e1a5b4:test-rados-api-smithi067-38889-2::foo:head DEBUG 2021-05-26 20:24:54,078 [shard 0] osd - getxattr on obj=14:55e1a5b4:test-rados-api-smithi067-38889-2::foo:head for attr=_lock.TestLockPP1 DEBUG 2021-05-26 20:24:54,078 [shard 0] bluestore - get_attr ================================================================= ==34068==ERROR: AddressSanitizer: heap-use-after-free on address 0x6030001851d0 at pc 0x7f824d6a5b27 bp 0x7f822b4201c0 sp 0x7f822b41f968 READ of size 17 at 0x6030001851d0 thread T28 (alien-store-tp) ... #0 0x7f824d6a5b26 (/lib64/libasan.so.5+0x40b26) #1 0x55e2cbb2e00b (/usr/bin/ceph-osd+0x2b6dc00b) #2 0x55e2d31f086e (/usr/bin/ceph-osd+0x32d9e86e) ceph#3 0x55e2d3467607 in crimson::os::ThreadPool::loop(std::chrono::duration<long, std::ratio<1l, 1000l> >, unsigned long) (/usr/bin/ceph-osd+0x33015607) ceph#4 0x55e2d346b14a (/usr/bin/ceph-osd+0x3301914a) ceph#5 0x7f8249d32ba2 (/lib64/libstdc++.so.6+0xc2ba2) ceph#6 0x7f824a00d149 in start_thread (/lib64/libpthread.so.0+0x8149) ceph#7 0x7f82486edf22 in clone (/lib64/libc.so.6+0xfcf22) 0x6030001851d0 is located 0 bytes inside of 31-byte region [0x6030001851d0,0x6030001851ef) freed by thread T0 here: #0 0x7f824d757688 in operator delete(void*) (/lib64/libasan.so.5+0xf2688) previously allocated by thread T0 here: #0 0x7f824d7567b0 in operator new(unsigned long) (/lib64/libasan.so.5+0xf17b0) Thread T28 (alien-store-tp) created by T0 here: #0 0x7f824d6b7ea3 in __interceptor_pthread_create (/lib64/libasan.so.5+0x52ea3) SUMMARY: AddressSanitizer: heap-use-after-free (/lib64/libasan.so.5+0x40b26) Shadow bytes around the buggy address: 0x0c06800289e0: fd fd fd fa fa fa fd fd fd fa fa fa 00 00 00 fa 0x0c06800289f0: fa fa fd fd fd fa fa fa fd fd fd fa fa fa fd fd 0x0c0680028a00: fd fa fa fa fd fd fd fa fa fa fd fd fd fa fa fa 0x0c0680028a10: fd fd fd fa fa fa fd fd fd fa fa fa fd fd fd fa 0x0c0680028a20: fa fa fd fd fd fa fa fa fd fd fd fa fa fa fd fd =>0x0c0680028a30: fd fd fa fa fd fd fd fd fa fa[fd]fd fd fd fa fa 0x0c0680028a40: fd fd fd fd fa fa fd fd fd fd fa fa 00 00 00 07 0x0c0680028a50: fa fa 00 00 00 fa fa fa 00 00 00 fa fa fa fd fd 0x0c0680028a60: fd fd fa fa fd fd fd fd fa fa fd fd fd fd fa fa 0x0c0680028a70: 00 00 00 00 fa fa fd fd fd fd fa fa fd fd fd fd 0x0c0680028a80: fa fa fd fd fd fd fa fa fd fd fd fd fa fa fd fd Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap left redzone: fa Freed heap region: fd Stack left redzone: f1 Stack mid redzone: f2 Stack right redzone: f3 Stack after return: f5 Stack use after scope: f8 Global redzone: f9 Global init order: f6 Poisoned by user: f7 Container overflow: fc Array cookie: ac Intra object redzone: bb ASan internal: fe Left alloca redzone: ca Right alloca redzone: cb ==34068==ABORTING ``` Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

`FuturizedStore` and `ObjectStore` use different memory layout for conveying object attributes: map of `bufferlists` and map of `bptrs` respectively. Unfortunately, `AlienStore` was trying to solve this mismatch with just a `reinterpret_cast`. Very likely this problem was the root cause behind the observed crahses in `PGBackend::load_matadata` like the following one: ``` 2021-06-15T09:25:07.511 INFO:journalctl@ceph.osd.3.smithi100.stdout:Jun 15 09:24:19 smithi100 conmon[54917]: DEBUG 2021-06-15 09:24:19,199 [shard 0] osd - peering_event(id=412, detail=PeeringEvent(from=7 pgid=5.14 sent=49 requested=49 evt=epoch_sent: 49 epoch_requested: 49 MInfoRec from 7 info: 5.14( v 45'2 (0'0,45'2] local-lis/les=48/49 n=0 ec=44/44 lis/c=48/44 les/c/f=49/45/0 sis=48) pg_lease_ack(ruub 19.176788330s))): complete 2021-06-15T09:25:07.511 INFO:journalctl@ceph.osd.3.smithi100.stdout:Jun 15 09:24:19 smithi100 conmon[54917]: Segmentation fault on shard 0. 2021-06-15T09:25:07.511 INFO:journalctl@ceph.osd.3.smithi100.stdout:Jun 15 09:24:19 smithi100 conmon[54917]: Backtrace: 2021-06-15T09:25:07.511 INFO:journalctl@ceph.osd.3.smithi100.stdout:Jun 15 09:24:19 smithi100 conmon[54917]: 0# 0x000055C99757FFBF in /usr/bin/ceph-osd 2021-06-15T09:25:07.511 INFO:journalctl@ceph.osd.3.smithi100.stdout:Jun 15 09:24:19 smithi100 conmon[54917]: 1# FatalSignal::signaled(int, siginfo_t const*) in /usr/bin/ceph-osd 2021-06-15T09:25:07.511 INFO:journalctl@ceph.osd.3.smithi100.stdout:Jun 15 09:24:19 smithi100 conmon[54917]: 2# FatalSignal::install_oneshot_signal_handler<11>()::{lambda(int, siginfo_t*, void*)#1}::_FUN(int, siginfo_t*, void*) in /usr/bin/ceph-osd 2021-06-15T09:25:07.512 INFO:journalctl@ceph.osd.3.smithi100.stdout:Jun 15 09:24:19 smithi100 conmon[54917]: 3# 0x00007F34BB632B20 in /lib64/libpthread.so.0 2021-06-15T09:25:07.512 INFO:journalctl@ceph.osd.3.smithi100.stdout:Jun 15 09:24:19 smithi100 conmon[54917]: 4# 0x000055C99263D4D2 in /usr/bin/ceph-osd 2021-06-15T09:25:07.512 INFO:journalctl@ceph.osd.3.smithi100.stdout:Jun 15 09:24:19 smithi100 conmon[54917]: 5# 0x000055C992740E47 in /usr/bin/ceph-osd 2021-06-15T09:25:07.512 INFO:journalctl@ceph.osd.3.smithi100.stdout:Jun 15 09:24:19 smithi100 conmon[54917]: 6# seastar::continuation<seastar::internal::promise_base_with_type<std::unique_ptr<PGBackend::loaded_object_md_t, std::default_delete<PGBackend::loaded_object_md_t> > >, seastar::noncopyable_function<crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<(std::errc)84> > >::_future<crimson::errorated_future_marker<std::unique_ptr<PGBackend::loaded_object_md_t, std::default_delete<PGBackend::loaded_object_md_t> > > > (seastar::future<std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<void>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > > >&&)>, seastar::future<std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<void>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > > >::then_wrapped_nrvo<crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<(std::errc)84> > >::_future<crimson::errorated_future_marker<std::unique_ptr<PGBackend::loaded_object_md_t, std::default_delete<PGBackend::loaded_object_md_t> > > >, seastar::noncopyable_function<crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<(std::errc)84> > >::_future<crimson::errorated_future_marker<std::unique_ptr<PGBackend::loaded_object_md_t, std::default_delete<PGBackend::loaded_object_md_t> > > > (seastar::future<std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<void>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > > >&&)> >(seastar::noncopyable_function<crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<(std::errc)84> > >::_future<crimson::errorated_future_marker<std::unique_ptr<PGBackend::loaded_object_md_t, std::default_delete<PGBackend::loaded_object_md_t> > > > (seastar::future<std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<void>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > > >&&)>&&)::{lambda(seastar::internal::promise_base_with_type<std::unique_ptr<PGBackend::loaded_object_md_t, std::default_delete<PGBackend::loaded_object_md_t> > >&&, seastar::noncopyable_function<crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<(std::errc)84> > >::_future<crimson::errorated_future_marker<std::unique_ptr<PGBackend::loaded_object_md_t, std::default_delete<PGBackend::loaded_object_md_t> > > > (seastar::future<std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<void>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > > >&&)>&, seastar::future_state<std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<void>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > > >&&)#1}, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<void>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > > >::run_and_dispose() in /usr/bin/ceph-osd 2021-06-15T09:25:07.512 INFO:journalctl@ceph.osd.3.smithi100.stdout:Jun 15 09:24:19 smithi100 conmon[54917]: 7# 0x000055C99CFD195F in /usr/bin/ceph-osd 2021-06-15T09:25:07.513 INFO:journalctl@ceph.osd.3.smithi100.stdout:Jun 15 09:24:19 smithi100 conmon[54917]: 8# 0x000055C99CFD6EA0 in /usr/bin/ceph-osd 2021-06-15T09:25:07.513 INFO:journalctl@ceph.osd.3.smithi100.stdout:Jun 15 09:24:19 smithi100 conmon[54917]: 9# 0x000055C99D188F0B in /usr/bin/ceph-osd 2021-06-15T09:25:07.513 INFO:journalctl@ceph.osd.3.smithi100.stdout:Jun 15 09:24:19 smithi100 conmon[54917]: 10# 0x000055C99CCE698A in /usr/bin/ceph-osd 2021-06-15T09:25:07.513 INFO:journalctl@ceph.osd.3.smithi100.stdout:Jun 15 09:24:19 smithi100 conmon[54917]: 11# 0x000055C99CCF0AAE in /usr/bin/ceph-osd 2021-06-15T09:25:07.513 INFO:journalctl@ceph.osd.3.smithi100.stdout:Jun 15 09:24:19 smithi100 conmon[54917]: 12# main in /usr/bin/ceph-osd 2021-06-15T09:25:07.513 INFO:journalctl@ceph.osd.3.smithi100.stdout:Jun 15 09:24:19 smithi100 conmon[54917]: 13# __libc_start_main in /lib64/libc.so.6 2021-06-15T09:25:07.514 INFO:journalctl@ceph.osd.3.smithi100.stdout:Jun 15 09:24:19 smithi100 conmon[54917]: 14# _start in /usr/bin/ceph-osd 2021-06-15T09:25:07.514 INFO:journalctl@ceph.osd.3.smithi100.stdout:Jun 15 09:24:19 smithi100 conmon[54917]: Fault at location: 0x31dfff8000 2021-06-15T09:25:07.514 INFO:journalctl@ceph.osd.3.smithi100.stdout:Jun 15 09:24:20 smithi100 podman[55356]: 2021-06-15 09:24:20.230341885 +0000 UTC m=+0.072958807 container died a3ea2a1d0a176286b93b8f5b94458982b9038e70d09128fb55f53b92976f0c42 (image=quay.ceph.io/ceph-ci/ceph@sha256:13ae953e3f83ee011d784d6eb9126fdc692f5bb688fe7d918be61ca7a7282b3c, name=ceph-43579b90-cdba-11eb-8c13-001a4aab830c-osd.3) ``` The fix deals with the issue by wrapping the `bptrs` in `bufferlists`. Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

`FuturizedStore` and `ObjectStore` use different memory layout for conveying object attributes: map of `bufferlists` and map of `bptrs` respectively. Unfortunately, `AlienStore` was trying to solve this mismatch with just a `reinterpret_cast`. Very likely this problem was the root cause behind the observed crashes in `PGBackend::load_matadata` like the following one: ``` 2021-06-15T09:25:07.511 INFO:journalctl@ceph.osd.3.smithi100.stdout:Jun 15 09:24:19 smithi100 conmon[54917]: DEBUG 2021-06-15 09:24:19,199 [shard 0] osd - peering_event(id=412, detail=PeeringEvent(from=7 pgid=5.14 sent=49 requested=49 evt=epoch_sent: 49 epoch_requested: 49 MInfoRec from 7 info: 5.14( v 45'2 (0'0,45'2] local-lis/les=48/49 n=0 ec=44/44 lis/c=48/44 les/c/f=49/45/0 sis=48) pg_lease_ack(ruub 19.176788330s))): complete 2021-06-15T09:25:07.511 INFO:journalctl@ceph.osd.3.smithi100.stdout:Jun 15 09:24:19 smithi100 conmon[54917]: Segmentation fault on shard 0. 2021-06-15T09:25:07.511 INFO:journalctl@ceph.osd.3.smithi100.stdout:Jun 15 09:24:19 smithi100 conmon[54917]: Backtrace: 2021-06-15T09:25:07.511 INFO:journalctl@ceph.osd.3.smithi100.stdout:Jun 15 09:24:19 smithi100 conmon[54917]: 0# 0x000055C99757FFBF in /usr/bin/ceph-osd 2021-06-15T09:25:07.511 INFO:journalctl@ceph.osd.3.smithi100.stdout:Jun 15 09:24:19 smithi100 conmon[54917]: 1# FatalSignal::signaled(int, siginfo_t const*) in /usr/bin/ceph-osd 2021-06-15T09:25:07.511 INFO:journalctl@ceph.osd.3.smithi100.stdout:Jun 15 09:24:19 smithi100 conmon[54917]: 2# FatalSignal::install_oneshot_signal_handler<11>()::{lambda(int, siginfo_t*, void*)#1}::_FUN(int, siginfo_t*, void*) in /usr/bin/ceph-osd 2021-06-15T09:25:07.512 INFO:journalctl@ceph.osd.3.smithi100.stdout:Jun 15 09:24:19 smithi100 conmon[54917]: 3# 0x00007F34BB632B20 in /lib64/libpthread.so.0 2021-06-15T09:25:07.512 INFO:journalctl@ceph.osd.3.smithi100.stdout:Jun 15 09:24:19 smithi100 conmon[54917]: 4# 0x000055C99263D4D2 in /usr/bin/ceph-osd 2021-06-15T09:25:07.512 INFO:journalctl@ceph.osd.3.smithi100.stdout:Jun 15 09:24:19 smithi100 conmon[54917]: 5# 0x000055C992740E47 in /usr/bin/ceph-osd 2021-06-15T09:25:07.512 INFO:journalctl@ceph.osd.3.smithi100.stdout:Jun 15 09:24:19 smithi100 conmon[54917]: 6# seastar::continuation<seastar::internal::promise_base_with_type<std::unique_ptr<PGBackend::loaded_object_md_t, std::default_delete<PGBackend::loaded_object_md_t> > >, seastar::noncopyable_function<crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<(std::errc)84> > >::_future<crimson::errorated_future_marker<std::unique_ptr<PGBackend::loaded_object_md_t, std::default_delete<PGBackend::loaded_object_md_t> > > > (seastar::future<std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<void>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > > >&&)>, seastar::future<std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<void>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > > >::then_wrapped_nrvo<crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<(std::errc)84> > >::_future<crimson::errorated_future_marker<std::unique_ptr<PGBackend::loaded_object_md_t, std::default_delete<PGBackend::loaded_object_md_t> > > >, seastar::noncopyable_function<crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<(std::errc)84> > >::_future<crimson::errorated_future_marker<std::unique_ptr<PGBackend::loaded_object_md_t, std::default_delete<PGBackend::loaded_object_md_t> > > > (seastar::future<std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<void>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > > >&&)> >(seastar::noncopyable_function<crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<(std::errc)84> > >::_future<crimson::errorated_future_marker<std::unique_ptr<PGBackend::loaded_object_md_t, std::default_delete<PGBackend::loaded_object_md_t> > > > (seastar::future<std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<void>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > > >&&)>&&)::{lambda(seastar::internal::promise_base_with_type<std::unique_ptr<PGBackend::loaded_object_md_t, std::default_delete<PGBackend::loaded_object_md_t> > >&&, seastar::noncopyable_function<crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<(std::errc)84> > >::_future<crimson::errorated_future_marker<std::unique_ptr<PGBackend::loaded_object_md_t, std::default_delete<PGBackend::loaded_object_md_t> > > > (seastar::future<std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<void>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > > >&&)>&, seastar::future_state<std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<void>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > > >&&)#1}, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<void>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > > >::run_and_dispose() in /usr/bin/ceph-osd 2021-06-15T09:25:07.512 INFO:journalctl@ceph.osd.3.smithi100.stdout:Jun 15 09:24:19 smithi100 conmon[54917]: 7# 0x000055C99CFD195F in /usr/bin/ceph-osd 2021-06-15T09:25:07.513 INFO:journalctl@ceph.osd.3.smithi100.stdout:Jun 15 09:24:19 smithi100 conmon[54917]: 8# 0x000055C99CFD6EA0 in /usr/bin/ceph-osd 2021-06-15T09:25:07.513 INFO:journalctl@ceph.osd.3.smithi100.stdout:Jun 15 09:24:19 smithi100 conmon[54917]: 9# 0x000055C99D188F0B in /usr/bin/ceph-osd 2021-06-15T09:25:07.513 INFO:journalctl@ceph.osd.3.smithi100.stdout:Jun 15 09:24:19 smithi100 conmon[54917]: 10# 0x000055C99CCE698A in /usr/bin/ceph-osd 2021-06-15T09:25:07.513 INFO:journalctl@ceph.osd.3.smithi100.stdout:Jun 15 09:24:19 smithi100 conmon[54917]: 11# 0x000055C99CCF0AAE in /usr/bin/ceph-osd 2021-06-15T09:25:07.513 INFO:journalctl@ceph.osd.3.smithi100.stdout:Jun 15 09:24:19 smithi100 conmon[54917]: 12# main in /usr/bin/ceph-osd 2021-06-15T09:25:07.513 INFO:journalctl@ceph.osd.3.smithi100.stdout:Jun 15 09:24:19 smithi100 conmon[54917]: 13# __libc_start_main in /lib64/libc.so.6 2021-06-15T09:25:07.514 INFO:journalctl@ceph.osd.3.smithi100.stdout:Jun 15 09:24:19 smithi100 conmon[54917]: 14# _start in /usr/bin/ceph-osd 2021-06-15T09:25:07.514 INFO:journalctl@ceph.osd.3.smithi100.stdout:Jun 15 09:24:19 smithi100 conmon[54917]: Fault at location: 0x31dfff8000 2021-06-15T09:25:07.514 INFO:journalctl@ceph.osd.3.smithi100.stdout:Jun 15 09:24:20 smithi100 podman[55356]: 2021-06-15 09:24:20.230341885 +0000 UTC m=+0.072958807 container died a3ea2a1d0a176286b93b8f5b94458982b9038e70d09128fb55f53b92976f0c42 (image=quay.ceph.io/ceph-ci/ceph@sha256:13ae953e3f83ee011d784d6eb9126fdc692f5bb688fe7d918be61ca7a7282b3c, name=ceph-43579b90-cdba-11eb-8c13-001a4aab830c-osd.3) ``` The fix deals with the issue by wrapping the `bptrs` in `bufferlists`. Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

It's perfectly legal for a client to reconnect to paricular `Watch` using different socket / `Connection` than original one. This shall include proper handling of the watch timer which is currently broken as, when reconnecting, we don't cancel the timer. This leaded to the following crash at Sepia: ``` rzarzynski@teuthology:/home/teuthworker/archive/rzarzynski-2021-09-02_07:44:51-rados-master-distro-basic-smithi/6372357$ less ./remote/smithi183/log/ceph-osd.4.log.gz ... DEBUG 2021-09-02 08:10:45,462 [shard 0] osd - client_request(id=12, detail=m=[osd_op(client.5087.0:93 7.1e 7:7c7084bd:::repobj:head {watch reconnect cookie 94478891024832 gen 1} snapc 0={} ondisk+write+know n_if_redirected e40) v8]): got obc lock ... DEBUG 2021-09-02 08:10:45,462 [shard 0] osd - do_op_watch INFO 2021-09-02 08:10:45,462 [shard 0] osd - found existing watch by client.5087 DEBUG 2021-09-02 08:10:45,462 [shard 0] osd - do_op_watch_subop_watch INFO 2021-09-02 08:10:45,462 [shard 0] osd - found existing watch watch(cookie 94478891024832 30s 172.21.15.150:0/3544196211) by client.5087 ... INFO 2021-09-02 08:10:45,462 [shard 0] osd - op_effect: found existing watcher: 94478891024832,client.5087 ceph-osd: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-7406-g9d30203c/rpm/el8/BUILD/ceph- 17.0.0-7406-g9d30203c/src/seastar/include/seastar/core/timer.hh:95: void seastar::timer<Clock>::arm_state(seastar::timer<Clock>::time_point, std::optional<typename Clock::duration>) [with Clock = seastar::l owres_clock; seastar::timer<Clock>::time_point = std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long int, std::ratio<1, 1000> > >; typename Clock::duration = std::chrono::duration<long int, std::ratio<1, 1000> >]: Assertion `!_armed' failed. Aborting on shard 0. Backtrace: 0# 0x000055CC052CF0B6 in ceph-osd 1# FatalSignal::signaled(int, siginfo_t const&) in ceph-osd 2# FatalSignal::install_oneshot_signal_handler<6>()::{lambda(int, siginfo_t*, void*)#1}::_FUN(int, siginfo_t*, void*) in ceph-osd 3# 0x00007FA58349FB20 in /lib64/libpthread.so.0 4# gsignal in /lib64/libc.so.6 5# abort in /lib64/libc.so.6 6# 0x00007FA581A98C89 in /lib64/libc.so.6 7# 0x00007FA581AA6A76 in /lib64/libc.so.6 8# 0x000055CC0BEEE9DD in ceph-osd 9# crimson::osd::Watch::connect(seastar::shared_ptr<crimson::net::Connection>, bool) in ceph-osd 10# 0x000055CC00B1D246 in ceph-osd 11# 0x000055CBFFEF01AE in ceph-osd ... ``` Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

It's perfectly legal for a client to reconnect to particular `Watch` using different socket / `Connection` than original one. This shall include proper handling of the watch timer which is currently broken as, when reconnecting, we don't cancel the timer. This leaded to the following crash at Sepia: ``` rzarzynski@teuthology:/home/teuthworker/archive/rzarzynski-2021-09-02_07:44:51-rados-master-distro-basic-smithi/6372357$ less ./remote/smithi183/log/ceph-osd.4.log.gz ... DEBUG 2021-09-02 08:10:45,462 [shard 0] osd - client_request(id=12, detail=m=[osd_op(client.5087.0:93 7.1e 7:7c7084bd:::repobj:head {watch reconnect cookie 94478891024832 gen 1} snapc 0={} ondisk+write+know n_if_redirected e40) v8]): got obc lock ... DEBUG 2021-09-02 08:10:45,462 [shard 0] osd - do_op_watch INFO 2021-09-02 08:10:45,462 [shard 0] osd - found existing watch by client.5087 DEBUG 2021-09-02 08:10:45,462 [shard 0] osd - do_op_watch_subop_watch INFO 2021-09-02 08:10:45,462 [shard 0] osd - found existing watch watch(cookie 94478891024832 30s 172.21.15.150:0/3544196211) by client.5087 ... INFO 2021-09-02 08:10:45,462 [shard 0] osd - op_effect: found existing watcher: 94478891024832,client.5087 ceph-osd: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-7406-g9d30203c/rpm/el8/BUILD/ceph- 17.0.0-7406-g9d30203c/src/seastar/include/seastar/core/timer.hh:95: void seastar::timer<Clock>::arm_state(seastar::timer<Clock>::time_point, std::optional<typename Clock::duration>) [with Clock = seastar::l owres_clock; seastar::timer<Clock>::time_point = std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long int, std::ratio<1, 1000> > >; typename Clock::duration = std::chrono::duration<long int, std::ratio<1, 1000> >]: Assertion `!_armed' failed. Aborting on shard 0. Backtrace: 0# 0x000055CC052CF0B6 in ceph-osd 1# FatalSignal::signaled(int, siginfo_t const&) in ceph-osd 2# FatalSignal::install_oneshot_signal_handler<6>()::{lambda(int, siginfo_t*, void*)#1}::_FUN(int, siginfo_t*, void*) in ceph-osd 3# 0x00007FA58349FB20 in /lib64/libpthread.so.0 4# gsignal in /lib64/libc.so.6 5# abort in /lib64/libc.so.6 6# 0x00007FA581A98C89 in /lib64/libc.so.6 7# 0x00007FA581AA6A76 in /lib64/libc.so.6 8# 0x000055CC0BEEE9DD in ceph-osd 9# crimson::osd::Watch::connect(seastar::shared_ptr<crimson::net::Connection>, bool) in ceph-osd 10# 0x000055CC00B1D246 in ceph-osd 11# 0x000055CBFFEF01AE in ceph-osd ... ``` Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

`seastar::engine()` is available only for Seastar's threads; it shouldn't be called outside of a reactor thread. Unfortunately, this assumption is violated in `AlienStore` where `OnCommit::finish()`, executed from a finisher thread of `BlueStore`, calls `alien()` on `seastar::engine()`. The net effect are crashes like the following one: ``` INFO 2021-09-22 14:26:33,214 [shard 0] osd - operator() writing superblock cluster_fsid 1d8f7908-2ebf-4a91-ae70-f445668c126b osd_fsid 4da9fe9a-1da5-4ea9-aa79-a1178165ede5 [381/1839] Segmentation fault. Backtrace: 0# print_backtrace(std::basic_string_view<char, std::char_traits<char> >) at /home/rzarzynski/ceph1/build/../src/crimson/common/fatal_signal.cc:80 1# FatalSignal::signaled(int, siginfo_t const&) at /opt/rh/gcc-toolset-9/root/usr/include/c++/9/ostream:570 2# FatalSignal::install_oneshot_signal_handler<11>()::{lambda(int, siginfo_t*, void*)#1}::_FUN(int, siginfo_t*, void*) at /home/rzarzynski/ceph1/build/../src/crimson/common/fatal_signal.cc: 62 3# 0x00007F16BBA13B30 in /lib64/libpthread.so.0 4# (anonymous namespace)::OnCommit::finish(int) at /home/rzarzynski/ceph1/build/../src/crimson/os/alienstore/alien_store.cc:53 5# Context::complete(int) at /home/rzarzynski/ceph1/build/../src/include/Context.h:100 6# Finisher::finisher_thread_entry() at /home/rzarzynski/ceph1/build/../src/common/Finisher.cc:65 7# 0x00007F16BBA0915A in /lib64/libpthread.so.0 8# clone in /lib64/libc.so.6 Dump of siginfo: ... si_addr: 0x10 ``` Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

It's all about these 4 items: ``` 0# print_backtrace(std::basic_string_view<char, std::char_traits<char> >) at /home/rzarzynski/ceph1/build/../src/crimson/common/fatal_signal.cc:80 1# FatalSignal::signaled(int, siginfo_t const&) at /opt/rh/gcc-toolset-9/root/usr/include/c++/9/ostream:570 2# FatalSignal::install_oneshot_signal_handler<11>()::{lambda(int, siginfo_t*, void*)#1}::_FUN(int, siginfo_t*, void*) at /home/rzarzynski/ceph1/build/../src/crimson/common/fatal_signal.cc: 62 3# 0x00007F16BBA13B30 in /lib64/libpthread.so.0 ``` They are part of our backtrace handling and typically developers are not interested in then. Let's be consistent with the classical OSD and hide them. Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

It's all about these items: ``` 0# print_backtrace(std::basic_string_view<char, std::char_traits<char> >) at /home/rzarzynski/ceph1/build/../src/crimson/common/fatal_signal.cc:80 1# FatalSignal::signaled(int, siginfo_t const&) at /opt/rh/gcc-toolset-9/root/usr/include/c++/9/ostream:570 2# FatalSignal::install_oneshot_signal_handler<11>()::{lambda(int, siginfo_t*, void*)#1}::_FUN(int, siginfo_t*, void*) at /home/rzarzynski/ceph1/build/../src/crimson/common/fatal_signal.cc: 62 3# 0x00007F16BBA13B30 in /lib64/libpthread.so.0 ``` They are part of our backtrace handling and typically developers are not interested in them. Let's be consistent with the classical OSD and hide them. Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

`PG::request_{local,remote}_recovery_reservation()` dynamically allocates up to 2 instances of `LambdaContext<T>` and transfers their ownership to the `AsyncReserver<T, F>`. This is expressed in raw pointers (`new` and `delete`) notion. Further analysis shows the only place where `delete` for these objects is called is the `AsyncReserver::cancel_reservation()`. In contrast to the classical OSD, crimson doesn't invoke the method when stopping a PG during the shutdown sequence. This would explain the following ASan issue observed at Sepia: ``` Direct leak of 576 byte(s) in 24 object(s) allocated from: #0 0x7fa108fc57b0 in operator new(unsigned long) (/lib64/libasan.so.5+0xf17b0) #1 0x55723d8b0b56 in non-virtual thunk to crimson::osd::PG::request_local_background_io_reservation(unsigned int, std::unique_ptr<PGPeeringEvent, std::default_delete<PGPeeringEvent> >, std::unique_ptr<PGPeeringEvent, std::default_delete<PGPeeringEvent> >) (/usr/bin/ceph-osd+0x24d95b56) #2 0x55723f1f66ef in PeeringState::WaitDeleteReserved::WaitDeleteReserved(boost::statechart::state<PeeringState::WaitDeleteReserved, PeeringState::ToDelete, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::my_context) (/usr/bin/ceph-osd+0x266db6ef) ``` Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

Otherwise, if we assert, we'll hang here: Thread 1 (Thread 0x7f74eba79580 (LWP 1688617)): #0 0x00007f74eb2aa529 in futex_wait (private=<optimized out>, expected=132, futex_word=0x7ffd642b4b54) at ../sysdeps/unix/sysv/linux/futex-internal.h:61 #1 futex_wait_simple (private=<optimized out>, expected=132, futex_word=0x7ffd642b4b54) at ../sysdeps/nptl/futex-internal.h:135 #2 __pthread_cond_destroy (cond=0x7ffd642b4b30) at pthread_cond_destroy.c:54 ceph#3 0x0000563ff2e5a891 in LibRadosService_StatusFormat_Test::TestBody (this=<optimized out>) at /usr/include/c++/7/bits/unique_ptr.h:78 ceph#4 0x0000563ff2e9dc3a in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void> (location=0x563ff2ea72e4 "the test body", method=<optimized out>, object=0x563ff422a6d0) at ./src/googletest/googletest/src/gtest.cc:2605 ceph#5 testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void> (object=object@entry=0x563ff422a6d0, method=<optimized out>, location=location@entry=0x563ff2ea72e4 "the test body") at ./src/googletest/googletest/src/gtest.cc:2641 ceph#6 0x0000563ff2e908c3 in testing::Test::Run (this=0x563ff422a6d0) at ./src/googletest/googletest/src/gtest.cc:2680 ceph#7 0x0000563ff2e90a25 in testing::TestInfo::Run (this=0x563ff41a3b70) at ./src/googletest/googletest/src/gtest.cc:2858 ceph#8 0x0000563ff2e90ec1 in testing::TestSuite::Run (this=0x563ff41b6230) at ./src/googletest/googletest/src/gtest.cc:3012 ceph#9 0x0000563ff2e92bdc in testing::internal::UnitTestImpl::RunAllTests (this=<optimized out>) at ./src/googletest/googletest/src/gtest.cc:5723 ceph#10 0x0000563ff2e9e14a in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (location=0x563ff2ea8728 "auxiliary test code (environments or event listeners)", method=<optimized out>, object=0x563ff41a2d10) at ./src/googletest/googletest/src/gtest.cc:2605 ceph#11 testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (object=0x563ff41a2d10, method=<optimized out>, location=location@entry=0x563ff2ea8728 "auxiliary test code (environments or event listeners)") at ./src/googletest/googletest/src/gtest.cc:2641 ceph#12 0x0000563ff2e90ae8 in testing::UnitTest::Run (this=0x563ff30c0660 <testing::UnitTest::GetInstance()::instance>) at ./src/googletest/googletest/src/gtest.cc:5306 Signed-off-by: Sage Weil <sage@newdream.net> (cherry picked from commit ee5a0c9)

exit() will call pthread_cond_destroy attempting to destroy dpdk::eal::cond upon which other threads are currently blocked results in undefine behavior. Link different libc version test, libc-2.17 can exit, libc-2.27 will deadlock, the call stack is as follows: Thread 3 (Thread 0xffff7e5749f0 (LWP 62213)): #0 0x0000ffff7f3c422c in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0xaaaadc0e30f4 <dpdk::eal::cond+44>) at ../sysdeps/unix/sysv/linux/futex-internal.h:88 #1 __pthread_cond_wait_common (abstime=0x0, mutex=0xaaaadc0e30f8 <dpdk::eal::lock>, cond=0xaaaadc0e30c8 <dpdk::eal::cond>) at pthread_cond_wait.c:502 #2 __pthread_cond_wait (cond=0xaaaadc0e30c8 <dpdk::eal::cond>, mutex=0xaaaadc0e30f8 <dpdk::eal::lock>) at pthread_cond_wait.c:655 ceph#3 0x0000ffff7f1f1f80 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /usr/lib/aarch64-linux-gnu/libstdc++.so.6 ceph#4 0x0000aaaad37f5078 in dpdk::eal::<lambda()>::operator()(void) const (__closure=<optimized out>, __closure=<optimized out>) at ./src/msg/async/dpdk/dpdk_rte.cc:136 ceph#5 0x0000ffff7f1f7ed4 in ?? () from /usr/lib/aarch64-linux-gnu/libstdc++.so.6 ceph#6 0x0000ffff7f3be088 in start_thread (arg=0xffffe73e197f) at pthread_create.c:463 ceph#7 0x0000ffff7efc74ec in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78 Thread 1 (Thread 0xffff7ee3b010 (LWP 62200)): #0 0x0000ffff7f3c3c38 in futex_wait (private=<optimized out>, expected=12, futex_word=0xaaaadc0e30ec <dpdk::eal::cond+36>) at ../sysdeps/unix/sysv/linux/futex-internal.h:61 #1 futex_wait_simple (private=<optimized out>, expected=12, futex_word=0xaaaadc0e30ec <dpdk::eal::cond+36>) at ../sysdeps/nptl/futex-internal.h:135 #2 __pthread_cond_destroy (cond=0xaaaadc0e30c8 <dpdk::eal::cond>) at pthread_cond_destroy.c:54 ceph#3 0x0000ffff7ef2be34 in __run_exit_handlers (status=-6, listp=0xffff7f04a5a0 <__exit_funcs>, run_list_atexit=255, run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:108 ceph#4 0x0000ffff7ef2bf6c in __GI_exit (status=<optimized out>) at exit.c:139 ceph#5 0x0000ffff7ef176e4 in __libc_start_main (main=0x0, argc=0, argv=0x0, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=<optimized out>) at ../csu/libc-start.c:344 ceph#6 0x0000aaaad2939db0 in _start () at ./src/include/buffer.h:642 Fixes: https://tracker.ceph.com/issues/42890 Signed-off-by: Chunsong Feng <fengchunsong@huawei.com> Signed-off-by: luo rixin <luorixin@huawei.com>

`InterruptibleOperation::interruptor` uses `IOInterruptCondition` underneath which in turn requires a PG to operate: ``` IOInterruptCondition::IOInterruptCondition(Ref<PG>& pg) : pg(pg), e(pg->get_osdmap_epoch()) {} ``` Providing empty smart pointer leads to crashes like the following one: ``` DEBUG 2021-11-16 13:05:13,536 [shard 0] osd - peering_event(id=3, detail=PeeringEvent(from=7 pgid=2.5 sent=15 requested=15 evt=epoch_sent: 15 epoch_requested: 15 MQuery 2.5 from 7 query_epoch 15 query: query(inf o 0'0 epoch_sent 15))): got map 15 ceph-osd: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-8895-ga6b615de/rpm/el8/BUILD/ceph-17.0. 0-8895-ga6b615de/x86_64-redhat-linux-gnu/boost/include/boost/smart_ptr/intrusive_ptr.hpp:199: T* boost::intrusive_ptr<T>::operator->() const [with T = crimson::osd::PG]: Assertion `px != 0' failed. Aborting on shard 0. Backtrace: Reactor stalled for 2493 ms on shard 0. Backtrace: 0xb14ab 0x46e4d6f8 0x46bba7dd 0x46bd668d 0x46bd6a52 0x46bd6c16 0x46bd6ec6 0x12b1f 0xc8e3b 0x3fdcdab2 0x3fdd11c8 0x3fdd44be 0x3fdd4b83 0x3fdca1fb 0x3fdca712 0x3fdcaf0a 0x12b1f 0x3737e 0x21db4 0x21c88 0x2fa75 0x3c5ca510 0x3bae05b7 0x3bae1026 0x3a752413 0x3a7764a2 0x3a777185 0x46b8c541 0x46bd47ea 0x46d5ebeb 0x46d60bc0 0x46810aa2 0x4681530b 0x39fc9922 0x23492 0x39b6f00d 0# gsignal in /lib64/libc.so.6 1# abort in /lib64/libc.so.6 2# 0x00007FDD53414C89 in /lib64/libc.so.6 3# 0x00007FDD53422A76 in /lib64/libc.so.6 4# crimson::osd::IOInterruptCondition::IOInterruptCondition(boost::intrusive_ptr<crimson::osd::PG>&) in ceph-osd 5# 0x0000559C7F1FB5B8 in ceph-osd 6# 0x0000559C7F1FC027 in ceph-osd 7# auto seastar::internal::future_invoke<seastar::noncopyable_function<seastar::future<void> (boost::intrusive_ptr<crimson::osd::PG>&&)>&, boost::intrusive_ptr<crimson::osd::PG> >(seastar::noncopyable_function<seastar::future<void> (boost::intrusive_ptr<crimson::osd::PG>&&)>&, boost::intrusive_ptr<crimson::osd::PG>&&) in ceph-osd 8# void seastar::futurize<seastar::future<void> >::satisfy_with_result_of<seastar::future<boost::intrusive_ptr<crimson::osd::PG> >::then_impl_nrvo<seastar::noncopyable_function<seastar::future<void> (boost::intrusive_ptr<crimson::osd::PG>&&)>, seastar::future<void> >(seastar::noncopyable_function<seastar::future<void> (boost::intrusive_ptr<crimson::osd::PG>&&)>&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::noncopyable_function<seastar::future<void> (boost::intrusive_ptr<crimson::osd::PG>&&)>&, seastar::future_state<boost::intrusive_ptr<crimson::osd::PG> >&&)#1}::operator()(seastar::internal::promise_base_with_type<void>&&, seastar::noncopyable_function<seastar::future<void> (boost::intrusive_ptr<crimson::osd::PG>&&)>&, seastar::future_state<boost::intrusive_ptr<crimson::osd::PG> >&&) const::{lambda()#1}>(seastar::internal::promise_base_with_type<void>&&, seastar::noncopyable_function<seastar::future<void> (boost::intrusive_ptr<crimson::osd::PG>&&)>&&) in ceph-osd 9# seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::noncopyable_function<seastar::future<void> (boost::intrusive_ptr<crimson::osd::PG>&&)>, seastar::future<boost::intrusive_ptr<crimson::osd::PG> >::then_impl_nrvo<seastar::noncopyable_function<seastar::future<void> (boost::intrusive_ptr<crimson::osd::PG>&&)>, seastar::future<void> >(seastar::noncopyable_function<seastar::future<void> (boost::intrusive_ptr<crimson::osd::PG>&&)>&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::noncopyable_function<seastar::future<void> (boost::intrusive_ptr<crimson::osd::PG>&&)>&, seastar::future_state<boost::intrusive_ptr<crimson::osd::PG> >&&)#1}, boost::intrusive_ptr<crimson::osd::PG> >::run_and_dispose() in ceph-osd ``` Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

The patch is supposed to fix the following problems (extra debugs onboard): ``` NFO 2021-11-16 01:18:38,713 [shard 0] osd - ~OSD: OSD dtor called INFO 2021-11-16 01:18:38,713 [shard 0] osd - Heartbeat::Peer: osd.6 removed INFO 2021-11-16 01:18:38,714 [shard 0] osd - Heartbeat::Peer: osd.5 removed INFO 2021-11-16 01:18:38,714 [shard 0] osd - Heartbeat::Peer: osd.2 removed INFO 2021-11-16 01:18:38,714 [shard 0] osd - ~ShardServices: ShardServices dtor called INFO 2021-11-16 01:18:38,714 [shard 0] osd - ~ObjectContextRegistry: ShardServices dtor called; unref_size=3, size=3 INFO 2021-11-16 01:18:38,714 [shard 0] osd - ~ObjectContextRegistry: unreferenced p=0x619000115380 INFO 2021-11-16 01:18:38,714 [shard 0] osd - ~ObjectContextRegistry: unreferenced p=0x619000114980 INFO 2021-11-16 01:18:38,714 [shard 0] osd - ~ObjectContextRegistry: unreferenced p=0x619000112680 INFO 2021-11-16 01:18:38,714 [shard 0] osd - ~ObjectContextRegistry: set p=0x619000114980 INFO 2021-11-16 01:18:38,714 [shard 0] osd - ~ObjectContextRegistry: set p=0x619000115380 INFO 2021-11-16 01:18:38,714 [shard 0] osd - ~ObjectContextRegistry: set p=0x619000112680 INFO 2021-11-16 01:18:38,738 [shard 0] osd - crimson shutdown complete ================================================================= ==33351==ERROR: LeakSanitizer: detected memory leaks Direct leak of 2808 byte(s) in 3 object(s) allocated from: #0 0x7fe10c0327b0 in operator new(unsigned long) (/lib64/libasan.so.5+0xf17b0) #1 0x55accbe8ffc4 in ceph::common::intrusive_lru<ceph::common::intrusive_lru_config<hobject_t, crimson::osd::ObjectContext, crimson::osd::obc_to_hoid<crimson::osd::ObjectContext> > >::get_or_create(hobject_t const&) (/usr/bin/ceph-osd+0x3b000fc4) Objects leaked above: 0x619000112680 (936 bytes) 0x619000114980 (936 bytes) 0x619000115380 (936 bytes) ``` Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

Initially, we were assuming that no pointer obtained from SharedLRU can outlive the lru itself. However, since going with the interruption concept for handling shutdowns, this is no longer valid. The patch is supposed to deal with crashes like the following one: ``` ceph-osd: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-8898-ge57ad63c/rpm/el8/BUILD/ceph-17.0. 0-8898-ge57ad63c/src/crimson/common/shared_lru.h:46: SharedLRU<K, V>::~SharedLRU() [with K = unsigned int; V = OSDMap]: Assertion `weak_refs.empty()' failed. Aborting on shard 0. Backtrace: Reactor stalled for 1162 ms on shard 0. Backtrace: 0xb14ab 0x46e57428 0x46bc450d 0x46be03bd 0x46be0782 0x46be0946 0x46be0bf6 0x12b1f 0xc8e3b 0x3fdd77e2 0x3fddccdb 0x3fdde1ee 0x3fdde8b3 0x3fdd3f2b 0x3fdd4442 0x3f dd4c3a 0x12b1f 0x3737e 0x21db4 0x21c88 0x2fa75 0x3a5ae1b9 0x3a38c5e2 0x3a0c823d 0x3a1771f1 0x3a1796f5 0x46ff92c9 0x46ff9525 0x46ff9e93 0x46ff8eae 0x46ff8bd9 0x3a160e67 0x39f50c83 0x39f51cd0 0x46b96271 0x46bde51a 0x46d6891b 0x46d6a8f0 0x4681a7d2 0x4681f03b 0x39fd50f2 0x23492 0x39b7a7dd 0# gsignal in /lib64/libc.so.6 1# abort in /lib64/libc.so.6 2# 0x00007F9535E04C89 in /lib64/libc.so.6 3# 0x00007F9535E12A76 in /lib64/libc.so.6 4# crimson::osd::OSD::~OSD() in ceph-osd 5# seastar::shared_ptr_count_for<crimson::osd::OSD>::~shared_ptr_count_for() in ceph-osd 6# seastar::shared_ptr<crimson::osd::OSD>::~shared_ptr() in ceph-osd 7# seastar::futurize<std::result_of<seastar::sharded<crimson::osd::OSD>::stop()::{lambda(seastar::future<void>)#2}::operator()(seastar::future<void>) const::{lambda(unsigned int)#1}::operator()(unsigned int) co nst::{lambda()#1} ()>::type>::type seastar::smp::submit_to<seastar::sharded<crimson::osd::OSD>::stop()::{lambda(seastar::future<void>)#2}::operator()(seastar::future<void>) const::{lambda(unsigned int)#1}::opera tor()(unsigned int) const::{lambda()#1}>(unsigned int, seastar::smp_submit_to_options, seastar::sharded<crimson::osd::OSD>::stop()::{lambda(seastar::future<void>)#2}::operator()(seastar::future<void>) const::{la mbda(unsigned int)#1}::operator()(unsigned int) const::{lambda()#1}&&) in ceph-osd 8# std::_Function_handler<seastar::future<void> (unsigned int), seastar::sharded<crimson::osd::OSD>::stop()::{lambda(seastar::future<void>)#2}::operator()(seastar::future<void>) const::{lambda(unsigned int)#1}> ::_M_invoke(std::_Any_data const&, unsigned int&&) in ceph-osd 9# 0x0000562DA18162CA in ceph-osd 10# 0x0000562DA1816526 in ceph-osd 11# 0x0000562DA1816E94 in ceph-osd 12# 0x0000562DA1815EAF in ceph-osd 13# 0x0000562DA1815BDA in ceph-osd 14# seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)>::direct_vtable_for<seastar::future<void>::then_wrapped_maybe_erase<true, seastar::future<void>, seastar::sharded<crimson::osd::OSD>::stop()::{lambda(seastar::future<void>)#2}>(seastar::sharded<crimson::osd::OSD>::stop()::{lambda(seastar::future<void>)#2}&&)::{lambda(seastar::future<void>&&)#1}>::call(seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)> const*, seastar::future<void>&&) in ceph-osd 15# 0x0000562D9476DC84 in ceph-osd 16# 0x0000562D9476ECD1 in ceph-osd 17# 0x0000562DA13B3272 in ceph-osd 18# 0x0000562DA13FB51B in ceph-osd 19# 0x0000562DA158591C in ceph-osd 20# 0x0000562DA15878F1 in ceph-osd 21# 0x0000562DA10377D3 in ceph-osd 22# 0x0000562DA103C03C in ceph-osd 23# main in ceph-osd 24# __libc_start_main in /lib64/libc.so.6 25# _start in ceph-osd ``` Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

`InterruptibleOperation::interruptor` uses `IOInterruptCondition` underneath which in turn requires a PG to operate: ``` IOInterruptCondition::IOInterruptCondition(Ref<PG>& pg) : pg(pg), e(pg->get_osdmap_epoch()) {} ``` Providing empty smart pointer leads to crashes like the following one: ``` DEBUG 2021-11-16 13:05:13,536 [shard 0] osd - peering_event(id=3, detail=PeeringEvent(from=7 pgid=2.5 sent=15 requested=15 evt=epoch_sent: 15 epoch_requested: 15 MQuery 2.5 from 7 query_epoch 15 query: query(inf o 0'0 epoch_sent 15))): got map 15 ceph-osd: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-8895-ga6b615de/rpm/el8/BUILD/ceph-17.0. 0-8895-ga6b615de/x86_64-redhat-linux-gnu/boost/include/boost/smart_ptr/intrusive_ptr.hpp:199: T* boost::intrusive_ptr<T>::operator->() const [with T = crimson::osd::PG]: Assertion `px != 0' failed. Aborting on shard 0. Backtrace: Reactor stalled for 2493 ms on shard 0. Backtrace: 0xb14ab 0x46e4d6f8 0x46bba7dd 0x46bd668d 0x46bd6a52 0x46bd6c16 0x46bd6ec6 0x12b1f 0xc8e3b 0x3fdcdab2 0x3fdd11c8 0x3fdd44be 0x3fdd4b83 0x3fdca1fb 0x3fdca712 0x3fdcaf0a 0x12b1f 0x3737e 0x21db4 0x21c88 0x2fa75 0x3c5ca510 0x3bae05b7 0x3bae1026 0x3a752413 0x3a7764a2 0x3a777185 0x46b8c541 0x46bd47ea 0x46d5ebeb 0x46d60bc0 0x46810aa2 0x4681530b 0x39fc9922 0x23492 0x39b6f00d 0# gsignal in /lib64/libc.so.6 1# abort in /lib64/libc.so.6 2# 0x00007FDD53414C89 in /lib64/libc.so.6 3# 0x00007FDD53422A76 in /lib64/libc.so.6 4# crimson::osd::IOInterruptCondition::IOInterruptCondition(boost::intrusive_ptr<crimson::osd::PG>&) in ceph-osd 5# 0x0000559C7F1FB5B8 in ceph-osd 6# 0x0000559C7F1FC027 in ceph-osd 7# auto seastar::internal::future_invoke<seastar::noncopyable_function<seastar::future<void> (boost::intrusive_ptr<crimson::osd::PG>&&)>&, boost::intrusive_ptr<crimson::osd::PG> >(seastar::noncopyable_function<seastar::future<void> (boost::intrusive_ptr<crimson::osd::PG>&&)>&, boost::intrusive_ptr<crimson::osd::PG>&&) in ceph-osd 8# void seastar::futurize<seastar::future<void> >::satisfy_with_result_of<seastar::future<boost::intrusive_ptr<crimson::osd::PG> >::then_impl_nrvo<seastar::noncopyable_function<seastar::future<void> (boost::intrusive_ptr<crimson::osd::PG>&&)>, seastar::future<void> >(seastar::noncopyable_function<seastar::future<void> (boost::intrusive_ptr<crimson::osd::PG>&&)>&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::noncopyable_function<seastar::future<void> (boost::intrusive_ptr<crimson::osd::PG>&&)>&, seastar::future_state<boost::intrusive_ptr<crimson::osd::PG> >&&)#1}::operator()(seastar::internal::promise_base_with_type<void>&&, seastar::noncopyable_function<seastar::future<void> (boost::intrusive_ptr<crimson::osd::PG>&&)>&, seastar::future_state<boost::intrusive_ptr<crimson::osd::PG> >&&) const::{lambda()#1}>(seastar::internal::promise_base_with_type<void>&&, seastar::noncopyable_function<seastar::future<void> (boost::intrusive_ptr<crimson::osd::PG>&&)>&&) in ceph-osd 9# seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::noncopyable_function<seastar::future<void> (boost::intrusive_ptr<crimson::osd::PG>&&)>, seastar::future<boost::intrusive_ptr<crimson::osd::PG> >::then_impl_nrvo<seastar::noncopyable_function<seastar::future<void> (boost::intrusive_ptr<crimson::osd::PG>&&)>, seastar::future<void> >(seastar::noncopyable_function<seastar::future<void> (boost::intrusive_ptr<crimson::osd::PG>&&)>&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::noncopyable_function<seastar::future<void> (boost::intrusive_ptr<crimson::osd::PG>&&)>&, seastar::future_state<boost::intrusive_ptr<crimson::osd::PG> >&&)#1}, boost::intrusive_ptr<crimson::osd::PG> >::run_and_dispose() in ceph-osd ``` Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

Initially, we were assuming that no pointer obtained from SharedLRU can outlive the lru itself. However, since going with the interruption concept for handling shutdowns, this is no longer valid. The patch is supposed to deal with crashes like the following one: ``` ceph-osd: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-8898-ge57ad63c/rpm/el8/BUILD/ceph-17.0. 0-8898-ge57ad63c/src/crimson/common/shared_lru.h:46: SharedLRU<K, V>::~SharedLRU() [with K = unsigned int; V = OSDMap]: Assertion `weak_refs.empty()' failed. Aborting on shard 0. Backtrace: Reactor stalled for 1162 ms on shard 0. Backtrace: 0xb14ab 0x46e57428 0x46bc450d 0x46be03bd 0x46be0782 0x46be0946 0x46be0bf6 0x12b1f 0xc8e3b 0x3fdd77e2 0x3fddccdb 0x3fdde1ee 0x3fdde8b3 0x3fdd3f2b 0x3fdd4442 0x3f dd4c3a 0x12b1f 0x3737e 0x21db4 0x21c88 0x2fa75 0x3a5ae1b9 0x3a38c5e2 0x3a0c823d 0x3a1771f1 0x3a1796f5 0x46ff92c9 0x46ff9525 0x46ff9e93 0x46ff8eae 0x46ff8bd9 0x3a160e67 0x39f50c83 0x39f51cd0 0x46b96271 0x46bde51a 0x46d6891b 0x46d6a8f0 0x4681a7d2 0x4681f03b 0x39fd50f2 0x23492 0x39b7a7dd 0# gsignal in /lib64/libc.so.6 1# abort in /lib64/libc.so.6 2# 0x00007F9535E04C89 in /lib64/libc.so.6 3# 0x00007F9535E12A76 in /lib64/libc.so.6 4# crimson::osd::OSD::~OSD() in ceph-osd 5# seastar::shared_ptr_count_for<crimson::osd::OSD>::~shared_ptr_count_for() in ceph-osd 6# seastar::shared_ptr<crimson::osd::OSD>::~shared_ptr() in ceph-osd 7# seastar::futurize<std::result_of<seastar::sharded<crimson::osd::OSD>::stop()::{lambda(seastar::future<void>)#2}::operator()(seastar::future<void>) const::{lambda(unsigned int)#1}::operator()(unsigned int) co nst::{lambda()#1} ()>::type>::type seastar::smp::submit_to<seastar::sharded<crimson::osd::OSD>::stop()::{lambda(seastar::future<void>)#2}::operator()(seastar::future<void>) const::{lambda(unsigned int)#1}::opera tor()(unsigned int) const::{lambda()#1}>(unsigned int, seastar::smp_submit_to_options, seastar::sharded<crimson::osd::OSD>::stop()::{lambda(seastar::future<void>)#2}::operator()(seastar::future<void>) const::{la mbda(unsigned int)#1}::operator()(unsigned int) const::{lambda()#1}&&) in ceph-osd 8# std::_Function_handler<seastar::future<void> (unsigned int), seastar::sharded<crimson::osd::OSD>::stop()::{lambda(seastar::future<void>)#2}::operator()(seastar::future<void>) const::{lambda(unsigned int)#1}> ::_M_invoke(std::_Any_data const&, unsigned int&&) in ceph-osd 9# 0x0000562DA18162CA in ceph-osd 10# 0x0000562DA1816526 in ceph-osd 11# 0x0000562DA1816E94 in ceph-osd 12# 0x0000562DA1815EAF in ceph-osd 13# 0x0000562DA1815BDA in ceph-osd 14# seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)>::direct_vtable_for<seastar::future<void>::then_wrapped_maybe_erase<true, seastar::future<void>, seastar::sharded<crimson::osd::OSD>::stop()::{lambda(seastar::future<void>)#2}>(seastar::sharded<crimson::osd::OSD>::stop()::{lambda(seastar::future<void>)#2}&&)::{lambda(seastar::future<void>&&)#1}>::call(seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)> const*, seastar::future<void>&&) in ceph-osd 15# 0x0000562D9476DC84 in ceph-osd 16# 0x0000562D9476ECD1 in ceph-osd 17# 0x0000562DA13B3272 in ceph-osd 18# 0x0000562DA13FB51B in ceph-osd 19# 0x0000562DA158591C in ceph-osd 20# 0x0000562DA15878F1 in ceph-osd 21# 0x0000562DA10377D3 in ceph-osd 22# 0x0000562DA103C03C in ceph-osd 23# main in ceph-osd 24# __libc_start_main in /lib64/libc.so.6 25# _start in ceph-osd ``` Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

``` DEBUG 2022-03-07 13:50:40,027 [shard 0] osd - calling method rbd.create, num_read=0, num_write=0 DEBUG 2022-03-07 13:50:40,027 [shard 0] objclass - <cls> ../src/cls/rbd/cls_rbd.cc:787: create object_prefix=parent_id size=2097152 order=0 features=1 DEBUG 2022-03-07 13:50:40,027 [shard 0] osd - handling op omap-get-vals-by-keys on object 1:144d5af5:::parent_id:head ================================================================= ==2109764==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7f6de5176e70 at pc 0x7f6dfd2a7157 bp 0x7f6de5176e30 sp 0x7f6de51765d8 WRITE of size 24 at 0x7f6de5176e70 thread T0 #0 0x7f6dfd2a7156 in __interceptor_sigaltstack.part.0 (/lib64/libasan.so.6+0x54156) #1 0x7f6dfd30d5b3 in __asan::PlatformUnpoisonStacks() (/lib64/libasan.so.6+0xba5b3) #2 0x7f6dfd31314c in __asan_handle_no_return (/lib64/libasan.so.6+0xc014c) Reactor stalled for 275 ms on shard 0. Backtrace: 0x45d9d 0xda72bd3 0xd801f73 0xd81f6f9 0xd81fb9c 0xd81fe2c 0xd8200f7 0x12b2f 0x7f6dfd3383c1 0x7f6dfd339b18 0x7f6dfd339bd4 0x7f6dfd339bd4 0x7f6dfd339bd4 0x7f6dfd339bd4 0x7f6dfd33b089 0x7f6dfd33bb36 0x7f6dfd32e0b5 0x7f6dfd32ff3a 0xd61d0 0x32412 0xbd8a7 0xbd134 0x54178 0xba5b3 0xc014c 0x1881f22 0x188344a 0xe8b439d 0xe8b58f2 0x2521d5a 0x2a2ee12 0x2c76349 0x2e04ce9 0x3c70c55 0x3cb8aa8 0x7f6de558de39 ceph#3 0x1881f22 in fmt::v6::internal::arg_map<fmt::v6::basic_format_context<seastar::internal::log_buf::inserter_iterator, char> >::~arg_map() /usr/include/fmt/core.h:1170 ceph#4 0x1881f22 in fmt::v6::basic_format_context<seastar::internal::log_buf::inserter_iterator, char>::~basic_format_context() /usr/include/fmt/core.h:1265 ceph#5 0x1881f22 in fmt::v6::format_handler<fmt::v6::arg_formatter<fmt::v6::internal::output_range<seastar::internal::log_buf::inserter_iterator, char> >, char, fmt::v6::basic_format_context<seastar::internal::log_buf::inserter_iterator, char> >::~format_handler() /usr/include/fmt/format.h:3143 ceph#6 0x1881f22 in fmt::v6::basic_format_context<seastar::internal::log_buf::inserter_iterator, char>::iterator fmt::v6::vformat_to<fmt::v6::arg_formatter<fmt::v6::internal::output_range<seastar::internal::log_buf::inserter_iterator, char> >, char, fmt::v6::basic_format_context<seastar::internal::log_buf::inserter_iterator, char> >(fmt::v6::arg_formatter<fmt::v6::internal::output_range<seastar::internal::log_buf::inserter_iterator, char> >::range, fmt::v6::basic_string_view<char>, fmt::v6::basic_format_args<fmt::v6::basic_format_context<seastar::internal::log_buf::inserter_iterator, char> >, fmt::v6::internal::locale_ref) /usr/include/fmt/format.h:3206 ceph#7 0x188344a in seastar::internal::log_buf::inserter_iterator fmt::v6::vformat_to<fmt::v6::basic_string_view<char>, seastar::internal::log_buf::inserter_iterator, , 0>(seastar::internal::log_buf::inserter_iterator, fmt::v6::basic_string_view<char> const&, fmt::v6::basic_format_args<fmt::v6::basic_format_context<fmt::v6::type_identity<seastar::internal::log_buf::inserter_iterator>::type, fmt::v6::internal::char_t_impl<fmt::v6::basic_string_view<char>, void>::type> >) /usr/include/fmt/format.h:3395 ceph#8 0x188344a in seastar::internal::log_buf::inserter_iterator fmt::v6::format_to<seastar::internal::log_buf::inserter_iterator, std::basic_string_view<char, std::char_traits<char> >, hobject_t const&, 0>(seastar::internal::log_buf::inserter_iterator, std::basic_string_view<char, std::char_traits<char> > const&, hobject_t const&) /usr/include/fmt/format.h:3418 ceph#9 0x188344a in seastar::logger::log<hobject_t const&>(seastar::log_level, seastar::logger::format_info, hobject_t const&)::{lambda(seastar::internal::log_buf::inserter_iterator)#1}::operator()(seastar::internal::log_buf::inserter_iterator) const ../src/seastar/include/seastar/util/log.hh:227 ceph#10 0x188344a in seastar::logger::lambda_log_writer<seastar::logger::log<hobject_t const&>(seastar::log_level, seastar::logger::format_info, hobject_t const&)::{lambda(seastar::internal::log_buf::inserter_iterator)#1}>::operator()(seastar::internal::log_buf::inserter_iterator) ../src/seastar/include/seastar/util/log.hh:106 ceph#11 0xe8b439d in operator() ../src/seastar/src/util/log.cc:268 ceph#12 0xe8b58f2 in seastar::logger::do_log(seastar::log_level, seastar::logger::log_writer&) ../src/seastar/src/util/log.cc:280 ceph#13 0x2521d5a in void seastar::logger::log<hobject_t const&>(seastar::log_level, seastar::logger::format_info, hobject_t const&) ../src/seastar/include/seastar/util/log.hh:230 ceph#14 0x2a2ee12 in void seastar::logger::debug<hobject_t const&>(seastar::logger::format_info, hobject_t const&) ../src/seastar/include/seastar/util/log.hh:373 ceph#15 0x2a2ee12 in PGBackend::omap_get_vals_by_keys(ObjectState const&, OSDOp&, object_stat_sum_t&) const ../src/crimson/osd/pg_backend.cc:1220 ceph#16 0x2c76349 in operator()<PGBackend, ObjectState> ../src/crimson/osd/ops_executer.cc:577 ceph#17 0x2c76349 in do_const_op<crimson::osd::OpsExecuter::execute_op(OSDOp&)::<lambda(auto:167&, const auto:168&)> > ../src/crimson/osd/ops_executer.cc:449 ceph#18 0x2e04ce9 in do_read_op<crimson::osd::OpsExecuter::execute_op(OSDOp&)::<lambda(auto:167&, const auto:168&)> > ../src/crimson/osd/ops_executer.h:216 ceph#19 0x2e04ce9 in crimson::osd::OpsExecuter::execute_op(OSDOp&) ../src/crimson/osd/ops_executer.cc:576 Reactor stalled for 762 ms on shard 0. Backtrace: 0x45d9d 0xda72bd3 0xd801f73 0xd81f6f9 0xd81fb9c 0xd81fe2c 0xd8200f7 0x12b2f 0x7f6dfd33ae85 0x7f6dfd33bb36 0x7f6dfd32e0b5 0x7f6dfd32ff3a 0xd61d0 0x32412 0xbd8a7 0xbd134 0x54178 0xba5b3 0xc014c 0x1881f22 0x188344a 0xe8b439d 0xe8b58f2 0x2521d5a 0x2a2ee12 0x2c76349 0x2e04ce9 0x3c70c55 0x3cb8aa8 0x7f6de558de39 ceph#20 0x3c70c55 in execute_osd_op ../src/crimson/osd/objclass.cc:35 ceph#21 0x3cb8aa8 in cls_cxx_map_get_val(void*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list*) ../src/crimson/osd/objclass.cc:372 ceph#22 0x7f6de558de39 (/home/rzarzynski/ceph1/build/lib/libcls_rbd.so.1.0.0+0x28e39) 0x7f6de5176e70 is located 249456 bytes inside of 262144-byte region [0x7f6de513a000,0x7f6de517a000) allocated by thread T0 here: #0 0x7f6dfd3084a7 in aligned_alloc (/lib64/libasan.so.6+0xb54a7) #1 0xdd414fc in seastar::thread_context::make_stack(unsigned long) ../src/seastar/src/core/thread.cc:196 #2 0x7fff3214bc4f ([stack]+0xa5c4f) ``` Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

The problem is: ``` DEBUG 2022-03-07 13:50:40,027 [shard 0] osd - calling method rbd.create, num_read=0, num_write=0 DEBUG 2022-03-07 13:50:40,027 [shard 0] objclass - <cls> ../src/cls/rbd/cls_rbd.cc:787: create object_prefix=parent_id size=2097152 order=0 features=1 DEBUG 2022-03-07 13:50:40,027 [shard 0] osd - handling op omap-get-vals-by-keys on object 1:144d5af5:::parent_id:head ================================================================= ==2109764==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7f6de5176e70 at pc 0x7f6dfd2a7157 bp 0x7f6de5176e30 sp 0x7f6de51765d8 WRITE of size 24 at 0x7f6de5176e70 thread T0 #0 0x7f6dfd2a7156 in __interceptor_sigaltstack.part.0 (/lib64/libasan.so.6+0x54156) #1 0x7f6dfd30d5b3 in __asan::PlatformUnpoisonStacks() (/lib64/libasan.so.6+0xba5b3) #2 0x7f6dfd31314c in __asan_handle_no_return (/lib64/libasan.so.6+0xc014c) Reactor stalled for 275 ms on shard 0. Backtrace: 0x45d9d 0xda72bd3 0xd801f73 0xd81f6f9 0xd81fb9c 0xd81fe2c 0xd8200f7 0x12b2f 0x7f6dfd3383c1 0x7f6dfd339b18 0x7f6dfd339bd4 0x7f6dfd339bd4 0x7f6dfd339bd4 0x7f6dfd339bd4 0x7f6dfd33b089 0x7f6dfd33bb36 0x7f6dfd32e0b5 0x7f6dfd32ff3a 0xd61d0 0x32412 0xbd8a7 0xbd134 0x54178 0xba5b3 0xc014c 0x1881f22 0x188344a 0xe8b439d 0xe8b58f2 0x2521d5a 0x2a2ee12 0x2c76349 0x2e04ce9 0x3c70c55 0x3cb8aa8 0x7f6de558de39 ceph#3 0x1881f22 in fmt::v6::internal::arg_map<fmt::v6::basic_format_context<seastar::internal::log_buf::inserter_iterator, char> >::~arg_map() /usr/include/fmt/core.h:1170 ceph#4 0x1881f22 in fmt::v6::basic_format_context<seastar::internal::log_buf::inserter_iterator, char>::~basic_format_context() /usr/include/fmt/core.h:1265 ceph#5 0x1881f22 in fmt::v6::format_handler<fmt::v6::arg_formatter<fmt::v6::internal::output_range<seastar::internal::log_buf::inserter_iterator, char> >, char, fmt::v6::basic_format_context<seastar::internal::log_buf::inserter_iterator, char> >::~format_handler() /usr/include/fmt/format.h:3143 ceph#6 0x1881f22 in fmt::v6::basic_format_context<seastar::internal::log_buf::inserter_iterator, char>::iterator fmt::v6::vformat_to<fmt::v6::arg_formatter<fmt::v6::internal::output_range<seastar::internal::log_buf::inserter_iterator, char> >, char, fmt::v6::basic_format_context<seastar::internal::log_buf::inserter_iterator, char> >(fmt::v6::arg_formatter<fmt::v6::internal::output_range<seastar::internal::log_buf::inserter_iterator, char> >::range, fmt::v6::basic_string_view<char>, fmt::v6::basic_format_args<fmt::v6::basic_format_context<seastar::internal::log_buf::inserter_iterator, char> >, fmt::v6::internal::locale_ref) /usr/include/fmt/format.h:3206 ceph#7 0x188344a in seastar::internal::log_buf::inserter_iterator fmt::v6::vformat_to<fmt::v6::basic_string_view<char>, seastar::internal::log_buf::inserter_iterator, , 0>(seastar::internal::log_buf::inserter_iterator, fmt::v6::basic_string_view<char> const&, fmt::v6::basic_format_args<fmt::v6::basic_format_context<fmt::v6::type_identity<seastar::internal::log_buf::inserter_iterator>::type, fmt::v6::internal::char_t_impl<fmt::v6::basic_string_view<char>, void>::type> >) /usr/include/fmt/format.h:3395 ceph#8 0x188344a in seastar::internal::log_buf::inserter_iterator fmt::v6::format_to<seastar::internal::log_buf::inserter_iterator, std::basic_string_view<char, std::char_traits<char> >, hobject_t const&, 0>(seastar::internal::log_buf::inserter_iterator, std::basic_string_view<char, std::char_traits<char> > const&, hobject_t const&) /usr/include/fmt/format.h:3418 ceph#9 0x188344a in seastar::logger::log<hobject_t const&>(seastar::log_level, seastar::logger::format_info, hobject_t const&)::{lambda(seastar::internal::log_buf::inserter_iterator)#1}::operator()(seastar::internal::log_buf::inserter_iterator) const ../src/seastar/include/seastar/util/log.hh:227 ceph#10 0x188344a in seastar::logger::lambda_log_writer<seastar::logger::log<hobject_t const&>(seastar::log_level, seastar::logger::format_info, hobject_t const&)::{lambda(seastar::internal::log_buf::inserter_iterator)#1}>::operator()(seastar::internal::log_buf::inserter_iterator) ../src/seastar/include/seastar/util/log.hh:106 ceph#11 0xe8b439d in operator() ../src/seastar/src/util/log.cc:268 ceph#12 0xe8b58f2 in seastar::logger::do_log(seastar::log_level, seastar::logger::log_writer&) ../src/seastar/src/util/log.cc:280 ceph#13 0x2521d5a in void seastar::logger::log<hobject_t const&>(seastar::log_level, seastar::logger::format_info, hobject_t const&) ../src/seastar/include/seastar/util/log.hh:230 ceph#14 0x2a2ee12 in void seastar::logger::debug<hobject_t const&>(seastar::logger::format_info, hobject_t const&) ../src/seastar/include/seastar/util/log.hh:373 ceph#15 0x2a2ee12 in PGBackend::omap_get_vals_by_keys(ObjectState const&, OSDOp&, object_stat_sum_t&) const ../src/crimson/osd/pg_backend.cc:1220 ceph#16 0x2c76349 in operator()<PGBackend, ObjectState> ../src/crimson/osd/ops_executer.cc:577 ceph#17 0x2c76349 in do_const_op<crimson::osd::OpsExecuter::execute_op(OSDOp&)::<lambda(auto:167&, const auto:168&)> > ../src/crimson/osd/ops_executer.cc:449 ceph#18 0x2e04ce9 in do_read_op<crimson::osd::OpsExecuter::execute_op(OSDOp&)::<lambda(auto:167&, const auto:168&)> > ../src/crimson/osd/ops_executer.h:216 ceph#19 0x2e04ce9 in crimson::osd::OpsExecuter::execute_op(OSDOp&) ../src/crimson/osd/ops_executer.cc:576 Reactor stalled for 762 ms on shard 0. Backtrace: 0x45d9d 0xda72bd3 0xd801f73 0xd81f6f9 0xd81fb9c 0xd81fe2c 0xd8200f7 0x12b2f 0x7f6dfd33ae85 0x7f6dfd33bb36 0x7f6dfd32e0b5 0x7f6dfd32ff3a 0xd61d0 0x32412 0xbd8a7 0xbd134 0x54178 0xba5b3 0xc014c 0x1881f22 0x188344a 0xe8b439d 0xe8b58f2 0x2521d5a 0x2a2ee12 0x2c76349 0x2e04ce9 0x3c70c55 0x3cb8aa8 0x7f6de558de39 ceph#20 0x3c70c55 in execute_osd_op ../src/crimson/osd/objclass.cc:35 ceph#21 0x3cb8aa8 in cls_cxx_map_get_val(void*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list*) ../src/crimson/osd/objclass.cc:372 ceph#22 0x7f6de558de39 (/home/rzarzynski/ceph1/build/lib/libcls_rbd.so.1.0.0+0x28e39) 0x7f6de5176e70 is located 249456 bytes inside of 262144-byte region [0x7f6de513a000,0x7f6de517a000) allocated by thread T0 here: #0 0x7f6dfd3084a7 in aligned_alloc (/lib64/libasan.so.6+0xb54a7) #1 0xdd414fc in seastar::thread_context::make_stack(unsigned long) ../src/seastar/src/core/thread.cc:196 #2 0x7fff3214bc4f ([stack]+0xa5c4f) ``` Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

Before the patch there was a possibility that `OSDConnectionPriv` gets destructed before a `PipelineHandle` instance that was using it. The reason is our remote-handling operations store `conn` directly while `handle` is defined in a parent class. Due to the language rules the former gets deinitialized earlier. ``` ==756032==ERROR: AddressSanitizer: heap-use-after-free on address 0x615000039684 at pc 0x0000020bdfa2 bp 0x7ffd3abfa370 sp 0x7ffd3abfa360 READ of size 1 at 0x615000039684 thread T0 Reactor stalled for 261 ms on shard 0. Backtrace: 0x45d9d 0xe90f6d1 0xe6b8a1d 0xe6d1205 0xe6d16a8 0xe6d1938 0xe6d1c03 0x12cdf 0xccebf 0x7f6447161b1e 0x7f644714aee8 0x7f644714eed6 0x7f644714fb36 0x7f64471420b5 0x 7f6447143f3a 0xd61d0 0x32412 0xbd8a7 0xbd134 0xbdc1a 0x20bdfa1 0x20c184e 0x352eb7f 0x352fa28 0x20b04a5 0x1be30e5 0xe694bc4 0xe6ebb8a 0xe843a11 0xe845a22 0xe29f497 0xe2a3ccd 0x1ab1841 0x3aca2 0x175698d #0 0x20bdfa1 in seastar::shared_mutex::unlock() ../src/seastar/include/seastar/core/shared_mutex.hh:122 #1 0x20c184e in crimson::OrderedExclusivePhaseT<crimson::osd::ConnectionPipeline::GetPG>::exit() ../src/crimson/common/operation.h:548 #2 0x20c184e in crimson::OrderedExclusivePhaseT<crimson::osd::ConnectionPipeline::GetPG>::ExitBarrier::exit() ../src/crimson/common/operation.h:533 ceph#3 0x20c184e in crimson::OrderedExclusivePhaseT<crimson::osd::ConnectionPipeline::GetPG>::ExitBarrier::cancel() ../src/crimson/common/operation.h:539 ceph#4 0x20c184e in crimson::OrderedExclusivePhaseT<crimson::osd::ConnectionPipeline::GetPG>::ExitBarrier::~ExitBarrier() ../src/crimson/common/operation.h:543 ceph#5 0x20c184e in crimson::OrderedExclusivePhaseT<crimson::osd::ConnectionPipeline::GetPG>::ExitBarrier::~ExitBarrier() ../src/crimson/common/operation.h:544 ceph#6 0x352eb7f in std::default_delete<crimson::PipelineExitBarrierI>::operator()(crimson::PipelineExitBarrierI*) const /opt/rh/gcc-toolset-11/root/usr/include/c++/11/bits/unique_ptr.h:85 ceph#7 0x352eb7f in std::unique_ptr<crimson::PipelineExitBarrierI, std::default_delete<crimson::PipelineExitBarrierI> >::~unique_ptr() /opt/rh/gcc-toolset-11/root/usr/include/c++/11/bits/unique_ptr.h:361 ceph#8 0x352eb7f in crimson::PipelineHandle::~PipelineHandle() ../src/crimson/common/operation.h:457 ceph#9 0x352eb7f in crimson::osd::PhasedOperationT<crimson::osd::ClientRequest>::~PhasedOperationT() ../src/crimson/osd/osd_operation.h:152 ceph#10 0x352eb7f in crimson::osd::ClientRequest::~ClientRequest() ../src/crimson/osd/osd_operations/client_request.cc:64 ceph#11 ... ``` Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

otherwise we have segfault when "npm ci", like ``` Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00007f77f89099ed in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 [Current thread is 1 (Thread 0x7f77f8496740 (LWP 4046307))] (gdb) bt #0 0x00007f77f89099ed in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #1 0x00000000008c3127 in node::Environment::Environment(node::IsolateData*, v8::Local<v8::Context>, node::tracing::AgentWriterHandle*) () #2 0x00000000008e4d4b in node::Start(v8::Isolate*, node::IsolateData*, std::vector<std::string, std::allocator<std::string> > const&, std::vector<std::string, std::allocator<std::string> > const&) () ceph#3 0x00000000008e34a2 in node::Start(int, char**) () ceph#4 0x00007f77f84c00b3 in __libc_start_main (main=0x89dc10 <main>, argc=3, argv=0x7ffd1dc8e8a8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffd1dc8e898) at ../csu/libc-start.c:308 ceph#5 0x000000000089dd45 in _start () ``` this change is not cherry-picked from master, because the change introducing the 10.16.0 change of 7f7f8a4 is way too large and touches lots of places in dashboard. while we just need to get the dashboard frontend npm packages ready with minimal change. Signed-off-by: Kefu Chai <tchaikov@gmail.com>

`Transaction` aggregates multiple contexts (event handlers) of three different genres: * on applied, * on commit, * on sync. Although not expressed within language means, a transaction holds ownership towards these contextes, and thus should free them when it gets destructed. Usually, this isn't an issue as those objects are taken from it by `collect_contexts()`, `get_on_*` or similar. However, there are some `Transaction` instances that can get destructed with contexts on board. Such a situation already happened in crimson: ``` INFO 2022-08-01 11:31:29,176 [shard 0] osd - crimson shutdown complete ================================================================= ==101541==ERROR: LeakSanitizer: detected memory leaks Direct leak of 16 byte(s) in 1 object(s) allocated from: #0 0x7fbaadd30307 in operator new(unsigned long) (/lib64/libasan.so.6+0xb6307) #1 0x55922ad3d81b in non-virtual thunk to crimson::osd::PG::start_flush_on_transaction(ceph::os::Transaction&) (/usr/bin/ceph-osd+0x32fda81b) #2 0x60800015019f (<unknown module>) SUMMARY: AddressSanitizer: 16 byte(s) leaked in 1 allocation(s). daemon-helper: command failed with exit status 1 ``` I'm afraid the same can affect the classical world but we may be missing it as the valgrind coverage is narrower than the one that comes from ASan in crimson (every single run hunts for memleaks). Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

This commit switches `ceph::osd::Transaction` to hold `Context` instances with `std::unique_ptr` instead of raw pointers which addesses the following memory leak observed in crimson: ``` INFO 2022-08-01 11:31:29,176 [shard 0] osd - crimson shutdown complete ================================================================= ==101541==ERROR: LeakSanitizer: detected memory leaks Direct leak of 16 byte(s) in 1 object(s) allocated from: #0 0x7fbaadd30307 in operator new(unsigned long) (/lib64/libasan.so.6+0xb6307) #1 0x55922ad3d81b in non-virtual thunk to crimson::osd::PG::start_flush_on_transaction(ceph::os::Transaction&) (/usr/bin/ceph-osd+0x32fda81b) #2 0x60800015019f (<unknown module>) SUMMARY: AddressSanitizer: 16 byte(s) leaked in 1 allocation(s). daemon-helper: command failed with exit status 1 ``` It's likely similar problem affects the classical OSD as well. TODO: refactor usages of `register_on_*` and switch them to `std::make_unique`. This will free us from *assuming* we always get the ownership (a compiler will validate). Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

`Transaction` aggregates multiple contexts (event handlers) of three different genres: * on applied, * on commit, * on sync. Although not expressed within language means, a transaction holds ownership towards these contextes, and thus should free them when it gets destructed. Usually, this isn't an issue as those objects are taken from it by `collect_contexts()`, `get_on_*` or similar. However, there are some `Transaction` instances that can get destructed with contexts on board. Such a situation already happened in crimson: ``` INFO 2022-08-01 11:31:29,176 [shard 0] osd - crimson shutdown complete ================================================================= ==101541==ERROR: LeakSanitizer: detected memory leaks Direct leak of 16 byte(s) in 1 object(s) allocated from: #0 0x7fbaadd30307 in operator new(unsigned long) (/lib64/libasan.so.6+0xb6307) #1 0x55922ad3d81b in non-virtual thunk to crimson::osd::PG::start_flush_on_transaction(ceph::os::Transaction&) (/usr/bin/ceph-osd+0x32fda81b) #2 0x60800015019f (<unknown module>) SUMMARY: AddressSanitizer: 16 byte(s) leaked in 1 allocation(s). daemon-helper: command failed with exit status 1 ``` I'm afraid the same can affect the classical world but we may be missing it as the valgrind coverage is narrower than the one that comes from ASan in crimson (every single run hunts for memleaks). Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

This commit switches `ceph::osd::Transaction` to hold `Context` instances with `std::unique_ptr` instead of raw pointers which addesses the following memory leak observed in crimson: ``` INFO 2022-08-01 11:31:29,176 [shard 0] osd - crimson shutdown complete ================================================================= ==101541==ERROR: LeakSanitizer: detected memory leaks Direct leak of 16 byte(s) in 1 object(s) allocated from: #0 0x7fbaadd30307 in operator new(unsigned long) (/lib64/libasan.so.6+0xb6307) #1 0x55922ad3d81b in non-virtual thunk to crimson::osd::PG::start_flush_on_transaction(ceph::os::Transaction&) (/usr/bin/ceph-osd+0x32fda81b) #2 0x60800015019f (<unknown module>) SUMMARY: AddressSanitizer: 16 byte(s) leaked in 1 allocation(s). daemon-helper: command failed with exit status 1 ``` It's likely similar problem affects the classical OSD as well. TODO: refactor usages of `register_on_*` and switch them to `std::make_unique`. This will free us from *assuming* we always get the ownership (a compiler will validate). Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

Improvement #1: CapTester.write_test_files() not only creates the test file but also does the following for every mount object it receives in parameters - * carefully produces the path for the test file as per parameters received * generates the unique data for each test file on a CephFS mount * creates a data structure -- list of lists -- that holds all this information along with mount object itself for each mount object so that tests can be conducted at a later point Untangle this mess of code by splitting this method into 3 separate methods - 1. To produce the path for test file (as per user's need). 2. To generate the data that will be written into the test file. 3. To actually create the test file on CephFS. Improvement #2: Remove the internal data structure used for testing -- self.test_set -- and use separate class attributes to store all the data required for testing instead of a tuple. This serves two purpose - One, it makes it easy to manipulate all this data from helper methods and during debugging session, especially while using a PDB session. And two, make it impossible to have multiple mounts/multiple "test sets" within same CapTester instance for the sake of simplicity. Users can instead create two instances of CapTester instances if needed. Signed-off-by: Rishabh Dave <ridave@redhat.com>

``` ================================================================= ==80592==ERROR: LeakSanitizer: detected memory leaks Direct leak of 8 byte(s) in 1 object(s) allocated from: #0 0x7f5c76eb6367 in operator new(unsigned long) (/lib64/libasan.so.6+0xb6367) #1 0x7f5c76a2fb81 in MallocExtension::Register(MallocExtension*) (/lib64/libtcmalloc.so.4+0x2fb81) SUMMARY: AddressSanitizer: 8 byte(s) leaked in 1 allocation(s) ``` Signed-off-by: Matan Breizman <mbreizma@redhat.com>

Sanitized backtrace: ``` DEBUG 2023-11-14 15:23:50,871 [shard 0] osd - snaptrim_event(id=10610, detail=SnapTrimEvent(pgid=16.1a snapid=a needs_pause=0)): interrupted crimson::common::actingset_changed (acting set changed) #0 0x5653c613c071 in seastar::shared_mutex::unlock() (/usr/bin/ceph-osd+0x1ed27071) #1 0x5653c8670acf in auto seastar::futurize_invoke<crimson::OrderedConcurrentPhaseT<crimson::osd::SnapTrimEvent::WaitSubop>::ExitBarrier<crimson::OrderedConcurrentPhaseT<crimson::osd::SnapTrimEvent::WaitSubop>::BlockingEvent::Trigger<crimson::osd::SnapTrimEvent> >::exit()::{lambda()#1}&>(crimson::OrderedConcurrentPhaseT<crimson::osd::SnapTrimEvent::WaitSubop>::ExitBarrier<crimson::OrderedConcurrentPhaseT<crimson::osd::SnapTrimEvent::WaitSubop>::BlockingEvent::Trigger<crimson::osd::SnapTrimEvent> >::exit()::{lambda()#1}&) (/usr/bin/ceph-osd+0x2125bacf) #2 0x5653c8670e22 in _ZN7seastar20noncopyable_functionIFNS_6futureIvEEvEE17direct_vtable_forIZNS2_4thenIZN7crimson23OrderedConcurrentPhaseTINS7_3osd13SnapTrimEvent9WaitSubopEE11ExitBarrierINSC_13BlockingEvent7TriggerISA_EEE4exitEvEUlvE_S2_EET0_OT_EUlDpOT_E_E4callEPKS4_ (/usr/bin/ceph-osd+0x2125be22) freed by thread T1 here: #0 0x7f10628b73cf in operator delete(void*, unsigned long) (/lib64/libasan.so.6+0xb73cf) #1 0x5653c8794bff in crimson::osd::SnapTrimEvent::~SnapTrimEvent() (/usr/bin/ceph-osd+0x2137fbff) previously allocated by thread T1 here: #0 0x7f10628b6367 in operator new(unsigned long) (/lib64/libasan.so.6+0xb6367) SUMMARY: AddressSanitizer: heap-use-after-free (/usr/bin/ceph-osd+0x1ed27071) in seastar::shared_mutex::unlock() ``` Signed-off-by: Matan Breizman <mbreizma@redhat.com>

``` // SnapTrimEvent is a background operation, // it's lifetime is not guarnteed since the caller // returned future is being ignored. We should capture // a self reference thourhgout the entire execution // progress (not only on finally() continuations). // See: PG::on_active_actmap() ``` Sanitized backtrace: ``` DEBUG 2023-11-16 08:42:48,441 [shard 0] osd - snaptrim_event(id=21122, detail=SnapTrimEvent(pgid=3.1 snapid=3cb needs_pause=1)): interrupted crimson::common::actingset_changed (acting set changed kernel callstack: #0 0x55e310e0ace7 in seastar::shared_mutex::unlock() (/usr/bin/ceph-osd+0x1edd0ce7) #1 0x55e313325d9c in auto seastar::futurize_invoke<crimson::OrderedConcurrentPhaseT<crimson::osd::SnapTrimEvent::WaitSubop>::ExitBarrier<crimson::OrderedConcurrentPhaseT<crimson::osd::SnapTrimEvent::WaitSubop>::BlockingEvent::Trigger<crimson::osd::SnapTrimEvent> >::exit()::{lambda()#1}&>(crimson::OrderedConcurrentPhaseT<crimson::osd::SnapTrimEvent::WaitSubop>::ExitBarrier<crimson::OrderedConcurrentPhaseT<crimson::osd::SnapTrimEvent::WaitSubop>::BlockingEvent::Trigger<crimson::osd::SnapTrimEvent> >::exit()::{lambda()#1}&) (/usr/bin/ceph-osd+0x212ebd9c) #2 0x55e3133260ef in _ZN7seastar20noncopyable_functionIFNS_6futureIvEEvEE17direct_vtable_forIZNS2_4thenIZN7crimson23OrderedConcurrentPhaseTINS7_3osd13SnapTrimEvent9WaitSubopEE11ExitBarrierINSC_13BlockingEvent7TriggerISA_EEE4exitEvEUlvE_S2_EET0_OT_EUlDpOT_E_E4callEPKS4_ (/usr/bin/ceph-osd+0x212ec0ef) 0x61500013365c is located 92 bytes inside of 472-byte region [0x615000133600,0x6150001337d8) freed by thread T2 here: #0 0x7fb345ab73cf in operator delete(void*, unsigned long) (/lib64/libasan.so.6+0xb73cf) #1 0x55e313474863 in crimson::osd::SnapTrimEvent::~SnapTrimEvent() (/usr/bin/ceph-osd+0x2143a863) previously allocated by thread T2 here: #0 0x7fb345ab6367 in operator new(unsigned long) (/lib64/libasan.so.6+0xb6367) #1 0x55e31183ac18 in auto crimson::OperationRegistryI::create_operation<crimson::osd::SnapTrimEvent, crimson::osd::PG*, SnapMapper&, snapid_t const&, bool const&>(crimson::osd::PG*&&, SnapMapper&, snapid_t const&, bool const&) (/usr/bin/ceph-osd+0x1f800c18) SUMMARY: AddressSanitizer: heap-use-after-free (/usr/bin/ceph-osd+0x1edd0ce7) in seastar::shared_mutex::unlock() ``` Signed-off-by: Matan Breizman <mbreizma@redhat.com>

``` ================================================================= ==80592==ERROR: LeakSanitizer: detected memory leaks Direct leak of 8 byte(s) in 1 object(s) allocated from: #0 0x7f5c76eb6367 in operator new(unsigned long) (/lib64/libasan.so.6+0xb6367) #1 0x7f5c76a2fb81 in MallocExtension::Register(MallocExtension*) (/lib64/libtcmalloc.so.4+0x2fb81) SUMMARY: AddressSanitizer: 8 byte(s) leaked in 1 allocation(s) ``` Signed-off-by: Matan Breizman <mbreizma@redhat.com> (cherry picked from commit bc19097)

wip

c5f9c0f

Signed-off-by: Kefu Chai <kchai@redhat.com>

rzarzynski approved these changes Aug 16, 2019

View reviewed changes

rzarzynski merged commit ea0b967 into rzarzynski:wip-denc-container_base Aug 16, 2019

tchaikov deleted the wip-24636 branch August 16, 2019 12:12

rzarzynski pushed a commit that referenced this pull request Sep 11, 2020

ceph-volume: minor updates prior to merge #1

60e110f

Changes addressing comments in PR - commit to be squashed prior to merge Signed-off-by: Paul Cuzner <pcuzner@redhat.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wip#1

wip#1
rzarzynski merged 1 commit intorzarzynski:wip-denc-container_basefrom
tchaikov:wip-24636

tchaikov commented Aug 16, 2019

Uh oh!

rzarzynski left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tchaikov commented Aug 16, 2019

Uh oh!

rzarzynski left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants