Skip to content

Wip obsync swift#6

Merged
liewegas merged 3 commits intoceph:masterfrom
kylemarsh:wip-obsync-swift
Dec 19, 2011
Merged

Wip obsync swift#6
liewegas merged 3 commits intoceph:masterfrom
kylemarsh:wip-obsync-swift

Conversation

@kylemarsh
Copy link

Obsync changes to make it talk swift. Changes CLI significantly, so existing things using obsync will break.

Kyle Marsh added 3 commits December 9, 2011 01:34
A single "url" doesn't make sense for a swift object store the way it does
for an S3 store or local file, so this commit explodes the interface from a
single url argument to three arguments: host, bucket, and object_prefix.

It also adds a SwiftStore and related classes.  Syncing ACLs with swift is
not yet implemented.
Commit ebe5fc6 removed the definition of
vvprint without removint all the places that used it.
s3 connections require the hostname and swift connections require the
authurl.  obsync treats these as equivalent internally, but breaks them
apart on the command line interface for clarity for the users.
liewegas added a commit that referenced this pull request Dec 19, 2011
@liewegas liewegas merged commit 29e6d6c into ceph:master Dec 19, 2011
liewegas pushed a commit that referenced this pull request Nov 18, 2012
Before the mon, and lockdep, in particular.

#0  __pthread_mutex_lock (mutex=0x30) at pthread_mutex_lock.c:50
#1  0x0000000000816092 in ceph::log::Log::submit_entry (this=0x0, e=0x2f4a270) at log/Log.cc:138
#2  0x00000000007ee0f8 in handle_fatal_signal (signum=11) at global/signal_handler.cc:100
#3  <signal handler called>
#4  0x00000000008e1300 in lockdep_will_lock (name=0x959aa7 "SignalHandler::lock", id=17) at common/lockdep.cc:163
#5  0x00000000008867fc in Mutex::_will_lock (this=0x2f20428) at ./common/Mutex.h:56
#6  0x0000000000886605 in Mutex::Lock (this=0x2f20428, no_lockdep=false) at common/Mutex.cc:81
#7  0x00000000007eeb95 in SignalHandler::entry (this=0x2f20300) at global/signal_handler.cc:198
#8  0x00000000008b0bd1 in Thread::_entry_func (arg=0x2f20300) at common/Thread.cc:43
#9  0x00007f36fefd6b50 in start_thread (arg=<optimized out>) at pthread_create.c:304
#10 0x00007f36fd80b6dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#11 0x0000000000000000 in ?? ()

#0  0x00007f36fefd7e75 in pthread_join (threadid=139874129766144, thread_return=0x0) at pthread_join.c:89
#1  0x00000000008b11ec in Thread::join (this=0x2f20300, prval=0x0) at common/Thread.cc:130
#2  0x00000000007eeae7 in SignalHandler::shutdown (this=0x2f20300) at global/signal_handler.cc:186
#3  0x00000000007ee9cf in SignalHandler::~SignalHandler (this=0x2f20300, __in_chrg=<optimized out>) at global/signal_handler.cc:175
#4  0x00000000007eea58 in SignalHandler::~SignalHandler (this=0x2f20300, __in_chrg=<optimized out>) at global/signal_handler.cc:176
#5  0x00000000007ee643 in shutdown_async_signal_handler () at global/signal_handler.cc:324
#6  0x00000000006de9d2 in main (argc=7, argv=0x7fffbfb8a1e8) at ceph_mon.cc:439

Signed-off-by: Sage Weil <sage@inktank.com>
liewegas added a commit that referenced this pull request Sep 15, 2014
The C_CancelOp path assumes op->session != NULL.  Cancel that op before
we clear it.  This fixes a crash like

#0  pthread_rwlock_wrlock () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_rwlock_wrlock.S:39
#1  0x00007fc82690a4b1 in RWLock::get_write (this=0x18, lockdep=<optimized out>) at ./common/RWLock.h:88
#2  0x00007fc8268f4d79 in Objecter::op_cancel (this=0x1f61830, s=0x0, tid=0, r=-110) at osdc/Objecter.cc:1850
#3  0x00007fc8268ba449 in Context::complete (this=0x1f68c20, r=<optimized out>) at ./include/Context.h:64
#4  0x00007fc8269769aa in RWTimer::timer_thread (this=0x1f61950) at common/Timer.cc:268
#5  0x00007fc82697a85d in RWTimerThread::entry (this=<optimized out>) at common/Timer.cc:200
#6  0x00007fc82651ce9a in start_thread (arg=0x7fc7e3fff700) at pthread_create.c:308

Signed-off-by: Sage Weil <sage@redhat.com>
chamdoo pushed a commit to chamdoo/ceph that referenced this pull request Nov 13, 2015
…ocks.

Summary:
SizeBeingCompacted was called without any lock protection. This causes
crashes, especially when running db_bench with value_size=128K.
The fix is to compute SizeUnderCompaction while holding the mutex and
passing in these values into the call to Finalize.

(gdb) where
ceph#4  leveldb::VersionSet::SizeBeingCompacted (this=this@entry=0x7f0b490931c0, level=level@entry=4) at db/version_set.cc:1827
ceph#5  0x000000000043a3c8 in leveldb::VersionSet::Finalize (this=this@entry=0x7f0b490931c0, v=v@entry=0x7f0b3b86b480) at db/version_set.cc:1420
ceph#6  0x00000000004418d1 in leveldb::VersionSet::LogAndApply (this=0x7f0b490931c0, edit=0x7f0b3dc8c200, mu=0x7f0b490835b0, new_descriptor_log=<optimized out>) at db/version_set.cc:1016
ceph#7  0x00000000004222b2 in leveldb::DBImpl::InstallCompactionResults (this=this@entry=0x7f0b49083400, compact=compact@entry=0x7f0b2b8330f0) at db/db_impl.cc:1473
ceph#8  0x0000000000426027 in leveldb::DBImpl::DoCompactionWork (this=this@entry=0x7f0b49083400, compact=compact@entry=0x7f0b2b8330f0) at db/db_impl.cc:1757
ceph#9  0x0000000000426690 in leveldb::DBImpl::BackgroundCompaction (this=this@entry=0x7f0b49083400, madeProgress=madeProgress@entry=0x7f0b41bf2d1e, deletion_state=...) at db/db_impl.cc:1268
ceph#10 0x0000000000428f42 in leveldb::DBImpl::BackgroundCall (this=0x7f0b49083400) at db/db_impl.cc:1170
ceph#11 0x000000000045348e in BGThread (this=0x7f0b49023100) at util/env_posix.cc:941
ceph#12 leveldb::(anonymous namespace)::PosixEnv::BGThreadWrapper (arg=0x7f0b49023100) at util/env_posix.cc:874
ceph#13 0x00007f0b4a7cf10d in start_thread (arg=0x7f0b41bf3700) at pthread_create.c:301
ceph#14 0x00007f0b49b4b11d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Test Plan:
make check

I am running db_bench with a value size of 128K to see if the segfault is fixed.

Reviewers: MarkCallaghan, sheki, emayanke

Reviewed By: sheki

CC: leveldb

Differential Revision: https://reviews.facebook.net/D9279
chamdoo pushed a commit to chamdoo/ceph that referenced this pull request Nov 13, 2015
Summary:
Now this gives us the real deal stack trace:

    Assertion failed: (false), function GetProperty, file db/db_impl.cc, line 4072.
    Received signal 6 (Abort trap: 6)
    #0   0x7fff57ce39b9
    ceph#1   abort (in libsystem_c.dylib) + 125
    ceph#2   basename (in libsystem_c.dylib) + 0
    ceph#3   rocksdb::DBImpl::GetProperty(rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >*) (in db_test) (db_impl.cc:4072)
    ceph#4   rocksdb::_Test_Empty::_Run() (in db_test) (testharness.h:68)
    ceph#5   rocksdb::_Test_Empty::_RunIt() (in db_test) (db_test.cc:1005)
    ceph#6   rocksdb::test::RunAllTests() (in db_test) (testharness.cc:60)
    ceph#7   main (in db_test) (db_test.cc:6697)
    ceph#8   start (in libdyld.dylib) + 1

Test Plan: added artificial assert, saw great stack trace

Reviewers: haobo, dhruba, ljin

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18309
liewegas referenced this pull request in liewegas/ceph Dec 8, 2015
…sdrestart to master

* commit 'eb162652b954909445432d0326dbe9e762e70a5a':
  Fix Segmentation Fault bug when restarting OSD daemon
mathslinux pushed a commit to mathslinux/ceph that referenced this pull request Jan 19, 2016
…-bad-argumentment

rgw dont abort for bad argumentment
liewegas added a commit that referenced this pull request Jan 27, 2016
UT for interval_set implementations
abhidixit pushed a commit to abhidixit/ceph that referenced this pull request Feb 23, 2016
…ogic

Console and IAM nonmock1 API integration
runsisi pushed a commit to runsisi/ceph that referenced this pull request Oct 24, 2016
…er instance

the caller needs to check the nullity of the parameter before calling
PK11_FreeSymKey or PK11_FreeSlot, otherwise if CryptoAESKeyHandler::init
failed, we will hit a segfault as follows:
  #0  0x00007f76844f5a95 in PK11_FreeSymKey () from /lib64/libnss3.so
  ceph#1  0x00007f76586b6e49 in CryptoAESKeyHandler::~CryptoAESKeyHandler() () from /lib64/librados.so.2
  ceph#2  0x00007f76586b5eea in CryptoAES::get_key_handler(ceph::buffer::ptr const&, std::string&) () from /lib64/librados.so.2
  ceph#3  0x00007f76586b4b9c in CryptoKey::_set_secret(int, ceph::buffer::ptr const&) () from /lib64/librados.so.2
  ceph#4  0x00007f76586b4e95 in CryptoKey::decode(ceph::buffer::list::iterator&) () from /lib64/librados.so.2
  ceph#5  0x00007f76586b7ee6 in KeyRing::set_modifier(char const*, char const*, EntityName&, std::map<std::string, ceph::buffer::list, std::less<std::string>, std::allocator<std::pair<std::string const, ceph::buffer::list> > >&) () from /lib64/librados.so.2
  ceph#6  0x00007f76586b8882 in KeyRing::decode_plaintext(ceph::buffer::list::iterator&) () from /lib64/librados.so.2
  ceph#7  0x00007f76586b9803 in KeyRing::decode(ceph::buffer::list::iterator&) () from /lib64/librados.so.2
  ceph#8  0x00007f76586b9a1f in KeyRing::load(CephContext*, std::string const&) () from /lib64/librados.so.2
  ceph#9  0x00007f76586ba04b in KeyRing::from_ceph_context(CephContext*) () from /lib64/librados.so.2
  ceph#10 0x00007f765852d0cd in MonClient::init() () from /lib64/librados.so.2
  ceph#11 0x00007f76583c15f5 in librados::RadosClient::connect() () from /lib64/librados.so.2
  ceph#12 0x00007f765838cb1c in rados_connect () from /lib64/librados.so.2
  ...

Signed-off-by: runsisi <runsisi@zte.com.cn>
runsisi pushed a commit to runsisi/ceph that referenced this pull request Oct 26, 2016
we have to shutdown the hunting timer and reset cur_con at the same place,
or the hunting timer may set cur_con before it get shutdown, which results
segfault as follows:

  #0  0x00007fb09ffca989 in raise () from /lib64/libc.so.6
  ceph#1  0x00007fb09ffcc098 in abort () from /lib64/libc.so.6
  ceph#2  0x00007fb08ea52677 in ceph::__ceph_assert_fail(char const*, char const*, int, char const*) () from /lib64/librados.so.2
  ceph#3  0x00007fb08e93144c in ceph::log::SubsystemMap::should_gather(unsigned int, int) [clone .part.120] () from /lib64/librados.so.2
  ceph#4  0x00007fb08e97eb15 in RefCountedObject::put() () from /lib64/librados.so.2
  ceph#5  0x00007fb08eae3f9e in MonClient::~MonClient() () from /lib64/librados.so.2
  ceph#6  0x00007fb08e97c2d5 in librados::RadosClient::~RadosClient() () from /lib64/librados.so.2
  ceph#7  0x00007fb08e97c319 in librados::RadosClient::~RadosClient() () from /lib64/librados.so.2
  ceph#8  0x00007fb08e94684a in rados_shutdown () from /lib64/librados.so.2
  ceph#9  0x00007fb098074210 in __pyx_pw_5rados_5Rados_7shutdown () from /usr/lib64/python2.7/site-packages/rados.so
  ...

Signed-off-by: runsisi <runsisi@zte.com.cn>
tchaikov pushed a commit that referenced this pull request Oct 28, 2016
…er instance

the caller needs to check the nullity of the parameter before calling
PK11_FreeSymKey or PK11_FreeSlot, otherwise if CryptoAESKeyHandler::init
failed, we will hit a segfault as follows:
  #0  0x00007f76844f5a95 in PK11_FreeSymKey () from /lib64/libnss3.so
  #1  0x00007f76586b6e49 in CryptoAESKeyHandler::~CryptoAESKeyHandler() () from /lib64/librados.so.2
  #2  0x00007f76586b5eea in CryptoAES::get_key_handler(ceph::buffer::ptr const&, std::string&) () from /lib64/librados.so.2
  #3  0x00007f76586b4b9c in CryptoKey::_set_secret(int, ceph::buffer::ptr const&) () from /lib64/librados.so.2
  #4  0x00007f76586b4e95 in CryptoKey::decode(ceph::buffer::list::iterator&) () from /lib64/librados.so.2
  #5  0x00007f76586b7ee6 in KeyRing::set_modifier(char const*, char const*, EntityName&, std::map<std::string, ceph::buffer::list, std::less<std::string>, std::allocator<std::pair<std::string const, ceph::buffer::list> > >&) () from /lib64/librados.so.2
  #6  0x00007f76586b8882 in KeyRing::decode_plaintext(ceph::buffer::list::iterator&) () from /lib64/librados.so.2
  #7  0x00007f76586b9803 in KeyRing::decode(ceph::buffer::list::iterator&) () from /lib64/librados.so.2
  #8  0x00007f76586b9a1f in KeyRing::load(CephContext*, std::string const&) () from /lib64/librados.so.2
  #9  0x00007f76586ba04b in KeyRing::from_ceph_context(CephContext*) () from /lib64/librados.so.2
  #10 0x00007f765852d0cd in MonClient::init() () from /lib64/librados.so.2
  #11 0x00007f76583c15f5 in librados::RadosClient::connect() () from /lib64/librados.so.2
  #12 0x00007f765838cb1c in rados_connect () from /lib64/librados.so.2
  ...

Signed-off-by: runsisi <runsisi@zte.com.cn>
liewegas pushed a commit that referenced this pull request Dec 14, 2016
fs: Fix bad "mds set inline_data" syntax
chardan pushed a commit to chardan/ceph that referenced this pull request Jun 6, 2017
idryomov added a commit to idryomov/ceph that referenced this pull request Jul 27, 2017
I'm seeing sporadic single thread deedlocks on fio stat_mutex during krbd
thrash runs:

  (gdb) info threads
    Id   Target Id         Frame
  * 1    Thread 0x7f89ee730740 (LWP 15604) 0x00007f89ed9f41bd in __lll_lock_wait () from /lib64/libpthread.so.0
  (gdb) bt
  #0  0x00007f89ed9f41bd in __lll_lock_wait () from /lib64/libpthread.so.0
  ceph#1  0x00007f89ed9f17b2 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  ceph#2  0x00000000004429b9 in fio_mutex_down (mutex=0x7f89ee72d000) at mutex.c:170
  ceph#3  0x0000000000459704 in thread_main (data=<optimized out>) at backend.c:1639
  ceph#4  0x000000000045b013 in fork_main (offset=0, shmid=<optimized out>, sk_out=0x0) at backend.c:1778
  ceph#5  run_threads (sk_out=sk_out@entry=0x0) at backend.c:2195
  ceph#6  0x000000000045b47f in fio_backend (sk_out=sk_out@entry=0x0) at backend.c:2400
  ceph#7  0x000000000040cb0c in main (argc=2, argv=0x7fffad3e3888, envp=<optimized out>) at fio.c:63
  (gdb) up 2
  170                     pthread_cond_wait(&mutex->cond, &mutex->lock);
  (gdb) p mutex.lock.__data.__owner
  $1 = 15604

Upgrading to 2.21 seems to make these go away.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
idryomov added a commit to idryomov/ceph that referenced this pull request Jul 27, 2017
I'm seeing sporadic single thread deadlocks on fio stat_mutex during krbd
thrash runs:

  (gdb) info threads
    Id   Target Id         Frame
  * 1    Thread 0x7f89ee730740 (LWP 15604) 0x00007f89ed9f41bd in __lll_lock_wait () from /lib64/libpthread.so.0
  (gdb) bt
  #0  0x00007f89ed9f41bd in __lll_lock_wait () from /lib64/libpthread.so.0
  ceph#1  0x00007f89ed9f17b2 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  ceph#2  0x00000000004429b9 in fio_mutex_down (mutex=0x7f89ee72d000) at mutex.c:170
  ceph#3  0x0000000000459704 in thread_main (data=<optimized out>) at backend.c:1639
  ceph#4  0x000000000045b013 in fork_main (offset=0, shmid=<optimized out>, sk_out=0x0) at backend.c:1778
  ceph#5  run_threads (sk_out=sk_out@entry=0x0) at backend.c:2195
  ceph#6  0x000000000045b47f in fio_backend (sk_out=sk_out@entry=0x0) at backend.c:2400
  ceph#7  0x000000000040cb0c in main (argc=2, argv=0x7fffad3e3888, envp=<optimized out>) at fio.c:63
  (gdb) up 2
  170                     pthread_cond_wait(&mutex->cond, &mutex->lock);
  (gdb) p mutex.lock.__data.__owner
  $1 = 15604

Upgrading to 2.21 seems to make these go away.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
liewegas added a commit that referenced this pull request May 5, 2021
Otherwise, if we assert, we'll hang here:

Thread 1 (Thread 0x7f74eba79580 (LWP 1688617)):
#0  0x00007f74eb2aa529 in futex_wait (private=<optimized out>, expected=132, futex_word=0x7ffd642b4b54) at ../sysdeps/unix/sysv/linux/futex-internal.h:61
#1  futex_wait_simple (private=<optimized out>, expected=132, futex_word=0x7ffd642b4b54) at ../sysdeps/nptl/futex-internal.h:135
#2  __pthread_cond_destroy (cond=0x7ffd642b4b30) at pthread_cond_destroy.c:54

#3  0x0000563ff2e5a891 in LibRadosService_StatusFormat_Test::TestBody (this=<optimized out>) at /usr/include/c++/7/bits/unique_ptr.h:78
#4  0x0000563ff2e9dc3a in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void> (location=0x563ff2ea72e4 "the test body", method=<optimized out>, object=0x563ff422a6d0)
    at ./src/googletest/googletest/src/gtest.cc:2605
#5  testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void> (object=object@entry=0x563ff422a6d0, method=<optimized out>, location=location@entry=0x563ff2ea72e4 "the test body")
    at ./src/googletest/googletest/src/gtest.cc:2641
#6  0x0000563ff2e908c3 in testing::Test::Run (this=0x563ff422a6d0) at ./src/googletest/googletest/src/gtest.cc:2680
#7  0x0000563ff2e90a25 in testing::TestInfo::Run (this=0x563ff41a3b70) at ./src/googletest/googletest/src/gtest.cc:2858
#8  0x0000563ff2e90ec1 in testing::TestSuite::Run (this=0x563ff41b6230) at ./src/googletest/googletest/src/gtest.cc:3012
#9  0x0000563ff2e92bdc in testing::internal::UnitTestImpl::RunAllTests (this=<optimized out>) at ./src/googletest/googletest/src/gtest.cc:5723
#10 0x0000563ff2e9e14a in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (location=0x563ff2ea8728 "auxiliary test code (environments or event listeners)",
    method=<optimized out>, object=0x563ff41a2d10) at ./src/googletest/googletest/src/gtest.cc:2605
#11 testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (object=0x563ff41a2d10, method=<optimized out>,
    location=location@entry=0x563ff2ea8728 "auxiliary test code (environments or event listeners)") at ./src/googletest/googletest/src/gtest.cc:2641
#12 0x0000563ff2e90ae8 in testing::UnitTest::Run (this=0x563ff30c0660 <testing::UnitTest::GetInstance()::instance>) at ./src/googletest/googletest/src/gtest.cc:5306

Signed-off-by: Sage Weil <sage@newdream.net>
liewegas added a commit that referenced this pull request May 12, 2021
Otherwise, if we assert, we'll hang here:

Thread 1 (Thread 0x7f74eba79580 (LWP 1688617)):
#0  0x00007f74eb2aa529 in futex_wait (private=<optimized out>, expected=132, futex_word=0x7ffd642b4b54) at ../sysdeps/unix/sysv/linux/futex-internal.h:61
#1  futex_wait_simple (private=<optimized out>, expected=132, futex_word=0x7ffd642b4b54) at ../sysdeps/nptl/futex-internal.h:135
#2  __pthread_cond_destroy (cond=0x7ffd642b4b30) at pthread_cond_destroy.c:54

#3  0x0000563ff2e5a891 in LibRadosService_StatusFormat_Test::TestBody (this=<optimized out>) at /usr/include/c++/7/bits/unique_ptr.h:78
#4  0x0000563ff2e9dc3a in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void> (location=0x563ff2ea72e4 "the test body", method=<optimized out>, object=0x563ff422a6d0)
    at ./src/googletest/googletest/src/gtest.cc:2605
#5  testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void> (object=object@entry=0x563ff422a6d0, method=<optimized out>, location=location@entry=0x563ff2ea72e4 "the test body")
    at ./src/googletest/googletest/src/gtest.cc:2641
#6  0x0000563ff2e908c3 in testing::Test::Run (this=0x563ff422a6d0) at ./src/googletest/googletest/src/gtest.cc:2680
#7  0x0000563ff2e90a25 in testing::TestInfo::Run (this=0x563ff41a3b70) at ./src/googletest/googletest/src/gtest.cc:2858
#8  0x0000563ff2e90ec1 in testing::TestSuite::Run (this=0x563ff41b6230) at ./src/googletest/googletest/src/gtest.cc:3012
#9  0x0000563ff2e92bdc in testing::internal::UnitTestImpl::RunAllTests (this=<optimized out>) at ./src/googletest/googletest/src/gtest.cc:5723
#10 0x0000563ff2e9e14a in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (location=0x563ff2ea8728 "auxiliary test code (environments or event listeners)",
    method=<optimized out>, object=0x563ff41a2d10) at ./src/googletest/googletest/src/gtest.cc:2605
#11 testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (object=0x563ff41a2d10, method=<optimized out>,
    location=location@entry=0x563ff2ea8728 "auxiliary test code (environments or event listeners)") at ./src/googletest/googletest/src/gtest.cc:2641
#12 0x0000563ff2e90ae8 in testing::UnitTest::Run (this=0x563ff30c0660 <testing::UnitTest::GetInstance()::instance>) at ./src/googletest/googletest/src/gtest.cc:5306

Signed-off-by: Sage Weil <sage@newdream.net>
(cherry picked from commit ee5a0c9)
rzarzynski added a commit to rzarzynski/ceph that referenced this pull request May 25, 2021
f7181ab has optimized the client
parallelism. To achieve that `PG::do_osd_ops()` were converted to
return basically future pair of futures. Unfortunately, the life-
time management of `OpsExecuter` was kept intact. In the result,
the object was valid only till fullfying the outer future while,
due to the `rollbacker` instances, it should be available till
`all_completed` becomes available.

This issue can explain the following problem has been observed
in a Teuthology job [1].

```
DEBUG 2021-05-20 08:03:22,617 [shard 0] osd - do_op_call: method returned ret=-17, outdata.length()=0 while num_read=1, num_write=0
DEBUG 2021-05-20 08:03:22,617 [shard 0] osd - rollback_obc_if_modified: object 19:e17d4708:test-rados-api-smithi095-38404-2::foo:head got erro
r generic:17, need_rollback=false
=================================================================
==33626==ERROR: AddressSanitizer: heap-use-after-free on address 0x60d0000b9320 at pc 0x560f486b8222 bp 0x7fffc467a1e0 sp 0x7fffc467a1d0
READ of size 4 at 0x60d0000b9320 thread T0
    #0 0x560f486b8221  (/usr/bin/ceph-osd+0x2c610221)
    #1 0x560f4880c6b1 in seastar::continuation<seastar::internal::promise_base_with_type<boost::intrusive_ptr<MOSDOpReply> >, seastar::noncopy
able_function<crimson::interruptible::interruptible_future_detail<crimson::osd::IOInterruptCondition, crimson::errorator<crimson::unthrowable_
wrapper<std::error_code const&, crimson::ec<(std::errc)11> > >::_future<crimson::errorated_future_marker<boost::intrusive_ptr<MOSDOpReply> > >
 > ()>, seastar::future<void>::then_impl_nrvo<seastar::noncopyable_function<crimson::interruptible::interruptible_future_detail<crimson::osd::
IOInterruptCondition, crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<(std::errc)11> > >::_future<crimson:
:errorated_future_marker<boost::intrusive_ptr<MOSDOpReply> > > > ()>, crimson::interruptible::interruptible_future_detail<crimson::osd::IOInte
rruptCondition, crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<(std::errc)11> > >::_future<crimson::error
ated_future_marker<boost::intrusive_ptr<MOSDOpReply> > > > >(seastar::noncopyable_function<crimson::interruptible::interruptible_future_detail
<crimson::osd::IOInterruptCondition, crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<(std::errc)11> > >::_
future<crimson::errorated_future_marker<boost::intrusive_ptr<MOSDOpReply> > > > ()>&&)::{lambda(seastar::internal::promise_base_with_type<boos
t::intrusive_ptr<MOSDOpReply> >&&, seastar::noncopyable_function<crimson::interruptible::interruptible_future_detail<crimson::osd::IOInterruptCondition, crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<(std::errc)11> > >::_future<crimson::errorated_future_marker<boost::intrusive_ptr<MOSDOpReply> > > > ()>&, seastar::future_state<seastar::internal::monostate>&&)#1}, void>::run_and_dispose() (/usr/bin/ceph-osd+0x2c7646b1)
    #2 0x560f5352c3ae  (/usr/bin/ceph-osd+0x374843ae)
    ceph#3 0x560f535318ef  (/usr/bin/ceph-osd+0x374898ef)
    ceph#4 0x560f536e395a  (/usr/bin/ceph-osd+0x3763b95a)
    ceph#5 0x560f532413d9  (/usr/bin/ceph-osd+0x371993d9)
    ceph#6 0x560f476af95a in main (/usr/bin/ceph-osd+0x2b60795a)
    ceph#7 0x7f7aa0af97b2 in __libc_start_main (/lib64/libc.so.6+0x237b2)
    ceph#8 0x560f477d2e8d in _start (/usr/bin/ceph-osd+0x2b72ae8d)

```

[1]: http://pulpito.front.sepia.ceph.com/rzarzynski-2021-05-20_07:28:16-rados-master-distro-basic-smithi/6124735/

The commit deals with the problem by repacking the outer future.
An alternative could be in switching from `std::unique_ptr` to
`seastar::shared_ptr` for managing `OpsExecuter`.

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
rzarzynski added a commit to rzarzynski/ceph that referenced this pull request May 25, 2021
f7181ab has optimized the client
parallelism. To achieve that `PG::do_osd_ops()` were converted to
return basically future pair of futures. Unfortunately, the life-
time management of `OpsExecuter` was kept intact. In the result,
the object was valid only till fullfying the outer future while,
due to the `rollbacker` instances, it should be available till
`all_completed` becomes available.

This issue can explain the following problem has been observed
in a Teuthology job [1].

```
DEBUG 2021-05-20 08:03:22,617 [shard 0] osd - do_op_call: method returned ret=-17, outdata.length()=0 while num_read=1, num_write=0
DEBUG 2021-05-20 08:03:22,617 [shard 0] osd - rollback_obc_if_modified: object 19:e17d4708:test-rados-api-smithi095-38404-2::foo:head got erro
r generic:17, need_rollback=false
=================================================================
==33626==ERROR: AddressSanitizer: heap-use-after-free on address 0x60d0000b9320 at pc 0x560f486b8222 bp 0x7fffc467a1e0 sp 0x7fffc467a1d0
READ of size 4 at 0x60d0000b9320 thread T0
    #0 0x560f486b8221  (/usr/bin/ceph-osd+0x2c610221)
    #1 0x560f4880c6b1 in seastar::continuation<seastar::internal::promise_base_with_type<boost::intrusive_ptr<MOSDOpReply> >, seastar::noncopy
able_function<crimson::interruptible::interruptible_future_detail<crimson::osd::IOInterruptCondition, crimson::errorator<crimson::unthrowable_
wrapper<std::error_code const&, crimson::ec<(std::errc)11> > >::_future<crimson::errorated_future_marker<boost::intrusive_ptr<MOSDOpReply> > >
 > ()>, seastar::future<void>::then_impl_nrvo<seastar::noncopyable_function<crimson::interruptible::interruptible_future_detail<crimson::osd::
IOInterruptCondition, crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<(std::errc)11> > >::_future<crimson:
:errorated_future_marker<boost::intrusive_ptr<MOSDOpReply> > > > ()>, crimson::interruptible::interruptible_future_detail<crimson::osd::IOInte
rruptCondition, crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<(std::errc)11> > >::_future<crimson::error
ated_future_marker<boost::intrusive_ptr<MOSDOpReply> > > > >(seastar::noncopyable_function<crimson::interruptible::interruptible_future_detail
<crimson::osd::IOInterruptCondition, crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<(std::errc)11> > >::_
future<crimson::errorated_future_marker<boost::intrusive_ptr<MOSDOpReply> > > > ()>&&)::{lambda(seastar::internal::promise_base_with_type<boos
t::intrusive_ptr<MOSDOpReply> >&&, seastar::noncopyable_function<crimson::interruptible::interruptible_future_detail<crimson::osd::IOInterruptCondition, crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<(std::errc)11> > >::_future<crimson::errorated_future_marker<boost::intrusive_ptr<MOSDOpReply> > > > ()>&, seastar::future_state<seastar::internal::monostate>&&)#1}, void>::run_and_dispose() (/usr/bin/ceph-osd+0x2c7646b1)
    #2 0x560f5352c3ae  (/usr/bin/ceph-osd+0x374843ae)
    ceph#3 0x560f535318ef  (/usr/bin/ceph-osd+0x374898ef)
    ceph#4 0x560f536e395a  (/usr/bin/ceph-osd+0x3763b95a)
    ceph#5 0x560f532413d9  (/usr/bin/ceph-osd+0x371993d9)
    ceph#6 0x560f476af95a in main (/usr/bin/ceph-osd+0x2b60795a)
    ceph#7 0x7f7aa0af97b2 in __libc_start_main (/lib64/libc.so.6+0x237b2)
    ceph#8 0x560f477d2e8d in _start (/usr/bin/ceph-osd+0x2b72ae8d)

```

[1]: http://pulpito.front.sepia.ceph.com/rzarzynski-2021-05-20_07:28:16-rados-master-distro-basic-smithi/6124735/

The commit deals with the problem by repacking the outer future.
An alternative could be in switching from `std::unique_ptr` to
`seastar::shared_ptr` for managing `OpsExecuter`.

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
rzarzynski added a commit to rzarzynski/ceph that referenced this pull request May 31, 2021
The `FuturizedStore` interface imposes the `get_attr()`
takes the `name` parameter as `std::string_view`, and
thus burdens implementations with extending the life-
time of the data the instance refers to.

Unfortunately, `AlienStore` is unaware that prolonging
the life of a `std::string_view` instance doesn't prolong
the data memory it points to. This problem has manifested
in the following use-after-free detected at Sepia:

```
rzarzynski@teuthology:/home/teuthworker/archive/rzarzynski-2021-05-26_12:20:26-rados-master-distro-basic-smithi/6136929$ less ./remote/smithi194/log/ceph-osd.7.log.gz
...
DEBUG 2021-05-26 20:24:54,077 [shard 0] osd - do_osd_ops_execute: object 14:55e1a5b4:test-rados-api-smithi067-38889-2::foo:head - handling op
call
DEBUG 2021-05-26 20:24:54,077 [shard 0] osd - handling op call on object 14:55e1a5b4:test-rados-api-smithi067-38889-2::foo:head
DEBUG 2021-05-26 20:24:54,078 [shard 0] osd - calling method lock.lock, num_read=0, num_write=0
DEBUG 2021-05-26 20:24:54,078 [shard 0] osd - handling op getxattr on object 14:55e1a5b4:test-rados-api-smithi067-38889-2::foo:head
DEBUG 2021-05-26 20:24:54,078 [shard 0] osd - getxattr on obj=14:55e1a5b4:test-rados-api-smithi067-38889-2::foo:head for attr=_lock.TestLockPP1
DEBUG 2021-05-26 20:24:54,078 [shard 0] bluestore - get_attr
=================================================================
==34068==ERROR: AddressSanitizer: heap-use-after-free on address 0x6030001851d0 at pc 0x7f824d6a5b27 bp 0x7f822b4201c0 sp 0x7f822b41f968
READ of size 17 at 0x6030001851d0 thread T28 (alien-store-tp)
...
    #0 0x7f824d6a5b26  (/lib64/libasan.so.5+0x40b26)
    #1 0x55e2cbb2e00b  (/usr/bin/ceph-osd+0x2b6dc00b)
    #2 0x55e2d31f086e  (/usr/bin/ceph-osd+0x32d9e86e)
    ceph#3 0x55e2d3467607 in crimson::os::ThreadPool::loop(std::chrono::duration<long, std::ratio<1l, 1000l> >, unsigned long) (/usr/bin/ceph-osd+0x33015607)
    ceph#4 0x55e2d346b14a  (/usr/bin/ceph-osd+0x3301914a)
    ceph#5 0x7f8249d32ba2  (/lib64/libstdc++.so.6+0xc2ba2)
    ceph#6 0x7f824a00d149 in start_thread (/lib64/libpthread.so.0+0x8149)
    ceph#7 0x7f82486edf22 in clone (/lib64/libc.so.6+0xfcf22)

0x6030001851d0 is located 0 bytes inside of 31-byte region [0x6030001851d0,0x6030001851ef)
freed by thread T0 here:
    #0 0x7f824d757688 in operator delete(void*) (/lib64/libasan.so.5+0xf2688)

previously allocated by thread T0 here:
    #0 0x7f824d7567b0 in operator new(unsigned long) (/lib64/libasan.so.5+0xf17b0)

Thread T28 (alien-store-tp) created by T0 here:
    #0 0x7f824d6b7ea3 in __interceptor_pthread_create (/lib64/libasan.so.5+0x52ea3)

SUMMARY: AddressSanitizer: heap-use-after-free (/lib64/libasan.so.5+0x40b26)
Shadow bytes around the buggy address:
  0x0c06800289e0: fd fd fd fa fa fa fd fd fd fa fa fa 00 00 00 fa
  0x0c06800289f0: fa fa fd fd fd fa fa fa fd fd fd fa fa fa fd fd
  0x0c0680028a00: fd fa fa fa fd fd fd fa fa fa fd fd fd fa fa fa
  0x0c0680028a10: fd fd fd fa fa fa fd fd fd fa fa fa fd fd fd fa
  0x0c0680028a20: fa fa fd fd fd fa fa fa fd fd fd fa fa fa fd fd
=>0x0c0680028a30: fd fd fa fa fd fd fd fd fa fa[fd]fd fd fd fa fa
  0x0c0680028a40: fd fd fd fd fa fa fd fd fd fd fa fa 00 00 00 07
  0x0c0680028a50: fa fa 00 00 00 fa fa fa 00 00 00 fa fa fa fd fd
  0x0c0680028a60: fd fd fa fa fd fd fd fd fa fa fd fd fd fd fa fa
  0x0c0680028a70: 00 00 00 00 fa fa fd fd fd fd fa fa fd fd fd fd
  0x0c0680028a80: fa fa fd fd fd fd fa fa fd fd fd fd fa fa fd fd
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==34068==ABORTING
```

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
rzarzynski added a commit to rzarzynski/ceph that referenced this pull request May 31, 2021
The `FuturizedStore` interface imposes the `get_attr()`
takes the `name` parameter as `std::string_view`, and
thus burdens implementations with extending the life-
time of the data the instance refers to.

Unfortunately, `AlienStore` is unaware that prolonging
the life of a `std::string_view` instance doesn't prolong
the data memory it points to. This problem has manifested
in the following use-after-free detected at Sepia:

```
rzarzynski@teuthology:/home/teuthworker/archive/rzarzynski-2021-05-26_12:20:26-rados-master-distro-basic-smithi/6136929$ less ./remote/smithi194/log/ceph-osd.7.log.gz
...
DEBUG 2021-05-26 20:24:54,077 [shard 0] osd - do_osd_ops_execute: object 14:55e1a5b4:test-rados-api-smithi067-38889-2::foo:head - handling op
call
DEBUG 2021-05-26 20:24:54,077 [shard 0] osd - handling op call on object 14:55e1a5b4:test-rados-api-smithi067-38889-2::foo:head
DEBUG 2021-05-26 20:24:54,078 [shard 0] osd - calling method lock.lock, num_read=0, num_write=0
DEBUG 2021-05-26 20:24:54,078 [shard 0] osd - handling op getxattr on object 14:55e1a5b4:test-rados-api-smithi067-38889-2::foo:head
DEBUG 2021-05-26 20:24:54,078 [shard 0] osd - getxattr on obj=14:55e1a5b4:test-rados-api-smithi067-38889-2::foo:head for attr=_lock.TestLockPP1
DEBUG 2021-05-26 20:24:54,078 [shard 0] bluestore - get_attr
=================================================================
==34068==ERROR: AddressSanitizer: heap-use-after-free on address 0x6030001851d0 at pc 0x7f824d6a5b27 bp 0x7f822b4201c0 sp 0x7f822b41f968
READ of size 17 at 0x6030001851d0 thread T28 (alien-store-tp)
...
    #0 0x7f824d6a5b26  (/lib64/libasan.so.5+0x40b26)
    #1 0x55e2cbb2e00b  (/usr/bin/ceph-osd+0x2b6dc00b)
    #2 0x55e2d31f086e  (/usr/bin/ceph-osd+0x32d9e86e)
    ceph#3 0x55e2d3467607 in crimson::os::ThreadPool::loop(std::chrono::duration<long, std::ratio<1l, 1000l> >, unsigned long) (/usr/bin/ceph-osd+0x33015607)
    ceph#4 0x55e2d346b14a  (/usr/bin/ceph-osd+0x3301914a)
    ceph#5 0x7f8249d32ba2  (/lib64/libstdc++.so.6+0xc2ba2)
    ceph#6 0x7f824a00d149 in start_thread (/lib64/libpthread.so.0+0x8149)
    ceph#7 0x7f82486edf22 in clone (/lib64/libc.so.6+0xfcf22)

0x6030001851d0 is located 0 bytes inside of 31-byte region [0x6030001851d0,0x6030001851ef)
freed by thread T0 here:
    #0 0x7f824d757688 in operator delete(void*) (/lib64/libasan.so.5+0xf2688)

previously allocated by thread T0 here:
    #0 0x7f824d7567b0 in operator new(unsigned long) (/lib64/libasan.so.5+0xf17b0)

Thread T28 (alien-store-tp) created by T0 here:
    #0 0x7f824d6b7ea3 in __interceptor_pthread_create (/lib64/libasan.so.5+0x52ea3)

SUMMARY: AddressSanitizer: heap-use-after-free (/lib64/libasan.so.5+0x40b26)
Shadow bytes around the buggy address:
  0x0c06800289e0: fd fd fd fa fa fa fd fd fd fa fa fa 00 00 00 fa
  0x0c06800289f0: fa fa fd fd fd fa fa fa fd fd fd fa fa fa fd fd
  0x0c0680028a00: fd fa fa fa fd fd fd fa fa fa fd fd fd fa fa fa
  0x0c0680028a10: fd fd fd fa fa fa fd fd fd fa fa fa fd fd fd fa
  0x0c0680028a20: fa fa fd fd fd fa fa fa fd fd fd fa fa fa fd fd
=>0x0c0680028a30: fd fd fa fa fd fd fd fd fa fa[fd]fd fd fd fa fa
  0x0c0680028a40: fd fd fd fd fa fa fd fd fd fd fa fa 00 00 00 07
  0x0c0680028a50: fa fa 00 00 00 fa fa fa 00 00 00 fa fa fa fd fd
  0x0c0680028a60: fd fd fa fa fd fd fd fd fa fa fd fd fd fd fa fa
  0x0c0680028a70: 00 00 00 00 fa fa fd fd fd fd fa fa fd fd fd fd
  0x0c0680028a80: fa fa fd fd fd fd fa fa fd fd fd fd fa fa fd fd
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==34068==ABORTING
```

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
rzarzynski added a commit to rzarzynski/ceph that referenced this pull request Jun 1, 2021
The `FuturizedStore` interface imposes the `get_attr()`
takes the `name` parameter as `std::string_view`, and
thus burdens implementations with extending the life-
time of the data the instance refers to.

Unfortunately, `AlienStore` is unaware that prolonging
the life of a `std::string_view` instance doesn't prolong
the data memory it points to. This problem has manifested
in the following use-after-free detected at Sepia:

```
rzarzynski@teuthology:/home/teuthworker/archive/rzarzynski-2021-05-26_12:20:26-rados-master-distro-basic-smithi/6136929$ less ./remote/smithi194/log/ceph-osd.7.log.gz
...
DEBUG 2021-05-26 20:24:54,077 [shard 0] osd - do_osd_ops_execute: object 14:55e1a5b4:test-rados-api-smithi067-38889-2::foo:head - handling op
call
DEBUG 2021-05-26 20:24:54,077 [shard 0] osd - handling op call on object 14:55e1a5b4:test-rados-api-smithi067-38889-2::foo:head
DEBUG 2021-05-26 20:24:54,078 [shard 0] osd - calling method lock.lock, num_read=0, num_write=0
DEBUG 2021-05-26 20:24:54,078 [shard 0] osd - handling op getxattr on object 14:55e1a5b4:test-rados-api-smithi067-38889-2::foo:head
DEBUG 2021-05-26 20:24:54,078 [shard 0] osd - getxattr on obj=14:55e1a5b4:test-rados-api-smithi067-38889-2::foo:head for attr=_lock.TestLockPP1
DEBUG 2021-05-26 20:24:54,078 [shard 0] bluestore - get_attr
=================================================================
==34068==ERROR: AddressSanitizer: heap-use-after-free on address 0x6030001851d0 at pc 0x7f824d6a5b27 bp 0x7f822b4201c0 sp 0x7f822b41f968
READ of size 17 at 0x6030001851d0 thread T28 (alien-store-tp)
...
    #0 0x7f824d6a5b26  (/lib64/libasan.so.5+0x40b26)
    #1 0x55e2cbb2e00b  (/usr/bin/ceph-osd+0x2b6dc00b)
    #2 0x55e2d31f086e  (/usr/bin/ceph-osd+0x32d9e86e)
    ceph#3 0x55e2d3467607 in crimson::os::ThreadPool::loop(std::chrono::duration<long, std::ratio<1l, 1000l> >, unsigned long) (/usr/bin/ceph-osd+0x33015607)
    ceph#4 0x55e2d346b14a  (/usr/bin/ceph-osd+0x3301914a)
    ceph#5 0x7f8249d32ba2  (/lib64/libstdc++.so.6+0xc2ba2)
    ceph#6 0x7f824a00d149 in start_thread (/lib64/libpthread.so.0+0x8149)
    ceph#7 0x7f82486edf22 in clone (/lib64/libc.so.6+0xfcf22)

0x6030001851d0 is located 0 bytes inside of 31-byte region [0x6030001851d0,0x6030001851ef)
freed by thread T0 here:
    #0 0x7f824d757688 in operator delete(void*) (/lib64/libasan.so.5+0xf2688)

previously allocated by thread T0 here:
    #0 0x7f824d7567b0 in operator new(unsigned long) (/lib64/libasan.so.5+0xf17b0)

Thread T28 (alien-store-tp) created by T0 here:
    #0 0x7f824d6b7ea3 in __interceptor_pthread_create (/lib64/libasan.so.5+0x52ea3)

SUMMARY: AddressSanitizer: heap-use-after-free (/lib64/libasan.so.5+0x40b26)
Shadow bytes around the buggy address:
  0x0c06800289e0: fd fd fd fa fa fa fd fd fd fa fa fa 00 00 00 fa
  0x0c06800289f0: fa fa fd fd fd fa fa fa fd fd fd fa fa fa fd fd
  0x0c0680028a00: fd fa fa fa fd fd fd fa fa fa fd fd fd fa fa fa
  0x0c0680028a10: fd fd fd fa fa fa fd fd fd fa fa fa fd fd fd fa
  0x0c0680028a20: fa fa fd fd fd fa fa fa fd fd fd fa fa fa fd fd
=>0x0c0680028a30: fd fd fa fa fd fd fd fd fa fa[fd]fd fd fd fa fa
  0x0c0680028a40: fd fd fd fd fa fa fd fd fd fd fa fa 00 00 00 07
  0x0c0680028a50: fa fa 00 00 00 fa fa fa 00 00 00 fa fa fa fd fd
  0x0c0680028a60: fd fd fa fa fd fd fd fd fa fa fd fd fd fd fa fa
  0x0c0680028a70: 00 00 00 00 fa fa fd fd fd fd fa fa fd fd fd fd
  0x0c0680028a80: fa fa fd fd fd fd fa fa fd fd fd fd fa fa fd fd
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==34068==ABORTING
```

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
rzarzynski added a commit to rzarzynski/ceph that referenced this pull request Jun 2, 2021
There is a scenario when the `active_con` is properly
chosen but isn't marked as `ready_to_send`.
If `renew_subs()` is called during the `on_session_opened()`,
the flag will be turned on after the subscriptions are
renewed which cannot happen as it requires the flag to be
alread set. In other words: there is a circular data depedency.

The net result is stalling the subscription machinery,
particularly the OSDMap subs. This caused a nasty peering
issue at Sepia [1] where PG 2.7 got stuck in the `GetInfo`
state.

rzarzynski@teuthology:/home/teuthworker/archive/rzarzynski-2021-05-26_12:20:26-rados-master-distro-basic-smithi/6136908$ less ./remote/smithi039/log/ceph-osd.1.log.gz
...
DEBUG 2021-05-26 20:19:48,134 [shard 0] osd -  pg_epoch 14 pg[2.7( DNE empty local-lis/les=0/0 n=0 ec=0/0 lis/c=0/0 les/c/f=0/0/0 sis=0) [] r=
-1 lpr=0 crt=0'0 mlcod 0'0 unknown enter Initial
...
DEBUG 2021-05-26 20:19:48,138 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0]
r=0 lpr=0 crt=0'0 mlcod 0'0 unknown enter Reset
...
DEBUG 2021-05-26 20:19:48,138 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 unknown enter Started
...
DEBUG 2021-05-26 20:19:48,138 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 unknown enter Start
...
DEBUG 2021-05-26 20:19:48,138 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 unknown enter Started/Primary
...
DEBUG 2021-05-26 20:19:48,138 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 creating enter Started/Primary/Peering
...
DEBUG 2021-05-26 20:19:48,138 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 creating+peering enter Started/Primary/Peering/GetInfo
DEBUG 2021-05-26 20:19:48,138 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 creating+peering build_prior all_probe
DEBUG 2021-05-26 20:19:48,139 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 creating+peering build_prior final: probe 0,1 down  blocked_by {}
DEBUG 2021-05-26 20:19:48,139 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 creating+peering up_thru 0 < same_since 14, must notify monitor
DEBUG 2021-05-26 20:19:48,139 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 creating+peering state<Started/Primary/Peering/GetInfo>:  no prior_set down osds, clearing prior_readable_until_ub
DEBUG 2021-05-26 20:19:48,139 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 creating+peering state<Started/Primary/Peering/GetInfo>:  querying info from osd.0
...
DEBUG 2021-05-26 20:19:48,237 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 creating+peering  got osd.0 2.7( DNE empty local-lis/les=0/0 n=0 ec=0/0 lis/c=0/0 les/c/f=0/0/0 sis=0)
DEBUG 2021-05-26 20:19:48,237 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 creating+peering state<Started/Primary/Peering/GetInfo>: Adding osd: 0 peer features: 3f01cfbb7ffdffff
DEBUG 2021-05-26 20:19:48,237 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 creating+peering state<Started/Primary/Peering/GetInfo>: Common peer features: 3f01cfbb7ffdffff
DEBUG 2021-05-26 20:19:48,237 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 creating+peering state<Started/Primary/Peering/GetInfo>: Common acting features: 3f01cfbb7ffdffff
DEBUG 2021-05-26 20:19:48,238 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 creating+peering state<Started/Primary/Peering/GetInfo>: Common upacting features: 3f01cfbb7ffdffff
DEBUG 2021-05-26 20:19:48,238 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 creating+peering exit Started/Primary/Peering/GetInfo 0.099480 4 2021-05-26T20:19:48.146172+0000
...
DEBUG 2021-05-26 20:19:48,238 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 creating+peering enter Started/Primary/Peering/GetLog
...
DEBUG 2021-05-26 20:19:48,238 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 creating+peering enter Started/Primary/Peering/GetMissing
...
DEBUG 2021-05-26 20:19:48,238 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 creating+peering enter Started/Primary/Peering/WaitUpThru
...
DEBUG 2021-05-26 20:19:49,139 [shard 0] osd -  pg_epoch 15 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 creating enter Started/Primary/Active
...
DEBUG 2021-05-26 20:19:49,142 [shard 0] osd -  pg_epoch 15 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 creating+activating enter Started/Primary/Active/Activating
...
DEBUG 2021-05-26 20:19:49,204 [shard 0] osd -  pg_epoch 15 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/0 les/c/f=15/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 active enter Started/Primary/Active/Recovered
...
DEBUG 2021-05-26 20:19:49,204 [shard 0] osd -  pg_epoch 15 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/0 les/c/f=15/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 active enter Started/Primary/Active/Clean
...
DEBUG 2021-05-26 20:22:31,223 [shard 0] osd -  pg_epoch 86 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/14 les/c/f=15/15/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 active enter Reset
...
<a lot of flipping>
...
DEBUG 2021-05-26 20:24:07,851 [shard 0] osd -  pg_epoch 163 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/14 les/c/f=15/15/0 sis=163
) [1,0] r=0 lpr=163 pi=[14,163)/1 crt=0'0 mlcod 0'0 unknown activate_map
DEBUG 2021-05-26 20:24:07,851 [shard 0] osd -  pg_epoch 163 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/14 les/c/f=15/15/0 sis=163
) [1,0] r=0 lpr=163 pi=[14,163)/1 crt=0'0 mlcod 0'0 unknown exit Reset 0.035744 1 2021-05-26T20:24:07.817331+0000
INFO  2021-05-26 20:24:07,851 [shard 0] osd - Exiting state: Reset, entered at 1622060647.8158188, 1622060647.8173316 spent on 1 events
DEBUG 2021-05-26 20:24:07,851 [shard 0] osd -  pg_epoch 163 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/14 les/c/f=15/15/0 sis=163
) [1,0] r=0 lpr=163 pi=[14,163)/1 crt=0'0 mlcod 0'0 unknown enter Started
INFO  2021-05-26 20:24:07,851 [shard 0] osd - Entering state: Started
DEBUG 2021-05-26 20:24:07,851 [shard 0] osd -  pg_epoch 163 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/14 les/c/f=15/15/0 sis=163
) [1,0] r=0 lpr=163 pi=[14,163)/1 crt=0'0 mlcod 0'0 unknown enter Start
INFO  2021-05-26 20:24:07,851 [shard 0] osd - Entering state: Start
INFO  2021-05-26 20:24:07,851 [shard 0] osd -  pg_epoch 163 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/14 les/c/f=15/15/0 sis=163
) [1,0] r=0 lpr=163 pi=[14,163)/1 crt=0'0 mlcod 0'0 unknown state<Start>: transitioning to Primary
DEBUG 2021-05-26 20:24:07,851 [shard 0] osd -  pg_epoch 163 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/14 les/c/f=15/15/0 sis=163
) [1,0] r=0 lpr=163 pi=[14,163)/1 crt=0'0 mlcod 0'0 unknown exit Start 0.000041 0 0.000000
INFO  2021-05-26 20:24:07,851 [shard 0] osd - Exiting state: Start, entered at 1622060647.8516333, 0.0 spent on 0 events
DEBUG 2021-05-26 20:24:07,852 [shard 0] osd -  pg_epoch 163 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/14 les/c/f=15/15/0 sis=163
) [1,0] r=0 lpr=163 pi=[14,163)/1 crt=0'0 mlcod 0'0 unknown enter Started/Primary
INFO  2021-05-26 20:24:07,852 [shard 0] osd - Entering state: Started/Primary
DEBUG 2021-05-26 20:24:07,852 [shard 0] osd -  pg_epoch 163 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/14 les/c/f=15/15/0 sis=163
) [1,0] r=0 lpr=163 pi=[14,163)/1 crt=0'0 mlcod 0'0 unknown enter Started/Primary/Peering
INFO  2021-05-26 20:24:07,852 [shard 0] osd - Entering state: Started/Primary/Peering
DEBUG 2021-05-26 20:24:07,852 [shard 0] osd -  pg_epoch 163 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/14 les/c/f=15/15/0 sis=163) [1,0] r=0 lpr=163 pi=[14,163)/1 crt=0'0 mlcod 0'0 peering enter Started/Primary/Peering/GetInfo
INFO  2021-05-26 20:24:07,852 [shard 0] osd - Entering state: Started/Primary/Peering/GetInfo
...
DEBUG 2021-05-26 20:24:07,852 [shard 0] osd -  pg_epoch 163 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/14 les/c/f=15/15/0 sis=163) [1,0] r=0 lpr=163 pi=[14,163)/1 crt=0'0 mlcod 0'0 peering build_prior all_probe 0,1,4
DEBUG 2021-05-26 20:24:07,852 [shard 0] osd -  pg_epoch 163 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/14 les/c/f=15/15/0 sis=163) [1,0] r=0 lpr=163 pi=[14,163)/1 crt=0'0 mlcod 0'0 peering build_prior maybe_rw interval:139, acting: 0
DEBUG 2021-05-26 20:24:07,852 [shard 0] osd -  pg_epoch 163 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/14 les/c/f=15/15/0 sis=163) [1,0] r=0 lpr=163 pi=[14,163)/1 crt=0'0 mlcod 0'0 peering build_prior final: probe 0,1,4 down  blocked_by {}
DEBUG 2021-05-26 20:24:07,852 [shard 0] osd -  pg_epoch 163 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/14 les/c/f=15/15/0 sis=163) [1,0] r=0 lpr=163 pi=[14,163)/1 crt=0'0 mlcod 0'0 peering up_thru 125 < same_since 163, must notify monitor
DEBUG 2021-05-26 20:24:07,852 [shard 0] osd -  pg_epoch 163 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/14 les/c/f=15/15/0 sis=163) [1,0] r=0 lpr=163 pi=[14,163)/1 crt=0'0 mlcod 0'0 peering state<Started/Primary/Peering/GetInfo>:  no prior_set down osds, clearing prior_readable_until_ub
DEBUG 2021-05-26 20:24:07,852 [shard 0] osd -  pg_epoch 163 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/14 les/c/f=15/15/0 sis=163) [1,0] r=0 lpr=163 pi=[14,163)/1 crt=0'0 mlcod 0'0 peering state<Started/Primary/Peering/GetInfo>:  querying info from osd.0
DEBUG 2021-05-26 20:24:07,852 [shard 0] osd -  pg_epoch 163 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/14 les/c/f=15/15/0 sis=163) [1,0] r=0 lpr=163 pi=[14,163)/1 crt=0'0 mlcod 0'0 peering state<Started/Primary/Peering/GetInfo>:  querying info from osd.4
...
DEBUG 2021-05-26 20:24:07,924 [shard 0] ms - [osd.1(cluster) v2:172.21.15.39:6803/34727@61064 >> osd.4 v2:172.21.15.62:6802/34686] connect to existing
DEBUG 2021-05-26 20:24:07,924 [shard 0] ms - [osd.1(cluster) v2:172.21.15.39:6803/34727@61064 >> osd.4 v2:172.21.15.62:6802/34686] --> ceph#62 === pg_query2(2.7 2.7 query(info 0'0 epoch_sent 163) e163/163) v1 (131)
...
DEBUG 2021-05-26 20:24:07,942 [shard 0] ms - [osd.1(cluster) v2:172.21.15.39:6803/34727@61064 >> osd.4 v2:172.21.15.62:6802/34686] GOT AckFrame: seq=62
...
<plenty of osd_ping messanging but no reply to the pg_query for 2.7>
...
DEBUG 2021-05-26 20:58:19,829 [shard 0] ms - [osd.1(hb_front) v2:172.21.15.39:6807/34727 >> osd.4 v2:172.21.15.62:6807/34686@54816] <== ceph#772 =
== osd_ping(ping e17 up_from 10 ping_stamp 2021-05-26T20:58:19.825573+0000/2319.780029297s send_stamp 2319.780029297s) v5 (70)
DEBUG 2021-05-26 20:58:19,829 [shard 0] ms - [osd.1(hb_front) v2:172.21.15.39:6807/34727 >> osd.4 v2:172.21.15.62:6807/34686@54816] --> ceph#772 === osd_ping(ping_reply e249 up_from 10 ping_stamp 2021-05-26T20:58:19.825573+0000/2319.780029297s send_stamp 2320.039062500s) v5 (70
```

The peering request got stuck due to awaiting for `OSDMap`.

```
DEBUG 2021-05-26 20:24:07,930 [shard 0] ms - [osd.4(cluster) v2:172.21.15.62:6802/34686 >> osd.1 v2:172.21.15.39:6803/34727@61064] <== ceph#62 === pg_query2(2.7 2.7 query(info 0'0 epoch_sent 163) e163/163) v1 (131)
DEBUG 2021-05-26 20:24:07,930 [shard 0] osd - handle_peering_op on 2.7 from 1
DEBUG 2021-05-26 20:24:07,930 [shard 0] osd - peering_event(id=517, detail=PeeringEvent(from=1 pgid=2.7 sent=163 requested=163 evt=epoch_sent: 163 epoch_requested: 163 MQuery 2.7 from 1 query_epoch 163 query: query(info 0'0 epoch_sent 163))): star
```

```
INFO  2021-05-26 20:19:49,127 [shard 0] osd - evt epoch is 15, i have 14, will wait
INFO  2021-05-26 20:19:49,128 [shard 0] osd - osdmap_subscribe(14)
DEBUG 2021-05-26 20:19:49,128 [shard 0] ms - [osd.4(client) v2:172.21.15.62:6801/34686@63208 >> mon.1 v2:172.21.15.62:3300/0] --> ceph#9 === mon_s
ubscribe({osdmap=14}) v3 (15)
...
INFO  2021-05-26 20:19:49,131 [shard 0] osd - handle_osd_map osd_map(14..15 src has 1..15) v4
INFO  2021-05-26 20:19:49,131 [shard 0] osd - handle_osd_map epochs [14..15], i have 15, src has [1..15]
...
INFO  2021-05-26 20:19:49,138 [shard 0] osd - handle_osd_map osd_map(14..15 src has 1..15) v4
INFO  2021-05-26 20:19:49,138 [shard 0] osd - handle_osd_map epochs [14..15], i have 15, src has [1..15]
...
INFO  2021-05-26 20:19:49,139 [shard 0] osd - evt epoch is 15, i have 14, will wait
INFO  2021-05-26 20:19:49,141 [shard 0] osd - osdmap_subscribe(14)
WARN  2021-05-26 20:19:49,141 [shard 0] monc - renew_subs - empty
...
INFO  2021-05-26 20:19:50,140 [shard 0] osd - handle_osd_map osd_map(15..16 src has 1..16) v4
INFO  2021-05-26 20:19:50,140 [shard 0] osd - handle_osd_map epochs [15..16], i have 15, src has [1..16]
DEBUG 2021-05-26 20:19:50,141 [shard 0] bluestore - do_transaction
INFO  2021-05-26 20:19:50,145 [shard 0] osd - osd.4: committed_osd_maps(16, 16)
...
INFO  2021-05-26 20:20:42,881 [shard 0] osd - handle_osd_map epochs [16..17], i have 16, src has [1..17]
DEBUG 2021-05-26 20:20:42,882 [shard 0] bluestore - do_transaction
INFO  2021-05-26 20:20:42,886 [shard 0] osd - osd.4: committed_osd_maps(17, 17)
...
INFO  2021-05-26 20:20:43,941 [shard 0] osd - evt epoch is 18, i have 17, will wait
INFO  2021-05-26 20:20:43,941 [shard 0] osd - osdmap_subscribe(17)
...
INFO  2021-05-26 20:20:43,957 [shard 0] osd - evt epoch is 18, i have 17, will wait
INFO  2021-05-26 20:20:43,957 [shard 0] osd - osdmap_subscribe(17)
...
INFO  2021-05-26 20:20:43,969 [shard 0] osd - evt epoch is 18, i have 17, will wait
INFO  2021-05-26 20:20:43,969 [shard 0] osd - osdmap_subscribe(17)
...
DEBUG 2021-05-26 20:20:46,930 [shard 0] ms - [osd.4(client) v2:172.21.15.62:6801/34686@57288 >> mon.2 v2:172.21.15.39:3301/0] <== ceph#4 === osd_m
ap(20..21 src has 1..21) v4 (41)
INFO  2021-05-26 20:20:46,930 [shard 0] osd - handle_osd_map osd_map(20..21 src has 1..21) v4
INFO  2021-05-26 20:20:46,930 [shard 0] osd - handle_osd_map epochs [20..21], i have 17, src has [1..21]
INFO  2021-05-26 20:20:46,930 [shard 0] osd - handle_osd_map message skips epochs 18..19
INFO  2021-05-26 20:20:46,930 [shard 0] osd - osdmap_subscribe(18)
...
DEBUG 2021-05-26 20:20:47,936 [shard 0] ms - [osd.4(client) v2:172.21.15.62:6801/34686@57288 >> mon.2 v2:172.21.15.39:3301/0] <== ceph#5 === osd_m
ap(21..22 src has 1..22) v4 (41)
INFO  2021-05-26 20:20:47,936 [shard 0] osd - handle_osd_map osd_map(21..22 src has 1..22) v4
INFO  2021-05-26 20:20:47,936 [shard 0] osd - handle_osd_map epochs [21..22], i have 17, src has [1..22]
INFO  2021-05-26 20:20:47,936 [shard 0] osd - handle_osd_map message skips epochs 18..20
INFO  2021-05-26 20:20:47,936 [shard 0] osd - osdmap_subscribe(18)
...
<osdmap_subscribe(18) over and over>
```

```
2021-05-26T20:19:42.048+0000 7f4712ffd700  1 -- [v2:172.21.15.62:3300/0,v1:172.21.15.62:6789/0] <== osd.4 v2:172.21.15.62:6801/34686 4 ==== mon_subscribe({mgrmap=0+,osd_pg_creates=0+,osdmap=0+}) v3 ==== 82+0+0 (secure 0 0 0) 0x7f46fc04e150 con 0x7f470401c480
2021-05-26T20:19:42.048+0000 7f4712ffd700 20 mon.b@1(peon) e1 _ms_dispatch existing session 0x7f46fc02f500 for osd.4
2021-05-26T20:19:42.048+0000 7f4712ffd700 20 mon.b@1(peon) e1  entity_name osd.4 global_id 4168 (new_ok) caps allow *
2021-05-26T20:19:42.048+0000 7f4712ffd700 10 mon.b@1(peon) e1 handle_subscribe mon_subscribe({mgrmap=0+,osd_pg_creates=0+,osdmap=0+}) v3
...
2021-05-26T20:19:49.129+0000 7f4712ffd700  1 -- [v2:172.21.15.62:3300/0,v1:172.21.15.62:6789/0] <== osd.4 v2:172.21.15.62:6801/34686 9 ==== mo
n_subscribe({osdmap=14}) v3 ==== 36+0+0 (secure 0 0 0) 0x7f46e8556210 con 0x7f470401c480
2021-05-26T20:19:49.129+0000 7f4712ffd700 20 mon.b@1(peon) e1 _ms_dispatch existing session 0x7f46fc02f500 for osd.4
2021-05-26T20:19:49.129+0000 7f4712ffd700 20 mon.b@1(peon) e1  entity_name osd.4 global_id 4168 (new_ok) caps allow *
2021-05-26T20:19:49.129+0000 7f4712ffd700 10 mon.b@1(peon) e1 handle_subscribe mon_subscribe({osdmap=14}) v3
2021-05-26T20:19:49.129+0000 7f4712ffd700 20 is_capable service=mon command= read addr v2:172.21.15.62:6801/34686 on cap allow *
2021-05-26T20:19:49.129+0000 7f4712ffd700 20  allow so far , doing grant allow *
2021-05-26T20:19:49.129+0000 7f4712ffd700 20  allow all
2021-05-26T20:19:49.129+0000 7f4712ffd700 20 is_capable service=osd command= read addr v2:172.21.15.62:6801/34686 on cap allow *
2021-05-26T20:19:49.129+0000 7f4712ffd700 20  allow so far , doing grant allow *
2021-05-26T20:19:49.129+0000 7f4712ffd700 20  allow all
2021-05-26T20:19:49.129+0000 7f4712ffd700 10 mon.b@1(peon).osd e15 check_osdmap_sub 0x7f46e84f0150 next 14 (onetime)
2021-05-26T20:19:49.129+0000 7f4712ffd700  5 mon.b@1(peon).osd e15 send_incremental [14..15] to osd.4
2021-05-26T20:19:49.129+0000 7f4712ffd700 10 mon.b@1(peon).osd e15 build_incremental [14..15] with features 3f01cfbb7ffdffff
2021-05-26T20:19:49.129+0000 7f4712ffd700 20 mon.b@1(peon).osd e15 build_incremental    inc 15 622 bytes
2021-05-26T20:19:49.129+0000 7f4712ffd700 20 mon.b@1(peon).osd e15 build_incremental    inc 14 578 bytes
2021-05-26T20:19:49.129+0000 7f4712ffd700  1 -- [v2:172.21.15.62:3300/0,v1:172.21.15.62:6789/0] --> v2:172.21.15.62:6801/34686 -- osd_map(14..
15 src has 1..15) v4 -- 0x7f46e856a100 con 0x7f470401c480
```

```
seastar::future<> Client::renew_subs()
{
  if (!sub.have_new()) {
    logger().warn("{} - empty", __func__);
    return seastar::now();
  }
  logger().trace("{}", __func__);

  auto m = crimson::make_message<MMonSubscribe>();
  m->what = sub.get_subs();
  m->hostname = ceph_get_short_hostname();
  return send_message(std::move(m)).then([this] {
    sub.renewed();
  });
}
```

```
INFO  2021-05-26 20:19:42,081 [shard 0] osd - osdmap_subscribe(1)
DEBUG 2021-05-26 20:19:42,081 [shard 0] ms - [osd.4(client) v2:172.21.15.62:6801/34686@63208 >> mon.1 v2:172.21.15.62:3300/0] --> ceph#6 === mon_s
ubscribe({osdmap=1}) v3 (15)
...
INFO  2021-05-26 20:19:49,128 [shard 0] osd - osdmap_subscribe(14)
DEBUG 2021-05-26 20:19:49,128 [shard 0] ms - [osd.4(client) v2:172.21.15.62:6801/34686@63208 >> mon.1 v2:172.21.15.62:3300/0] --> ceph#9 === mon_subscribe({osdmap=14}) v3 (15)
...
INFO  2021-05-26 20:19:49,141 [shard 0] osd - osdmap_subscribe(14)
WARN  2021-05-26 20:19:49,141 [shard 0] monc - renew_subs - empty
<no MMonSubcribe>
...
INFO  2021-05-26 20:20:43,941 [shard 0] osd - evt epoch is 18, i have 17, will wait
INFO  2021-05-26 20:20:43,941 [shard 0] osd - osdmap_subscribe(17)
<no MMonSubcribe>
...
INFO  2021-05-26 20:20:46,930 [shard 0] osd - handle_osd_map message skips epochs 18..19
INFO  2021-05-26 20:20:46,930 [shard 0] osd - osdmap_subscribe(18)
<no MMonSubcribe>
```

[1]: http://pulpito.front.sepia.ceph.com/rzarzynski-2021-05-26_12:20:26-rados-master-distro-basic-smithi/6136908

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
rzarzynski added a commit to rzarzynski/ceph that referenced this pull request Jun 2, 2021
There is a scenario when the `active_con` is properly
chosen but isn't marked as `ready_to_send`.
If `renew_subs()` is called during the `on_session_opened()`,
the flag will be turned on after the subscriptions are
renewed which cannot happen as it requires the flag to be
already set. In other words: there is a circular data dependency.

The net result is stalling the subscription machinery,
particularly the `OSDMap` subs. This caused a nasty peering
issue at Sepia [1] where PG 2.7 got stuck in the `GetInfo`
state.

```
rzarzynski@teuthology:/home/teuthworker/archive/rzarzynski-2021-05-26_12:20:26-rados-master-distro-basic-smithi/6136908$ less ./remote/smithi039/log/ceph-osd.1.log.gz
...
DEBUG 2021-05-26 20:19:48,134 [shard 0] osd -  pg_epoch 14 pg[2.7( DNE empty local-lis/les=0/0 n=0 ec=0/0 lis/c=0/0 les/c/f=0/0/0 sis=0) [] r=
-1 lpr=0 crt=0'0 mlcod 0'0 unknown enter Initial
...
DEBUG 2021-05-26 20:19:48,138 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0]
r=0 lpr=0 crt=0'0 mlcod 0'0 unknown enter Reset
...
DEBUG 2021-05-26 20:19:48,138 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 unknown enter Started
...
DEBUG 2021-05-26 20:19:48,138 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 unknown enter Start
...
DEBUG 2021-05-26 20:19:48,138 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 unknown enter Started/Primary
...
DEBUG 2021-05-26 20:19:48,138 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 creating enter Started/Primary/Peering
...
DEBUG 2021-05-26 20:19:48,138 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 creating+peering enter Started/Primary/Peering/GetInfo
DEBUG 2021-05-26 20:19:48,138 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 creating+peering build_prior all_probe
DEBUG 2021-05-26 20:19:48,139 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 creating+peering build_prior final: probe 0,1 down  blocked_by {}
DEBUG 2021-05-26 20:19:48,139 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 creating+peering up_thru 0 < same_since 14, must notify monitor
DEBUG 2021-05-26 20:19:48,139 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 creating+peering state<Started/Primary/Peering/GetInfo>:  no prior_set down osds, clearing prior_readable_until_ub
DEBUG 2021-05-26 20:19:48,139 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 creating+peering state<Started/Primary/Peering/GetInfo>:  querying info from osd.0
...
DEBUG 2021-05-26 20:19:48,237 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 creating+peering  got osd.0 2.7( DNE empty local-lis/les=0/0 n=0 ec=0/0 lis/c=0/0 les/c/f=0/0/0 sis=0)
DEBUG 2021-05-26 20:19:48,237 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 creating+peering state<Started/Primary/Peering/GetInfo>: Adding osd: 0 peer features: 3f01cfbb7ffdffff
DEBUG 2021-05-26 20:19:48,237 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 creating+peering state<Started/Primary/Peering/GetInfo>: Common peer features: 3f01cfbb7ffdffff
DEBUG 2021-05-26 20:19:48,237 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 creating+peering state<Started/Primary/Peering/GetInfo>: Common acting features: 3f01cfbb7ffdffff
DEBUG 2021-05-26 20:19:48,238 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 creating+peering state<Started/Primary/Peering/GetInfo>: Common upacting features: 3f01cfbb7ffdffff
DEBUG 2021-05-26 20:19:48,238 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 creating+peering exit Started/Primary/Peering/GetInfo 0.099480 4 2021-05-26T20:19:48.146172+0000
...
DEBUG 2021-05-26 20:19:48,238 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 creating+peering enter Started/Primary/Peering/GetLog
...
DEBUG 2021-05-26 20:19:48,238 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 creating+peering enter Started/Primary/Peering/GetMissing
...
DEBUG 2021-05-26 20:19:48,238 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 creating+peering enter Started/Primary/Peering/WaitUpThru
...
DEBUG 2021-05-26 20:19:49,139 [shard 0] osd -  pg_epoch 15 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 creating enter Started/Primary/Active
...
DEBUG 2021-05-26 20:19:49,142 [shard 0] osd -  pg_epoch 15 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 creating+activating enter Started/Primary/Active/Activating
...
DEBUG 2021-05-26 20:19:49,204 [shard 0] osd -  pg_epoch 15 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/0 les/c/f=15/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 active enter Started/Primary/Active/Recovered
...
DEBUG 2021-05-26 20:19:49,204 [shard 0] osd -  pg_epoch 15 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/0 les/c/f=15/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 active enter Started/Primary/Active/Clean
...
DEBUG 2021-05-26 20:22:31,223 [shard 0] osd -  pg_epoch 86 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/14 les/c/f=15/15/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 active enter Reset
...
<a lot of flipping>
...
DEBUG 2021-05-26 20:24:07,851 [shard 0] osd -  pg_epoch 163 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/14 les/c/f=15/15/0 sis=163
) [1,0] r=0 lpr=163 pi=[14,163)/1 crt=0'0 mlcod 0'0 unknown activate_map
DEBUG 2021-05-26 20:24:07,851 [shard 0] osd -  pg_epoch 163 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/14 les/c/f=15/15/0 sis=163
) [1,0] r=0 lpr=163 pi=[14,163)/1 crt=0'0 mlcod 0'0 unknown exit Reset 0.035744 1 2021-05-26T20:24:07.817331+0000
INFO  2021-05-26 20:24:07,851 [shard 0] osd - Exiting state: Reset, entered at 1622060647.8158188, 1622060647.8173316 spent on 1 events
DEBUG 2021-05-26 20:24:07,851 [shard 0] osd -  pg_epoch 163 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/14 les/c/f=15/15/0 sis=163
) [1,0] r=0 lpr=163 pi=[14,163)/1 crt=0'0 mlcod 0'0 unknown enter Started
INFO  2021-05-26 20:24:07,851 [shard 0] osd - Entering state: Started
DEBUG 2021-05-26 20:24:07,851 [shard 0] osd -  pg_epoch 163 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/14 les/c/f=15/15/0 sis=163
) [1,0] r=0 lpr=163 pi=[14,163)/1 crt=0'0 mlcod 0'0 unknown enter Start
INFO  2021-05-26 20:24:07,851 [shard 0] osd - Entering state: Start
INFO  2021-05-26 20:24:07,851 [shard 0] osd -  pg_epoch 163 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/14 les/c/f=15/15/0 sis=163
) [1,0] r=0 lpr=163 pi=[14,163)/1 crt=0'0 mlcod 0'0 unknown state<Start>: transitioning to Primary
DEBUG 2021-05-26 20:24:07,851 [shard 0] osd -  pg_epoch 163 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/14 les/c/f=15/15/0 sis=163
) [1,0] r=0 lpr=163 pi=[14,163)/1 crt=0'0 mlcod 0'0 unknown exit Start 0.000041 0 0.000000
INFO  2021-05-26 20:24:07,851 [shard 0] osd - Exiting state: Start, entered at 1622060647.8516333, 0.0 spent on 0 events
DEBUG 2021-05-26 20:24:07,852 [shard 0] osd -  pg_epoch 163 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/14 les/c/f=15/15/0 sis=163
) [1,0] r=0 lpr=163 pi=[14,163)/1 crt=0'0 mlcod 0'0 unknown enter Started/Primary
INFO  2021-05-26 20:24:07,852 [shard 0] osd - Entering state: Started/Primary
DEBUG 2021-05-26 20:24:07,852 [shard 0] osd -  pg_epoch 163 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/14 les/c/f=15/15/0 sis=163
) [1,0] r=0 lpr=163 pi=[14,163)/1 crt=0'0 mlcod 0'0 unknown enter Started/Primary/Peering
INFO  2021-05-26 20:24:07,852 [shard 0] osd - Entering state: Started/Primary/Peering
DEBUG 2021-05-26 20:24:07,852 [shard 0] osd -  pg_epoch 163 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/14 les/c/f=15/15/0 sis=163) [1,0] r=0 lpr=163 pi=[14,163)/1 crt=0'0 mlcod 0'0 peering enter Started/Primary/Peering/GetInfo
INFO  2021-05-26 20:24:07,852 [shard 0] osd - Entering state: Started/Primary/Peering/GetInfo
...
DEBUG 2021-05-26 20:24:07,852 [shard 0] osd -  pg_epoch 163 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/14 les/c/f=15/15/0 sis=163) [1,0] r=0 lpr=163 pi=[14,163)/1 crt=0'0 mlcod 0'0 peering build_prior all_probe 0,1,4
DEBUG 2021-05-26 20:24:07,852 [shard 0] osd -  pg_epoch 163 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/14 les/c/f=15/15/0 sis=163) [1,0] r=0 lpr=163 pi=[14,163)/1 crt=0'0 mlcod 0'0 peering build_prior maybe_rw interval:139, acting: 0
DEBUG 2021-05-26 20:24:07,852 [shard 0] osd -  pg_epoch 163 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/14 les/c/f=15/15/0 sis=163) [1,0] r=0 lpr=163 pi=[14,163)/1 crt=0'0 mlcod 0'0 peering build_prior final: probe 0,1,4 down  blocked_by {}
DEBUG 2021-05-26 20:24:07,852 [shard 0] osd -  pg_epoch 163 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/14 les/c/f=15/15/0 sis=163) [1,0] r=0 lpr=163 pi=[14,163)/1 crt=0'0 mlcod 0'0 peering up_thru 125 < same_since 163, must notify monitor
DEBUG 2021-05-26 20:24:07,852 [shard 0] osd -  pg_epoch 163 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/14 les/c/f=15/15/0 sis=163) [1,0] r=0 lpr=163 pi=[14,163)/1 crt=0'0 mlcod 0'0 peering state<Started/Primary/Peering/GetInfo>:  no prior_set down osds, clearing prior_readable_until_ub
DEBUG 2021-05-26 20:24:07,852 [shard 0] osd -  pg_epoch 163 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/14 les/c/f=15/15/0 sis=163) [1,0] r=0 lpr=163 pi=[14,163)/1 crt=0'0 mlcod 0'0 peering state<Started/Primary/Peering/GetInfo>:  querying info from osd.0
DEBUG 2021-05-26 20:24:07,852 [shard 0] osd -  pg_epoch 163 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/14 les/c/f=15/15/0 sis=163) [1,0] r=0 lpr=163 pi=[14,163)/1 crt=0'0 mlcod 0'0 peering state<Started/Primary/Peering/GetInfo>:  querying info from osd.4
...
DEBUG 2021-05-26 20:24:07,924 [shard 0] ms - [osd.1(cluster) v2:172.21.15.39:6803/34727@61064 >> osd.4 v2:172.21.15.62:6802/34686] connect to existing
DEBUG 2021-05-26 20:24:07,924 [shard 0] ms - [osd.1(cluster) v2:172.21.15.39:6803/34727@61064 >> osd.4 v2:172.21.15.62:6802/34686] --> ceph#62 === pg_query2(2.7 2.7 query(info 0'0 epoch_sent 163) e163/163) v1 (131)
...
DEBUG 2021-05-26 20:24:07,942 [shard 0] ms - [osd.1(cluster) v2:172.21.15.39:6803/34727@61064 >> osd.4 v2:172.21.15.62:6802/34686] GOT AckFrame: seq=62
...
<plenty of osd_ping messanging but no reply to the pg_query for 2.7>
...
DEBUG 2021-05-26 20:58:19,829 [shard 0] ms - [osd.1(hb_front) v2:172.21.15.39:6807/34727 >> osd.4 v2:172.21.15.62:6807/34686@54816] <== ceph#772 =
== osd_ping(ping e17 up_from 10 ping_stamp 2021-05-26T20:58:19.825573+0000/2319.780029297s send_stamp 2319.780029297s) v5 (70)
DEBUG 2021-05-26 20:58:19,829 [shard 0] ms - [osd.1(hb_front) v2:172.21.15.39:6807/34727 >> osd.4 v2:172.21.15.62:6807/34686@54816] --> ceph#772 === osd_ping(ping_reply e249 up_from 10 ping_stamp 2021-05-26T20:58:19.825573+0000/2319.780029297s send_stamp 2320.039062500s) v5 (70
```

The peering request got stuck due to awaiting for `OSDMap`.

```
DEBUG 2021-05-26 20:24:07,930 [shard 0] ms - [osd.4(cluster) v2:172.21.15.62:6802/34686 >> osd.1 v2:172.21.15.39:6803/34727@61064] <== ceph#62 === pg_query2(2.7 2.7 query(info 0'0 epoch_sent 163) e163/163) v1 (131)
DEBUG 2021-05-26 20:24:07,930 [shard 0] osd - handle_peering_op on 2.7 from 1
DEBUG 2021-05-26 20:24:07,930 [shard 0] osd - peering_event(id=517, detail=PeeringEvent(from=1 pgid=2.7 sent=163 requested=163 evt=epoch_sent: 163 epoch_requested: 163 MQuery 2.7 from 1 query_epoch 163 query: query(info 0'0 epoch_sent 163))): star
```

```
INFO  2021-05-26 20:19:49,127 [shard 0] osd - evt epoch is 15, i have 14, will wait
INFO  2021-05-26 20:19:49,128 [shard 0] osd - osdmap_subscribe(14)
DEBUG 2021-05-26 20:19:49,128 [shard 0] ms - [osd.4(client) v2:172.21.15.62:6801/34686@63208 >> mon.1 v2:172.21.15.62:3300/0] --> ceph#9 === mon_s
ubscribe({osdmap=14}) v3 (15)
...
INFO  2021-05-26 20:19:49,131 [shard 0] osd - handle_osd_map osd_map(14..15 src has 1..15) v4
INFO  2021-05-26 20:19:49,131 [shard 0] osd - handle_osd_map epochs [14..15], i have 15, src has [1..15]
...
INFO  2021-05-26 20:19:49,138 [shard 0] osd - handle_osd_map osd_map(14..15 src has 1..15) v4
INFO  2021-05-26 20:19:49,138 [shard 0] osd - handle_osd_map epochs [14..15], i have 15, src has [1..15]
...
INFO  2021-05-26 20:19:49,139 [shard 0] osd - evt epoch is 15, i have 14, will wait
INFO  2021-05-26 20:19:49,141 [shard 0] osd - osdmap_subscribe(14)
WARN  2021-05-26 20:19:49,141 [shard 0] monc - renew_subs - empty
...
INFO  2021-05-26 20:19:50,140 [shard 0] osd - handle_osd_map osd_map(15..16 src has 1..16) v4
INFO  2021-05-26 20:19:50,140 [shard 0] osd - handle_osd_map epochs [15..16], i have 15, src has [1..16]
DEBUG 2021-05-26 20:19:50,141 [shard 0] bluestore - do_transaction
INFO  2021-05-26 20:19:50,145 [shard 0] osd - osd.4: committed_osd_maps(16, 16)
...
INFO  2021-05-26 20:20:42,881 [shard 0] osd - handle_osd_map epochs [16..17], i have 16, src has [1..17]
DEBUG 2021-05-26 20:20:42,882 [shard 0] bluestore - do_transaction
INFO  2021-05-26 20:20:42,886 [shard 0] osd - osd.4: committed_osd_maps(17, 17)
...
INFO  2021-05-26 20:20:43,941 [shard 0] osd - evt epoch is 18, i have 17, will wait
INFO  2021-05-26 20:20:43,941 [shard 0] osd - osdmap_subscribe(17)
...
INFO  2021-05-26 20:20:43,957 [shard 0] osd - evt epoch is 18, i have 17, will wait
INFO  2021-05-26 20:20:43,957 [shard 0] osd - osdmap_subscribe(17)
...
INFO  2021-05-26 20:20:43,969 [shard 0] osd - evt epoch is 18, i have 17, will wait
INFO  2021-05-26 20:20:43,969 [shard 0] osd - osdmap_subscribe(17)
...
DEBUG 2021-05-26 20:20:46,930 [shard 0] ms - [osd.4(client) v2:172.21.15.62:6801/34686@57288 >> mon.2 v2:172.21.15.39:3301/0] <== ceph#4 === osd_m
ap(20..21 src has 1..21) v4 (41)
INFO  2021-05-26 20:20:46,930 [shard 0] osd - handle_osd_map osd_map(20..21 src has 1..21) v4
INFO  2021-05-26 20:20:46,930 [shard 0] osd - handle_osd_map epochs [20..21], i have 17, src has [1..21]
INFO  2021-05-26 20:20:46,930 [shard 0] osd - handle_osd_map message skips epochs 18..19
INFO  2021-05-26 20:20:46,930 [shard 0] osd - osdmap_subscribe(18)
...
DEBUG 2021-05-26 20:20:47,936 [shard 0] ms - [osd.4(client) v2:172.21.15.62:6801/34686@57288 >> mon.2 v2:172.21.15.39:3301/0] <== ceph#5 === osd_m
ap(21..22 src has 1..22) v4 (41)
INFO  2021-05-26 20:20:47,936 [shard 0] osd - handle_osd_map osd_map(21..22 src has 1..22) v4
INFO  2021-05-26 20:20:47,936 [shard 0] osd - handle_osd_map epochs [21..22], i have 17, src has [1..22]
INFO  2021-05-26 20:20:47,936 [shard 0] osd - handle_osd_map message skips epochs 18..20
INFO  2021-05-26 20:20:47,936 [shard 0] osd - osdmap_subscribe(18)
...
<osdmap_subscribe(18) over and over>
```

```
2021-05-26T20:19:42.048+0000 7f4712ffd700  1 -- [v2:172.21.15.62:3300/0,v1:172.21.15.62:6789/0] <== osd.4 v2:172.21.15.62:6801/34686 4 ==== mon_subscribe({mgrmap=0+,osd_pg_creates=0+,osdmap=0+}) v3 ==== 82+0+0 (secure 0 0 0) 0x7f46fc04e150 con 0x7f470401c480
2021-05-26T20:19:42.048+0000 7f4712ffd700 20 mon.b@1(peon) e1 _ms_dispatch existing session 0x7f46fc02f500 for osd.4
2021-05-26T20:19:42.048+0000 7f4712ffd700 20 mon.b@1(peon) e1  entity_name osd.4 global_id 4168 (new_ok) caps allow *
2021-05-26T20:19:42.048+0000 7f4712ffd700 10 mon.b@1(peon) e1 handle_subscribe mon_subscribe({mgrmap=0+,osd_pg_creates=0+,osdmap=0+}) v3
...
2021-05-26T20:19:49.129+0000 7f4712ffd700  1 -- [v2:172.21.15.62:3300/0,v1:172.21.15.62:6789/0] <== osd.4 v2:172.21.15.62:6801/34686 9 ==== mo
n_subscribe({osdmap=14}) v3 ==== 36+0+0 (secure 0 0 0) 0x7f46e8556210 con 0x7f470401c480
2021-05-26T20:19:49.129+0000 7f4712ffd700 20 mon.b@1(peon) e1 _ms_dispatch existing session 0x7f46fc02f500 for osd.4
2021-05-26T20:19:49.129+0000 7f4712ffd700 20 mon.b@1(peon) e1  entity_name osd.4 global_id 4168 (new_ok) caps allow *
2021-05-26T20:19:49.129+0000 7f4712ffd700 10 mon.b@1(peon) e1 handle_subscribe mon_subscribe({osdmap=14}) v3
2021-05-26T20:19:49.129+0000 7f4712ffd700 20 is_capable service=mon command= read addr v2:172.21.15.62:6801/34686 on cap allow *
2021-05-26T20:19:49.129+0000 7f4712ffd700 20  allow so far , doing grant allow *
2021-05-26T20:19:49.129+0000 7f4712ffd700 20  allow all
2021-05-26T20:19:49.129+0000 7f4712ffd700 20 is_capable service=osd command= read addr v2:172.21.15.62:6801/34686 on cap allow *
2021-05-26T20:19:49.129+0000 7f4712ffd700 20  allow so far , doing grant allow *
2021-05-26T20:19:49.129+0000 7f4712ffd700 20  allow all
2021-05-26T20:19:49.129+0000 7f4712ffd700 10 mon.b@1(peon).osd e15 check_osdmap_sub 0x7f46e84f0150 next 14 (onetime)
2021-05-26T20:19:49.129+0000 7f4712ffd700  5 mon.b@1(peon).osd e15 send_incremental [14..15] to osd.4
2021-05-26T20:19:49.129+0000 7f4712ffd700 10 mon.b@1(peon).osd e15 build_incremental [14..15] with features 3f01cfbb7ffdffff
2021-05-26T20:19:49.129+0000 7f4712ffd700 20 mon.b@1(peon).osd e15 build_incremental    inc 15 622 bytes
2021-05-26T20:19:49.129+0000 7f4712ffd700 20 mon.b@1(peon).osd e15 build_incremental    inc 14 578 bytes
2021-05-26T20:19:49.129+0000 7f4712ffd700  1 -- [v2:172.21.15.62:3300/0,v1:172.21.15.62:6789/0] --> v2:172.21.15.62:6801/34686 -- osd_map(14..
15 src has 1..15) v4 -- 0x7f46e856a100 con 0x7f470401c480
```

```
seastar::future<> Client::renew_subs()
{
  if (!sub.have_new()) {
    logger().warn("{} - empty", __func__);
    return seastar::now();
  }
  logger().trace("{}", __func__);

  auto m = crimson::make_message<MMonSubscribe>();
  m->what = sub.get_subs();
  m->hostname = ceph_get_short_hostname();
  return send_message(std::move(m)).then([this] {
    sub.renewed();
  });
}
```

```
INFO  2021-05-26 20:19:42,081 [shard 0] osd - osdmap_subscribe(1)
DEBUG 2021-05-26 20:19:42,081 [shard 0] ms - [osd.4(client) v2:172.21.15.62:6801/34686@63208 >> mon.1 v2:172.21.15.62:3300/0] --> ceph#6 === mon_s
ubscribe({osdmap=1}) v3 (15)
...
INFO  2021-05-26 20:19:49,128 [shard 0] osd - osdmap_subscribe(14)
DEBUG 2021-05-26 20:19:49,128 [shard 0] ms - [osd.4(client) v2:172.21.15.62:6801/34686@63208 >> mon.1 v2:172.21.15.62:3300/0] --> ceph#9 === mon_subscribe({osdmap=14}) v3 (15)
...
INFO  2021-05-26 20:19:49,141 [shard 0] osd - osdmap_subscribe(14)
WARN  2021-05-26 20:19:49,141 [shard 0] monc - renew_subs - empty
<no MMonSubcribe>
...
INFO  2021-05-26 20:20:43,941 [shard 0] osd - evt epoch is 18, i have 17, will wait
INFO  2021-05-26 20:20:43,941 [shard 0] osd - osdmap_subscribe(17)
<no MMonSubcribe>
...
INFO  2021-05-26 20:20:46,930 [shard 0] osd - handle_osd_map message skips epochs 18..19
INFO  2021-05-26 20:20:46,930 [shard 0] osd - osdmap_subscribe(18)
<no MMonSubcribe>
```

[1]: http://pulpito.front.sepia.ceph.com/rzarzynski-2021-05-26_12:20:26-rados-master-distro-basic-smithi/6136908

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
wangxuw pushed a commit to wangxuw/ceph that referenced this pull request Jun 15, 2021
f7181ab has optimized the client
parallelism. To achieve that `PG::do_osd_ops()` were converted to
return basically future pair of futures. Unfortunately, the life-
time management of `OpsExecuter` was kept intact. In the result,
the object was valid only till fullfying the outer future while,
due to the `rollbacker` instances, it should be available till
`all_completed` becomes available.

This issue can explain the following problem has been observed
in a Teuthology job [1].

```
DEBUG 2021-05-20 08:03:22,617 [shard 0] osd - do_op_call: method returned ret=-17, outdata.length()=0 while num_read=1, num_write=0
DEBUG 2021-05-20 08:03:22,617 [shard 0] osd - rollback_obc_if_modified: object 19:e17d4708:test-rados-api-smithi095-38404-2::foo:head got erro
r generic:17, need_rollback=false
=================================================================
==33626==ERROR: AddressSanitizer: heap-use-after-free on address 0x60d0000b9320 at pc 0x560f486b8222 bp 0x7fffc467a1e0 sp 0x7fffc467a1d0
READ of size 4 at 0x60d0000b9320 thread T0
    #0 0x560f486b8221  (/usr/bin/ceph-osd+0x2c610221)
    #1 0x560f4880c6b1 in seastar::continuation<seastar::internal::promise_base_with_type<boost::intrusive_ptr<MOSDOpReply> >, seastar::noncopy
able_function<crimson::interruptible::interruptible_future_detail<crimson::osd::IOInterruptCondition, crimson::errorator<crimson::unthrowable_
wrapper<std::error_code const&, crimson::ec<(std::errc)11> > >::_future<crimson::errorated_future_marker<boost::intrusive_ptr<MOSDOpReply> > >
 > ()>, seastar::future<void>::then_impl_nrvo<seastar::noncopyable_function<crimson::interruptible::interruptible_future_detail<crimson::osd::
IOInterruptCondition, crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<(std::errc)11> > >::_future<crimson:
:errorated_future_marker<boost::intrusive_ptr<MOSDOpReply> > > > ()>, crimson::interruptible::interruptible_future_detail<crimson::osd::IOInte
rruptCondition, crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<(std::errc)11> > >::_future<crimson::error
ated_future_marker<boost::intrusive_ptr<MOSDOpReply> > > > >(seastar::noncopyable_function<crimson::interruptible::interruptible_future_detail
<crimson::osd::IOInterruptCondition, crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<(std::errc)11> > >::_
future<crimson::errorated_future_marker<boost::intrusive_ptr<MOSDOpReply> > > > ()>&&)::{lambda(seastar::internal::promise_base_with_type<boos
t::intrusive_ptr<MOSDOpReply> >&&, seastar::noncopyable_function<crimson::interruptible::interruptible_future_detail<crimson::osd::IOInterruptCondition, crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<(std::errc)11> > >::_future<crimson::errorated_future_marker<boost::intrusive_ptr<MOSDOpReply> > > > ()>&, seastar::future_state<seastar::internal::monostate>&&)#1}, void>::run_and_dispose() (/usr/bin/ceph-osd+0x2c7646b1)
    #2 0x560f5352c3ae  (/usr/bin/ceph-osd+0x374843ae)
    ceph#3 0x560f535318ef  (/usr/bin/ceph-osd+0x374898ef)
    ceph#4 0x560f536e395a  (/usr/bin/ceph-osd+0x3763b95a)
    ceph#5 0x560f532413d9  (/usr/bin/ceph-osd+0x371993d9)
    ceph#6 0x560f476af95a in main (/usr/bin/ceph-osd+0x2b60795a)
    ceph#7 0x7f7aa0af97b2 in __libc_start_main (/lib64/libc.so.6+0x237b2)
    ceph#8 0x560f477d2e8d in _start (/usr/bin/ceph-osd+0x2b72ae8d)

```

[1]: http://pulpito.front.sepia.ceph.com/rzarzynski-2021-05-20_07:28:16-rados-master-distro-basic-smithi/6124735/

The commit deals with the problem by repacking the outer future.
An alternative could be in switching from `std::unique_ptr` to
`seastar::shared_ptr` for managing `OpsExecuter`.

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
tchaikov pushed a commit that referenced this pull request Oct 15, 2021
exit() will call pthread_cond_destroy attempting to destroy dpdk::eal::cond
upon which other threads are currently blocked results in undefine
behavior. Link different libc version test, libc-2.17 can exit,
libc-2.27 will deadlock, the call stack is as follows:

Thread 3 (Thread 0xffff7e5749f0 (LWP 62213)):
 #0  0x0000ffff7f3c422c in futex_wait_cancelable (private=<optimized out>, expected=0,
    futex_word=0xaaaadc0e30f4 <dpdk::eal::cond+44>) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
 #1  __pthread_cond_wait_common (abstime=0x0, mutex=0xaaaadc0e30f8 <dpdk::eal::lock>, cond=0xaaaadc0e30c8 <dpdk::eal::cond>)
    at pthread_cond_wait.c:502
 #2  __pthread_cond_wait (cond=0xaaaadc0e30c8 <dpdk::eal::cond>, mutex=0xaaaadc0e30f8 <dpdk::eal::lock>)
    at pthread_cond_wait.c:655
 #3  0x0000ffff7f1f1f80 in std::condition_variable::wait(std::unique_lock<std::mutex>&) ()
   from /usr/lib/aarch64-linux-gnu/libstdc++.so.6
 #4  0x0000aaaad37f5078 in dpdk::eal::<lambda()>::operator()(void) const (__closure=<optimized out>, __closure=<optimized out>)
    at ./src/msg/async/dpdk/dpdk_rte.cc:136
 #5  0x0000ffff7f1f7ed4 in ?? () from /usr/lib/aarch64-linux-gnu/libstdc++.so.6
 #6  0x0000ffff7f3be088 in start_thread (arg=0xffffe73e197f) at pthread_create.c:463
 #7  0x0000ffff7efc74ec in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78

Thread 1 (Thread 0xffff7ee3b010 (LWP 62200)):
 #0  0x0000ffff7f3c3c38 in futex_wait (private=<optimized out>, expected=12, futex_word=0xaaaadc0e30ec <dpdk::eal::cond+36>)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:61
 #1  futex_wait_simple (private=<optimized out>, expected=12, futex_word=0xaaaadc0e30ec <dpdk::eal::cond+36>)
    at ../sysdeps/nptl/futex-internal.h:135
 #2  __pthread_cond_destroy (cond=0xaaaadc0e30c8 <dpdk::eal::cond>) at pthread_cond_destroy.c:54
 #3  0x0000ffff7ef2be34 in __run_exit_handlers (status=-6, listp=0xffff7f04a5a0 <__exit_funcs>, run_list_atexit=255,
    run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:108
 #4  0x0000ffff7ef2bf6c in __GI_exit (status=<optimized out>) at exit.c:139
 #5  0x0000ffff7ef176e4 in __libc_start_main (main=0x0, argc=0, argv=0x0, init=<optimized out>, fini=<optimized out>,
    rtld_fini=<optimized out>, stack_end=<optimized out>) at ../csu/libc-start.c:344
 #6  0x0000aaaad2939db0 in _start () at ./src/include/buffer.h:642

Fixes: https://tracker.ceph.com/issues/42890
Signed-off-by: Chunsong Feng <fengchunsong@huawei.com>
Signed-off-by: luo rixin <luorixin@huawei.com>
rzarzynski added a commit to rzarzynski/ceph that referenced this pull request Mar 7, 2022
```
DEBUG 2022-03-07 13:50:40,027 [shard 0] osd - calling method rbd.create, num_read=0, num_write=0
DEBUG 2022-03-07 13:50:40,027 [shard 0] objclass - <cls> ../src/cls/rbd/cls_rbd.cc:787: create object_prefix=parent_id size=2097152 order=0 features=1
DEBUG 2022-03-07 13:50:40,027 [shard 0] osd - handling op omap-get-vals-by-keys on object 1:144d5af5:::parent_id:head
=================================================================
==2109764==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7f6de5176e70 at pc 0x7f6dfd2a7157 bp 0x7f6de5176e30 sp 0x7f6de51765d8
WRITE of size 24 at 0x7f6de5176e70 thread T0
    #0 0x7f6dfd2a7156 in __interceptor_sigaltstack.part.0 (/lib64/libasan.so.6+0x54156)
    #1 0x7f6dfd30d5b3 in __asan::PlatformUnpoisonStacks() (/lib64/libasan.so.6+0xba5b3)
    #2 0x7f6dfd31314c in __asan_handle_no_return (/lib64/libasan.so.6+0xc014c)
Reactor stalled for 275 ms on shard 0. Backtrace: 0x45d9d 0xda72bd3 0xd801f73 0xd81f6f9 0xd81fb9c 0xd81fe2c 0xd8200f7 0x12b2f 0x7f6dfd3383c1 0x7f6dfd339b18 0x7f6dfd339bd4 0x7f6dfd339bd4 0x7f6dfd339bd4 0x7f6dfd339bd4 0x7f6dfd33b089 0x7f6dfd33bb36 0x7f6dfd32e0b5 0x7f6dfd32ff3a 0xd61d0 0x32412 0xbd8a7 0xbd134 0x54178 0xba5b3 0xc014c 0x1881f22 0x188344a 0xe8b439d 0xe8b58f2 0x2521d5a 0x2a2ee12 0x2c76349 0x2e04ce9 0x3c70c55 0x3cb8aa8 0x7f6de558de39
    ceph#3 0x1881f22 in fmt::v6::internal::arg_map<fmt::v6::basic_format_context<seastar::internal::log_buf::inserter_iterator, char> >::~arg_map() /usr/include/fmt/core.h:1170
    ceph#4 0x1881f22 in fmt::v6::basic_format_context<seastar::internal::log_buf::inserter_iterator, char>::~basic_format_context() /usr/include/fmt/core.h:1265
    ceph#5 0x1881f22 in fmt::v6::format_handler<fmt::v6::arg_formatter<fmt::v6::internal::output_range<seastar::internal::log_buf::inserter_iterator, char> >, char, fmt::v6::basic_format_context<seastar::internal::log_buf::inserter_iterator, char> >::~format_handler() /usr/include/fmt/format.h:3143
    ceph#6 0x1881f22 in fmt::v6::basic_format_context<seastar::internal::log_buf::inserter_iterator, char>::iterator fmt::v6::vformat_to<fmt::v6::arg_formatter<fmt::v6::internal::output_range<seastar::internal::log_buf::inserter_iterator, char> >, char, fmt::v6::basic_format_context<seastar::internal::log_buf::inserter_iterator, char> >(fmt::v6::arg_formatter<fmt::v6::internal::output_range<seastar::internal::log_buf::inserter_iterator, char> >::range, fmt::v6::basic_string_view<char>, fmt::v6::basic_format_args<fmt::v6::basic_format_context<seastar::internal::log_buf::inserter_iterator, char> >, fmt::v6::internal::locale_ref) /usr/include/fmt/format.h:3206
    ceph#7 0x188344a in seastar::internal::log_buf::inserter_iterator fmt::v6::vformat_to<fmt::v6::basic_string_view<char>, seastar::internal::log_buf::inserter_iterator, , 0>(seastar::internal::log_buf::inserter_iterator, fmt::v6::basic_string_view<char> const&, fmt::v6::basic_format_args<fmt::v6::basic_format_context<fmt::v6::type_identity<seastar::internal::log_buf::inserter_iterator>::type, fmt::v6::internal::char_t_impl<fmt::v6::basic_string_view<char>, void>::type> >) /usr/include/fmt/format.h:3395
    ceph#8 0x188344a in seastar::internal::log_buf::inserter_iterator fmt::v6::format_to<seastar::internal::log_buf::inserter_iterator, std::basic_string_view<char, std::char_traits<char> >, hobject_t const&, 0>(seastar::internal::log_buf::inserter_iterator, std::basic_string_view<char, std::char_traits<char> > const&, hobject_t const&) /usr/include/fmt/format.h:3418
    ceph#9 0x188344a in seastar::logger::log<hobject_t const&>(seastar::log_level, seastar::logger::format_info, hobject_t const&)::{lambda(seastar::internal::log_buf::inserter_iterator)#1}::operator()(seastar::internal::log_buf::inserter_iterator) const ../src/seastar/include/seastar/util/log.hh:227
    ceph#10 0x188344a in seastar::logger::lambda_log_writer<seastar::logger::log<hobject_t const&>(seastar::log_level, seastar::logger::format_info, hobject_t const&)::{lambda(seastar::internal::log_buf::inserter_iterator)#1}>::operator()(seastar::internal::log_buf::inserter_iterator) ../src/seastar/include/seastar/util/log.hh:106
    ceph#11 0xe8b439d in operator() ../src/seastar/src/util/log.cc:268
    ceph#12 0xe8b58f2 in seastar::logger::do_log(seastar::log_level, seastar::logger::log_writer&) ../src/seastar/src/util/log.cc:280
    ceph#13 0x2521d5a in void seastar::logger::log<hobject_t const&>(seastar::log_level, seastar::logger::format_info, hobject_t const&) ../src/seastar/include/seastar/util/log.hh:230
    ceph#14 0x2a2ee12 in void seastar::logger::debug<hobject_t const&>(seastar::logger::format_info, hobject_t const&) ../src/seastar/include/seastar/util/log.hh:373
    ceph#15 0x2a2ee12 in PGBackend::omap_get_vals_by_keys(ObjectState const&, OSDOp&, object_stat_sum_t&) const ../src/crimson/osd/pg_backend.cc:1220
    ceph#16 0x2c76349 in operator()<PGBackend, ObjectState> ../src/crimson/osd/ops_executer.cc:577
    ceph#17 0x2c76349 in do_const_op<crimson::osd::OpsExecuter::execute_op(OSDOp&)::<lambda(auto:167&, const auto:168&)> > ../src/crimson/osd/ops_executer.cc:449
    ceph#18 0x2e04ce9 in do_read_op<crimson::osd::OpsExecuter::execute_op(OSDOp&)::<lambda(auto:167&, const auto:168&)> > ../src/crimson/osd/ops_executer.h:216
    ceph#19 0x2e04ce9 in crimson::osd::OpsExecuter::execute_op(OSDOp&) ../src/crimson/osd/ops_executer.cc:576
Reactor stalled for 762 ms on shard 0. Backtrace: 0x45d9d 0xda72bd3 0xd801f73 0xd81f6f9 0xd81fb9c 0xd81fe2c 0xd8200f7 0x12b2f 0x7f6dfd33ae85 0x7f6dfd33bb36 0x7f6dfd32e0b5 0x7f6dfd32ff3a 0xd61d0 0x32412 0xbd8a7 0xbd134 0x54178 0xba5b3 0xc014c 0x1881f22 0x188344a 0xe8b439d 0xe8b58f2 0x2521d5a 0x2a2ee12 0x2c76349 0x2e04ce9 0x3c70c55 0x3cb8aa8 0x7f6de558de39
    ceph#20 0x3c70c55 in execute_osd_op ../src/crimson/osd/objclass.cc:35
    ceph#21 0x3cb8aa8 in cls_cxx_map_get_val(void*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list*) ../src/crimson/osd/objclass.cc:372
    ceph#22 0x7f6de558de39  (/home/rzarzynski/ceph1/build/lib/libcls_rbd.so.1.0.0+0x28e39)

0x7f6de5176e70 is located 249456 bytes inside of 262144-byte region [0x7f6de513a000,0x7f6de517a000)
allocated by thread T0 here:
    #0 0x7f6dfd3084a7 in aligned_alloc (/lib64/libasan.so.6+0xb54a7)
    #1 0xdd414fc in seastar::thread_context::make_stack(unsigned long) ../src/seastar/src/core/thread.cc:196
    #2 0x7fff3214bc4f  ([stack]+0xa5c4f)
```

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
rzarzynski added a commit to rzarzynski/ceph that referenced this pull request Mar 7, 2022
The problem is:

```
DEBUG 2022-03-07 13:50:40,027 [shard 0] osd - calling method rbd.create, num_read=0, num_write=0
DEBUG 2022-03-07 13:50:40,027 [shard 0] objclass - <cls> ../src/cls/rbd/cls_rbd.cc:787: create object_prefix=parent_id size=2097152 order=0 features=1
DEBUG 2022-03-07 13:50:40,027 [shard 0] osd - handling op omap-get-vals-by-keys on object 1:144d5af5:::parent_id:head
=================================================================
==2109764==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7f6de5176e70 at pc 0x7f6dfd2a7157 bp 0x7f6de5176e30 sp 0x7f6de51765d8
WRITE of size 24 at 0x7f6de5176e70 thread T0
    #0 0x7f6dfd2a7156 in __interceptor_sigaltstack.part.0 (/lib64/libasan.so.6+0x54156)
    #1 0x7f6dfd30d5b3 in __asan::PlatformUnpoisonStacks() (/lib64/libasan.so.6+0xba5b3)
    #2 0x7f6dfd31314c in __asan_handle_no_return (/lib64/libasan.so.6+0xc014c)
Reactor stalled for 275 ms on shard 0. Backtrace: 0x45d9d 0xda72bd3 0xd801f73 0xd81f6f9 0xd81fb9c 0xd81fe2c 0xd8200f7 0x12b2f 0x7f6dfd3383c1 0x7f6dfd339b18 0x7f6dfd339bd4 0x7f6dfd339bd4 0x7f6dfd339bd4 0x7f6dfd339bd4 0x7f6dfd33b089 0x7f6dfd33bb36 0x7f6dfd32e0b5 0x7f6dfd32ff3a 0xd61d0 0x32412 0xbd8a7 0xbd134 0x54178 0xba5b3 0xc014c 0x1881f22 0x188344a 0xe8b439d 0xe8b58f2 0x2521d5a 0x2a2ee12 0x2c76349 0x2e04ce9 0x3c70c55 0x3cb8aa8 0x7f6de558de39
    ceph#3 0x1881f22 in fmt::v6::internal::arg_map<fmt::v6::basic_format_context<seastar::internal::log_buf::inserter_iterator, char> >::~arg_map() /usr/include/fmt/core.h:1170
    ceph#4 0x1881f22 in fmt::v6::basic_format_context<seastar::internal::log_buf::inserter_iterator, char>::~basic_format_context() /usr/include/fmt/core.h:1265
    ceph#5 0x1881f22 in fmt::v6::format_handler<fmt::v6::arg_formatter<fmt::v6::internal::output_range<seastar::internal::log_buf::inserter_iterator, char> >, char, fmt::v6::basic_format_context<seastar::internal::log_buf::inserter_iterator, char> >::~format_handler() /usr/include/fmt/format.h:3143
    ceph#6 0x1881f22 in fmt::v6::basic_format_context<seastar::internal::log_buf::inserter_iterator, char>::iterator fmt::v6::vformat_to<fmt::v6::arg_formatter<fmt::v6::internal::output_range<seastar::internal::log_buf::inserter_iterator, char> >, char, fmt::v6::basic_format_context<seastar::internal::log_buf::inserter_iterator, char> >(fmt::v6::arg_formatter<fmt::v6::internal::output_range<seastar::internal::log_buf::inserter_iterator, char> >::range, fmt::v6::basic_string_view<char>, fmt::v6::basic_format_args<fmt::v6::basic_format_context<seastar::internal::log_buf::inserter_iterator, char> >, fmt::v6::internal::locale_ref) /usr/include/fmt/format.h:3206
    ceph#7 0x188344a in seastar::internal::log_buf::inserter_iterator fmt::v6::vformat_to<fmt::v6::basic_string_view<char>, seastar::internal::log_buf::inserter_iterator, , 0>(seastar::internal::log_buf::inserter_iterator, fmt::v6::basic_string_view<char> const&, fmt::v6::basic_format_args<fmt::v6::basic_format_context<fmt::v6::type_identity<seastar::internal::log_buf::inserter_iterator>::type, fmt::v6::internal::char_t_impl<fmt::v6::basic_string_view<char>, void>::type> >) /usr/include/fmt/format.h:3395
    ceph#8 0x188344a in seastar::internal::log_buf::inserter_iterator fmt::v6::format_to<seastar::internal::log_buf::inserter_iterator, std::basic_string_view<char, std::char_traits<char> >, hobject_t const&, 0>(seastar::internal::log_buf::inserter_iterator, std::basic_string_view<char, std::char_traits<char> > const&, hobject_t const&) /usr/include/fmt/format.h:3418
    ceph#9 0x188344a in seastar::logger::log<hobject_t const&>(seastar::log_level, seastar::logger::format_info, hobject_t const&)::{lambda(seastar::internal::log_buf::inserter_iterator)#1}::operator()(seastar::internal::log_buf::inserter_iterator) const ../src/seastar/include/seastar/util/log.hh:227
    ceph#10 0x188344a in seastar::logger::lambda_log_writer<seastar::logger::log<hobject_t const&>(seastar::log_level, seastar::logger::format_info, hobject_t const&)::{lambda(seastar::internal::log_buf::inserter_iterator)#1}>::operator()(seastar::internal::log_buf::inserter_iterator) ../src/seastar/include/seastar/util/log.hh:106
    ceph#11 0xe8b439d in operator() ../src/seastar/src/util/log.cc:268
    ceph#12 0xe8b58f2 in seastar::logger::do_log(seastar::log_level, seastar::logger::log_writer&) ../src/seastar/src/util/log.cc:280
    ceph#13 0x2521d5a in void seastar::logger::log<hobject_t const&>(seastar::log_level, seastar::logger::format_info, hobject_t const&) ../src/seastar/include/seastar/util/log.hh:230
    ceph#14 0x2a2ee12 in void seastar::logger::debug<hobject_t const&>(seastar::logger::format_info, hobject_t const&) ../src/seastar/include/seastar/util/log.hh:373
    ceph#15 0x2a2ee12 in PGBackend::omap_get_vals_by_keys(ObjectState const&, OSDOp&, object_stat_sum_t&) const ../src/crimson/osd/pg_backend.cc:1220
    ceph#16 0x2c76349 in operator()<PGBackend, ObjectState> ../src/crimson/osd/ops_executer.cc:577
    ceph#17 0x2c76349 in do_const_op<crimson::osd::OpsExecuter::execute_op(OSDOp&)::<lambda(auto:167&, const auto:168&)> > ../src/crimson/osd/ops_executer.cc:449
    ceph#18 0x2e04ce9 in do_read_op<crimson::osd::OpsExecuter::execute_op(OSDOp&)::<lambda(auto:167&, const auto:168&)> > ../src/crimson/osd/ops_executer.h:216
    ceph#19 0x2e04ce9 in crimson::osd::OpsExecuter::execute_op(OSDOp&) ../src/crimson/osd/ops_executer.cc:576
Reactor stalled for 762 ms on shard 0. Backtrace: 0x45d9d 0xda72bd3 0xd801f73 0xd81f6f9 0xd81fb9c 0xd81fe2c 0xd8200f7 0x12b2f 0x7f6dfd33ae85 0x7f6dfd33bb36 0x7f6dfd32e0b5 0x7f6dfd32ff3a 0xd61d0 0x32412 0xbd8a7 0xbd134 0x54178 0xba5b3 0xc014c 0x1881f22 0x188344a 0xe8b439d 0xe8b58f2 0x2521d5a 0x2a2ee12 0x2c76349 0x2e04ce9 0x3c70c55 0x3cb8aa8 0x7f6de558de39
    ceph#20 0x3c70c55 in execute_osd_op ../src/crimson/osd/objclass.cc:35
    ceph#21 0x3cb8aa8 in cls_cxx_map_get_val(void*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list*) ../src/crimson/osd/objclass.cc:372
    ceph#22 0x7f6de558de39  (/home/rzarzynski/ceph1/build/lib/libcls_rbd.so.1.0.0+0x28e39)

0x7f6de5176e70 is located 249456 bytes inside of 262144-byte region [0x7f6de513a000,0x7f6de517a000)
allocated by thread T0 here:
    #0 0x7f6dfd3084a7 in aligned_alloc (/lib64/libasan.so.6+0xb54a7)
    #1 0xdd414fc in seastar::thread_context::make_stack(unsigned long) ../src/seastar/src/core/thread.cc:196
    #2 0x7fff3214bc4f  ([stack]+0xa5c4f)
```

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
zhscn referenced this pull request in zhscn/ceph Mar 14, 2022
The problem is:

```
DEBUG 2022-03-07 13:50:40,027 [shard 0] osd - calling method rbd.create, num_read=0, num_write=0
DEBUG 2022-03-07 13:50:40,027 [shard 0] objclass - <cls> ../src/cls/rbd/cls_rbd.cc:787: create object_prefix=parent_id size=2097152 order=0 features=1
DEBUG 2022-03-07 13:50:40,027 [shard 0] osd - handling op omap-get-vals-by-keys on object 1:144d5af5:::parent_id:head
=================================================================
==2109764==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7f6de5176e70 at pc 0x7f6dfd2a7157 bp 0x7f6de5176e30 sp 0x7f6de51765d8
WRITE of size 24 at 0x7f6de5176e70 thread T0
    #0 0x7f6dfd2a7156 in __interceptor_sigaltstack.part.0 (/lib64/libasan.so.6+0x54156)
    #1 0x7f6dfd30d5b3 in __asan::PlatformUnpoisonStacks() (/lib64/libasan.so.6+0xba5b3)
    #2 0x7f6dfd31314c in __asan_handle_no_return (/lib64/libasan.so.6+0xc014c)
Reactor stalled for 275 ms on shard 0. Backtrace: 0x45d9d 0xda72bd3 0xd801f73 0xd81f6f9 0xd81fb9c 0xd81fe2c 0xd8200f7 0x12b2f 0x7f6dfd3383c1 0x7f6dfd339b18 0x7f6dfd339bd4 0x7f6dfd339bd4 0x7f6dfd339bd4 0x7f6dfd339bd4 0x7f6dfd33b089 0x7f6dfd33bb36 0x7f6dfd32e0b5 0x7f6dfd32ff3a 0xd61d0 0x32412 0xbd8a7 0xbd134 0x54178 0xba5b3 0xc014c 0x1881f22 0x188344a 0xe8b439d 0xe8b58f2 0x2521d5a 0x2a2ee12 0x2c76349 0x2e04ce9 0x3c70c55 0x3cb8aa8 0x7f6de558de39
    #3 0x1881f22 in fmt::v6::internal::arg_map<fmt::v6::basic_format_context<seastar::internal::log_buf::inserter_iterator, char> >::~arg_map() /usr/include/fmt/core.h:1170
    #4 0x1881f22 in fmt::v6::basic_format_context<seastar::internal::log_buf::inserter_iterator, char>::~basic_format_context() /usr/include/fmt/core.h:1265
    #5 0x1881f22 in fmt::v6::format_handler<fmt::v6::arg_formatter<fmt::v6::internal::output_range<seastar::internal::log_buf::inserter_iterator, char> >, char, fmt::v6::basic_format_context<seastar::internal::log_buf::inserter_iterator, char> >::~format_handler() /usr/include/fmt/format.h:3143
    #6 0x1881f22 in fmt::v6::basic_format_context<seastar::internal::log_buf::inserter_iterator, char>::iterator fmt::v6::vformat_to<fmt::v6::arg_formatter<fmt::v6::internal::output_range<seastar::internal::log_buf::inserter_iterator, char> >, char, fmt::v6::basic_format_context<seastar::internal::log_buf::inserter_iterator, char> >(fmt::v6::arg_formatter<fmt::v6::internal::output_range<seastar::internal::log_buf::inserter_iterator, char> >::range, fmt::v6::basic_string_view<char>, fmt::v6::basic_format_args<fmt::v6::basic_format_context<seastar::internal::log_buf::inserter_iterator, char> >, fmt::v6::internal::locale_ref) /usr/include/fmt/format.h:3206
    #7 0x188344a in seastar::internal::log_buf::inserter_iterator fmt::v6::vformat_to<fmt::v6::basic_string_view<char>, seastar::internal::log_buf::inserter_iterator, , 0>(seastar::internal::log_buf::inserter_iterator, fmt::v6::basic_string_view<char> const&, fmt::v6::basic_format_args<fmt::v6::basic_format_context<fmt::v6::type_identity<seastar::internal::log_buf::inserter_iterator>::type, fmt::v6::internal::char_t_impl<fmt::v6::basic_string_view<char>, void>::type> >) /usr/include/fmt/format.h:3395
    #8 0x188344a in seastar::internal::log_buf::inserter_iterator fmt::v6::format_to<seastar::internal::log_buf::inserter_iterator, std::basic_string_view<char, std::char_traits<char> >, hobject_t const&, 0>(seastar::internal::log_buf::inserter_iterator, std::basic_string_view<char, std::char_traits<char> > const&, hobject_t const&) /usr/include/fmt/format.h:3418
    #9 0x188344a in seastar::logger::log<hobject_t const&>(seastar::log_level, seastar::logger::format_info, hobject_t const&)::{lambda(seastar::internal::log_buf::inserter_iterator)#1}::operator()(seastar::internal::log_buf::inserter_iterator) const ../src/seastar/include/seastar/util/log.hh:227
    #10 0x188344a in seastar::logger::lambda_log_writer<seastar::logger::log<hobject_t const&>(seastar::log_level, seastar::logger::format_info, hobject_t const&)::{lambda(seastar::internal::log_buf::inserter_iterator)#1}>::operator()(seastar::internal::log_buf::inserter_iterator) ../src/seastar/include/seastar/util/log.hh:106
    #11 0xe8b439d in operator() ../src/seastar/src/util/log.cc:268
    ceph#12 0xe8b58f2 in seastar::logger::do_log(seastar::log_level, seastar::logger::log_writer&) ../src/seastar/src/util/log.cc:280
    ceph#13 0x2521d5a in void seastar::logger::log<hobject_t const&>(seastar::log_level, seastar::logger::format_info, hobject_t const&) ../src/seastar/include/seastar/util/log.hh:230
    ceph#14 0x2a2ee12 in void seastar::logger::debug<hobject_t const&>(seastar::logger::format_info, hobject_t const&) ../src/seastar/include/seastar/util/log.hh:373
    ceph#15 0x2a2ee12 in PGBackend::omap_get_vals_by_keys(ObjectState const&, OSDOp&, object_stat_sum_t&) const ../src/crimson/osd/pg_backend.cc:1220
    ceph#16 0x2c76349 in operator()<PGBackend, ObjectState> ../src/crimson/osd/ops_executer.cc:577
    ceph#17 0x2c76349 in do_const_op<crimson::osd::OpsExecuter::execute_op(OSDOp&)::<lambda(auto:167&, const auto:168&)> > ../src/crimson/osd/ops_executer.cc:449
    ceph#18 0x2e04ce9 in do_read_op<crimson::osd::OpsExecuter::execute_op(OSDOp&)::<lambda(auto:167&, const auto:168&)> > ../src/crimson/osd/ops_executer.h:216
    ceph#19 0x2e04ce9 in crimson::osd::OpsExecuter::execute_op(OSDOp&) ../src/crimson/osd/ops_executer.cc:576
Reactor stalled for 762 ms on shard 0. Backtrace: 0x45d9d 0xda72bd3 0xd801f73 0xd81f6f9 0xd81fb9c 0xd81fe2c 0xd8200f7 0x12b2f 0x7f6dfd33ae85 0x7f6dfd33bb36 0x7f6dfd32e0b5 0x7f6dfd32ff3a 0xd61d0 0x32412 0xbd8a7 0xbd134 0x54178 0xba5b3 0xc014c 0x1881f22 0x188344a 0xe8b439d 0xe8b58f2 0x2521d5a 0x2a2ee12 0x2c76349 0x2e04ce9 0x3c70c55 0x3cb8aa8 0x7f6de558de39
    ceph#20 0x3c70c55 in execute_osd_op ../src/crimson/osd/objclass.cc:35
    ceph#21 0x3cb8aa8 in cls_cxx_map_get_val(void*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list*) ../src/crimson/osd/objclass.cc:372
    ceph#22 0x7f6de558de39  (/home/rzarzynski/ceph1/build/lib/libcls_rbd.so.1.0.0+0x28e39)

0x7f6de5176e70 is located 249456 bytes inside of 262144-byte region [0x7f6de513a000,0x7f6de517a000)
allocated by thread T0 here:
    #0 0x7f6dfd3084a7 in aligned_alloc (/lib64/libasan.so.6+0xb54a7)
    #1 0xdd414fc in seastar::thread_context::make_stack(unsigned long) ../src/seastar/src/core/thread.cc:196
    #2 0x7fff3214bc4f  ([stack]+0xa5c4f)
```

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
dpaganel pushed a commit to dpaganel/ceph that referenced this pull request May 17, 2022
The problem is:

```
DEBUG 2022-03-07 13:50:40,027 [shard 0] osd - calling method rbd.create, num_read=0, num_write=0
DEBUG 2022-03-07 13:50:40,027 [shard 0] objclass - <cls> ../src/cls/rbd/cls_rbd.cc:787: create object_prefix=parent_id size=2097152 order=0 features=1
DEBUG 2022-03-07 13:50:40,027 [shard 0] osd - handling op omap-get-vals-by-keys on object 1:144d5af5:::parent_id:head
=================================================================
==2109764==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7f6de5176e70 at pc 0x7f6dfd2a7157 bp 0x7f6de5176e30 sp 0x7f6de51765d8
WRITE of size 24 at 0x7f6de5176e70 thread T0
    #0 0x7f6dfd2a7156 in __interceptor_sigaltstack.part.0 (/lib64/libasan.so.6+0x54156)
    ceph#1 0x7f6dfd30d5b3 in __asan::PlatformUnpoisonStacks() (/lib64/libasan.so.6+0xba5b3)
    ceph#2 0x7f6dfd31314c in __asan_handle_no_return (/lib64/libasan.so.6+0xc014c)
Reactor stalled for 275 ms on shard 0. Backtrace: 0x45d9d 0xda72bd3 0xd801f73 0xd81f6f9 0xd81fb9c 0xd81fe2c 0xd8200f7 0x12b2f 0x7f6dfd3383c1 0x7f6dfd339b18 0x7f6dfd339bd4 0x7f6dfd339bd4 0x7f6dfd339bd4 0x7f6dfd339bd4 0x7f6dfd33b089 0x7f6dfd33bb36 0x7f6dfd32e0b5 0x7f6dfd32ff3a 0xd61d0 0x32412 0xbd8a7 0xbd134 0x54178 0xba5b3 0xc014c 0x1881f22 0x188344a 0xe8b439d 0xe8b58f2 0x2521d5a 0x2a2ee12 0x2c76349 0x2e04ce9 0x3c70c55 0x3cb8aa8 0x7f6de558de39
    ceph#3 0x1881f22 in fmt::v6::internal::arg_map<fmt::v6::basic_format_context<seastar::internal::log_buf::inserter_iterator, char> >::~arg_map() /usr/include/fmt/core.h:1170
    ceph#4 0x1881f22 in fmt::v6::basic_format_context<seastar::internal::log_buf::inserter_iterator, char>::~basic_format_context() /usr/include/fmt/core.h:1265
    ceph#5 0x1881f22 in fmt::v6::format_handler<fmt::v6::arg_formatter<fmt::v6::internal::output_range<seastar::internal::log_buf::inserter_iterator, char> >, char, fmt::v6::basic_format_context<seastar::internal::log_buf::inserter_iterator, char> >::~format_handler() /usr/include/fmt/format.h:3143
    ceph#6 0x1881f22 in fmt::v6::basic_format_context<seastar::internal::log_buf::inserter_iterator, char>::iterator fmt::v6::vformat_to<fmt::v6::arg_formatter<fmt::v6::internal::output_range<seastar::internal::log_buf::inserter_iterator, char> >, char, fmt::v6::basic_format_context<seastar::internal::log_buf::inserter_iterator, char> >(fmt::v6::arg_formatter<fmt::v6::internal::output_range<seastar::internal::log_buf::inserter_iterator, char> >::range, fmt::v6::basic_string_view<char>, fmt::v6::basic_format_args<fmt::v6::basic_format_context<seastar::internal::log_buf::inserter_iterator, char> >, fmt::v6::internal::locale_ref) /usr/include/fmt/format.h:3206
    ceph#7 0x188344a in seastar::internal::log_buf::inserter_iterator fmt::v6::vformat_to<fmt::v6::basic_string_view<char>, seastar::internal::log_buf::inserter_iterator, , 0>(seastar::internal::log_buf::inserter_iterator, fmt::v6::basic_string_view<char> const&, fmt::v6::basic_format_args<fmt::v6::basic_format_context<fmt::v6::type_identity<seastar::internal::log_buf::inserter_iterator>::type, fmt::v6::internal::char_t_impl<fmt::v6::basic_string_view<char>, void>::type> >) /usr/include/fmt/format.h:3395
    ceph#8 0x188344a in seastar::internal::log_buf::inserter_iterator fmt::v6::format_to<seastar::internal::log_buf::inserter_iterator, std::basic_string_view<char, std::char_traits<char> >, hobject_t const&, 0>(seastar::internal::log_buf::inserter_iterator, std::basic_string_view<char, std::char_traits<char> > const&, hobject_t const&) /usr/include/fmt/format.h:3418
    ceph#9 0x188344a in seastar::logger::log<hobject_t const&>(seastar::log_level, seastar::logger::format_info, hobject_t const&)::{lambda(seastar::internal::log_buf::inserter_iterator)ceph#1}::operator()(seastar::internal::log_buf::inserter_iterator) const ../src/seastar/include/seastar/util/log.hh:227
    ceph#10 0x188344a in seastar::logger::lambda_log_writer<seastar::logger::log<hobject_t const&>(seastar::log_level, seastar::logger::format_info, hobject_t const&)::{lambda(seastar::internal::log_buf::inserter_iterator)ceph#1}>::operator()(seastar::internal::log_buf::inserter_iterator) ../src/seastar/include/seastar/util/log.hh:106
    ceph#11 0xe8b439d in operator() ../src/seastar/src/util/log.cc:268
    ceph#12 0xe8b58f2 in seastar::logger::do_log(seastar::log_level, seastar::logger::log_writer&) ../src/seastar/src/util/log.cc:280
    ceph#13 0x2521d5a in void seastar::logger::log<hobject_t const&>(seastar::log_level, seastar::logger::format_info, hobject_t const&) ../src/seastar/include/seastar/util/log.hh:230
    ceph#14 0x2a2ee12 in void seastar::logger::debug<hobject_t const&>(seastar::logger::format_info, hobject_t const&) ../src/seastar/include/seastar/util/log.hh:373
    ceph#15 0x2a2ee12 in PGBackend::omap_get_vals_by_keys(ObjectState const&, OSDOp&, object_stat_sum_t&) const ../src/crimson/osd/pg_backend.cc:1220
    ceph#16 0x2c76349 in operator()<PGBackend, ObjectState> ../src/crimson/osd/ops_executer.cc:577
    ceph#17 0x2c76349 in do_const_op<crimson::osd::OpsExecuter::execute_op(OSDOp&)::<lambda(auto:167&, const auto:168&)> > ../src/crimson/osd/ops_executer.cc:449
    ceph#18 0x2e04ce9 in do_read_op<crimson::osd::OpsExecuter::execute_op(OSDOp&)::<lambda(auto:167&, const auto:168&)> > ../src/crimson/osd/ops_executer.h:216
    ceph#19 0x2e04ce9 in crimson::osd::OpsExecuter::execute_op(OSDOp&) ../src/crimson/osd/ops_executer.cc:576
Reactor stalled for 762 ms on shard 0. Backtrace: 0x45d9d 0xda72bd3 0xd801f73 0xd81f6f9 0xd81fb9c 0xd81fe2c 0xd8200f7 0x12b2f 0x7f6dfd33ae85 0x7f6dfd33bb36 0x7f6dfd32e0b5 0x7f6dfd32ff3a 0xd61d0 0x32412 0xbd8a7 0xbd134 0x54178 0xba5b3 0xc014c 0x1881f22 0x188344a 0xe8b439d 0xe8b58f2 0x2521d5a 0x2a2ee12 0x2c76349 0x2e04ce9 0x3c70c55 0x3cb8aa8 0x7f6de558de39
    ceph#20 0x3c70c55 in execute_osd_op ../src/crimson/osd/objclass.cc:35
    ceph#21 0x3cb8aa8 in cls_cxx_map_get_val(void*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list*) ../src/crimson/osd/objclass.cc:372
    ceph#22 0x7f6de558de39  (/home/rzarzynski/ceph1/build/lib/libcls_rbd.so.1.0.0+0x28e39)

0x7f6de5176e70 is located 249456 bytes inside of 262144-byte region [0x7f6de513a000,0x7f6de517a000)
allocated by thread T0 here:
    #0 0x7f6dfd3084a7 in aligned_alloc (/lib64/libasan.so.6+0xb54a7)
    ceph#1 0xdd414fc in seastar::thread_context::make_stack(unsigned long) ../src/seastar/src/core/thread.cc:196
    ceph#2 0x7fff3214bc4f  ([stack]+0xa5c4f)
```

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
m-ildefons pushed a commit to m-ildefons/ceph that referenced this pull request Jun 9, 2022
rzarzynski added a commit to rzarzynski/ceph that referenced this pull request Jul 5, 2022
Before the patch there was a possibility that `OSDConnectionPriv`
gets destructed before a `PipelineHandle` instance that was using
it. The reason is our remote-handling operations store `conn` directly
while `handle` is defined in a parent class. Due to the language rules
the former gets deinitialized earlier.

```
==756032==ERROR: AddressSanitizer: heap-use-after-free on address 0x615000039684 at pc 0x0000020bdfa2 bp 0x7ffd3abfa370 sp 0x7ffd3abfa360
READ of size 1 at 0x615000039684 thread T0
Reactor stalled for 261 ms on shard 0. Backtrace: 0x45d9d 0xe90f6d1 0xe6b8a1d 0xe6d1205 0xe6d16a8 0xe6d1938 0xe6d1c03 0x12cdf 0xccebf 0x7f6447161b1e 0x7f644714aee8 0x7f644714eed6 0x7f644714fb36 0x7f64471420b5 0x
7f6447143f3a 0xd61d0 0x32412 0xbd8a7 0xbd134 0xbdc1a 0x20bdfa1 0x20c184e 0x352eb7f 0x352fa28 0x20b04a5 0x1be30e5 0xe694bc4 0xe6ebb8a 0xe843a11 0xe845a22 0xe29f497 0xe2a3ccd 0x1ab1841 0x3aca2 0x175698d
    #0 0x20bdfa1 in seastar::shared_mutex::unlock() ../src/seastar/include/seastar/core/shared_mutex.hh:122
    #1 0x20c184e in crimson::OrderedExclusivePhaseT<crimson::osd::ConnectionPipeline::GetPG>::exit() ../src/crimson/common/operation.h:548
    #2 0x20c184e in crimson::OrderedExclusivePhaseT<crimson::osd::ConnectionPipeline::GetPG>::ExitBarrier::exit() ../src/crimson/common/operation.h:533
    ceph#3 0x20c184e in crimson::OrderedExclusivePhaseT<crimson::osd::ConnectionPipeline::GetPG>::ExitBarrier::cancel() ../src/crimson/common/operation.h:539
    ceph#4 0x20c184e in crimson::OrderedExclusivePhaseT<crimson::osd::ConnectionPipeline::GetPG>::ExitBarrier::~ExitBarrier() ../src/crimson/common/operation.h:543
    ceph#5 0x20c184e in crimson::OrderedExclusivePhaseT<crimson::osd::ConnectionPipeline::GetPG>::ExitBarrier::~ExitBarrier() ../src/crimson/common/operation.h:544
    ceph#6 0x352eb7f in std::default_delete<crimson::PipelineExitBarrierI>::operator()(crimson::PipelineExitBarrierI*) const /opt/rh/gcc-toolset-11/root/usr/include/c++/11/bits/unique_ptr.h:85
    ceph#7 0x352eb7f in std::unique_ptr<crimson::PipelineExitBarrierI, std::default_delete<crimson::PipelineExitBarrierI> >::~unique_ptr() /opt/rh/gcc-toolset-11/root/usr/include/c++/11/bits/unique_ptr.h:361
    ceph#8 0x352eb7f in crimson::PipelineHandle::~PipelineHandle() ../src/crimson/common/operation.h:457
    ceph#9 0x352eb7f in crimson::osd::PhasedOperationT<crimson::osd::ClientRequest>::~PhasedOperationT() ../src/crimson/osd/osd_operation.h:152
    ceph#10 0x352eb7f in crimson::osd::ClientRequest::~ClientRequest() ../src/crimson/osd/osd_operations/client_request.cc:64
    ceph#11 ...
```

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Pegonzal pushed a commit that referenced this pull request Oct 13, 2022
Before the patch there was a possibility that `OSDConnectionPriv`
gets destructed before a `PipelineHandle` instance that was using
it. The reason is our remote-handling operations store `conn` directly
while `handle` is defined in a parent class. Due to the language rules
the former gets deinitialized earlier.

```
==756032==ERROR: AddressSanitizer: heap-use-after-free on address 0x615000039684 at pc 0x0000020bdfa2 bp 0x7ffd3abfa370 sp 0x7ffd3abfa360
READ of size 1 at 0x615000039684 thread T0
Reactor stalled for 261 ms on shard 0. Backtrace: 0x45d9d 0xe90f6d1 0xe6b8a1d 0xe6d1205 0xe6d16a8 0xe6d1938 0xe6d1c03 0x12cdf 0xccebf 0x7f6447161b1e 0x7f644714aee8 0x7f644714eed6 0x7f644714fb36 0x7f64471420b5 0x
7f6447143f3a 0xd61d0 0x32412 0xbd8a7 0xbd134 0xbdc1a 0x20bdfa1 0x20c184e 0x352eb7f 0x352fa28 0x20b04a5 0x1be30e5 0xe694bc4 0xe6ebb8a 0xe843a11 0xe845a22 0xe29f497 0xe2a3ccd 0x1ab1841 0x3aca2 0x175698d
    #0 0x20bdfa1 in seastar::shared_mutex::unlock() ../src/seastar/include/seastar/core/shared_mutex.hh:122
    #1 0x20c184e in crimson::OrderedExclusivePhaseT<crimson::osd::ConnectionPipeline::GetPG>::exit() ../src/crimson/common/operation.h:548
    #2 0x20c184e in crimson::OrderedExclusivePhaseT<crimson::osd::ConnectionPipeline::GetPG>::ExitBarrier::exit() ../src/crimson/common/operation.h:533
    #3 0x20c184e in crimson::OrderedExclusivePhaseT<crimson::osd::ConnectionPipeline::GetPG>::ExitBarrier::cancel() ../src/crimson/common/operation.h:539
    #4 0x20c184e in crimson::OrderedExclusivePhaseT<crimson::osd::ConnectionPipeline::GetPG>::ExitBarrier::~ExitBarrier() ../src/crimson/common/operation.h:543
    #5 0x20c184e in crimson::OrderedExclusivePhaseT<crimson::osd::ConnectionPipeline::GetPG>::ExitBarrier::~ExitBarrier() ../src/crimson/common/operation.h:544
    #6 0x352eb7f in std::default_delete<crimson::PipelineExitBarrierI>::operator()(crimson::PipelineExitBarrierI*) const /opt/rh/gcc-toolset-11/root/usr/include/c++/11/bits/unique_ptr.h:85
    #7 0x352eb7f in std::unique_ptr<crimson::PipelineExitBarrierI, std::default_delete<crimson::PipelineExitBarrierI> >::~unique_ptr() /opt/rh/gcc-toolset-11/root/usr/include/c++/11/bits/unique_ptr.h:361
    #8 0x352eb7f in crimson::PipelineHandle::~PipelineHandle() ../src/crimson/common/operation.h:457
    #9 0x352eb7f in crimson::osd::PhasedOperationT<crimson::osd::ClientRequest>::~PhasedOperationT() ../src/crimson/osd/osd_operation.h:152
    #10 0x352eb7f in crimson::osd::ClientRequest::~ClientRequest() ../src/crimson/osd/osd_operations/client_request.cc:64
    #11 ...
```

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
icepic added a commit to icepic/ceph that referenced this pull request Feb 3, 2023
The actual user-id in this article is very vague. It starts with having you create "a user" that can ssh and sudo, then creates lots of stuff as root, then starts ceph-mon as .. some random uid which will not necessarily be able to read root-owned files, and is not certain to be "ceph" either.

While this manual perhaps gets you something that runs, you would probably be using systemctl or whatever upstart handler your OS comes with at step ceph#6, and not have the mon hang off your login-as-some-user ssh session. At least systemd unit files will set user and group to "ceph" for the ceph-mon, so if you mkfs'd and ran ceph-mon as something else before, then this is when you will notice if perms are bad.
@dang dang mentioned this pull request Jun 2, 2023
14 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants