Skip to content

log: avoid heap allocations for most log entries#23721

Merged
tchaikov merged 3 commits intoceph:masterfrom
batrick:log-entry-stack
Sep 19, 2018
Merged

log: avoid heap allocations for most log entries#23721
tchaikov merged 3 commits intoceph:masterfrom
batrick:log-entry-stack

Conversation

@batrick
Copy link
Copy Markdown
Member

@batrick batrick commented Aug 23, 2018

Please see commit message for more information and trivial experimental results.

This is my first time using this flame graph tool. It's results are not always reliably repeatable. I think it's reasonable to suggest this new version is about 50% faster in addition to obviating nearly 100% of heap allocations for logging.

@ukernel
Copy link
Copy Markdown
Contributor

ukernel commented Aug 24, 2018

LGTM

@tchaikov tchaikov requested a review from aclamk August 24, 2018 07:08
Copy link
Copy Markdown
Contributor

@tchaikov tchaikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this change, in addition to replacing the cached streambuf on heap with non-cached streambuf on stack, we trade the overhead of allocating Entry on heap for memcpy() the entries on stack to m_new on heap.

i don't think this adds up to a win in performance.

before this change

[ RUN      ] Log.Speed_gather
[       OK ] Log.Speed_gather (1883 ms)
[ RUN      ] Log.Speed_nogather
[       OK ] Log.Speed_nogather (137 ms)

after this change

[ RUN      ] Log.Speed_gather
[       OK ] Log.Speed_gather (2809 ms)
[ RUN      ] Log.Speed_nogather
[       OK ] Log.Speed_nogather (140 ms)

#include <string_view>
#include <vector>

template<std::size_t SIZE>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could just s/SIZE/stack_size/, and remove the member variable of stack_size.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding was exposing the template parameters in the class is generally good practice?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am not sure though. if size is not part of the interface, there is no need to expose it, IMHO.

#include <vector>

template<std::size_t SIZE>
class StackStringBuf : public std::streambuf
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could mark this class final if it is not supposed to be inherited.

StackStringBuf& operator=(const StackStringBuf&) = delete;
StackStringBuf(StackStringBuf&& o)
:
vec(std::move(o.vec)),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might want to move this line to the previous one.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like:

StackStringBuf(StackStringBuf&& o)
: vec(std::move(o.vec)),

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ok

~StackStringBuf() override {}
StackStringBuf(const StackStringBuf&) = delete;
StackStringBuf& operator=(const StackStringBuf&) = delete;
StackStringBuf(StackStringBuf&& o)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the default implementation would suffice, as it performs member-wise move/copy of its members. so we can just put

StackStringBuf(StackStringBuf&&) = default;
StackStringBuf& operator=(StackStringBuf&&) = default;

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @cbodley noted, I'm only moving the part of the array used.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess small_vector could have a nice side effect here: ... = default;-ability. :-)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I'm looking at that too. I was trying to find something like that in boost but missed it! Thanks for pointing it out!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm only moving the part of the array used.

i knew, just wanted to remind that default move constructor is good enough in our case.


void push(std::string_view sv)
{
if (vec.size()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might want to use empty() if we don't care about the "size".

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK

vec.insert(vec.end(), sv.begin(), sv.end());
} else {
if (sv.size()+arr_used < stack_size) {
std::memcpy(arr.data()+arr_used, sv.data(), sv.size());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add spaces around +.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK

} else {
vec.reserve(arr_used+sv.size());
std::string_view stack_sv(arr.data(), arr_used);
vec.insert(vec.end(), stack_sv.begin(), stack_sv.end());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could just put

vec.insert(vec.end(), arr.begin(), arr.begin() + arr_used);

so we can avoid creating a temporary object.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK

src/log/Entry.h Outdated
delete(this);
}
private:
StackStringStream os;
Copy link
Copy Markdown
Contributor

@tchaikov tchaikov Aug 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before this change, the streambuf used by log entries is cached on a per-thread basis, and each thread has its own streambuf, so log entries created on a given thread do not need to create and destroy streambuf for stringifying the message to print.

after this change, the streambuf is created and destroyed on the stack for every single log Entry.

basic_string::_M_mutate is also another allocation via PrebufferedStreambuf.

i think what you meant is CachedPrebufferedStreambuf. PrebufferedStreambuf is not used anymore. as i explained above, the creation and destructure of CachedPrebufferedStreambuf is a one-shot operation on each thread. so i think the overhead introduced by it is fine. and it will be offset by the reuse of the cached streambuf.

for more details on the cached streambuf, see #17938

@cbodley
Copy link
Copy Markdown
Contributor

cbodley commented Aug 24, 2018

i think the main benefit from @aclamk's work in #17938 was to avoid the cost of re-constructing std::locale (which requires synchronization on global locale state), both in streambuf and ostream

it would be great if we could preserve those improvements, while also avoiding the allocations - maybe we could find a way to apply these changes to prebuffered_data in common/CachedPrebufferedStreambuf.h, which is the per-log-entry state that holds the actual buffers

vec(std::move(o.vec)),
arr_used(o.arr_used)
{
std::memcpy(arr.data(), o.arr.data(), o.arr_used);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess restricting the memcpy to o.arr_used bytes is the reason for having non-default move- constructor and assignment operator. Might be worth a comment.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK

if (sv.size()+arr_used < stack_size) {
std::memcpy(arr.data()+arr_used, sv.data(), sv.size());
arr_used += sv.size();
} else {
Copy link
Copy Markdown
Contributor

@rzarzynski rzarzynski Aug 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I see: if the embedded buffer isn't big enough to hold sv, then switch entirely to dynamically allocated vector. Shuffle everything that was inside as strv is supposed to provide continuous memory.

Is there any reason for not using Boost's small_vector?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll give that a try, thanks for pointing it out!

StackStringStream(StackStringStream&& o) : std::ostream(&ssb), ssb(std::move(o.ssb)) {}
StackStringStream& operator=(StackStringStream&& o)
{
ssb = std::move(o.ssb);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the base class subobject?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I shamefully missed that. I'll add it in the next version of this patchset...

m_queue_mutex_holder = pthread_self();

if (m_inject_segv)
if (unlikely(m_inject_segv))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:-) I bet a CPU had a lot of fun while encountering the assignment due to misprediction.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@batrick batrick force-pushed the log-entry-stack branch 2 times, most recently from c7a9a4a to 394e716 Compare August 27, 2018 05:57
@batrick
Copy link
Copy Markdown
Member Author

batrick commented Aug 27, 2018

New version:

[ RUN      ] Log.Speed_gather
[       OK ] Log.Speed_gather (1308 ms)

Need to do some cleanup but this is mostly done I think.


void push(std::string_view sv)
{
vec.reserve(sv.size());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vec.size() + sv.size() right?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes thanks!

src/log/Entry.h Outdated
auto strv = e.strv();
str.clear();
str.reserve(strv.size());
str.insert(str.end(), strv.begin(), strv.end());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can use str.assign(strv.begin(), strv.end()); instead of clear() + insert()

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

@batrick
Copy link
Copy Markdown
Member Author

batrick commented Aug 27, 2018

So this looks nearly done except there is still a mysterious allocation being done:

            $ google-pprof --alloc_objects --text bin/ceph-mds out/mds.a.profile.0001.heap
            Total: 1052127 objects
          384957  36.6%  36.6%   384957  36.6% __gnu_cxx::new_allocator::allocate (inline)
          274720  26.1%  62.7%   274720  26.1% __gnu_cxx::__aligned_membuf::_M_addr (inline)
           68496   6.5%  69.2%    68496   6.5% std::__cxx11::basic_string::_M_mutate

compare to allocations before this patch:

        $ google-pprof --alloc_objects --text bin/ceph-mds out/mds.a.profile.0001.heap
        Total: 1105048 objects
          433329  39.2%  39.2%   433329  39.2% ceph::logging::Log::create_entry
          209311  18.9%  58.2%   209311  18.9% __gnu_cxx::__aligned_membuf::_M_addr (inline)
          192963  17.5%  75.6%   192963  17.5% __gnu_cxx::new_allocator::allocate (inline)
           61774   5.6%  81.2%    61774   5.6% std::__cxx11::basic_string::_M_mutate
           37689   3.4%  84.6%    37689   3.4% ceph::buffer::raw_combined::create (inline)
           22773   2.1%  86.7%    22773   2.1% mempool::pool_allocator::allocate (inline)

@batrick
Copy link
Copy Markdown
Member Author

batrick commented Aug 27, 2018

allocations.svg.txt

Uploaded svg of the allocations. (Remove .txt extension and view locally.)

@batrick batrick force-pushed the log-entry-stack branch 2 times, most recently from 71d0710 to 7af4e61 Compare August 27, 2018 23:15
}

void Log::submit_entry(Entry *e)
void Log::submit_entry(Entry&& e)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we templatize this method, if we know the actual type of Entry at compile time? so we don't need to resolve the address of virtual method at run-time.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK

src/log/Log.cc Outdated

///
void Log::set_coarse_timestamps(bool coarse) {
Locker lock(m_flush_mutex);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think std::scoped_lock is a better fit in this case. as it more light-weighted, and we are not using the extra features provided by std::unique_lock.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean at this location or everywhere? We are using the unlock/lock methods of Locker in other locations. I don't really see a need to optimize for these methods which are called only once at startup?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the condition variables only take a unique_lock.

Copy link
Copy Markdown
Contributor

@tchaikov tchaikov Aug 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean at this location or everywhere?

i mean the wherever std::scoped_lock applies.

We are using the unlock/lock methods of Locker in other locations.

yes, i knew. but we are not required to use the same lock type in Log, are we?

Also, the condition variables only take a unique_lock.

yeah, condition_variables takes unique_lock. in addition to an optimisation, i'd take this as an improvement in readability. when one reads a function accepts std::string instead of const std::string& or std::string_view, he/she would wonder why. the same applies to unique_lock. when i was reading the code, i asked myself, why we wanted to use unique_lock here?

src/log/Log.cc Outdated
m_flush.swap(m_new);
m_cond_loggers.notify_all();
m_queue_mutex_holder = 0;
lock2.unlock();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, if we want to unlock a lock first, why not just use the RAII style locking? like

std::scoped_lock flush_lock{m_flush_mutex};
{
  std::scoped_lock queue_lock{m_queue_mutex};
  // update m_flush* and m_queue*
}
_flush(m_flush, true, false);
m_flush_mutex_holder = 0;

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK

src/log/Log.cc Outdated

line_used += e->snprintf(line + line_used, line_size - line_used - 1);
ceph_assert(line_used < line_size - 1);
used += (std::size_t)append_time(stamp, pos+used, allocated-used);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, space before and after +.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK

@batrick batrick force-pushed the log-entry-stack branch 2 times, most recently from 891c845 to 93a7082 Compare August 28, 2018 19:53

Entry() = delete;
Entry(short pr, short sub) :
m_stamp(clock().now()),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now() is a static function, so log_clock::now() would work without a helper function

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that but it doesn't work:

In file included from /home/pdonnell/ceph/src/log/Log.h:19:0,
                 from /home/pdonnell/ceph/src/common/dout.h:26,
                 from /home/pdonnell/ceph/src/common/debug.h:18,
                 from /home/pdonnell/ceph/src/erasure-code/shec/ErasureCodePluginShec.cc:22:
/home/pdonnell/ceph/src/log/Entry.h: In constructor ‘ceph::logging::Entry::Entry(short int, short int)’:
/home/pdonnell/ceph/src/log/Entry.h:26:28: error: cannot call member function ‘ceph::logging::log_clock::time_point ceph::logging::log_clock::now()’ without object
     m_stamp(log_clock::now()),
                            ^

The method is a trick to have a static log_clock without requiring a definition in a corresponding Entry.cc file. Do you think it's not worth doing?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sneaky - works for me :) i had assumed it was just one of the clocks from ceph_time.h

virtual ~Entry() = default;

virtual std::string_view strv() const = 0;
virtual std::size_t size() const = 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as @tchaikov mentioned in #23721 (comment), there is still some overhead from these virtual functions. i think we could remove its vtable entirely if Log::submit_entry() took MutableEntry&& directly, and ConcreteEntry's converting constructor took const MutableEntry& instead of const Entry&. think that's worth doing?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just templatized submit_entry. Does it look okay now?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like m_new.emplace_back(std::move(e)) still calls ConcreteEntry(const Entry&), which relies on the virtual Entry::strv() to perform the copy. if it was ConcreteEntry(const MutableEntry&), we wouldn't need any virtuals in Entry

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jeeze, is this all really worth it? The cost to code readability seems really high.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

think that's worth doing?

apparently not, and that's fine - i just wanted to clarify that the change to submit_entry didn't address @tchaikov's point about virtual function overhead

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, thanks for pointing it out! As you saw, I undid the templated submit_entry method. I don't think it's worth the trouble.

@batrick
Copy link
Copy Markdown
Member Author

batrick commented Aug 28, 2018

@cbodley @tchaikov

do you see what's causing these allocations?

Total: 1009203 objects
  374497  37.1%  37.1%   374497  37.1% __gnu_cxx::new_allocator::allocate (inline)
  258008  25.6%  62.7%   258008  25.6% __gnu_cxx::__aligned_membuf::_M_addr (inline)
   65407   6.5%  69.2%    65407   6.5% std::__cxx11::basic_string::_M_mutate
   42715   4.2%  73.4%    49886   4.9% std::pair::pair (inline)

Zheng and I are at a loss for the cause. Looking at the dot graph didn't help shed any light on the cause.

@aclamk
Copy link
Copy Markdown
Contributor

aclamk commented Sep 7, 2018

@batrick I created additional test to evaluate logger in multithread environment. #23982.
In https://gist.github.com/aclamk/aeb0faf22c271fddabc81c5efe9fa1e5 is listed impact of size of ConcreteEntry when cores are doing intense memory operations.
Actual execution times do not provide clear picture, but number of page-faults are clearly higher when ConcreteEntry is larger.

Unfortunatelly, I did not run your improvements on actual test environment.

@tchaikov
Copy link
Copy Markdown
Contributor

2018-09-07T08:10:25.425 INFO:tasks.ceph.osd.2.mira037.stderr:*** Caught signal (Segmentation fault) **
2018-09-07T08:10:25.425 INFO:tasks.ceph.osd.2.mira037.stderr: in thread 7f21aafbb8c0 thread_name:ceph-osd
2018-09-07T08:10:25.430 INFO:tasks.ceph.osd.2.mira037.stderr: ceph version 14.0.0-2936-g2411df7 (2411df774954edfd5976b11d992b843d2fe42e85) nautilus (dev)
2018-09-07T08:10:25.430 INFO:tasks.ceph.osd.2.mira037.stderr: 1: (()+0x12890) [0x7f21a87ea890]
2018-09-07T08:10:25.430 INFO:tasks.ceph.osd.2.mira037.stderr: 2: (std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long)+0x157) [0x7f21a7f17777]
2018-09-07T08:10:25.431 INFO:tasks.ceph.osd.2.mira037.stderr: 3: (BlueFS::open_for_read(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, BlueFS::FileReader**, bool)+0x4f2) [0x561596f2d862]
2018-09-07T08:10:25.431 INFO:tasks.ceph.osd.2.mira037.stderr: 4: (BlueRocksEnv::NewSequentialFile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unique_ptr<rocksdb::SequentialFile, std::default_delete<rocksdb::SequentialFile> >*, rocksdb::EnvOptions const&)+0x224) [0x561596f507e4]
2018-09-07T08:10:25.431 INFO:tasks.ceph.osd.2.mira037.stderr: 5: (rocksdb::VersionSet::ListColumnFamilies(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::Env*)+0x28b) [0x56159749148b]
2018-09-07T08:10:25.431 INFO:tasks.ceph.osd.2.mira037.stderr: 6: (rocksdb::DB::ListColumnFamilies(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >*)+0x2d) [0x5615973b58dd]
2018-09-07T08:10:25.431 INFO:tasks.ceph.osd.2.mira037.stderr: 7: (RocksDBStore::do_open(std::ostream&, bool, std::vector<KeyValueDB::ColumnFamily, std::allocator<KeyValueDB::ColumnFamily> > const*)+0x248) [0x561596ebe5b8]
2018-09-07T08:10:25.431 INFO:tasks.ceph.osd.2.mira037.stderr: 8: (BlueStore::_open_db(bool, bool)+0xd06) [0x561596e59286]
2018-09-07T08:10:25.431 INFO:tasks.ceph.osd.2.mira037.stderr: 9: (BlueStore::_mount(bool, bool)+0x53a) [0x561596e6f93a]
2018-09-07T08:10:25.431 INFO:tasks.ceph.osd.2.mira037.stderr: 10: (OSD::init()+0x2e6) [0x5615969af716]
2018-09-07T08:10:25.431 INFO:tasks.ceph.osd.2.mira037.stderr: 11: (main()+0x3a4e) [0x56159688b51e]

kchai-2018-09-07_07:49:32-rados-wip-kefu-testing-2018-09-07-1352-distro-basic-mira/2988359

2018-09-07T08:56:38.046 INFO:tasks.workunit.client.0.mira012.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/misc/rados-striper.sh:42: run:  rados -
-pool rbd --striper put toyfile td/rados-striper/toyfile
2018-09-07T08:56:38.163 INFO:tasks.workunit.client.0.mira012.stderr:*** Caught signal (Segmentation fault) **
2018-09-07T08:56:38.163 INFO:tasks.workunit.client.0.mira012.stderr: in thread 7f87a17b8c00 thread_name:rados
2018-09-07T08:56:38.168 INFO:tasks.workunit.client.0.mira012.stderr: ceph version 14.0.0-2936-g2411df7 (2411df774954edfd5976b11d992b843d2fe42e85) nautilus (de
v)
2018-09-07T08:56:38.168 INFO:tasks.workunit.client.0.mira012.stderr: 1: (()+0x12890) [0x7f8797fbc890]
2018-09-07T08:56:38.168 INFO:tasks.workunit.client.0.mira012.stderr: 2: (CachedStackStringStream::CachedStackStringStream()+0x1c7) [0x7f8798424b57]
2018-09-07T08:56:38.168 INFO:tasks.workunit.client.0.mira012.stderr: 3: (AdminSocket::unregister_command(std::basic_string_view<char, std::char_traits<char> >
)+0x2cb) [0x7f87984e300b]
2018-09-07T08:56:38.168 INFO:tasks.workunit.client.0.mira012.stderr: 4: (()+0xd76e3) [0x7f87a13206e3]
2018-09-07T08:56:38.168 INFO:tasks.workunit.client.0.mira012.stderr: 5: (librados::RadosClient::shutdown()+0xab) [0x7f87a12f865b]
2018-09-07T08:56:38.168 INFO:tasks.workunit.client.0.mira012.stderr: 6: (librados::Rados::shutdown()+0x28) [0x7f87a12a4248]
2018-09-07T08:56:38.168 INFO:tasks.workunit.client.0.mira012.stderr: 7: (libradosstriper::RadosStriper::~RadosStriper()+0x64) [0x7f87a0f89b54]
2018-09-07T08:56:38.168 INFO:tasks.workunit.client.0.mira012.stderr: 8: (()+0x43041) [0x7f8796eac041]
2018-09-07T08:56:38.169 INFO:tasks.workunit.client.0.mira012.stderr: 9: (()+0x4313a) [0x7f8796eac13a]
2018-09-07T08:56:38.169 INFO:tasks.workunit.client.0.mira012.stderr: 10: (__libc_start_main()+0xee) [0x7f8796e8ab9e]
2018-09-07T08:56:38.169 INFO:tasks.workunit.client.0.mira012.stderr: 11: (_start()+0x2a) [0x557ce4535fca]

kchai-2018-09-07_07:49:32-rados-wip-kefu-testing-2018-09-07-1352-distro-basic-mira/2988364

@batrick
Copy link
Copy Markdown
Member Author

batrick commented Sep 10, 2018

Seems like a repeatable issue:

(gdb) bt
#0  0x00007f39b643159b in raise () from /lib64/libpthread.so.0
#1  0x0000564e299a962d in reraise_fatal (signum=11) at /usr/src/debug/ceph-14.0.0-2936-g2411df7/src/global/signal_handler.cc:80
#2  handle_fatal_signal (signum=11) at /usr/src/debug/ceph-14.0.0-2936-g2411df7/src/global/signal_handler.cc:293
#3  <signal handler called>
#4  0x0000564e293ae308 in clear (this=0x564e33e00800) at /usr/src/debug/ceph-14.0.0-2936-g2411df7/src/common/StackStringStream.h:107
#5  CachedStackStringStream::CachedStackStringStream (this=0x7f39ad91d7a8) at /usr/src/debug/ceph-14.0.0-2936-g2411df7/src/common/StackStringStream.h:137
#6  0x0000564e2993ea51 in MutableEntry (sub=48, pr=10, this=0x7f39ad91d780) at /usr/src/debug/ceph-14.0.0-2936-g2411df7/src/log/Entry.h:57
#7  BlueFS::_read_random (this=0x564e33ec9200, h=0x564e33e8ac00, off=386, len=45, out=out@entry=0x564e33edf860 "0\370\355\063NV") at /usr/src/debug/ceph-14.0.0-2936-g2411df7/src/os/bluestore/BlueFS.cc:1022
#8  0x0000564e29964e00 in read_random (out=0x564e33edf860 "0\370\355\063NV", len=<optimized out>, offset=<optimized out>, h=<optimized out>, this=<optimized out>) at /usr/src/debug/ceph-14.0.0-2936-g2411df7/src/os/bluestore/BlueFS.h:438
#9  BlueRocksRandomAccessFile::Read (this=<optimized out>, offset=<optimized out>, n=<optimized out>, result=0x7f39ad91d9d0, scratch=0x564e33edf860 "0\370\355\063NV") at /usr/src/debug/ceph-14.0.0-2936-g2411df7/src/os/bluestore/BlueRocksEnv.cc:98
#10 0x0000564e29f415b8 in rocksdb::RandomAccessFileReader::Read (this=0x564e33e87580, offset=386, n=45, result=result@entry=0x7f39ad91ddd8, scratch=0x564e33edf860 "0\370\355\063NV") at /usr/src/debug/ceph-14.0.0-2936-g2411df7/src/rocksdb/util/file_reader_writer.cc:127
#11 0x0000564e29efdd5c in rocksdb::BlockFetcher::ReadBlockContents (this=this@entry=0x7f39ad91dd20) at /usr/src/debug/ceph-14.0.0-2936-g2411df7/src/rocksdb/table/block_fetcher.cc:196
#12 0x0000564e29eee044 in rocksdb::BlockBasedTable::ReadFilter (this=this@entry=0x564e330407a0, prefetch_buffer=prefetch_buffer@entry=0x0, filter_handle=..., is_a_filter_partition=is_a_filter_partition@entry=false, prefix_extractor=prefix_extractor@entry=0x0) at /usr/src/debug/ceph-14.0.0-2936-g2411df7/src/rocksdb/table/block_based_table_reader.cc:1266
#13 0x0000564e29eee55d in rocksdb::BlockBasedTable::GetFilter (this=0x564e330407a0, prefetch_buffer=0x0, filter_blk_handle=..., is_a_filter_partition=<optimized out>, no_io=<optimized out>, get_context=0x0, prefix_extractor=0x0) at /usr/src/debug/ceph-14.0.0-2936-g2411df7/src/rocksdb/table/block_based_table_reader.cc:1365
#14 0x0000564e29eee819 in rocksdb::BlockBasedTable::GetFilter (this=this@entry=0x564e330407a0, prefix_extractor=prefix_extractor@entry=0x0, prefetch_buffer=prefetch_buffer@entry=0x0, no_io=no_io@entry=false, get_context=get_context@entry=0x0) at /usr/src/debug/ceph-14.0.0-2936-g2411df7/src/rocksdb/table/block_based_table_reader.cc:1319
#15 0x0000564e29ef7620 in rocksdb::BlockBasedTable::Open(rocksdb::ImmutableCFOptions const&, rocksdb::EnvOptions const&, rocksdb::BlockBasedTableOptions const&, rocksdb::InternalKeyComparator const&, std::unique_ptr<rocksdb::RandomAccessFileReader, std::default_delete<rocksdb::RandomAccessFileReader> >&&, unsigned long, std::unique_ptr<rocksdb::TableReader, std::default_delete<rocksdb::TableReader> >*, rocksdb::SliceTransform const*, bool, bool, int) (ioptions=..., env_options=..., table_options=..., internal_comparator=..., file=<optimized out>, file_size=1484, table_reader=0x7f39ad920c38, prefix_extractor=0x0, prefetch_index_and_filter_in_cache=false, skip_filters=false, level=0) at /usr/src/debug/ceph-14.0.0-2936-g2411df7/src/rocksdb/table/block_based_table_reader.cc:913
#16 0x0000564e29ee7e70 in rocksdb::BlockBasedTableFactory::NewTableReader(rocksdb::TableReaderOptions const&, std::unique_ptr<rocksdb::RandomAccessFileReader, std::default_delete<rocksdb::RandomAccessFileReader> >&&, unsigned long, std::unique_ptr<rocksdb::TableReader, std::default_delete<rocksdb::TableReader> >*, bool) const (this=<optimized out>, table_reader_options=..., file=<optimized out>, 
    file_size=<optimized out>, table_reader=<optimized out>, prefetch_index_and_filter_in_cache=false) at /usr/src/debug/ceph-14.0.0-2936-g2411df7/src/rocksdb/table/block_based_table_factory.cc:74
#17 0x0000564e29e7a051 in rocksdb::TableCache::GetTableReader (this=this@entry=0x564e33ff2a60, env_options=..., internal_comparator=..., fd=..., sequential_mode=sequential_mode@entry=false, readahead=0, record_read_stats=true, file_read_hist=0x564e33eb2000, table_reader=0x7f39ad920c38, prefix_extractor=0x0, skip_filters=false, level=0, prefetch_index_and_filter_in_cache=false, for_compaction=false)
    at /usr/src/debug/ceph-14.0.0-2936-g2411df7/src/rocksdb/db/table_cache.cc:117
#18 0x0000564e29e7a692 in rocksdb::TableCache::FindTable (this=0x564e33ff2a60, env_options=..., internal_comparator=..., fd=..., handle=handle@entry=0x564e33e8b6b8, prefix_extractor=0x0, no_io=false, record_read_stats=true, file_read_hist=0x564e33eb2000, skip_filters=false, level=0, prefetch_index_and_filter_in_cache=false) at /usr/src/debug/ceph-14.0.0-2936-g2411df7/src/rocksdb/db/table_cache.cc:155
#19 0x0000564e29fa7952 in operator() (__closure=0x564e33e0b170) at /usr/src/debug/ceph-14.0.0-2936-g2411df7/src/rocksdb/db/version_builder.cc:394
#20 std::_Function_handler<void (), rocksdb::VersionBuilder::Rep::LoadTableHandlers(rocksdb::InternalStats*, int, bool, rocksdb::SliceTransform const*)::{lambda()#1}>::_M_invoke(std::_Any_data const&) (__functor=...) at /opt/rh/devtoolset-7/root/usr/include/c++/7/bits/std_function.h:316
#21 0x0000564e2a052cef in execute_native_thread_routine ()
#22 0x00007f39b6429e25 in start_thread () from /lib64/libpthread.so.0
#23 0x00007f39b5507bad in clone () from /lib64/libc.so.6

From: /ceph/teuthology-archive/kchai-2018-09-07_07:49:32-rados-wip-kefu-testing-2018-09-07-1352-distro-basic-mira/2988565/remote/mira010/coredump/1536416245.27817.core

and

(gdb) bt
#0  raise (sig=sig@entry=11) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x0000557ce4572c93 in reraise_fatal (signum=11) at ./src/global/signal_handler.cc:80
#2  handle_fatal_signal (signum=11) at ./src/global/signal_handler.cc:293
#3  <signal handler called>
#4  0x00007f8798424b57 in StackStringStream::clear (this=0x0) at ./src/common/StackStringStream.h:107
#5  CachedStackStringStream::CachedStackStringStream (this=0x7ffdf530c4d8) at ./src/common/StackStringStream.h:137
#6  0x00007f87984e300b in ceph::logging::MutableEntry::MutableEntry (sub=42, pr=5, this=0x7ffdf530c4b0) at ./src/log/Entry.h:57
#7  AdminSocket::unregister_command (this=this@entry=0x557ce54f69b0, command="objecter_requests") at ./src/common/admin_socket.cc:475
#8  0x00007f87a13206e3 in Objecter::shutdown (this=0x557ce558ca10) at ./src/osdc/Objecter.cc:520
#9  0x00007f87a12f865b in librados::RadosClient::shutdown (this=0x557ce5585e50) at ./src/librados/RadosClient.cc:381
#10 0x00007f87a12a4248 in librados::Rados::shutdown (this=0x557ce5572248) at ./src/librados/librados.cc:2345
#11 0x00007f87a0f89b54 in libradosstriper::RadosStriper::~RadosStriper() () from /usr/lib/libradosstriper.so.1
#12 0x00007f8796eac041 in __run_exit_handlers (status=0, listp=0x7f8797254718 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:108
#13 0x00007f8796eac13a in __GI_exit (status=<optimized out>) at exit.c:139
#14 0x00007f8796e8ab9e in __libc_start_main (main=0x557ce45205b0 <main(int, char const**)>, argc=7, argv=0x7ffdf530c7b8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>,
    stack_end=0x7ffdf530c7a8) at ../csu/libc-start.c:344
#15 0x0000557ce4535fca in _start ()

from: /ceph/teuthology-archive/kchai-2018-09-07_07:49:32-rados-wip-kefu-testing-2018-09-07-1352-distro-basic-mira/2988364/remote/mira012/coredump/1536310598.11264.core

Nothing pops out at me as being obviously wrong to cause this.

@cbodley
Copy link
Copy Markdown
Contributor

cbodley commented Sep 11, 2018

@aclamk does this look like the same issue you encountered in common/CachedPrebufferedStreambuf.cc?

// std::unique cannot be used here, as deletion will not clear value,
// but thread variable will still exist, causing free-memory-read
struct cached_os_t

@batrick
Copy link
Copy Markdown
Member Author

batrick commented Sep 11, 2018

@aclamk does this look like the same issue you encountered in common/CachedPrebufferedStreambuf.cc?

That seems to be the issue. The destructors for the thread_local variables were called before the librados destructors. I've got a fix for this I'll push soon.

Thanks for thinking of that @cbodley !

@batrick
Copy link
Copy Markdown
Member Author

batrick commented Sep 11, 2018

@cbodley @tchaikov this passes my testing. I think it's ready for another round of QA.

@tchaikov
Copy link
Copy Markdown
Contributor

2018-09-13T14:43:52.247 INFO:teuthology.orchestra.run.mira071.stderr:*** Caught signal (Segmentation fault) **
2018-09-13T14:43:52.247 INFO:teuthology.orchestra.run.mira071.stderr: in thread 7f992e7b7980 thread_name:ceph-osd
2018-09-13T14:43:52.256 INFO:teuthology.orchestra.run.mira071.stderr: ceph version 14.0.0-3263-g179797e (179797eecd0e59330f27082b54d943f6224fece0) nautilus (dev)
2018-09-13T14:43:52.256 INFO:teuthology.orchestra.run.mira071.stderr: 1: (()+0x12890) [0x7f992bfe6890]
2018-09-13T14:43:52.256 INFO:teuthology.orchestra.run.mira071.stderr: 2: (CachedStackStringStream::CachedStackStringStream()+0x132) [0x562094aa9072]
2018-09-13T14:43:52.256 INFO:teuthology.orchestra.run.mira071.stderr: 3: (CephRocksdbLogger::Logv(rocksdb::InfoLogLevel, char const*, __va_list_tag*)+0xb6) [0x562095016e26]
2018-09-13T14:43:52.256 INFO:teuthology.orchestra.run.mira071.stderr: 4: (rocksdb::Log(rocksdb::InfoLogLevel, std::shared_ptr<rocksdb::Logger> const&, char const*, ...)+0x96) [0x5620955fe9d6]
2018-09-13T14:43:52.256 INFO:teuthology.orchestra.run.mira071.stderr: 5: (rocksdb::VersionSet::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool)+0x1153) [0x5620955e1153]
2018-09-13T14:43:52.256 INFO:teuthology.orchestra.run.mira071.stderr: 6: (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool)+0xe2c) [0x56209553fa5c]
2018-09-13T14:43:52.256 INFO:teuthology.orchestra.run.mira071.stderr: 7: (rocksdb::DBImpl::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**, bool)+0x696) [0x5620955404b6]
2018-09-13T14:43:52.256 INFO:teuthology.orchestra.run.mira071.stderr: 8: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**)+0x22) [0x562095541db2]
2018-09-13T14:43:52.257 INFO:teuthology.orchestra.run.mira071.stderr: 9: (RocksDBStore::do_open(std::ostream&, bool, std::vector<KeyValueDB::ColumnFamily, std::allocator<KeyValueDB::ColumnFamily> > const*)+0x1247) [0x56209500c467]
2018-09-13T14:43:52.257 INFO:teuthology.orchestra.run.mira071.stderr: 10: (BlueStore::_open_db(bool, bool)+0xd06) [0x562094f60686]
2018-09-13T14:43:52.257 INFO:teuthology.orchestra.run.mira071.stderr: 11: (BlueStore::_fsck(bool, bool)+0xe59) [0x562094fb1169]
2018-09-13T14:43:52.257 INFO:teuthology.orchestra.run.mira071.stderr: 12: (BlueStore::_mount(bool, bool)+0x207) [0x562094fbdb27]
2018-09-13T14:43:52.257 INFO:teuthology.orchestra.run.mira071.stderr: 13: (OSD::mkfs(CephContext*, ObjectStore*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, uuid_d, int)+0x1d1) [0x562094ac1a11]
2018-09-13T14:43:52.257 INFO:teuthology.orchestra.run.mira071.stderr: 14: (main()+0x1684) [0x5620949ca634]
2018-09-13T14:43:52.257 INFO:teuthology.orchestra.run.mira071.stderr: 15: (__libc_start_main()+0xe7) [0x7f992ac76b97]
2018-09-13T14:43:52.257 INFO:teuthology.orchestra.run.mira071.stderr: 16: (_start()+0x2a) [0x562094aa30fa]
2018-09-13T14:43:52.329 INFO:teuthology.orchestra.run.mira071:Running: 'sudo MALLOC_CHECK_=3 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph-osd --cluster ceph --mkfs --mkkey -i 1 --monmap /home/ubuntu/cephtest/ceph.monmap'

see http://pulpito.ceph.com/kchai-2018-09-13_14:27:22-rados-wip-kefu-testing-2018-09-13-1956-distro-basic-mira/3016307/

@batrick
Copy link
Copy Markdown
Member Author

batrick commented Sep 14, 2018

I don't see the issue from the stacktrace:

(gdb) bt
#0  0x00007f992deef74d in tc_newarray () from /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4
#1  0x0000562094b2951a in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::reserve (this=0x7ffc320cedc0, __res=<optimized out>) at /usr/include/c++/7/bits/basic_string.tcc:293
#2  0x00007f992b718bf8 in std::__cxx11::basic_stringbuf<char, std::char_traits<char>, std::allocator<char> >::overflow(int) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007f992b72359b in std::basic_streambuf<char, std::char_traits<char> >::xsputn(char const*, long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f992b713774 in std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long) ()
   from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x000056209510b468 in std::operator<< <char, std::char_traits<char>, std::allocator<char> > (
    __str="{\n    \"crash_id\": \"2018-09-13_14:43:51.997093Z_7316ff21-71cf-49cb-887c-bb8c2d39b978\",\n    \"timestamp\": \"2018-09-13 14:43:51.997093Z\",\n    \"entity_name\": \"osd.1\",\n    \"ceph_version\": \"14.0.0-3263-g1797"..., __os=...) at /usr/include/c++/7/bits/basic_string.h:6277
#6  ceph::JSONFormatter::flush (this=this@entry=0x7ffc320cf960, os=...) at ./src/common/Formatter.cc:129
#7  0x00005620950e056a in handle_fatal_signal (signum=11) at ./src/global/signal_handler.cc:258
#8  <signal handler called>
#9  0x0000562094aa9072 in StackStringStream<4096ul>::clear (this=0x56209f7dc000) at ./src/common/StackStringStream.h:90
#10 CachedStackStringStream::CachedStackStringStream (this=0x7ffc320d55b8) at ./src/common/StackStringStream.h:119
#11 0x0000562095016e26 in ceph::logging::MutableEntry::MutableEntry (sub=51, pr=<optimized out>, this=0x7ffc320d5590) at ./src/log/Entry.h:57
#12 CephRocksdbLogger::Logv (this=0x56209f80f220, log_level=<optimized out>,
    format=0x562095933ef8 "[/build/ceph-14.0.0-3263-g179797e/src/rocksdb/db/version_set.cc:3374] Recovered from manifest file:%s succeeded,manifest_file_number is %lu, next_file_number is %lu, last_sequence is %lu, log_number i"..., ap=0x7ffc320e5610) at ./src/kv/RocksDBStore.cc:183
#13 0x00005620955fe9d6 in rocksdb::Log (log_level=log_level@entry=rocksdb::INFO_LEVEL, info_log=std::shared_ptr<rocksdb::Logger> (use count 4, weak count 0) = {...},
    format=format@entry=0x562095933ef8 "[/build/ceph-14.0.0-3263-g179797e/src/rocksdb/db/version_set.cc:3374] Recovered from manifest file:%s succeeded,manifest_file_number is %lu, next_file_number is %lu, last_sequence is %lu, log_number i"...) at ./src/rocksdb/env/env.cc:253
#14 0x00005620955e1153 in rocksdb::VersionSet::Recover (this=<optimized out>, column_families=std::vector of length 1, capacity 1 = {...}, read_only=read_only@entry=false) at ./src/rocksdb/db/version_set.cc:3363
#15 0x000056209553fa5c in rocksdb::DBImpl::Recover (this=this@entry=0x56209f883000, column_families=std::vector of length 1, capacity 1 = {...}, read_only=read_only@entry=false,
    error_if_log_file_exist=error_if_log_file_exist@entry=false, error_if_data_exists_in_logs=error_if_data_exists_in_logs@entry=false) at ./src/rocksdb/db/db_impl_open.cc:379
#16 0x00005620955404b6 in rocksdb::DBImpl::Open (db_options=..., dbname="db", column_families=std::vector of length 1, capacity 1 = {...}, handles=handles@entry=0x7ffc320e67b0, dbptr=dbptr@entry=0x56209ea1ace8,
    seq_per_batch=false) at ./src/rocksdb/db/db_impl_open.cc:1067
#17 0x0000562095541db2 in rocksdb::DB::Open (db_options=..., dbname="db", column_families=std::vector of length 1, capacity 1 = {...}, handles=handles@entry=0x7ffc320e67b0, dbptr=dbptr@entry=0x56209ea1ace8)
    at ./src/rocksdb/db/db_impl_open.cc:1008
#18 0x000056209500c467 in RocksDBStore::do_open (this=0x56209ea1ac00, out=..., create_if_missing=false, cfs=0x7ffc320e7100) at ./src/kv/RocksDBStore.cc:557
#19 0x0000562094f60686 in BlueStore::_open_db (this=this@entry=0x56209f852000, create=create@entry=false, to_repair_db=to_repair_db@entry=false) at ./src/os/bluestore/BlueStore.cc:5203
#20 0x0000562094fb1169 in BlueStore::_fsck (this=this@entry=0x56209f852000, deep=true, repair=<optimized out>, repair@entry=false) at ./src/os/bluestore/BlueStore.cc:6238
#21 0x0000562094fbdb27 in BlueStore::fsck (deep=<optimized out>, this=0x56209f852000) at ./src/os/bluestore/BlueStore.h:2365
#22 BlueStore::_mount (this=0x56209f852000, kv_only=<optimized out>, open_db=<optimized out>) at ./src/os/bluestore/BlueStore.cc:5833
#23 0x0000562094ac1a11 in OSD::mkfs (cct=0x56209ed00000, store=0x56209f852000, dev=..., fsid=..., whoami=1) at ./src/osd/OSD.cc:1807
#24 0x00005620949ca634 in main (argc=<optimized out>, argv=<optimized out>) at ./src/ceph_osd.cc:336
(gdb) frame 9            
#9  0x0000562094aa9072 in StackStringStream<4096ul>::clear (this=0x56209f7dc000) at ./src/common/StackStringStream.h:90
90      in ./src/common/StackStringStream.h
(gdb) print *this
$9 = {<std::basic_ostream<char, std::char_traits<char> >> = {<No data fields>}, ssb = {<std::basic_streambuf<char, std::char_traits<char> >> = {<No data fields>}, 
    vec = {<boost::container::small_vector_base<char, boost::container::new_allocator<char> >> = {<boost::container::vector<char, boost::container::small_vector_allocator<boost::container::new_allocator<char> >, void>> = {m_holder = {<boost::container::small_vector_allocator<boost::container::new_allocator<char> >> = {<boost::container::new_allocator<char>> = {<No data fields>}, <No data fields>}, 
            m_start = 0xffffffffffffffff <error: Cannot access memory at address 0xffffffffffffffff>, m_size = 18446744073709551615, m_capacity = 18446744073709551615}}, m_storage_start = {aligner = {
            data = "\377"}, data = "\377"}}, <boost::container::small_vector_storage<boost::move_detail::aligned_struct_wrapper<1, 1>, 4088>> = {m_rest_of_storage = {{aligner = {data = "\377"}, 
            data = "\377"} <repeats 4088 times>}}, static needed_extra_storages = <optimized out>, static needed_bytes = <optimized out>, static header_bytes = <optimized out>, static s_start = <optimized out>, 
      static static_capacity = <optimized out>}}}

Looks almost like the small_vector was destructed but I don't see how. I'll keep thinking about this although hopefully someone sees the issue.

Each log Entry now exists on the stack and uses a large (4k) buffer for its log
stream. This Entry is std::move'd to the queues (std::vector and
boost::circular_buffer) in the Log, involving only memory copies in the general
case. There are two memory copies (std::move) for any given Entry, once in
Log::submit_entry and again in Log::_flush

In practice, this eliminates 100% of allocations outside of startup
allocations

I've run a simple experiment with the MDS that copies /usr/bin to CephFS. I got
measurements for the number of allocations from the heap profiler and the
profile of CPU usage in the MDS.

** Before this patch **

    == Heap profile: ==

    $ google-pprof --alloc_objects --text bin/ceph-mds out/mds.a.profile.0001.heap
    Total: 1105048 objects
      433329  39.2%  39.2%   433329  39.2% ceph::logging::Log::create_entry
      209311  18.9%  58.2%   209311  18.9% __gnu_cxx::__aligned_membuf::_M_addr (inline)
      192963  17.5%  75.6%   192963  17.5% __gnu_cxx::new_allocator::allocate (inline)
       61774   5.6%  81.2%    61774   5.6% std::__cxx11::basic_string::_M_mutate
       37689   3.4%  84.6%    37689   3.4% ceph::buffer::raw_combined::create (inline)
       22773   2.1%  86.7%    22773   2.1% mempool::pool_allocator::allocate (inline)
       17761   1.6%  88.3%    20523   1.9% std::pair::pair (inline)
       15795   1.4%  89.7%    15797   1.4% std::swap (inline)
       11011   1.0%  90.7%   130061  11.8% std::__cxx11::list::_M_insert (inline)
       10822   1.0%  91.7%    10822   1.0% std::__cxx11::basic_string::reserve
        9108   0.8%  92.5%    32721   3.0% __gnu_cxx::new_allocator::construct (inline)
        8608   0.8%  93.3%     8610   0.8% std::_Deque_base::_M_initialize_map (inline)
        7694   0.7%  94.0%     7694   0.7% std::__cxx11::basic_string::_M_capacity (inline)
        6160   0.6%  94.5%     6160   0.6% Journaler::wrap_finisher
        6084   0.6%  95.1%    70892   6.4% std::map::operator[] (inline)
        5347   0.5%  95.6%     5347   0.5% MutationImpl::add_projected_fnode
        4381   0.4%  96.0%     7706   0.7% mempool::pool_allocator::construct (inline)
        3588   0.3%  96.3%   182966  16.6% Locker::_do_cap_update
        3049   0.3%  96.6%     5280   0.5% std::__shared_count::__shared_count (inline)
        3043   0.3%  96.9%     3043   0.3% MDSLogContextBase::MDSLogContextBase (inline)
        3038   0.3%  97.1%    14763   1.3% std::__uninitialized_copy::__uninit_copy (inline)

   So approximately 430k heap allocations for Entry were created! The
   basic_string::_M_mutate is also another allocation via PrebufferedStreambuf

	== Profile data ==

	Selecting interesting functions

Samples: 798K of event 'cycles:pp', Event count (approx
  Children      Self  Com  Shared Object        Symbol
+    1.04%     1.04%  log  libceph-common.so.0  [.] ceph::logging::Log::_flush
+    0.05%     0.05%  log  libceph-common.so.0  [.] ceph::logging::Log::flush
+    0.00%     0.00%  log  libceph-common.so.0  [.] ceph::logging::Log::_log_safe_write
+    0.00%     0.00%  log  libceph-common.so.0  [.] ceph::logging::Log::entry
+    0.00%     0.00%  log  libceph-common.so.0  [.] ceph::logging::Log::_flush_logbuf
...

  Children      Self  Command         Shared Object        Symbol
+    3.69%     0.00%  safe_timer      libceph-common.so.0  [.] CachedPrebufferedStreambuf::~CachedPrebufferedStreambuf
+    0.53%     0.00%  ms_dispatch     libceph-common.so.0  [.] CachedPrebufferedStreambuf::~CachedPrebufferedStreambuf
+    0.13%     0.00%  fn_anonymous    libceph-common.so.0  [.] CachedPrebufferedStreambuf::~CachedPrebufferedStreambuf
+    0.00%     0.00%  ms_dispatch     libceph-common.so.0  [.] CachedPrebufferedStreambuf::create
+    0.00%     0.00%  fn_anonymous    libceph-common.so.0  [.] CachedPrebufferedStreambuf::create

  Children      Self  Command         Shared Object        Symbol
+    0.07%     0.07%  fn_anonymous    libceph-common.so.0  [.] ceph::logging::Log::create_entry
+    0.00%     0.00%  ms_dispatch     libceph-common.so.0  [.] ceph::logging::Log::create_entry
+    0.00%     0.00%  ms_dispatch     ceph-mds             [.] _ZN4ceph7logging3Log12create_entryEiiPm@plt
+    0.00%     0.00%  md_submit       libceph-common.so.0  [.] ceph::logging::Log::create_entry
+    0.00%     0.00%  fn_anonymous    ceph-mds             [.] _ZN4ceph7logging3Log12create_entryEiiPm@plt
+    0.00%     0.00%  safe_timer      libceph-common.so.0  [.] ceph::logging::Log::create_entry
+    0.00%     0.00%  mds_rank_progr  libceph-common.so.0  [.] ceph::logging::Log::create_entry
+    0.00%     0.00%  mds_rank_progr  ceph-mds             [.] _ZN4ceph7logging3Log12create_entryEiiPm@plt
+    0.00%     0.00%  msgr-worker-0   libceph-common.so.0  [.] ceph::logging::Log::create_entry
+    0.00%     0.00%  msgr-worker-2   libceph-common.so.0  [.] ceph::logging::Log::create_entry
+    0.00%     0.00%  md_submit       ceph-mds             [.] _ZN4ceph7logging3Log12create_entryEiiPm@plt
+    0.00%     0.00%  msgr-worker-1   libceph-common.so.0  [.] ceph::logging::Log::create_entry

  Children      Self  Command         Shared Object        Symbol
+    8.29%     0.00%  ms_dispatch     libstdc++.so.6.0.24  [.] virtual thunk to std::basic_ostream<char, std::char_traits<char> >::~basic_ostream
+    7.55%     1.46%  ms_dispatch     libstdc++.so.6.0.24  [.] std::ostream::_M_insert<long>
+    3.87%     0.00%  fn_anonymous    libstdc++.so.6.0.24  [.] std::basic_ostream<char, std::char_traits<char> >::~basic_ostream
+    2.92%     0.00%  md_submit       libstdc++.so.6.0.24  [.] virtual thunk to std::basic_ostream<char, std::char_traits<char> >::~basic_ostream
+    2.38%     0.00%  fn_anonymous    libstdc++.so.6.0.24  [.] virtual thunk to std::basic_ostream<char, std::char_traits<char> >::~basic_ostream
+    2.22%     2.22%  fn_anonymous    libstdc++.so.6.0.24  [.] std::ostream::sentry::sentry
+    1.89%     0.13%  fn_anonymous    libstdc++.so.6.0.24  [.] std::__ostream_insert<char, std::char_traits<char> >
+    0.71%     0.00%  ms_dispatch     libstdc++.so.6.0.24  [.] std::basic_ostream<char, std::char_traits<char> >::~basic_ostream
+    0.39%     0.06%  fn_anonymous    libstdc++.so.6.0.24  [.] std::ostream::_M_insert<long>
+    0.29%     0.21%  ms_dispatch     libstdc++.so.6.0.24  [.] std::__ostream_insert<char, std::char_traits<char> >
+    0.27%     0.27%  ms_dispatch     libstdc++.so.6.0.24  [.] std::ostream::sentry::~sentry
+    0.27%     0.27%  fn_anonymous    libstdc++.so.6.0.24  [.] std::num_put<char, std::ostreambuf_iterator<char, std::char_traits<char> > >::_M_insert_int<long>
+    0.22%     0.22%  ms_dispatch     libstdc++.so.6.0.24  [.] std::basic_streambuf<char, std::char_traits<char> >::xsputn
+    0.20%     0.20%  ms_dispatch     libstdc++.so.6.0.24  [.] std::num_put<char, std::ostreambuf_iterator<char, std::char_traits<char> > >::_M_insert_int<long>
+    0.15%     0.15%  fn_anonymous    libstdc++.so.6.0.24  [.] std::ostream::sentry::~sentry
+    0.14%     0.14%  ms_dispatch     libstdc++.so.6.0.24  [.] std::ostream::sentry::sentry
+    0.13%     0.00%  ms_dispatch     libstdc++.so.6.0.24  [.] std::ostream::_M_insert<unsigned long>
+    0.13%     0.13%  fn_anonymous    libstdc++.so.6.0.24  [.] std::basic_streambuf<char, std::char_traits<char> >::xsputn
+    0.00%     0.00%  fn_anonymous    libstdc++.so.6.0.24  [.] std::ostream::_M_insert<unsigned long>
+    0.00%     0.00%  ms_dispatch     libstdc++.so.6.0.24  [.] std::num_put<char, std::ostreambuf_iterator<char, std::char_traits<char> > >::_M_insert_int<unsigned long>

    And the unittest_log time:

    $ bin/unittest_log
    [==========] Running 15 tests from 1 test case
    [----------] Global test environment set-up
    [----------] 15 tests from Log
    [ RUN      ] Log.Simple
    [       OK ] Log.Simple (0 ms)
    [ RUN      ] Log.ReuseBad
    [       OK ] Log.ReuseBad (1 ms)
    [ RUN      ] Log.ManyNoGather
    [       OK ] Log.ManyNoGather (0 ms)
    [ RUN      ] Log.ManyGatherLog
    [       OK ] Log.ManyGatherLog (12 ms)
    [ RUN      ] Log.ManyGatherLogStringAssign
    [       OK ] Log.ManyGatherLogStringAssign (27 ms)
    [ RUN      ] Log.ManyGatherLogStringAssignWithReserve
    [       OK ] Log.ManyGatherLogStringAssignWithReserve (27 ms)
    [ RUN      ] Log.ManyGatherLogPrebuf
    [       OK ] Log.ManyGatherLogPrebuf (15 ms)
    [ RUN      ] Log.ManyGatherLogPrebufOverflow
    [       OK ] Log.ManyGatherLogPrebufOverflow (15 ms)
    [ RUN      ] Log.ManyGather
    [       OK ] Log.ManyGather (8 ms)
    [ RUN      ] Log.InternalSegv

    [WARNING] /home/pdonnell/cephfs-shell/src/googletest/googletest/src/gtest-death-test.cc:836:: Death tests use fork(), which is unsafe particularly in a threaded context. For this test, Google Test detected 3 threads
    [       OK ] Log.InternalSegv (8 ms)
    [ RUN      ] Log.LargeLog
    [       OK ] Log.LargeLog (43 ms)
    [ RUN      ] Log.TimeSwitch
    [       OK ] Log.TimeSwitch (1 ms)
    [ RUN      ] Log.TimeFormat
    [       OK ] Log.TimeFormat (0 ms)
    [ RUN      ] Log.Speed_gather
    [       OK ] Log.Speed_gather (1779 ms)
    [ RUN      ] Log.Speed_nogather
    [       OK ] Log.Speed_nogather (64 ms)
    [----------] 15 tests from Log (2000 ms total)

    [----------] Global test environment tear-down
    [==========] 15 tests from 1 test case ran. (2000 ms total)
    [  PASSED  ] 15 tests

** After Patch **

The StackStreamBuf uses 4k for its default buffer. This appears to be more than
reasonable for preventing allocations for logging

    == Heap profile: ==

	$ google-pprof --alloc_objects --text bin/ceph-mds out/mds.a.profile.0001.heap
	Total: 1052127 objects
      384957  36.6%  36.6%   384957  36.6% __gnu_cxx::new_allocator::allocate (inline)
      274720  26.1%  62.7%   274720  26.1% __gnu_cxx::__aligned_membuf::_M_addr (inline)
       68496   6.5%  69.2%    68496   6.5% std::__cxx11::basic_string::_M_mutate
       44140   4.2%  73.4%    51828   4.9% std::pair::pair (inline)
       43091   4.1%  77.5%    43091   4.1% ceph::buffer::raw_combined::create (inline)
       27706   2.6%  80.1%   236407  22.5% std::__cxx11::list::_M_insert (inline)
       25699   2.4%  82.6%    25699   2.4% std::__cxx11::basic_string::reserve
       23183   2.2%  84.8%    23183   2.2% mempool::pool_allocator::allocate (inline)
       19466   1.9%  86.6%    81716   7.8% __gnu_cxx::new_allocator::construct (inline)
       17606   1.7%  88.3%    17606   1.7% std::__cxx11::basic_string::_M_capacity (inline)
       16879   1.6%  89.9%    16881   1.6% std::swap (inline)
        8572   0.8%  90.7%     8574   0.8% std::_Deque_base::_M_initialize_map (inline)
        8477   0.8%  91.5%    11808   1.1% mempool::pool_allocator::construct (inline)
        6166   0.6%  92.1%     6166   0.6% Journaler::wrap_finisher
        6080   0.6%  92.7%   134975  12.8% std::map::operator[] (inline)
        6079   0.6%  93.3%     6079   0.6% MutationImpl::add_projected_fnode

	== Profile data ==

    Samples: 62K of event 'cycles:u', Event count (approx.)
      Overhead  Command         Shared Object         Symbol
    +    5.91%  log             libc-2.23.so          [.] vfprintf
    +    5.75%  ms_dispatch     libstdc++.so.6.0.24   [.] std::__ostream_insert<char, std::char_traits<char> >
    +    4.59%  ms_dispatch     ceph-mds              [.] StackStringBuf<4096ul>::xsputn
    +    4.26%  ms_dispatch     libc-2.23.so          [.] __memmove_ssse3_back
    +    4.07%  log             libceph-common.so.0   [.] ceph::logging::Log::_flush
    +    2.48%  ms_dispatch     libstdc++.so.6.0.24   [.] std::num_put<char, std::ostreambuf_iterator<char, std::char_traits<char> > >::_M_insert_int<l
    +    2.13%  fn_anonymous    libstdc++.so.6.0.24   [.] std::__ostream_insert<char, std::char_traits<char> >
    +    2.09%  ms_dispatch     ceph-mds              [.] CDir::check_rstats
    +    2.06%  ms_dispatch     libstdc++.so.6.0.24   [.] std::ostream::sentry::sentry
    +    1.98%  ms_dispatch     libstdc++.so.6.0.24   [.] std::ostream::sentry::~sentry
    +    1.87%  log             libc-2.23.so          [.] __strcpy_sse2_unaligned
    +    1.60%  fn_anonymous    ceph-mds              [.] StackStringBuf<4096ul>::xsputn
    +    1.46%  log             libc-2.23.so          [.] _IO_default_xsputn
    +    1.45%  log             libc-2.23.so          [.] _itoa_word
    +    1.43%  fn_anonymous    libc-2.23.so          [.] __memmove_ssse3_back
    +    1.40%  ms_dispatch     libstdc++.so.6.0.24   [.] std::ostream::_M_insert<long>
    +    0.98%  fn_anonymous    libstdc++.so.6.0.24   [.] std::num_put<char, std::ostreambuf_iterator<char, std::char_traits<char> > >::_M_insert_int<l
    +    0.89%  ms_dispatch     libstdc++.so.6.0.24   [.] 0x
    +    0.88%  ms_dispatch     libstdc++.so.6.0.24   [.] std::_Rb_tree_increment

    And the unittest_log time:

    $ bin/unittest_log
    [==========] Running 13 tests from 1 test case.
    [----------] Global test environment set-up.
    [----------] 13 tests from Log
    [ RUN      ] Log.Simple
    [       OK ] Log.Simple (1 ms)
    [ RUN      ] Log.ReuseBad
    [       OK ] Log.ReuseBad (0 ms)
    [ RUN      ] Log.ManyNoGather
    [       OK ] Log.ManyNoGather (0 ms)
    [ RUN      ] Log.ManyGatherLog
    [       OK ] Log.ManyGatherLog (83 ms)
    [ RUN      ] Log.ManyGatherLogStringAssign
    [       OK ] Log.ManyGatherLogStringAssign (79 ms)
    [ RUN      ] Log.ManyGatherLogStackSpillover
    [       OK ] Log.ManyGatherLogStackSpillover (81 ms)
    [ RUN      ] Log.ManyGather
    [       OK ] Log.ManyGather (80 ms)
    [ RUN      ] Log.InternalSegv

    [WARNING] /home/pdonnell/ceph/src/googletest/googletest/src/gtest-death-test.cc:836:: Death tests use fork(), which is unsafe particularly in a threaded context. For this test, Google Test detected 3 threads.
    [       OK ] Log.InternalSegv (7 ms)
    [ RUN      ] Log.LargeLog
    [       OK ] Log.LargeLog (55 ms)
    [ RUN      ] Log.TimeSwitch
    [       OK ] Log.TimeSwitch (4 ms)
    [ RUN      ] Log.TimeFormat
    [       OK ] Log.TimeFormat (1 ms)
    [ RUN      ] Log.Speed_gather
    [       OK ] Log.Speed_gather (1441 ms)
    [ RUN      ] Log.Speed_nogather
    [       OK ] Log.Speed_nogather (63 ms)
    [----------] 13 tests from Log (1895 ms total)

    [----------] Global test environment tear-down
    [==========] 13 tests from 1 test case ran. (1895 ms total)
    [  PASSED  ] 13 tests.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
In favor of the simpler StackStreamBuffer.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
This brings down the static size of the memory used by the logging infrastructure:

    If we used 1024, we'd have 1088*10000 = 10880000 = 10MB in use by the ring
    buffer and 2*1088*100 = 2*108800 = 2*106KB for the m_new and m_flush
    vectors.

In my testing, 1024 covers most log entries.

Note, I've kept 4096 for the StackStringStream via MutableEntry as these are
already allocated on the heap and cached in a thread local vector. Generally
there should only be about a dozen of these allocated so it's worth keeping a
larger buffer.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
@batrick
Copy link
Copy Markdown
Member Author

batrick commented Sep 15, 2018

valgrind to the rescue, I think I found the issue (an errant strcpy with a non-NUL-terminated std::string_view source).

@batrick
Copy link
Copy Markdown
Member Author

batrick commented Sep 15, 2018

I believe I've fixed the issue. It passes tests for an fs run: http://pulpito.ceph.com/pdonnell-2018-09-15_04:37:26-fs-wip-pdonnell-testing-20180915.024633-testing-basic-smithi/

The valgrind failure et al. are unrelated.

@tchaikov
Copy link
Copy Markdown
Contributor

@tchaikov tchaikov merged commit 1c5f2cc into ceph:master Sep 19, 2018
@batrick batrick deleted the log-entry-stack branch September 19, 2018 22:17
batrick added a commit to batrick/ceph that referenced this pull request Oct 25, 2018
This reverts commit 1c5f2cc, reversing
changes made to 58a08c6.

Conflicts:
	src/log/Log.cc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants