Skip to content

[RFC] test/objectstore/store_test: First pass at an OmapBench test#39976

Closed
markhpc wants to merge 1 commit intoceph:mainfrom
markhpc:wip-omap-bench-test
Closed

[RFC] test/objectstore/store_test: First pass at an OmapBench test#39976
markhpc wants to merge 1 commit intoceph:mainfrom
markhpc:wip-omap-bench-test

Conversation

@markhpc
Copy link
Member

@markhpc markhpc commented Mar 9, 2021

This is a quick and dirty first pass at a new gtest microbenchmark for omap that heavily borrows from the generic OmapSimple test but could be extended further. It creates a configurable number of random kv pairs with configurable (if you modify the test) key and value lengths. Arguably we may want something a little more sophisticated for kv pairs with common prefixes or other cases that this isn't covering.

For now, here's a quick set of tests from Officinalis using key/value sizes roughly based on what we saw when investigating for the trocksdb folks:

https://docs.google.com/spreadsheets/d/1fNFI8U-JRkU5uaRJzgg5rNxqhgRJFlDB4TsTAVsuYkk/edit?usp=sharing

1M 64 byte keys with 256 byte values:

Test memstore filestore kstore bluestore
(buffered off)
bluestore
(buffered on)
omap_setkeys 1.138 1.141 1.140 1.171 1.164
omap_get 0.253 2.40 1.974 1.352 2.052
iteration (seek_to_first) 0.489 0.518 0.872 0.650 0.711
iteration (lower_bound) 0.462 0.772 0.695 2.32 1.06
remove 0.000 0.000 0.359 0.302 0.405

Edit: A quick couple of passes make these results look relatively repeatable, though literally I just populated the OmapSimple tests with more random data and timed the tests and haven't thought about the specifics much yet.

If anyone would like to give it a try, you can run something like:

bin/ceph_test_objectstore --gtest_filter="*StoreTest.OmapBench/2" --bluefs_buffered_io=false

Signed-off-by: Mark Nelson mnelson@redhat.com

Checklist

  • [] References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@markhpc
Copy link
Member Author

markhpc commented Mar 9, 2021

Additional thoughts: Create multiple collections? Create/Remove lots of kv pairs between repeated iteration tests?

@markhpc
Copy link
Member Author

markhpc commented Mar 10, 2021

Quick wallclock profile of thread 1 in both bluefs_buffered_io cases repeatedly doing the lower_bound iteration test over and over:

bluefs_buffered_io=false

Thread: 1 (ceph_test_objec) - 1000 samples 

+ 100.00% main
  + 100.00% RUN_ALL_TESTS
    + 100.00% testing::UnitTest::Run
      + 100.00% testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>
        + 100.00% testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>
          + 100.00% testing::internal::UnitTestImpl::RunAllTests
            + 100.00% testing::TestSuite::Run
              + 100.00% testing::TestSuite::Run
                + 100.00% testing::TestInfo::Run
                  + 100.00% testing::TestInfo::Run
                    + 100.00% testing::Test::Run
                      + 100.00% testing::Test::Run
                        + 100.00% testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>
                          + 100.00% testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>
                            + 100.00% StoreTest_OmapBench_Test::TestBody
                              + 55.90% BlueStore::OmapIteratorImpl::next
                              | + 55.90% ShardMergeIteratorImpl::next
                              |   + 55.90% rocksdb::DBIter::Next
                              |     + 34.00% rocksdb::DBIter::FindNextUserEntry
                              |     | + 34.00% rocksdb::DBIter::FindNextUserEntryInternal
                              |     |   + 34.00% rocksdb::UserComparatorWrapper::Compare
                              |     + 21.90% rocksdb::IteratorWrapperBase<rocksdb::Slice>::Next
                              |       + 21.90% rocksdb::MergingIterator::NextAndGetResult
                              |         + 21.90% rocksdb::MergingIterator::Next
                              |           + 21.90% rocksdb::IteratorWrapperBase<rocksdb::Slice>::Next
                              |             + 21.90% rocksdb::(anonymous namespace)::LevelIterator::NextAndGetResult
                              |               + 21.90% rocksdb::(anonymous namespace)::LevelIterator::NextImpl
                              |                 + 21.90% rocksdb::IteratorWrapperBase<rocksdb::Slice>::Next
                              |                   + 21.90% rocksdb::BlockBasedTableIterator<rocksdb::DataBlockIter, rocksdb::Slice>::NextAndGetResult
                              |                     + 21.90% rocksdb::BlockBasedTableIterator<rocksdb::DataBlockIter, rocksdb::Slice>::Next
                              |                       + 21.90% rocksdb::BlockBasedTableIterator<rocksdb::DataBlockIter, rocksdb::Slice>::FindKeyForward
                              |                         + 21.90% rocksdb::BlockBasedTableIterator<rocksdb::DataBlockIter, rocksdb::Slice>::FindBlockForward
                              |                           + 21.90% rocksdb::BlockBasedTableIterator<rocksdb::DataBlockIter, rocksdb::Slice>::InitDataBlock
                              |                             + 21.90% rocksdb::BlockBasedTable::NewDataBlockIterator<rocksdb::DataBlockIter>
                              |                               + 21.90% rocksdb::BlockBasedTable::RetrieveBlock<rocksdb::Block>
                              |                                 + 21.90% rocksdb::BlockBasedTable::MaybeReadBlockAndLoadToCache<rocksdb::Block>
                              |                                   + 11.60% rocksdb::BlockFetcher::ReadBlockContents
                              |                                   | + 11.60% rocksdb::BlockFetcher::TryGetFromPrefetchBuffer
                              |                                   |   + 11.60% rocksdb::FilePrefetchBuffer::TryReadFromCache
                              |                                   |     + 11.60% rocksdb::FilePrefetchBuffer::Prefetch
                              |                                   |       + 11.60% rocksdb::RandomAccessFileReader::Read
                              |                                   |         + 11.60% rocksdb::LegacyRandomAccessFileWrapper::Read
                              |                                   |           + 11.60% BlueRocksRandomAccessFile::Read
                              |                                   |             + 11.60% BlueFS::read_random
                              |                                   |               + 11.60% BlueFS::_read_random
                              |                                   |                 + 11.60% KernelDevice::read_random
                              |                                   |                   + 11.60% pread64
                              |                                   + 10.30% rocksdb::BlockBasedTable::GetDataBlockFromCache<rocksdb::Block>
                              |                                     + 10.30% rocksdb::BlockBasedTable::GetEntryFromCache
                              |                                       + 10.30% rocksdb_cache::ShardedCache::Lookup
                              |                                         + 10.30% rocksdb_cache::ShardedCache::HashSlice
                              |                                           + 10.30% ceph_str_hash
                              |                                             + 10.30% ceph_str_hash_rjenkins
                              + 33.90% std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >::operator[]
                              | + 33.90% std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list>, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >::_M_emplace_hint_unique<std::piecewise_construct_t const&, std::tuple<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&>, std::tuple<> >(std::_Rb_tree_const_iterator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> >, std::piecewise_construct_t const&, std::tuple<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&>&&, std::tuple<>&&)
                              |   + 33.90% std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list>, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >::_M_insert_node
                              |     + 33.90% std::_Rb_tree_insert_and_rebalance(bool, std::_Rb_tree_node_base*, std::_Rb_tree_node_base*, std::_Rb_tree_node_base&)
                              + 5.20% BlueStore::OmapIteratorImpl::valid
                              | + 5.20% ShardMergeIteratorImpl::raw_key[abi:cxx11]()
                              |   + 5.20% ShardMergeIteratorImpl::key[abi:cxx11]()
                              |     + 5.20% rocksdb::Slice::ToString[abi:cxx11](bool) const
                              |       + 5.20% std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::assign
                              |         + 5.20% std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_replace
                              + 5.00% BlueStore::OmapIteratorImpl::value
                                + 5.00% ShardMergeIteratorImpl::value
                                  + 5.00% to_bufferlist
                                    + 5.00% ceph::buffer::v15_2_0::ptr::ptr
                                      + 5.00% ceph::buffer::v15_2_0::copy
                                        + 5.00% ceph::buffer::v15_2_0::create_aligned
                                          + 5.00% ceph::buffer::v15_2_0::create_aligned_in_mempool
                                            + 5.00% ceph::buffer::v15_2_0::raw_combined::create
                                              + 5.00% tc_posix_memalign
                                                + 5.00% tcmalloc::ThreadCache::FetchFromCentralCache(unsigned int, int, void* (*)(unsigned long))
                                                  + 5.00% tcmalloc::CentralFreeList::RemoveRange(void**, void**, int)
                                                    + 5.00% tcmalloc::CentralFreeList::FetchFromOneSpansSafe(int, void**, void**)
                                                      + 5.00% tcmalloc::CentralFreeList::Populate()

bluefs_buffered_io=true

Thread: 1 (ceph_test_objec) - 1000 samples 

+ 100.00% main
  + 100.00% RUN_ALL_TESTS
    + 100.00% testing::UnitTest::Run
      + 100.00% testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>
        + 100.00% testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>
          + 100.00% testing::internal::UnitTestImpl::RunAllTests
            + 100.00% testing::TestSuite::Run
              + 100.00% testing::TestSuite::Run
                + 100.00% testing::TestInfo::Run
                  + 100.00% testing::TestInfo::Run
                    + 100.00% testing::Test::Run
                      + 100.00% testing::Test::Run
                        + 100.00% testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>
                          + 100.00% testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>
                            + 100.00% StoreTest_OmapBench_Test::TestBody
                              + 44.50% BlueStore::OmapIteratorImpl::next
                              | + 21.30% ceph::mono_clock::now
                              | | + 21.30% clock_gettime@GLIBC_2.2.5
                              | |   + 21.30% clock_gettime
                              | + 15.10% ShardMergeIteratorImpl::next
                              | | + 15.10% rocksdb::DBIter::Next
                              | |   + 15.10% rocksdb::IteratorWrapperBase<rocksdb::Slice>::Next
                              | |     + 15.10% rocksdb::MergingIterator::NextAndGetResult
                              | |       + 15.10% rocksdb::MergingIterator::Next
                              | |         + 15.10% rocksdb::IteratorWrapperBase<rocksdb::Slice>::Next
                              | |           + 15.10% rocksdb::(anonymous namespace)::LevelIterator::NextAndGetResult
                              | |             + 15.10% rocksdb::(anonymous namespace)::LevelIterator::NextImpl
                              | |               + 14.90% rocksdb::IteratorWrapperBase<rocksdb::Slice>::Next
                              | |               | + 14.90% rocksdb::BlockBasedTableIterator<rocksdb::DataBlockIter, rocksdb::Slice>::NextAndGetResult
                              | |               |   + 14.90% rocksdb::BlockBasedTableIterator<rocksdb::DataBlockIter, rocksdb::Slice>::Next
                              | |               |     + 14.90% rocksdb::BlockBasedTableIterator<rocksdb::DataBlockIter, rocksdb::Slice>::FindKeyForward
                              | |               |       + 14.90% rocksdb::BlockBasedTableIterator<rocksdb::DataBlockIter, rocksdb::Slice>::FindBlockForward
                              | |               |         + 14.90% rocksdb::BlockBasedTableIterator<rocksdb::DataBlockIter, rocksdb::Slice>::InitDataBlock
                              | |               |           + 9.30% rocksdb::BlockBasedTable::NewDataBlockIterator<rocksdb::DataBlockIter>
                              | |               |           | + 9.30% rocksdb::BlockBasedTable::RetrieveBlock<rocksdb::Block>
                              | |               |           |   + 7.40% rocksdb::BlockBasedTable::MaybeReadBlockAndLoadToCache<rocksdb::Block>
                              | |               |           |   | + 4.10% rocksdb::BlockFetcher::ReadBlockContents
                              | |               |           |   | | + 4.10% rocksdb::RandomAccessFileReader::Read
                              | |               |           |   | |   + 4.10% rocksdb::LegacyRandomAccessFileWrapper::Read
                              | |               |           |   | |     + 4.10% BlueRocksRandomAccessFile::Read
                              | |               |           |   | |       + 4.10% BlueFS::read_random
                              | |               |           |   | |         + 4.10% BlueFS::_read_random
                              | |               |           |   | |           + 4.10% ceph::buffer::v15_2_0::list::iterator_impl<false>::copy
                              | |               |           |   | |             + 4.10% __memmove_avx_unaligned_erms
                              | |               |           |   | + 3.20% rocksdb::BlockBasedTable::PutDataBlockToCache<rocksdb::Block>
                              | |               |           |   |   + 3.20% rocksdb_cache::ShardedCache::Insert
                              | |               |           |   |     + 3.20% rocksdb_cache::BinnedLRUCacheShard::Insert
                              | |               |           |   |       + 3.20% rocksdb_cache::BinnedLRUHandle::Free
                              | |               |           |   |         + 3.20% rocksdb::(anonymous namespace)::DeleteCachedEntry<rocksdb::Block>
                              | |               |           |   |           + 3.20% rocksdb::(anonymous namespace)::DeleteCachedEntry<rocksdb::Block>
                              | |               |           |   |             + 3.20% tc_free
                              | |               |           |   + 1.90% rocksdb::Status::operator=
                              | |               |           + 5.60% rocksdb::RandomAccessFileReader::Prefetch
                              | |               |             + 5.60% rocksdb::LegacyRandomAccessFileWrapper::Prefetch
                              | |               |               + 5.60% BlueRocksRandomAccessFile::Prefetch
                              | |               |                 + 5.60% BlueFS::read
                              | |               |                   + 5.60% BlueFS::_read
                              | |               |                     + 5.60% KernelDevice::read
                              | |               |                       + 5.60% pread64
                              | |               + 0.20% rocksdb::(anonymous namespace)::LevelIterator::SkipEmptyFileForward
                              | + 8.10% BlueStore::log_latency
                              |   + 8.10% ceph::common::PerfCounters::tinc
                              |     + 0.20% std::__atomic_base<unsigned long>::operator++
                              + 39.00% BlueStore::OmapIteratorImpl::value
                              | + 37.60% ShardMergeIteratorImpl::value
                              |   + 37.60% to_bufferlist
                              |     + 31.50% ceph::buffer::v15_2_0::ptr::ptr
                              |     | + 31.50% ceph::buffer::v15_2_0::copy
                              |     |   + 31.50% ceph::buffer::v15_2_0::create_aligned
                              |     |     + 31.50% ceph::buffer::v15_2_0::create_aligned_in_mempool
                              |     |       + 31.50% ceph::buffer::v15_2_0::raw_combined::create
                              |     |         + 31.50% tc_posix_memalign
                              |     |           + 29.40% tcmalloc::ThreadCache::FetchFromCentralCache(unsigned int, int, void* (*)(unsigned long))
                              |     |           | + 29.40% tcmalloc::CentralFreeList::RemoveRange(void**, void**, int)
                              |     |           |   + 29.40% tcmalloc::CentralFreeList::FetchFromOneSpansSafe(int, void**, void**)
                              |     |           |     + 29.40% tcmalloc::CentralFreeList::Populate()
                              |     |           + 2.10% tc_memalign
                              |     + 6.10% ceph::buffer::v15_2_0::list::append
                              |       + 6.10% ceph::buffer::v15_2_0::list::push_back
                              |         + 6.10% ceph::buffer::v15_2_0::ptr_node::create<ceph::buffer::v15_2_0::ptr>
                              |           + 6.10% tcmalloc::ThreadCache::FetchFromCentralCache(unsigned int, int, void* (*)(unsigned long))
                              |             + 6.10% tcmalloc::CentralFreeList::RemoveRange(void**, void**, int)
                              |               + 6.10% tcmalloc::CentralFreeList::FetchFromOneSpansSafe(int, void**, void**)
                              |                 + 6.10% tcmalloc::CentralFreeList::Populate()
                              + 8.40% std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >::operator[]
                              | + 8.00% std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >::lower_bound
                              | | + 8.00% std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list>, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >::lower_bound
                              | |   + 8.00% std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list>, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >::_M_lower_bound
                              | |     + 8.00% std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >::operator()
                              | |       + 8.00% std::operator< <char, std::char_traits<char>, std::allocator<char> >
                              | |         + 8.00% std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::compare
                              | |           + 8.00% std::char_traits<char>::compare
                              | |             + 8.00% __memcmp_avx2_movbe
                              | + 0.40% std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list>, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >::_M_emplace_hint_unique<std::piecewise_construct_t const&, std::tuple<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&>, std::tuple<> >(std::_Rb_tree_const_iterator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> >, std::piecewise_construct_t const&, std::tuple<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&>&&, std::tuple<>&&)
                              + 4.90% BlueStore::OmapIteratorImpl::valid
                              | + 4.30% ShardMergeIteratorImpl::raw_key[abi:cxx11]()
                              |   + 4.30% ShardMergeIteratorImpl::key[abi:cxx11]()
                              |     + 3.10% rocksdb::Slice::ToString[abi:cxx11](bool) const
                              |       + 3.10% std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::assign
                              |         + 3.10% std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_replace
                              |           + 3.10% std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_mutate
                              |             + 3.10% tcmalloc::ThreadCache::FetchFromCentralCache(unsigned int, int, void* (*)(unsigned long))
                              |               + 3.10% tcmalloc::CentralFreeList::RemoveRange(void**, void**, int)
                              |                 + 3.10% tcmalloc::CentralFreeList::FetchFromOneSpansSafe(int, void**, void**)
                              |                   + 3.10% tcmalloc::CentralFreeList::Populate()
                              + 3.20% BlueStore::OmapIteratorImpl::key[abi:cxx11]()
                                + 2.20% BlueStore::Onode::decode_omap_key
                                | + 2.20% std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::substr
                                |   + 2.20% std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string
                                |     + 2.20% std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_construct<char const*>
                                |       + 2.20% std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_construct_aux<char const*>
                                |         + 2.20% std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_construct<char const*>
                                |           + 2.20% tcmalloc::ThreadCache::FetchFromCentralCache(unsigned int, int, void* (*)(unsigned long))
                                |             + 2.20% tcmalloc::CentralFreeList::RemoveRange(void**, void**, int)
                                |               + 2.20% tcmalloc::CentralFreeList::FetchFromOneSpansSafe(int, void**, void**)
                                |                 + 2.20% tcmalloc::CentralFreeList::Populate()
                                + 1.00% ShardMergeIteratorImpl::raw_key[abi:cxx11]()

@markhpc
Copy link
Member Author

markhpc commented Mar 10, 2021

Going to retry profile using the detach/attach method in gdbpmp which is much slower but may provide better sample quality.

@mattbenjamin
Copy link
Contributor

+ 29.40% tcmalloc::ThreadCache::FetchFromCentralCache(unsigned int, int, void* (*)(unsigned long))
                          |     |           | + 29.40% tcmalloc::CentralFreeList::RemoveRange(void**, void**, int)
                          |     |           |   + 29.40% tcmalloc::CentralFreeList::FetchFromOneSpansSafe(int, void**, void**)
                          |     |           |     + 29.40% tcmalloc::CentralFreeList::Populate()

@markhpc
Copy link
Member Author

markhpc commented Mar 10, 2021

@mattbenjamin I probably should run with:

env -i TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=134217728

which will probably help (and I think/hope how we are still running OSDs in production). This is why I keep harping on trying to recycle memory (or objects!) for short lived stuff though. What a pain.

Having said that, the thing I'm especially noticing is that we are indeed following the different prefetch paths in rocksdb depending on whether buffered IO is used. With bluefs_buffered_io = false rocksdb does prefetch in the FilePrefetchBuffer via the BlockFetcher, but in the bluefs_buffered_io=true case, it does it directly in the NewDataBlockIterator. This was what we thought happened based on the analysis in #38044 (comment) so it's nice to see we were right.

Also I'm noticing that in the direct case we are spending a ton of time in ShardedCache::Lookup, but never in ShardedCache::Insert like in the buffered case! I'm not sure what that means exactly yet, but clearly we're seeing very different behavior in rocksdb when buffered IO is enabled.

@markhpc markhpc requested a review from mattbenjamin March 10, 2021 00:53
@markhpc
Copy link
Member Author

markhpc commented Mar 10, 2021

Ok, final update for tonight. I manually disabled Prefetch in bluerocksenv and saw no change at all in performance but I haven't profiled yet. What I have notices is that in the direct IO case, the first low_bound iteration test is slow, then subsequent ones are fast for a little while, before becoming forever slow again:

lower_bound - 1000000 random 64 byte keys, 256 byte values in 2.307001
lower_bound - 1000000 random 64 byte keys, 256 byte values in 0.723705
lower_bound - 1000000 random 64 byte keys, 256 byte values in 0.661072
lower_bound - 1000000 random 64 byte keys, 256 byte values in 0.683104
lower_bound - 1000000 random 64 byte keys, 256 byte values in 0.810730
lower_bound - 1000000 random 64 byte keys, 256 byte values in 0.782861
lower_bound - 1000000 random 64 byte keys, 256 byte values in 0.733968
lower_bound - 1000000 random 64 byte keys, 256 byte values in 0.705092
lower_bound - 1000000 random 64 byte keys, 256 byte values in 0.673418
lower_bound - 1000000 random 64 byte keys, 256 byte values in 0.769384
lower_bound - 1000000 random 64 byte keys, 256 byte values in 0.870825
lower_bound - 1000000 random 64 byte keys, 256 byte values in 2.334309
lower_bound - 1000000 random 64 byte keys, 256 byte values in 2.299980
lower_bound - 1000000 random 64 byte keys, 256 byte values in 2.338020
lower_bound - 1000000 random 64 byte keys, 256 byte values in 2.303266
lower_bound - 1000000 random 64 byte keys, 256 byte values in 2.315797

This may be key to understanding what's going on.

@markhpc
Copy link
Member Author

markhpc commented Mar 10, 2021

I think the drop in performance is likely due to hitting the 4GB memory limit defined by the osd_memory_target. Going to rewrite the benchmark to submit data in batches.

@markhpc markhpc force-pushed the wip-omap-bench-test branch from 41e44a0 to 7bf8ebe Compare March 10, 2021 23:49
@markhpc
Copy link
Member Author

markhpc commented Mar 10, 2021

Ok, the slowdown in the previous results was due to in-memory datastructures laying around causing the prioritycachemanager to shrink the in-memory caches since it wants to keep process memory below the osd_memory_target. Now we can optionally do omap_setkeys in batches to stop that from happening. Also now the benchmark can span OMAP across multiple objects which also reduces the submission size.

Current test results for 100K objects wtih 100 keys each:

Test memstore filestore kstore bluestore(buffered off) bluestore(buffered on)
omap_setkeys 59.575 15.756 7.998 64.254 64.143
omap_get 57.137 12.453 177.809 128.341 63.937
iteration (seek_to_first) 3.421 7.069 9.164 49.293 8.744
iteration (lower_bound) 3.299 6.946 8.978 50.309 8.770
remove 88.393 28.828 56.890 128.338 92.671

These results are pretty odd looking with some suspiciously good looking results for filestore (and kstore for setkeys). The speed difference for bluestore with buffered IO seems to be pretty apparent in the iteration tests though.

Edit: I should add that filestore has access to a lot (384GB!) of pagecache on this node. I probably need to figure out if I can use a cgroup to limit pagecache or something.

@markhpc
Copy link
Member Author

markhpc commented Mar 11, 2021

Ok, I ran some tests with different cgroup memory limits. I'm still wary of these filestore numbers. We should also keep in mind that this doesn't exactly represent how an OSD would work in real situations (much less parallelism!) Also remember that filestore completely ignores the osd_memory_target and uses rocksdb much less aggressively (relying on the filesystem for more). In any event, one thing I notice here in the iteration tests is that bluestore can do very well with direct IO when there is enough memory to seemingly keep all omap blocks present in the blockcache. I'm back to strongly suspecting that the block cache is being polluted by something else when iteration gets slow. Since this is master it should no longer be onodes, but we still might be thrashing the cache with other stuff (allocation data seems like a strong contender). Need to investigate more.

omap_setkeys

cgroup limit osd_memory_target filestore bluestore (buffered) bluestore (direct)
4GB 4GB 15.782 64.990 64.948
8GB 4GB -- 64.296 65.248
8GB 8GB 15.609 65.106 64.738
16GB 4GB -- 64.720 64.775
16GB 8GB -- 64.624 64.710
16GB 16GB 15.403 64.818 64.763

omap_get

cgroup limit osd_memory_target filestore bluestore (buffered) bluestore (direct)
4GB 4GB 12.315 114.196 129.34
8GB 4GB -- 94.645 131.341
8GB 8GB 12.406 94.043 130.289
16GB 4GB -- 64.611 133.593
16GB 8GB -- 64.606 132.085
16GB 16GB 12.490 64.741 135.179

seek_to_first iteration

cgroup limit osd_memory_target filestore bluestore (buffered) bluestore (direct)
4GB 4GB 7.057 41.895 49.188
8GB 4GB -- 13.373 48.902
8GB 8GB 6.965 5.757 5.907
16GB 4GB -- 8.933 49.448
16GB 8GB -- 5.632 6.018
16GB 16GB 7.029 5.903 6.099

lower_bound iteration

cgroup limit osd_memory_target filestore bluestore (buffered) bluestore (direct)
4GB 4GB 6.890 43.269 50.368
8GB 4GB -- 8.865 50.187
8GB 8GB 6.921 6.028 5.741
16GB 4GB -- 8.851 50.731
16GB 8GB -- 6.228 5.653
16GB 16GB 7.042 5.971 5.771

remove

cgroup limit osd_memory_target filestore bluestore (buffered) bluestore (direct)
4GB 4GB 32.930 171.825 128.401
8GB 4GB -- 93.665 131.681
8GB 8GB 29.363 93.512 105.095
16GB 4GB -- 94.279 128.096
16GB 8GB -- 93.005 105.833
16GB 16GB 29.277 93.272 105.529

@markhpc
Copy link
Member Author

markhpc commented Mar 11, 2021

I'm trusting the filestore results a little more. We do see some slowdown (but ti's still surprisingly good) even at much lower cgroup memory limits:

Filestore cgroup memory.limit_in_bytes tests:

test 1GB 2GB 4GB 8GB 16GB
omap_setkeys 14.739 14.918 15.782 15.609 15.403
omap_get 38.723 37.615 12.315 12.406 12.490
iteration (seek_to_first) 24.981 21.962 7.057 6.965 5.903
iteration (lower_bound) 23.549 21.329 6.890 6.922 5.971
remove 63.016 58.143 32.930 29.363 29.277

Filestore really appears to be faster in these tests, likely due to significantly less rocksdbs usage and thus higher ability to keep things cached with less memory (but I should say that bluestore is still far faster in many non-omap tests!)

Copy link
Contributor

@aclamk aclamk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, simple and very functional bench test.

@markhpc markhpc force-pushed the wip-omap-bench-test branch from 7bf8ebe to 701a943 Compare March 24, 2021 19:59
@markhpc
Copy link
Member Author

markhpc commented Mar 24, 2021

@ifed01 @aclamk I've updated the version here to mirror what I've been testing with locally. Please take a look!

@markhpc
Copy link
Member Author

markhpc commented Mar 24, 2021

Copy link
Contributor

@aclamk aclamk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good OMAP test.
Only cosmetic changes required.

Signed-off-by: Mark Nelson <mnelson@redhat.com>
@markhpc markhpc force-pushed the wip-omap-bench-test branch from 701a943 to 6ed95b8 Compare March 29, 2021 17:22
@markhpc
Copy link
Member Author

markhpc commented Mar 29, 2021

Before we merge this, we probably want to figure out if we want to be able to pass in user-supplied options and what the default options should be. This test can take an extremely long time (1 hour+) to run in some configurations (low memory, direct IO, unsorted).

@ronen-fr
Copy link
Contributor

Before we merge this, we probably want to figure out if we want to be able to pass in user-supplied options and what the default options should be. This test can take an extremely long time (1 hour+) to run in some configurations (low memory, direct IO, unsorted).

At least: controlling the number of objects (to enable short verification tests)

@stale
Copy link

stale bot commented Jun 11, 2021

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@stale stale bot added the stale label Jun 11, 2021
@neha-ojha
Copy link
Member

@markhpc ping

@stale stale bot removed the stale label Jun 15, 2021
@markhpc
Copy link
Member Author

markhpc commented Jun 25, 2021

Probably the next step for this is to figure out how to turn it into an independent benchmark. It talks to the objectstore directly so might need some rethinking if we want to run it (or something like it) against existing clusters. Wouldn't be impossible though to just rip the gtest part of it out and replace it with some kind of cli wrapper.

@stale
Copy link

stale bot commented Jan 9, 2022

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@stale stale bot added the stale label Jan 9, 2022
@neha-ojha
Copy link
Member

we still want this!

@stale stale bot removed the stale label Jan 10, 2022
@rzarzynski
Copy link
Contributor

As Neha already said:

we still want this!

@djgalloway djgalloway changed the base branch from master to main July 6, 2022 00:00
@github-actions
Copy link

github-actions bot commented Sep 4, 2022

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Sep 4, 2022
@github-actions
Copy link

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

@github-actions github-actions bot closed this Oct 28, 2022
@markhpc markhpc mentioned this pull request Oct 31, 2024
14 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants