Skip to content

cls/rgw: [WIP] Make rgw_bucket_list encode bls rather than dir entries#29980

Closed
markhpc wants to merge 1 commit intoceph:masterfrom
markhpc:wip-rgw-cls-list-encode
Closed

cls/rgw: [WIP] Make rgw_bucket_list encode bls rather than dir entries#29980
markhpc wants to merge 1 commit intoceph:masterfrom
markhpc:wip-rgw-cls-list-encode

Conversation

@markhpc
Copy link
Member

@markhpc markhpc commented Aug 28, 2019

This PR is a first pass attempt to examine if we can gain performance by having rgw_bucket_list skip encoding dir entries it just decoded from the KeyValueDB and instead just directly encode the BL and pass that to RGW instead. In performance tests this appeared to result in roughly a 50% performance gain when listing a 6.4M object bucket using hsbench. Note this PR does yet implement any kind of version checking or backward compatibility support. Additional gains can be realized by removing some of the debug 20 CLS_LOG calls and potentially by restructuring the rgw_bucket_dir_entry so we can only decode filters in the OSD rather the entire thing.

Before After
Time 1103.27 722.46
Pages/s (1000 entries) 5.8 8.86
Min Lat (ms) 41.00 40.65
Avg Lat (ms) 172.41 112.90
99% Lat (ms) 184.49 128.13
Max Lat (ms) 1548.24 673.33

A wallclock profile of the tp_osd_tp thread before:

Thread: 8 (tp_osd_tp) - 1000 samples 

+ 100.00% clone
  + 100.00% start_thread
    + 100.00% ShardedThreadPool::WorkThreadSharded::entry
      + 100.00% ShardedThreadPool::shardedthreadpool_worker
        + 100.00% OSD::ShardedOpWQ::_process
          + 48.50% std::condition_variable::wait(std::unique_lock<std::mutex>&)
          | + 48.50% pthread_cond_wait@@GLIBC_2.3.2
          + 48.20% run
          | + 48.20% PGOpItem::run
          |   + 48.20% OSD::dequeue_op
          |     + 48.20% PrimaryLogPG::do_request
          |       + 48.20% PrimaryLogPG::do_op
          |         + 48.00% PrimaryLogPG::execute_ctx
          |         | + 47.60% PrimaryLogPG::prepare_transaction
          |         | | + 47.60% PrimaryLogPG::do_osd_ops
          |         | |   + 47.50% ClassHandler::ClassMethod::exec
          |         | |   | + 47.50% visit<ClassHandler::ClassMethod::exec(cls_method_context_t, ceph::bufferlist&, ceph::bufferlist&)::<lambda(auto:27)>, std::variant<int (*)(void*, ceph::buffer::v14_2_0::list*, ceph::buffer::v14_2_0::list*), int (*)(void*, char*, int, char**, int*)>&>
          |         | |   |   + 47.50% std::__detail::__variant::__gen_vtable_impl<std::__detail::__variant::_Multi_array<void (*)(ClassHandler::ClassMethod::exec(cls_method_context_t, ceph::bufferlist&, ceph::bufferlist&)::<lambda(auto:27)>&&, std::variant<int (*)(void*, ceph::buffer::v14_2_0::list*, ceph::buffer::v14_2_0::list*), int (*)(void*, char*, int, char**, int*)>&)>, std::tuple<std::variant<int (*)(void*, ceph::buffer::v14_2_0::list*, ceph::buffer::v14_2_0::list*), int (*)(void*, char*, int, char**, int*)>&>, std::integer_sequence<long unsigned int, 0> >::__visit_invoke(<unknown type in /usr/local/bin/ceph-osd, CU 0x109ee4f, DIE 0x11395de>, std::variant<int (*)(void*, ceph::buffer::v14_2_0::list*, ceph::buffer::v14_2_0::list*), int (*)(void*, char*, int, char**, int*)> &)
          |         | |   |     + 47.50% __invoke<ClassHandler::ClassMethod::exec(cls_method_context_t, ceph::bufferlist&, ceph::bufferlist&)::<lambda(auto:27)>, int (*&)(void*, ceph::buffer::v14_2_0::list*, ceph::buffer::v14_2_0::list*)>
          |         | |   |       + 47.50% __invoke_impl<void, ClassHandler::ClassMethod::exec(cls_method_context_t, ceph::bufferlist&, ceph::bufferlist&)::<lambda(auto:27)>, int (*&)(void*, ceph::buffer::v14_2_0::list*, ceph::buffer::v14_2_0::list*)>
          |         | |   |         + 47.50% operator()<int (*)(void*, ceph::buffer::v14_2_0::list*, ceph::buffer::v14_2_0::list*)>
          |         | |   |           + 47.50% rgw_bucket_list
          |         | |   |             + 22.90% get_obj_vals
          |         | |   |             | + 22.90% cls_cxx_map_get_vals
          |         | |   |             |   + 16.50% PrimaryLogPG::do_osd_ops
          |         | |   |             |   | + 5.00% BlueStore::OmapIteratorImpl::next
          |         | |   |             |   | | + 3.20% next
          |         | |   |             |   | | + 1.10% BlueStore::log_latency
          |         | |   |             |   | | + 0.40% now
          |         | |   |             |   | | + 0.20% ~shared_lock
          |         | |   |             |   | + 3.10% BlueStore::OmapIteratorImpl::value
          |         | |   |             |   | + 2.70% BlueStore::OmapIteratorImpl::valid
          |         | |   |             |   | + 2.70% BlueStore::OmapIteratorImpl::key
          |         | |   |             |   | + 0.70% ceph::buffer::v14_2_0::list::claim_append
          |         | |   |             |   | + 0.60% ceph::encode
          |         | |   |             |   | + 0.40% encode
          |         | |   |             |   | + 0.40% BlueStore::get_omap_iterator
          |         | |   |             |   | + 0.30% ~list
          |         | |   |             |   | + 0.30% ~basic_string
          |         | |   |             |   | + 0.10% ~shared_ptr
          |         | |   |             |   | + 0.10% std::string::substr
          |         | |   |             |   | + 0.10% BlueStore::OmapIteratorImpl::upper_bound
          |         | |   |             |   + 4.10% ceph::decode<std::map<std::string, ceph::buffer::v14_2_0::list, std::less<std::string>, std::allocator<std::pair<std::string const, ceph::buffer::v14_2_0::list> > >, denc_traits<std::map<std::string, ceph::buffer::v14_2_0::list, std::less<std::string>, std::allocator<std::pair<std::string const, ceph::buffer::v14_2_0::list> > >, void> >
          |         | |   |             |   + 2.30% ~vector
          |         | |   |             + 5.90% cls_log
          |         | |   |             | + 5.80% vsnprintf
          |         | |   |             |   + 5.80% __vsnprintf_chk
          |         | |   |             |     + 5.50% vfprintf
          |         | |   |             |     | + 0.90% __GI__IO_default_xsputn
          |         | |   |             |     | + 0.30% strchrnul
          |         | |   |             |     | + 0.30% _itoa_word
          |         | |   |             |     | + 0.10% (anonymous namespace)::free_null_or_invalid
          |         | |   |             |     + 0.10% _IO_str_init_static_internal
          |         | |   |             |     + 0.10% _IO_no_init
          |         | |   |             + 5.10% decode
          |         | |   |             | + 5.10% rgw_bucket_dir_entry::decode
          |         | |   |             |   + 4.10% decode
          |         | |   |             |   + 0.90% ceph::decode
          |         | |   |             |   + 0.10% decode_packed_val<unsigned long>
          |         | |   |             + 4.60% encode
          |         | |   |             + 2.80% ~rgw_cls_list_ret
          |         | |   |             + 2.50% operator[]
          |         | |   |             + 1.50% ~map
          |         | |   |             + 0.70% operator=
          |         | |   |             + 0.50% operator++
          |         | |   |             + 0.40% cbegin
          |         | |   |             + 0.20% read_bucket_header
          |         | |   |             + 0.10% rgw_bucket_dir_entry::~rgw_bucket_dir_entry
          |         | |   + 0.10% ceph::buffer::v14_2_0::list::claim_append
          |         | + 0.30% PrimaryLogPG::complete_read_ctx
          |         | + 0.10% MOSDOpReply::MOSDOpReply
          |         + 0.10% PrimaryLogPG::maybe_await_blocked_head
          |         + 0.10% OSD::init_op_flags
          + 3.20% PG::lock
          + 0.10% unlock

And after:

Thread: 15 (tp_osd_tp) - 1000 samples 

+ 100.00% clone
  + 100.00% start_thread
    + 100.00% ShardedThreadPool::WorkThreadSharded::entry
      + 100.00% ShardedThreadPool::shardedthreadpool_worker
        + 100.00% OSD::ShardedOpWQ::_process
          + 50.00% std::condition_variable::wait(std::unique_lock<std::mutex>&)
          | + 50.00% pthread_cond_wait@@GLIBC_2.3.2
          + 49.60% run
          | + 49.60% PGOpItem::run
          |   + 49.60% OSD::dequeue_op
          |     + 49.60% PrimaryLogPG::do_request
          |       + 49.50% PrimaryLogPG::do_op
          |         + 49.40% PrimaryLogPG::execute_ctx
          |         | + 46.00% PrimaryLogPG::prepare_transaction
          |         | | + 46.00% PrimaryLogPG::do_osd_ops
          |         | |   + 45.30% ClassHandler::ClassMethod::exec
          |         | |   | + 45.30% visit<ClassHandler::ClassMethod::exec(cls_method_context_t, ceph::bufferlist&, ceph::bufferlist&)::<lambda(auto:27)>, std::variant<int (*)(void*, ceph::buffer::v14_2_0::list*, ceph::buffer::v14_2_0::li
st*), int (*)(void*, char*, int, char**, int*)>&>
          |         | |   |   + 45.30% std::__detail::__variant::__gen_vtable_impl<std::__detail::__variant::_Multi_array<void (*)(ClassHandler::ClassMethod::exec(cls_method_context_t, ceph::bufferlist&, ceph::bufferlist&)::<lambda(auto:2
7)>&&, std::variant<int (*)(void*, ceph::buffer::v14_2_0::list*, ceph::buffer::v14_2_0::list*), int (*)(void*, char*, int, char**, int*)>&)>, std::tuple<std::variant<int (*)(void*, ceph::buffer::v14_2_0::list*, ceph::buffer::v14_2_0::list
*), int (*)(void*, char*, int, char**, int*)>&>, std::integer_sequence<long unsigned int, 0> >::__visit_invoke(<unknown type in /usr/local/bin/ceph-osd, CU 0x109ee4f, DIE 0x11395de>, std::variant<int (*)(void*, ceph::buffer::v14_2_0::list
*, ceph::buffer::v14_2_0::list*), int (*)(void*, char*, int, char**, int*)> &)
          |         | |   |     + 45.30% __invoke<ClassHandler::ClassMethod::exec(cls_method_context_t, ceph::bufferlist&, ceph::bufferlist&)::<lambda(auto:27)>, int (*&)(void*, ceph::buffer::v14_2_0::list*, ceph::buffer::v14_2_0::list*)>
          |         | |   |       + 45.30% __invoke_impl<void, ClassHandler::ClassMethod::exec(cls_method_context_t, ceph::bufferlist&, ceph::bufferlist&)::<lambda(auto:27)>, int (*&)(void*, ceph::buffer::v14_2_0::list*, ceph::buffer::v14
_2_0::list*)>
          |         | |   |         + 45.30% operator()<int (*)(void*, ceph::buffer::v14_2_0::list*, ceph::buffer::v14_2_0::list*)>
          |         | |   |           + 45.30% rgw_bucket_list
          |         | |   |             + 25.70% get_obj_vals
          |         | |   |             | + 25.70% cls_cxx_map_get_vals
          |         | |   |             |   + 17.50% PrimaryLogPG::do_osd_ops
          |         | |   |             |   | + 7.00% BlueStore::OmapIteratorImpl::next
          |         | |   |             |   | | + 4.30% next
          |         | |   |             |   | | + 1.50% BlueStore::log_latency
          |         | |   |             |   | | + 0.80% now
          |         | |   |             |   | | + 0.20% shared_lock
          |         | |   |             |   | + 2.60% BlueStore::OmapIteratorImpl::value
          |         | |   |             |   | + 2.50% BlueStore::OmapIteratorImpl::key
          |         | |   |             |   | + 1.60% BlueStore::OmapIteratorImpl::valid
          |         | |   |             |   | + 1.00% encode
          |         | |   |             |   | + 0.90% ceph::encode
          |         | |   |             |   | + 0.80% ceph::buffer::v14_2_0::list::claim_append
          |         | |   |             |   | + 0.40% ~list
          |         | |   |             |   | + 0.30% BlueStore::get_omap_iterator
          |         | |   |             |   | + 0.20% BlueStore::OmapIteratorImpl::upper_bound
          |         | |   |             |   | + 0.10% ~basic_string
          |         | |   |             |   | + 0.10% operator->
          |         | |   |             |   + 6.70% ceph::decode<std::map<std::string, ceph::buffer::v14_2_0::list, std::less<std::string>, std::allocator<std::pair<std::string const, ceph::buffer::v14_2_0::list> > >, denc_traits<std::map
<std::string, ceph::buffer::v14_2_0::list, std::less<std::string>, std::allocator<std::pair<std::string const, ceph::buffer::v14_2_0::list> > >, void> >
          |         | |   |             |   | + 3.90% decode
          |         | |   |             |   | + 1.60% ceph::buffer::v14_2_0::list::iterator_impl<true>::copy_shallow
          |         | |   |             |   | + 1.20% ceph::buffer::v14_2_0::list::iterator_impl<true>::advance
          |         | |   |             |   + 1.50% ~vector
          |         | |   |             + 4.90% decode
          |         | |   |             + 4.50% cls_log
          |         | |   |             + 2.90% encode
          |         | |   |             + 2.40% operator[]
          |         | |   |             + 1.40% ~map
          |         | |   |             + 0.90% operator=
          |         | |   |             + 0.80% operator++
          |         | |   |             + 0.50% ~rgw_cls_list_ret
          |         | |   |             + 0.40% rgw_bucket_dir_entry::~rgw_bucket_dir_entry
          |         | |   |             + 0.20% cbegin
          |         | |   |             + 0.10% read_bucket_header
          |         | |   + 0.60% ceph::buffer::v14_2_0::list::claim_append
          |         | |   + 0.10% ~basic_string
          |         | + 1.90% PrimaryLogPG::complete_read_ctx
          |         | + 1.50% MOSDOpReply::MOSDOpReply
          |         + 0.10% MOSDOp::finish_decode
          + 0.30% PG::lock
          + 0.10% operator[]

Signed-off-by: Mark Nelson mnelson@redhat.com

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard backend
  • jenkins test docs
  • jenkins render docs

Signed-off-by: Mark Nelson <mnelson@redhat.com>
@mattbenjamin
Copy link
Contributor

@markhpc I don't love this, though I'd like the perf. I'd like to go over some of the comments in the description--why should we be concerned about cls_log(20) statements--are we not getting this nearly for free as hoped?

@markhpc
Copy link
Member Author

markhpc commented Aug 29, 2019

@mattbenjamin Not super in love with this either to be honest. I wonder if we can do something clever to construct the BL without the directory entries but then leave the structure and decode alone so it's backwards compatible and RGW can stay the same. Ultimately if we can get encode/decode mostly out of the way in the CLS it's going to be a big help imho.

Regarding performance: I'm going to run some additional tests tomorrow. I realized tonight that I was running with dynamic bucket resharding enabled which can skew the results if the resharding happens at different start offsets.

Regarding cls logging: see the 5.9% and 4.5% cls_log lines in the callgraphs. We're definitely spending some time there.

The real win imho is going to be either your idea regarding the hashing scheme changes or some kind of cache (perhaps even OSD side) to essentially do readahead.

@stale
Copy link

stale bot commented Oct 28, 2019

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@stale stale bot added the stale label Oct 28, 2019
@dang
Copy link
Contributor

dang commented Dec 19, 2019

jenkins retest this please

@stale stale bot removed the stale label Dec 19, 2019
@cbodley
Copy link
Contributor

cbodley commented Dec 19, 2019

unstale, still interested in this

@stale
Copy link

stale bot commented Feb 17, 2020

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@stale stale bot added the stale label Feb 17, 2020
@stale
Copy link

stale bot commented May 17, 2020

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

@stale stale bot closed this May 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants