os, osd: bring the lightweight OMAP iteration#60278
Conversation
| if (key.substr(0, filter_prefix.size()) != filter_prefix) { | ||
| return ObjectStore::omap_iter_ret_t::STOP; | ||
| } | ||
| if (num >= max_return || bl.length() >= max_bytes) { |
There was a problem hiding this comment.
The limit is known a priori. Perhaps the interface could be extended to accommodate that and allow readahead / minimize size of the critical section under the Collection::lock. Yet, RocksDB is only about iterating one-by-one (no batching AFAIK), so rather not for now.
| return ""; | ||
| } | ||
| virtual ceph::buffer::list value() = 0; | ||
| virtual std::string_view value_as_sv() = 0; |
There was a problem hiding this comment.
Would be nice to drop old variants ultimately. Please note that doing the memcpy here results in more fragmentation of bufferlist at the Message layer.
There was a problem hiding this comment.
Keys should undergo exactly the same expansion procedure.
There is a problem with string_view lifetime. It should be clearly stated for how long the value existence is guaranteed.
For RocksDB that would be until next seek/bound/next call, I am reasonably sure.
For MemStore and KStore it should be checked.
There was a problem hiding this comment.
MemStore stores OMAP simply as map<string, bl> so it should be good till modification. Ideally we would hide the new variants from users and only let ObjectStores to use them.
|
@rzarzynski have not debugged yet, but with this PR |
d0f953b to
04bd4e6
Compare
|
Apologizes, @mkogan1! I also added two new commits that bring the latency logging we have in |
src/os/bluestore/BlueStore.cc
Outdated
| o->get_omap_tail(&tail); | ||
| while (it->valid()) { | ||
| std::string user_key; | ||
| if (const auto& db_key = it->raw_key().second; db_key >= tail) { |
There was a problem hiding this comment.
rocksdb::Slice::ToString() sounds promising. Perhaps it's another place std::string_view could help to pacify.
- 9.46% _ZN14CFIteratorImpl7raw_keyB5cxx11Ev ▒
- 7.32% _ZNK7rocksdb5Slice8ToStringB5cxx11Eb ▒
- _ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE10_M_replaceEmmPKcm ▒
- _ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE9_M_mutateEmmPKcm ▒
2.69% _Znwm@plt ▒
There was a problem hiding this comment.
I think this is the leaf of our vast abstraction over the rocksdb::Iterator:
string key() override {
return dbiter->key().ToString();
}
std::pair<std::string, std::string> raw_key() override {
return make_pair(prefix, key());
}
rocksdb::Slice already offers ToStringView() :-). Switching raw_key() (and perhaps key()) seems worth consideration.
04bd4e6 to
f992a31
Compare
| truncated = true; | ||
| return ObjectStore::omap_iter_ret_t::STOP; | ||
| } | ||
| encode(key, bl); |
There was a problem hiding this comment.
IDEA FOR FOLLOWUP: Maybe the bl should be initialized with bigger append buffer from the very beginning.
src/os/bluestore/BlueStore.cc
Outdated
| o->get_omap_tail(&tail); | ||
| while (it->valid()) { | ||
| std::string user_key; | ||
| if (const auto& db_key = it->raw_key().second; db_key >= tail) { |
There was a problem hiding this comment.
I think this is the leaf of our vast abstraction over the rocksdb::Iterator:
string key() override {
return dbiter->key().ToString();
}
std::pair<std::string, std::string> raw_key() override {
return make_pair(prefix, key());
}
rocksdb::Slice already offers ToStringView() :-). Switching raw_key() (and perhaps key()) seems worth consideration.
| } | ||
| std::string_view value_as_sv() override { | ||
| rocksdb::Slice val = iters[0]->value(); | ||
| return std::string_view{val.data(), val.size()}; |
There was a problem hiding this comment.
| return std::string_view{val.data(), val.size()}; | |
| return val.ToStringView(); |
| } | ||
| std::string_view value_as_sv() override { | ||
| rocksdb::Slice val = dbiter->value(); | ||
| return std::string_view{val.data(), val.size()}; |
There was a problem hiding this comment.
| return std::string_view{val.data(), val.size()}; | |
| return val.ToStringView(); |
| rocksdb::Slice val = dbiter->value(); | ||
| return std::string_view{val.data(), val.size()}; |
There was a problem hiding this comment.
| rocksdb::Slice val = dbiter->value(); | |
| return std::string_view{val.data(), val.size()}; | |
| return dbiter->value().ToStringView(); |
src/os/bluestore/BlueStore.cc
Outdated
| if (const auto& db_key = it->raw_key().second; db_key >= tail) { | ||
| break; | ||
| } else { | ||
| o->decode_omap_key(db_key, &user_key); |
There was a problem hiding this comment.
This could easily work with std::string_view while being worth around 4% of cycles;
void BlueStore::Onode::decode_omap_key(const string& key, string *user_key)
{
size_t pos = sizeof(uint64_t) + 1;
if (!onode.is_pgmeta_omap()) {
if (onode.is_perpg_omap()) {
pos += sizeof(uint64_t) + sizeof(uint32_t);
} else if (onode.is_perpool_omap()) {
pos += sizeof(uint64_t);
}
}
*user_key = key.substr(pos);
} - 4.53% _ZN9BlueStore5Onode15decode_omap_keyERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEPS6_ ▒
- _ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE12_M_constructIPKcEEvT_S8_St20forward_iterator_tag ▒
1.40% _Znwm@plt ▒
0.52% memcpy@plt ▒
mkogan1
left a comment
There was a problem hiding this comment.
following the same methodology used to test PR #60000
S3 LIST of 50M objects
the measured performance improvement is unequivocal:
7 minutes faster with this PR,
percentage difference - the 50 million object listing duration was 17.26% faster.
time (nice numactl -N 1 -m 1 -- ~/go/bin/hsbench -a b2345678901234567890 -s b234567890123456789012345678901234567890 -u http://127.0.0.1:8000 -z 4K -d -1 -t $(( $(numactl -N 0 -- nproc) / 1 )) -b 1 -n 50000000 -m l -bp b01b- -op 'folder01/stage01_')
# Before PR: 47:20.61 total
2024/10/09 08:06:42 Running Loop 0 BUCKET LIST TEST
2024/10/09 08:54:02 Loop: 0, Int: TOTAL, Dur(s): 2840.6, Mode: LIST, Ops: 50000, MB/s: 0.00, IO/s: 18, Lat(ms): [ min: 43.5, avg: 56.8, 99%: 76.6, max: 120.5 ], Slowdowns: 0
( nice numactl -N 1 -m 1 -- ~/go/bin/hsbench -a b2345678901234567890 -s -u ) 2163.04s user 85.44s system 79% cpu 47:20.61 total
# After PR: 40:22.58 total
2024/10/14 11:42:08 Running Loop 0 BUCKET LIST TEST
2024/10/14 12:22:31 Loop: 0, Int: TOTAL, Dur(s): 2422.6, Mode: LIST, Ops: 50000, MB/s: 0.00, IO/s: 21, Lat(ms): [ min: 36.8, avg: 48.5, 99%: 68.5, max: 117.8 ], Slowdowns: 0
( nice numactl -N 1 -m 1 -- ~/go/bin/hsbench -a b2345678901234567890 -s -u ) 2140.62s user 76.28s system 91% cpu 40:22.58 total
|
@mkogan1: do you have an information ( In the yesterday's testing I'm seeing around 400% more IOPS: but:
|
src/os/ObjectStore.h
Outdated
| omap_iter_seek_t start_from, ///< [in] where the iterator should point to at the beginning | ||
| std::function<omap_iter_ret_t(std::string_view, std::string_view)> f | ||
| ) { | ||
| return -EOPNOTSUPP; |
There was a problem hiding this comment.
Ultimately, we need some implementation here, otherwise KStore and MemStore based OSDs would need handle this alternatively.
There was a problem hiding this comment.
Of course, this is definitely a TODO before turning the draft into a full-blown PR.
| STOP, | ||
| NEXT | ||
| }; | ||
| virtual int omap_iterate( |
There was a problem hiding this comment.
I like the concept of omap iteration as a black box.
It would be so much better if we do not expose RocksDB iterator behaviours to OSD.
Yes, i really propose to delete OmapIterator class altogether.
There was a problem hiding this comment.
I'm on board with this! omap_iter_ret_t already provides room for e.g. PREV; similarly SEEK_TO_FIRST could be added to omap_iter_seek_t.
Another interface-layer cleanup would be dropping (some of) the std::string-centric methods of KeyValueDB::IteratorImpl (then hopefully used only by providers of ObjectStore).
All of this as follow-ups with the reason being backportability – personally I'm fine even with short-living duplication of the code if it helps minimizing intrusiveness.
sure, here are the |
|
@mkogan1: OK, so the listing of constant number (50M) of objects took |
|
Oh, report from benchmarking & profiling ec6ed26 (squeezing |
attached FG with this PR, (without the 2 commits from yesterday yet) |
|
updated flame graph with PR (+ the 2 commits from yesterday), and fixed unknown stacks (the methodology was changed slightly): |
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
```
- 63.07% _ZN12PrimaryLogPG19prepare_transactionEPNS_9OpContextE ▒
- 63.06% _ZN12PrimaryLogPG10do_osd_opsEPNS_9OpContextERSt6vectorI5OSDOpSaIS3_EE ▒
- 20.19% _ZN9BlueStore16OmapIteratorImpl4nextEv ▒
- 12.21% _ZN14CFIteratorImpl4nextEv ▒
+ 10.56% _ZN7rocksdb6DBIter4NextEv ▒
1.02% _ZN7rocksdb18ArenaWrappedDBIter4NextEv ▒
+ 3.11% clock_gettime@@GLIBC_2.17 ▒
+ 2.44% _ZN9BlueStore11log_latencyEPKciRKNSt6chrono8durationImSt5ratioILl1ELl1000000000EEEEdS1_i ▒
0.78% pthread_rwlock_rdlock@plt ▒
0.69% pthread_rwlock_unlock@plt ▒
- 14.28% _ZN9BlueStore16OmapIteratorImpl5valueEv ▒
- 11.60% _ZN14CFIteratorImpl5valueEv ▒
- 11.41% _ZL13to_bufferlistN7rocksdb5SliceE ▒
- 10.50% _ZN4ceph6buffer7v15_2_03ptrC1EPKcj ▒
- _ZN4ceph6buffer7v15_2_04copyEPKcj ▒
- 10.01% _ZN4ceph6buffer7v15_2_014create_alignedEjj ▒
- _ZN4ceph6buffer7v15_2_025create_aligned_in_mempoolEjji ▒
5.27% _ZN7mempool6pool_t12adjust_countEll ▒
+ 3.72% tc_posix_memalign ▒
0.54% _ZN4ceph6buffer7v15_2_04list6appendEONS1_3ptrE ▒
1.25% pthread_rwlock_rdlock@plt ▒
0.90% pthread_rwlock_unlock@plt
```
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
…lation Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
049c321 to
203ad55
Compare
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
…rate() Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
|
Commit 19392d4, being a response to a review comment, will pushed the follow-up PR. |
203ad55 to
cbc771a
Compare
|
Pushed a refactored (without any important / invasive changes) version. Ready for QA. |
|
I'm dropping the |
It's know that the `md_config_t::get_val<>()` method template is costly and should be avoided on hot paths. Recent profiling[1] by Mark Kogani has shown that, on RGW's bucket listing, an OSD had burnt 2,87% of CPU cycles on `get_val<long>()` in `PG::prepare_stats_for_publish()`. [1]: ceph#60278 (comment) Fixes: https://tracker.ceph.com/issues/69657 Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
It's know that the `md_config_t::get_val<>()` method template is costly and should be avoided on hot paths. Recent profiling[1] by Mark Kogani has shown that, on RGW's bucket listing, an OSD had burnt 2,87% of CPU cycles on `get_val<long>()` in `PG::prepare_stats_for_publish()`. [1]: ceph#60278 (comment) Fixes: https://tracker.ceph.com/issues/69657 Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
It's know that the `md_config_t::get_val<>()` method template is costly and should be avoided on hot paths. Recent profiling[1] by Mark Kogani has shown that, on RGW's bucket listing, an OSD had burnt 2,87% of CPU cycles on `get_val<long>()` in `PG::prepare_stats_for_publish()`. [1]: ceph#60278 (comment) Fixes: https://tracker.ceph.com/issues/69657 Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
It's know that the `md_config_t::get_val<>()` method template is costly and should be avoided on hot paths. Recent profiling[1] by Mark Kogani has shown that, on RGW's bucket listing, an OSD had burnt 2,87% of CPU cycles on `get_val<long>()` in `PG::prepare_stats_for_publish()`. [1]: ceph#60278 (comment) Fixes: https://tracker.ceph.com/issues/69657 Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Special thanks and credits to Mark Kogan for profiling ceph-osd under the RGW list workload. In this particular workload the change is worth around 4.2% of CPU cycles [1]. However, it's not restricted to RGW's bucket listing nor `cls_rgw`; I think it affects every single cls plugin in the system. Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com> [1]: ceph/ceph#60278 (comment)
It's know that the `md_config_t::get_val<>()` method template is costly and should be avoided on hot paths. Recent profiling[1] by Mark Kogani has shown that, on RGW's bucket listing, an OSD had burnt 2,87% of CPU cycles on `get_val<long>()` in `PG::prepare_stats_for_publish()`. [1]: ceph/ceph#60278 (comment) Fixes: https://tracker.ceph.com/issues/69657 Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Special thanks and credits to Mark Kogan for profiling ceph-osd under the RGW list workload. In this particular workload the change is worth around 4.2% of CPU cycles [1]. However, it's not restricted to RGW's bucket listing nor `cls_rgw`; I think it affects every single cls plugin in the system. Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com> [1]: ceph#60278 (comment) (cherry picked from commit 99c4041)

Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windowsjenkins test rook e2e