Skip to content

os, osd: bring the lightweight OMAP iteration#60278

Merged
yuriw merged 11 commits intoceph:mainfrom
rzarzynski:wip-os-fastomapiter
Jan 13, 2025
Merged

os, osd: bring the lightweight OMAP iteration#60278
yuriw merged 11 commits intoceph:mainfrom
rzarzynski:wip-os-fastomapiter

Conversation

@rzarzynski
Copy link
Contributor

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

if (key.substr(0, filter_prefix.size()) != filter_prefix) {
return ObjectStore::omap_iter_ret_t::STOP;
}
if (num >= max_return || bl.length() >= max_bytes) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The limit is known a priori. Perhaps the interface could be extended to accommodate that and allow readahead / minimize size of the critical section under the Collection::lock. Yet, RocksDB is only about iterating one-by-one (no batching AFAIK), so rather not for now.

return "";
}
virtual ceph::buffer::list value() = 0;
virtual std::string_view value_as_sv() = 0;
Copy link
Contributor Author

@rzarzynski rzarzynski Oct 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to drop old variants ultimately. Please note that doing the memcpy here results in more fragmentation of bufferlist at the Message layer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keys should undergo exactly the same expansion procedure.

There is a problem with string_view lifetime. It should be clearly stated for how long the value existence is guaranteed.
For RocksDB that would be until next seek/bound/next call, I am reasonably sure.
For MemStore and KStore it should be checked.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MemStore stores OMAP simply as map<string, bl> so it should be good till modification. Ideally we would hide the new variants from users and only let ObjectStores to use them.

@mkogan1
Copy link
Contributor

mkogan1 commented Oct 13, 2024

@rzarzynski have not debugged yet, but with this PR s3cmd ls ... always returns an empty LIST
(the LIST op is successful (200 OK) but no items are returned)
image

@rzarzynski
Copy link
Contributor Author

Apologizes, @mkogan1! rados -p test listomapkeys benchmark_data_o08_3272042_object0 I used in manual testing triggers another flow than RGW / rados bench. Fixed the problem and repushed.

I also added two new commits that bring the latency logging we have in get_omap_iterator() / next() to make this apple-to-apple. Although perf testifies it's pretty costly (~10%), the overall gain looks promising even with those extras.

o->get_omap_tail(&tail);
while (it->valid()) {
std::string user_key;
if (const auto& db_key = it->raw_key().second; db_key >= tail) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rocksdb::Slice::ToString() sounds promising. Perhaps it's another place std::string_view could help to pacify.

                              - 9.46% _ZN14CFIteratorImpl7raw_keyB5cxx11Ev                                                                                                                                        ▒
                                 - 7.32% _ZNK7rocksdb5Slice8ToStringB5cxx11Eb                                                                                                                                     ▒
                                    - _ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE10_M_replaceEmmPKcm                                                                                                    ▒
                                       - _ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE9_M_mutateEmmPKcm                                                                                                   ▒
                                            2.69% _Znwm@plt                                                                                                                                                       ▒

Copy link
Contributor Author

@rzarzynski rzarzynski Oct 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the leaf of our vast abstraction over the rocksdb::Iterator:

  string key() override {
    return dbiter->key().ToString();
  } 
  std::pair<std::string, std::string> raw_key() override { 
    return make_pair(prefix, key());
  } 

rocksdb::Slice already offers ToStringView() :-). Switching raw_key() (and perhaps key()) seems worth consideration.

truncated = true;
return ObjectStore::omap_iter_ret_t::STOP;
}
encode(key, bl);
Copy link
Contributor Author

@rzarzynski rzarzynski Oct 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IDEA FOR FOLLOWUP: Maybe the bl should be initialized with bigger append buffer from the very beginning.

o->get_omap_tail(&tail);
while (it->valid()) {
std::string user_key;
if (const auto& db_key = it->raw_key().second; db_key >= tail) {
Copy link
Contributor Author

@rzarzynski rzarzynski Oct 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the leaf of our vast abstraction over the rocksdb::Iterator:

  string key() override {
    return dbiter->key().ToString();
  } 
  std::pair<std::string, std::string> raw_key() override { 
    return make_pair(prefix, key());
  } 

rocksdb::Slice already offers ToStringView() :-). Switching raw_key() (and perhaps key()) seems worth consideration.

}
std::string_view value_as_sv() override {
rocksdb::Slice val = iters[0]->value();
return std::string_view{val.data(), val.size()};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return std::string_view{val.data(), val.size()};
return val.ToStringView();

}
std::string_view value_as_sv() override {
rocksdb::Slice val = dbiter->value();
return std::string_view{val.data(), val.size()};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return std::string_view{val.data(), val.size()};
return val.ToStringView();

Comment on lines +2272 to +2308
rocksdb::Slice val = dbiter->value();
return std::string_view{val.data(), val.size()};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
rocksdb::Slice val = dbiter->value();
return std::string_view{val.data(), val.size()};
return dbiter->value().ToStringView();

if (const auto& db_key = it->raw_key().second; db_key >= tail) {
break;
} else {
o->decode_omap_key(db_key, &user_key);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could easily work with std::string_view while being worth around 4% of cycles;

void BlueStore::Onode::decode_omap_key(const string& key, string *user_key)
{
  size_t pos = sizeof(uint64_t) + 1;
  if (!onode.is_pgmeta_omap()) {
    if (onode.is_perpg_omap()) {
      pos += sizeof(uint64_t) + sizeof(uint32_t);
    } else if (onode.is_perpool_omap()) {
      pos += sizeof(uint64_t);
    }
  }
  *user_key = key.substr(pos);
}
                              - 4.53% _ZN9BlueStore5Onode15decode_omap_keyERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEPS6_                                                                             ▒
                                 - _ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE12_M_constructIPKcEEvT_S8_St20forward_iterator_tag                                                                        ▒
                                      1.40% _Znwm@plt                                                                                                                                                             ▒
                                      0.52% memcpy@plt                                                                                                                                                            ▒

@mkogan1 mkogan1 self-requested a review October 14, 2024 13:39
Copy link
Contributor

@mkogan1 mkogan1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

following the same methodology used to test PR #60000
S3 LIST of 50M objects

the measured performance improvement is unequivocal:
7 minutes faster with this PR,
percentage difference - the 50 million object listing duration was 17.26% faster.

time (nice numactl -N 1 -m 1 -- ~/go/bin/hsbench -a b2345678901234567890 -s b234567890123456789012345678901234567890 -u http://127.0.0.1:8000 -z 4K -d -1 -t $(( $(numactl -N 0 -- nproc) / 1 )) -b 1 -n 50000000 -m l -bp b01b- -op 'folder01/stage01_')

# Before PR: 47:20.61 total
2024/10/09 08:06:42 Running Loop 0 BUCKET LIST TEST
2024/10/09 08:54:02 Loop: 0, Int: TOTAL, Dur(s): 2840.6, Mode: LIST, Ops: 50000, MB/s: 0.00, IO/s: 18, Lat(ms): [ min: 43.5, avg: 56.8, 99%: 76.6, max: 120.5 ], Slowdowns: 0               
( nice numactl -N 1 -m 1 -- ~/go/bin/hsbench -a b2345678901234567890 -s  -u  )  2163.04s user 85.44s system 79% cpu 47:20.61 total

# After PR: 40:22.58 total
2024/10/14 11:42:08 Running Loop 0 BUCKET LIST TEST 
2024/10/14 12:22:31 Loop: 0, Int: TOTAL, Dur(s): 2422.6, Mode: LIST, Ops: 50000, MB/s: 0.00, IO/s: 21, Lat(ms): [ min: 36.8, avg: 48.5, 99%: 68.5, max: 117.8 ], Slowdowns: 0 
( nice numactl -N 1 -m 1 -- ~/go/bin/hsbench -a b2345678901234567890 -s  -u  )  2140.62s user 76.28s system 91% cpu 40:22.58 total

@rzarzynski
Copy link
Contributor Author

@mkogan1: do you have an information (perf stat preferably) on how many cycles the OSDs burnt to serve those S3 requests?

In the yesterday's testing I'm seeing around 400% more IOPS:

>>> 2702 / 649
4.163328197226503

but:

  • the environment is controlled for saturation (to ensure any improvement at a tp_osd_tp thread directly translates into more IOPS; not just lesser resource utilization);
  • the testing happens at the RADOS layer, without RGW. I think it's the biggest difference and what is visible at the S3 layer is reduction in latency of OSD + RGW combined (in other words: RGW increases the denominator).

omap_iter_seek_t start_from, ///< [in] where the iterator should point to at the beginning
std::function<omap_iter_ret_t(std::string_view, std::string_view)> f
) {
return -EOPNOTSUPP;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ultimately, we need some implementation here, otherwise KStore and MemStore based OSDs would need handle this alternatively.

Copy link
Contributor Author

@rzarzynski rzarzynski Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, this is definitely a TODO before turning the draft into a full-blown PR.

STOP,
NEXT
};
virtual int omap_iterate(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the concept of omap iteration as a black box.
It would be so much better if we do not expose RocksDB iterator behaviours to OSD.
Yes, i really propose to delete OmapIterator class altogether.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on board with this! omap_iter_ret_t already provides room for e.g. PREV; similarly SEEK_TO_FIRST could be added to omap_iter_seek_t.

Another interface-layer cleanup would be dropping (some of) the std::string-centric methods of KeyValueDB::IteratorImpl (then hopefully used only by providers of ObjectStore).

All of this as follow-ups with the reason being backportability – personally I'm fine even with short-living duplication of the code if it helps minimizing intrusiveness.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkogan1
Copy link
Contributor

mkogan1 commented Oct 15, 2024

@mkogan1: do you have an information (perf stat preferably) on how many cycles the OSDs burnt to serve those S3 requests?

sure, here are the perf stat of the OSD preccess with vs without the PR

## WITH PR:
###########
2024/10/14 18:29:27 Running Loop 0 BUCKET LIST TEST
2024/10/14 19:09:40 Loop: 0, Int: TOTAL, Dur(s): 2413.3, Mode: LIST, Ops: 50000, MB/s: 0.00, IO/s: 21, Lat(ms): [ min: 38.4, avg: 48.3, 99%: 68.7, max: 111.9 ], Slowdowns: 0
( nice numactl -N 1 -m 1 -- ~/go/bin/hsbench -a b2345678901234567890 -s  -u  )  2129.53s user 78.24s system 91% cpu 40:13.34 total

❯ sudo perf stat --pid=$(pidof ceph-osd)
 Performance counter stats for process id '1589143':
      4,706,988.29 msec task-clock                #    1.947 CPUs utilized
       144,002,203      context-switches          #   30.593 K/sec
        11,167,865      cpu-migrations            #    2.373 K/sec
                19      page-faults               #    0.004 /sec
13,565,614,722,638      cycles                    #    2.882 GHz
10,236,793,941,447      instructions              #    0.75  insn per cycle
 1,795,068,051,351      branches                  #  381.362 M/sec
    74,769,237,191      branch-misses             #    4.17% of all branches
    2418.152146173 seconds time elapsed


## WITHOUT PR:
##############
2024/10/14 22:25:48 Running Loop 0 BUCKET LIST TEST
2024/10/14 23:11:27 Loop: 0, Int: TOTAL, Dur(s): 2739.3, Mode: LIST, Ops: 50001, MB/s: 0.00, IO/s: 18, Lat(ms): [ min: 26.4, avg: 54.8, 99%: 75.6, max: 119.3 ], Slowdowns: 0
( nice numactl -N 1 -m 1 -- ~/go/bin/hsbench -a b2345678901234567890 -s  -u  )  2159.08s user 86.43s system 81% cpu 45:39.36 total

❯ sudo perf stat --pid=$(pidof ceph-osd)
 Performance counter stats for process id '2822674':
      6,651,889.34 msec task-clock                #    2.424 CPUs utilized
       198,727,437      context-switches          #   29.875 K/sec
        14,434,725      cpu-migrations            #    2.170 K/sec
         1,053,205      page-faults               #  158.332 /sec
19,146,385,188,311      cycles                    #    2.878 GHz
15,286,933,348,285      instructions              #    0.80  insn per cycle
 2,682,786,191,052      branches                  #  403.312 M/sec
   107,658,608,688      branch-misses             #    4.01% of all branches
    2744.733827871 seconds time elapsed

@rzarzynski
Copy link
Contributor Author

@mkogan1: OK, so the listing of constant number (50M) of objects took 2418 / 2744 ~= 0.88 of time while burning 13565614722638 / 19146385188311 ~= 0.70 cycles. Not so bad, I would say :-). Yet, maybe we can squeeze more. Any chance for flamegraphs?

@rzarzynski
Copy link
Contributor Author

rzarzynski commented Oct 15, 2024

Oh, report from benchmarking & profiling ec6ed26 (squeezing memcpy around raw_key()) is available here: https://gist.github.com/rzarzynski/454e5d31ce0e84333b79bfe88362efdb?permalink_comment_id=5234618#gistcomment-5234618.

@mkogan1
Copy link
Contributor

mkogan1 commented Oct 16, 2024

@mkogan1: OK, so the listing of constant number (50M) of objects took ... Any chance for flamegraphs?

attached FG with this PR, (without the 2 commits from yesterday yet)
perf was collected as following: sudo perf record --call-graph 'lbr' -m 8M --aio -z --pid=$(pidof ceph-osd) -- sleep 60 while the hsbench LIST workload was running
there are many unknown stacks, will see if can improve
ceph-osd_583018_lbr_60

attached FG without this PR:
ceph-osd_3634166_lbr_60

@mkogan1
Copy link
Contributor

mkogan1 commented Oct 16, 2024

updated flame graph with PR (+ the 2 commits from yesterday), and fixed unknown stacks (the methodology was changed slightly):

sudo perf record -e 'cycles' --switch-events --sample-cpu -g --call-graph 'lbr' -m 8M --aio -z --pid=$(pidof ceph-osd) -- sleep 30
time (sudo perf script | stackcollapse-perf.pl --kernel | flamegraph.pl --width 1600 --bgcolors grey --cp --hash > ceph-osd_$(pidof ceph-osd).svg)

ceph-osd_1653738

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
```
                     - 63.07% _ZN12PrimaryLogPG19prepare_transactionEPNS_9OpContextE                                                                                                                              ▒
                        - 63.06% _ZN12PrimaryLogPG10do_osd_opsEPNS_9OpContextERSt6vectorI5OSDOpSaIS3_EE                                                                                                           ▒
                           - 20.19% _ZN9BlueStore16OmapIteratorImpl4nextEv                                                                                                                                        ▒
                              - 12.21% _ZN14CFIteratorImpl4nextEv                                                                                                                                                 ▒
                                 + 10.56% _ZN7rocksdb6DBIter4NextEv                                                                                                                                               ▒
                                   1.02% _ZN7rocksdb18ArenaWrappedDBIter4NextEv                                                                                                                                   ▒
                              + 3.11% clock_gettime@@GLIBC_2.17                                                                                                                                                   ▒
                              + 2.44% _ZN9BlueStore11log_latencyEPKciRKNSt6chrono8durationImSt5ratioILl1ELl1000000000EEEEdS1_i                                                                                    ▒
                                0.78% pthread_rwlock_rdlock@plt                                                                                                                                                   ▒
                                0.69% pthread_rwlock_unlock@plt                                                                                                                                                   ▒
                           - 14.28% _ZN9BlueStore16OmapIteratorImpl5valueEv                                                                                                                                       ▒
                              - 11.60% _ZN14CFIteratorImpl5valueEv                                                                                                                                                ▒
                                 - 11.41% _ZL13to_bufferlistN7rocksdb5SliceE                                                                                                                                      ▒
                                    - 10.50% _ZN4ceph6buffer7v15_2_03ptrC1EPKcj                                                                                                                                   ▒
                                       - _ZN4ceph6buffer7v15_2_04copyEPKcj                                                                                                                                        ▒
                                          - 10.01% _ZN4ceph6buffer7v15_2_014create_alignedEjj                                                                                                                     ▒
                                             - _ZN4ceph6buffer7v15_2_025create_aligned_in_mempoolEjji                                                                                                             ▒
                                                  5.27% _ZN7mempool6pool_t12adjust_countEll                                                                                                                       ▒
                                                + 3.72% tc_posix_memalign                                                                                                                                         ▒
                                      0.54% _ZN4ceph6buffer7v15_2_04list6appendEONS1_3ptrE                                                                                                                        ▒
                                1.25% pthread_rwlock_rdlock@plt                                                                                                                                                   ▒
                                0.90% pthread_rwlock_unlock@plt
```

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
…lation

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
…rate()

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
@rzarzynski
Copy link
Contributor Author

Commit 19392d4, being a response to a review comment, will pushed the follow-up PR.

@rzarzynski
Copy link
Contributor Author

Pushed a refactored (without any important / invasive changes) version. Ready for QA.

@rzarzynski
Copy link
Contributor Author

I'm dropping the rgw label as the RGW commits have been dissected to the follow-up PR.

@ljflores
Copy link
Member

@yuriw yuriw merged commit 89f42f1 into ceph:main Jan 13, 2025
rzarzynski added a commit to rzarzynski/ceph that referenced this pull request Jan 25, 2025
It's know that the `md_config_t::get_val<>()` method template
is costly and should be avoided on hot paths.

Recent profiling[1] by Mark Kogani has shown that, on RGW's bucket
listing, an OSD had burnt 2,87% of CPU cycles on `get_val<long>()`
in `PG::prepare_stats_for_publish()`.

[1]: ceph#60278 (comment)

Fixes: https://tracker.ceph.com/issues/69657
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
dnyanee1997 pushed a commit to rhcs-dashboard/ceph that referenced this pull request Feb 22, 2025
It's know that the `md_config_t::get_val<>()` method template
is costly and should be avoided on hot paths.

Recent profiling[1] by Mark Kogani has shown that, on RGW's bucket
listing, an OSD had burnt 2,87% of CPU cycles on `get_val<long>()`
in `PG::prepare_stats_for_publish()`.

[1]: ceph#60278 (comment)

Fixes: https://tracker.ceph.com/issues/69657
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
dnyanee1997 pushed a commit to rhcs-dashboard/ceph that referenced this pull request Feb 23, 2025
It's know that the `md_config_t::get_val<>()` method template
is costly and should be avoided on hot paths.

Recent profiling[1] by Mark Kogani has shown that, on RGW's bucket
listing, an OSD had burnt 2,87% of CPU cycles on `get_val<long>()`
in `PG::prepare_stats_for_publish()`.

[1]: ceph#60278 (comment)

Fixes: https://tracker.ceph.com/issues/69657
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
dnyanee1997 pushed a commit to rhcs-dashboard/ceph that referenced this pull request Feb 24, 2025
It's know that the `md_config_t::get_val<>()` method template
is costly and should be avoided on hot paths.

Recent profiling[1] by Mark Kogani has shown that, on RGW's bucket
listing, an OSD had burnt 2,87% of CPU cycles on `get_val<long>()`
in `PG::prepare_stats_for_publish()`.

[1]: ceph#60278 (comment)

Fixes: https://tracker.ceph.com/issues/69657
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
harriscr pushed a commit to ceph/ceph-ci that referenced this pull request May 15, 2025
Special thanks and credits to Mark Kogan for profiling
ceph-osd under the RGW list workload.

In this particular workload the change is worth around 4.2%
of CPU cycles [1]. However, it's not restricted to RGW's
bucket listing nor `cls_rgw`; I think it affects every single
cls plugin in the system.

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

[1]: ceph/ceph#60278 (comment)
harriscr pushed a commit to ceph/ceph-ci that referenced this pull request May 15, 2025
It's know that the `md_config_t::get_val<>()` method template
is costly and should be avoided on hot paths.

Recent profiling[1] by Mark Kogani has shown that, on RGW's bucket
listing, an OSD had burnt 2,87% of CPU cycles on `get_val<long>()`
in `PG::prepare_stats_for_publish()`.

[1]: ceph/ceph#60278 (comment)

Fixes: https://tracker.ceph.com/issues/69657
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
rzarzynski added a commit to rzarzynski/ceph that referenced this pull request Jul 8, 2025
Special thanks and credits to Mark Kogan for profiling
ceph-osd under the RGW list workload.

In this particular workload the change is worth around 4.2%
of CPU cycles [1]. However, it's not restricted to RGW's
bucket listing nor `cls_rgw`; I think it affects every single
cls plugin in the system.

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

[1]: ceph#60278 (comment)

(cherry picked from commit 99c4041)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants