Skip to content

osd: avoid costly md_config_t::get_val<>() when preparing stats#60610

Merged
ljflores merged 1 commit intoceph:mainfrom
rzarzynski:wip-no-getval-on-osd-stats
Feb 19, 2025
Merged

osd: avoid costly md_config_t::get_val<>() when preparing stats#60610
ljflores merged 1 commit intoceph:mainfrom
rzarzynski:wip-no-getval-on-osd-stats

Conversation

@rzarzynski
Copy link
Contributor

@rzarzynski rzarzynski commented Nov 4, 2024

It's know that the md_config_t::get_val<>() method template is costly and should be avoided on hot paths.

Recent profiling1 by Mark Kogan has shown that, on RGW's bucket listing, an OSD had burnt 2,87% of CPU cycles on get_val<long>() in PG::prepare_stats_for_publish().

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

are collected.
default: 500
with_legacy: false
with_legacy: true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Officially, we do not add new legacy options:

// This macro block defines C members of the md_config_t struct
// corresponding to the definitions in legacy_config_opts.h.
// These C members are consumed by code that was written before
// the new options.cc infrastructure: all newer code should
// be consume options via explicit get() rather than C members.

You can load these via OSD::handle_conf_change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ouch, moved to md_config_cacher_t.

@rzarzynski rzarzynski force-pushed the wip-no-getval-on-osd-stats branch from 8be7014 to f1621c1 Compare November 4, 2024 17:38
@rzarzynski rzarzynski requested a review from batrick November 4, 2024 17:39
Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't know about md_config_cacher_t, nice! That only works for primitive types, right? (i.e. not std::string)

@rzarzynski
Copy link
Contributor Author

This is a question about std::atomic for such types. I guess it would compile but the lock would be mutex-based.

epoch_t cutoff_epoch = info.stats.reported_epoch;
cutoff_epoch +=
cct->_conf.get_val<int64_t>("osd_pg_stat_report_interval_max_epochs");
static_cast<int64_t>(osd_pg_stat_report_interval_max_epochs);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an error in the Shaman build related to this change:
https://shaman.ceph.com/builds/ceph/wip-yuri4-testing-2025-01-13-1147/

19.3.0-6882-gb9fe3ef7/src/osd/PeeringState.cc
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/19.3.0-6882-gb9fe3ef7/rpm/el9/BUILD/ceph-19.3.0-6882-gb9fe3ef7/src/osd/PeeringState.cc: In member function ‘std::optional<pg_stat_t> PeeringState::prepare_stats_for_publish(const std::optional<pg_stat_t>&, const object_stat_collection_t&)’:
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/19.3.0-6882-gb9fe3ef7/rpm/el9/BUILD/ceph-19.3.0-6882-gb9fe3ef7/src/osd/PeeringState.cc:3934:7: error: invalid ‘static_cast’ from type ‘md_config_cacher_t<long int>’ to type ‘int64_t’ {aka ‘long int’}
 3934 |       static_cast<int64_t>(osd_pg_stat_report_interval_max_seconds);
      |       ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/19.3.0-6882-gb9fe3ef7/rpm/el9/BUILD/ceph-19.3.0-6882-gb9fe3ef7/src/osd/PeeringState.cc:3943:7: error: invalid ‘static_cast’ from type ‘md_config_cacher_t<long int>’ to type ‘int64_t’ {aka ‘long int’}
 3943 |       static_cast<int64_t>(osd_pg_stat_report_interval_max_epochs);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing the '*' (should be *osd_pg...)

Copy link
Contributor

@ronen-fr ronen-fr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing the '*' (should be *osd_pg...)

@rzarzynski
Copy link
Contributor Author

Ah, there was the cacher's API change. I bet it merged earlier resulting in undetected merge conflict. I will fix.

It's know that the `md_config_t::get_val<>()` method template
is costly and should be avoided on hot paths.

Recent profiling[1] by Mark Kogani has shown that, on RGW's bucket
listing, an OSD had burnt 2,87% of CPU cycles on `get_val<long>()`
in `PG::prepare_stats_for_publish()`.

[1]: ceph#60278 (comment)

Fixes: https://tracker.ceph.com/issues/69657
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
@rzarzynski rzarzynski force-pushed the wip-no-getval-on-osd-stats branch from f1621c1 to 39a12b5 Compare January 25, 2025 16:51
@rzarzynski
Copy link
Contributor Author

Rebased to fresh main with the cacher's interface reworked.
Created https://tracker.ceph.com/issues/69657 to track backports.
Hopefully it's done.

@ronen-fr
Copy link
Contributor

jenkins test api

@ronen-fr
Copy link
Contributor

jenkins test make check arm64

@ljflores
Copy link
Member

@ljflores ljflores merged commit af9e68e into ceph:main Feb 19, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants