osd: Run osd bench test to override default max osd capacity for mclock by sseshasa · Pull Request #41308 · ceph/ceph

sseshasa · 2021-05-12T15:48:33Z

If mclock scheduler is enabled, run the osd bench test as part of osd
initialization sequence in order to determine the max osd capacity. The
iops determined as part of the test is used to override the default
osd_mclock_max_capacity_iops_[hdd,ssd].

The test performs random writes of 100 objects of 4MiB size using
4KiB blocksize. The existing test which was a part of asok_command() is
factored out into a separate method called run_osd_bench_test() so that it
can be used for both purposes.

A new method called update_configuration() in introduced in OpScheduler
base class to facilitate propagation of changes to a config option
that is not user initiated. This method helps in applying changes and
update any internal variable associated with a config option as
long as it is tracked. In this case, the change to the max osd capacity
is propagated to each op shard using the mentioned method. In the
future this method can be useful to propagate changes to advanced
config option(s) that the user is not expected to modify.

Update mclock-config-ref to reflect automated OSD benchmarking.

Signed-off-by: Sridhar Seshasayee sseshasa@redhat.com

Checklist

References tracker ticket
Updates documentation if necessary
Includes tests for new functionality or reproducer for bug

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox

sseshasa · 2021-05-12T15:53:45Z

Logs messages from testing:

Underlying device: SSD

2021-05-12T11:38:57.153+0000 7f91ecde00c0  1 osd.0 0 maybe_override_max_osd_capacity_for_qos osd bench test - max_bytes_count: 12288000 blocksize: 4096 max_object_count: 200
2021-05-12T11:38:57.706+0000 7f91ecde00c0  1 osd.0 0 maybe_override_max_osd_capacity_for_qos osd bench result - bandwidth (MiB/sec): 110.115 iops: 28189.365 elapsed_sec: 0.106

Underlying device: HDD

2021-05-12T11:16:04.905+0000 7fa938e280c0  1 osd.0 0 maybe_override_max_osd_capacity_for_qos osd bench test - max_bytes_count: 12288000 blocksize: 4096 max_object_count: 200
2021-05-12T11:16:21.478+0000 7fa938e280c0  1 osd.0 0 maybe_override_max_osd_capacity_for_qos osd bench result - bandwidth (MiB/sec): 1.236 iops: 316.420 elapsed_sec: 9.481

neha-ojha · 2021-05-12T22:22:09Z

src/osd/OSD.cc

+    } else {
+      double rate = count / elapsed;
+      double iops = rate / bsize;
+      dout(1) << __func__


I am not sure we need to log all the above lines at level 1 in this function. Only adding the following summary line to the osd log at level 1 and maybe cluster log as well, when we have successfully overridden the defaults, should be enough.

'ceph config show' displaying it is good enough - no need to add it to the cluster log. agree about the osd log levels

second, i'd suggest not put this in clog.

I have used derr for the error case. As suggested, I just added a log at level 1 when the test succeeds.

doc/rados/configuration/mclock-config-ref.rst

src/osd/OSD.cc

tchaikov · 2021-05-13T03:23:46Z

src/osd/OSD.cc

+    } else {
+      double rate = count / elapsed;
+      double iops = rate / bsize;
+      dout(1) << __func__


second, i'd suggest not put this in clog.

neha-ojha

few nits on the documentation, code changes look fine

doc/rados/configuration/mclock-config-ref.rst

neha-ojha · 2021-05-14T15:56:35Z

doc/rados/configuration/mclock-config-ref.rst

-your Ceph configuration file).
+1. Bring up your Ceph cluster and login to the Ceph node hosting the OSDs that
+   you wish to benchmark.
+2. Ensure that the bluestore throttle options ( i.e.


do we need this? these will be set to defaults as per the setup procedure provided so far

Sure, I can remove this since defaults will be set.

On another note it may be worthwhile to come up with a script that performs these steps and determine the bluestore throttle values that work. We can for example set a tolerance of some percentage of the baseline throughput below which the script stops and displays the throttle values. This obviously cannot be done during OSD init as it consumes time. What do you think?

doc/rados/configuration/mclock-config-ref.rst

sseshasa · 2021-05-18T06:40:33Z

Pulpito Run:
https://pulpito.ceph.com/sseshasa-2021-05-17_11:08:21-rados-wip-sseshasa-testing-2021-05-17-1504-distro-basic-smithi/

sseshasa · 2021-05-20T14:42:54Z

Added commit 379698b to fix the TestProgress failure seen in the teuthology run (JobID: 6118355). Ran the above test again multiple times. See:
https://pulpito.ceph.com/sseshasa-2021-05-20_13:44:35-rados:mgr-wip-sseshasa-testing-progress-distro-basic-smithi/

All the tests passed in all the runs. For some tests for e.g. 6124846/49/50/53 did encounter the scenario that needed a few more seconds for the recovery to complete.

The failures pertaining to OSD_DOWN health check warnings were because of heartbeat timeouts due to worker thread of an op shard not getting an item to process immediately. The scheduler kept returning status indicating when the next request could be dequeued (future request). This happened continuously for more than 15 secs after which the hearbeat timeouts started to appear resulting in the concerned osd being marked down. The fix is to disable hearbeats when there are no immediate requests to process and re-enable once a work item is ready for processing. See commit a543550.

@neha-ojha @jdurgin @athanatos @tchaikov Please review the above commit for correctness.

Test Results with Above Commit:
https://pulpito.ceph.com/sseshasa-2021-05-25_16:41:19-rados:mgr-wip-sseshasa-testing-prog-1-distro-basic-smithi/
https://pulpito.ceph.com/sseshasa-2021-05-25_16:45:58-rados:mgr-wip-sseshasa-testing-prog-1-distro-basic-smithi/

Will re-run the failed jobs again for completeness.

sseshasa · 2021-05-20T15:16:01Z

jenkins retest this please

github-actions · 2021-05-26T01:44:21Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

sseshasa · 2021-05-26T14:36:07Z

jenkins test make check

sseshasa · 2021-05-27T08:34:35Z

Latest Run:
https://pulpito.ceph.com/sseshasa-2021-05-26_07:34:51-rados-wip-sseshasa-testing-2021-05-26-1127-distro-basic-smithi/

The progress module test got fixed in the above run. However, there was a failure of a job with ceph_objectstore_test (see https://tracker.ceph.com/issues/50903) due to the delay in osd initialization caused by the duration of the 'osd bench' test. The 'osd bench' duration was improved and the following are the runs post the fix:

ceph_objectstore_test:
https://pulpito.ceph.com/sseshasa-2021-06-01_08:16:40-rados:objectstore-wip-sseshasa-testing-objs-test-2-distro-basic-smithi/

Re-run failed and dead jobs from the latest run:
https://pulpito.ceph.com/sseshasa-2021-06-01_08:27:04-rados-wip-sseshasa-testing-objs-test-2-distro-basic-smithi/

@neha-ojha I think this PR is good to be merged now. Please take a look and merge this if you concur.

Remove the generic "osd_mclock_max_capacity_iops" option and use the "osd_mclock_max_capacity_iops_[hdd,ssd]" options. It is better to have a clear indication about the type of underlying device. This helps in avoiding confusion when trying to read or override the options. Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>

If mclock scheduler is enabled, run the osd bench test as part of osd initialization sequence in order to determine the max osd capacity. The iops determined as part of the test is used to override the default osd_mclock_max_capacity_iops_[hdd,ssd] option depending on the underlying device type. The test performs random writes of 100 objects of 4MiB size using 4KiB blocksize. The existing test which was a part of asok_command() is factored out into a separate method called run_osd_bench_test() so that it can be used for both purposes. If the test fails, the default values for the above mentioned options are used. A new method called update_configuration() in introduced in OpScheduler base class to facilitate propagation of changes to a config option that is not user initiated. This method helps in applying changes and update any internal variable associated with a config option as long as it is tracked. In this case, the change to the max osd capacity is propagated to each op shard using the mentioned method. In the future this method can be useful to propagate changes to advanced config option(s) that the user is not expected to modify. Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>

…cessed There could be rare instances when employing the mclock scheduler where a worker thread for a shard may not get an immediate work item to process. Such items are designated as future work items. In such cases, the _process() loop waits until the time indicated by the scheduler to attempt a dequeue from the scheduler queue again. It may so happen that if there are multiple threads per shard, a thread may not get an immediate item for a long time. This time could exceed the heartbeat timeout for the thread and result in hearbeat timeouts reported for the osd in question. To prevent this, the heartbeat timeouts for the thread is disabled before waiting for an item and enabled once the wait period is over. Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>

With mclock scheduler enabled, the recovery throughput is throttled based on factors like the type of mclock profile enabled, the OSD capacity among others. Due to this the recovery times may vary and therefore the existing timeout of 120 secs may not be sufficient. To address the above, a new method called _is_inprogress_or_complete() is introduced in the TestProgress Class that checks if the event with the specified 'id' is in progress by checking the 'progress' key of the progress command response. This method also handles the corner case where the event completes just before it's called. The existing wait_until_true() method in the CephTestCase Class is modified to accept another function argument called "check_fn". This is set to the _is_inprogress_or_complete() function described earlier in the "test_turn_off_module" test that has been observed to fail due to the reasons already described above. A retry mechanism of a maximum of 5 attempts is introduced after the first timeout is hit. This means that the wait can extend up to a maximum of 600 secs (120 secs * 5) as long as there is recovery progress reported by the 'ceph progress' command result. Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>

sseshasa · 2021-06-02T11:06:52Z

jenkins test docs

doc/rados/configuration/mclock-config-ref.rst

neha-ojha · 2021-06-03T00:33:09Z

Latest Run:
https://pulpito.ceph.com/sseshasa-2021-05-26_07:34:51-rados-wip-sseshasa-testing-2021-05-26-1127-distro-basic-smithi/

The progress module test got fixed in the above run. However, there was a failure of a job with ceph_objectstore_test (see https://tracker.ceph.com/issues/50903) due to the delay in osd initialization caused by the duration of the 'osd bench' test. The 'osd bench' duration was improved and the following are the runs post the fix:

ceph_objectstore_test:
https://pulpito.ceph.com/sseshasa-2021-06-01_08:16:40-rados:objectstore-wip-sseshasa-testing-objs-test-2-distro-basic-smithi/

Re-run failed and dead jobs from the latest run:
https://pulpito.ceph.com/sseshasa-2021-06-01_08:27:04-rados-wip-sseshasa-testing-objs-test-2-distro-basic-smithi/

@neha-ojha I think this PR is good to be merged now. Please take a look and merge this if you concur.

I noticed some dead jobs timing out in wait_for_*, but can't say much without logs though. Let's merge after creating trackers for those and the osd-rep-recov-eio.sh failure.

Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>

sseshasa · 2021-06-03T10:42:56Z

Latest Run:
https://pulpito.ceph.com/sseshasa-2021-05-26_07:34:51-rados-wip-sseshasa-testing-2021-05-26-1127-distro-basic-smithi/
The progress module test got fixed in the above run. However, there was a failure of a job with ceph_objectstore_test (see https://tracker.ceph.com/issues/50903) due to the delay in osd initialization caused by the duration of the 'osd bench' test. The 'osd bench' duration was improved and the following are the runs post the fix:
ceph_objectstore_test:
https://pulpito.ceph.com/sseshasa-2021-06-01_08:16:40-rados:objectstore-wip-sseshasa-testing-objs-test-2-distro-basic-smithi/
Re-run failed and dead jobs from the latest run:
https://pulpito.ceph.com/sseshasa-2021-06-01_08:27:04-rados-wip-sseshasa-testing-objs-test-2-distro-basic-smithi/
@neha-ojha I think this PR is good to be merged now. Please take a look and merge this if you concur.

I noticed some dead jobs timing out in wait_for_*, but can't say much without logs though. Let's merge after creating trackers for those and the osd-rep-recov-eio.sh failure.

Tracker for the failed job:
6145022 - osd-rep-recov-eio.sh: https://tracker.ceph.com/issues/51074

Trackers for the dead jobs:
6145021 - wait_for_recovery timeout: https://tracker.ceph.com/issues/51076
6145024 - Updated existing tracker: https://tracker.ceph.com/issues/50806 since the same assertion failure is hit.

sseshasa added documentation core needs-qa labels May 12, 2021

sseshasa requested review from athanatos, jdurgin and neha-ojha May 12, 2021 15:48

sseshasa added the wip-sseshasa-testing label May 12, 2021

sseshasa force-pushed the wip-osd-benchmark-for-mclock branch from def4c08 to 4219c37 Compare May 12, 2021 17:41

neha-ojha reviewed May 12, 2021

View reviewed changes

jdurgin reviewed May 13, 2021

View reviewed changes

doc/rados/configuration/mclock-config-ref.rst Show resolved Hide resolved

jdurgin reviewed May 13, 2021

View reviewed changes

doc/rados/configuration/mclock-config-ref.rst Show resolved Hide resolved

jdurgin reviewed May 13, 2021

View reviewed changes

src/osd/OSD.cc Show resolved Hide resolved

tchaikov reviewed May 13, 2021

View reviewed changes

sseshasa force-pushed the wip-osd-benchmark-for-mclock branch from 4219c37 to f87b7f7 Compare May 14, 2021 12:18

github-actions bot added the common label May 14, 2021

sseshasa force-pushed the wip-osd-benchmark-for-mclock branch 4 times, most recently from 61c587f to de2eb9a Compare May 14, 2021 14:39

neha-ojha approved these changes May 14, 2021

View reviewed changes

tchaikov reviewed May 14, 2021

View reviewed changes

doc/rados/configuration/mclock-config-ref.rst Outdated Show resolved Hide resolved

sseshasa force-pushed the wip-osd-benchmark-for-mclock branch 3 times, most recently from e5e747c to ecd2c7b Compare May 17, 2021 08:13

sseshasa force-pushed the wip-osd-benchmark-for-mclock branch from ecd2c7b to b4811c6 Compare May 20, 2021 14:32

github-actions bot added the tests label May 20, 2021

sseshasa force-pushed the wip-osd-benchmark-for-mclock branch from b4811c6 to 0e51e97 Compare May 20, 2021 16:45

jdurgin approved these changes May 21, 2021

View reviewed changes

sseshasa force-pushed the wip-osd-benchmark-for-mclock branch 4 times, most recently from 662e0cc to be97e72 Compare May 25, 2021 18:48

github-actions bot added the needs-rebase label May 26, 2021

sseshasa force-pushed the wip-osd-benchmark-for-mclock branch from be97e72 to 9bc3178 Compare May 26, 2021 05:18

github-actions bot removed the needs-rebase label May 26, 2021

sseshasa force-pushed the wip-osd-benchmark-for-mclock branch from 9bc3178 to b0c6e4f Compare May 26, 2021 12:04

sseshasa force-pushed the wip-osd-benchmark-for-mclock branch from b0c6e4f to 3585833 Compare June 1, 2021 08:54

sseshasa added 4 commits June 2, 2021 14:19

sseshasa force-pushed the wip-osd-benchmark-for-mclock branch from 3585833 to a438090 Compare June 2, 2021 09:02

neha-ojha reviewed Jun 3, 2021

View reviewed changes

doc/rados/configuration/mclock-config-ref.rst Outdated Show resolved Hide resolved

doc: Update mclock-config-ref to reflect automated OSD benchmarking

76420f9

Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>

sseshasa force-pushed the wip-osd-benchmark-for-mclock branch from a438090 to 76420f9 Compare June 3, 2021 09:16

neha-ojha merged commit 11252f6 into ceph:master Jun 3, 2021

sseshasa deleted the wip-osd-benchmark-for-mclock branch June 4, 2021 06:52

sseshasa removed the wip-sseshasa-testing label Jun 4, 2021

sseshasa mentioned this pull request Jun 7, 2021

pacific: osd: Run osd bench test to override default max osd capacity for mclock #41731

Merged

tchaikov mentioned this pull request Jun 8, 2021

Revert "osd: Run osd bench test to override default max osd capacity for mclock" #41754

Closed

Conversation

sseshasa commented May 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

sseshasa commented May 12, 2021

Uh oh!

neha-ojha May 12, 2021

Choose a reason for hiding this comment

Uh oh!

jdurgin May 13, 2021

Choose a reason for hiding this comment

Uh oh!

tchaikov May 13, 2021

Choose a reason for hiding this comment

Uh oh!

sseshasa May 14, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tchaikov May 13, 2021

Choose a reason for hiding this comment

Uh oh!

neha-ojha left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

neha-ojha May 14, 2021

Choose a reason for hiding this comment

Uh oh!

sseshasa May 14, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sseshasa commented May 18, 2021

Uh oh!

sseshasa commented May 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sseshasa commented May 20, 2021

Uh oh!

github-actions bot commented May 26, 2021

Uh oh!

sseshasa commented May 26, 2021

Uh oh!

sseshasa commented May 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sseshasa commented Jun 2, 2021

Uh oh!

Uh oh!

neha-ojha commented Jun 3, 2021

Uh oh!

sseshasa commented Jun 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sseshasa commented May 12, 2021 •

edited

Loading

sseshasa commented May 20, 2021 •

edited

Loading

sseshasa commented May 27, 2021 •

edited

Loading

sseshasa commented Jun 3, 2021 •

edited

Loading