Skip to content

osd: Run osd bench test to override default max osd capacity for mclock#41308

Merged
neha-ojha merged 5 commits intoceph:masterfrom
sseshasa:wip-osd-benchmark-for-mclock
Jun 3, 2021
Merged

osd: Run osd bench test to override default max osd capacity for mclock#41308
neha-ojha merged 5 commits intoceph:masterfrom
sseshasa:wip-osd-benchmark-for-mclock

Conversation

@sseshasa
Copy link
Contributor

@sseshasa sseshasa commented May 12, 2021

If mclock scheduler is enabled, run the osd bench test as part of osd
initialization sequence in order to determine the max osd capacity. The
iops determined as part of the test is used to override the default
osd_mclock_max_capacity_iops_[hdd,ssd].

The test performs random writes of 100 objects of 4MiB size using
4KiB blocksize. The existing test which was a part of asok_command() is
factored out into a separate method called run_osd_bench_test() so that it
can be used for both purposes.

A new method called update_configuration() in introduced in OpScheduler
base class to facilitate propagation of changes to a config option
that is not user initiated. This method helps in applying changes and
update any internal variable associated with a config option as
long as it is tracked. In this case, the change to the max osd capacity
is propagated to each op shard using the mentioned method. In the
future this method can be useful to propagate changes to advanced
config option(s) that the user is not expected to modify.

Update mclock-config-ref to reflect automated OSD benchmarking.

Signed-off-by: Sridhar Seshasayee sseshasa@redhat.com

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@sseshasa
Copy link
Contributor Author

Logs messages from testing:

Underlying device: SSD

2021-05-12T11:38:57.153+0000 7f91ecde00c0  1 osd.0 0 maybe_override_max_osd_capacity_for_qos osd bench test - max_bytes_count: 12288000 blocksize: 4096 max_object_count: 200
2021-05-12T11:38:57.706+0000 7f91ecde00c0  1 osd.0 0 maybe_override_max_osd_capacity_for_qos osd bench result - bandwidth (MiB/sec): 110.115 iops: 28189.365 elapsed_sec: 0.106

Underlying device: HDD

2021-05-12T11:16:04.905+0000 7fa938e280c0  1 osd.0 0 maybe_override_max_osd_capacity_for_qos osd bench test - max_bytes_count: 12288000 blocksize: 4096 max_object_count: 200
2021-05-12T11:16:21.478+0000 7fa938e280c0  1 osd.0 0 maybe_override_max_osd_capacity_for_qos osd bench result - bandwidth (MiB/sec): 1.236 iops: 316.420 elapsed_sec: 9.481

@sseshasa sseshasa force-pushed the wip-osd-benchmark-for-mclock branch from def4c08 to 4219c37 Compare May 12, 2021 17:41
} else {
double rate = count / elapsed;
double iops = rate / bsize;
dout(1) << __func__
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure we need to log all the above lines at level 1 in this function. Only adding the following summary line to the osd log at level 1 and maybe cluster log as well, when we have successfully overridden the defaults, should be enough.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'ceph config show' displaying it is good enough - no need to add it to the cluster log. agree about the osd log levels

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

second, i'd suggest not put this in clog.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have used derr for the error case. As suggested, I just added a log at level 1 when the test succeeds.

} else {
double rate = count / elapsed;
double iops = rate / bsize;
dout(1) << __func__
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

second, i'd suggest not put this in clog.

@sseshasa sseshasa force-pushed the wip-osd-benchmark-for-mclock branch from 4219c37 to f87b7f7 Compare May 14, 2021 12:18
@sseshasa sseshasa force-pushed the wip-osd-benchmark-for-mclock branch 4 times, most recently from 61c587f to de2eb9a Compare May 14, 2021 14:39
Copy link
Member

@neha-ojha neha-ojha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few nits on the documentation, code changes look fine

your Ceph configuration file).
1. Bring up your Ceph cluster and login to the Ceph node hosting the OSDs that
you wish to benchmark.
2. Ensure that the bluestore throttle options ( i.e.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this? these will be set to defaults as per the setup procedure provided so far

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can remove this since defaults will be set.

On another note it may be worthwhile to come up with a script that performs these steps and determine the bluestore throttle values that work. We can for example set a tolerance of some percentage of the baseline throughput below which the script stops and displays the throttle values. This obviously cannot be done during OSD init as it consumes time. What do you think?

@sseshasa sseshasa force-pushed the wip-osd-benchmark-for-mclock branch 3 times, most recently from e5e747c to ecd2c7b Compare May 17, 2021 08:13
@sseshasa
Copy link
Contributor Author

@sseshasa sseshasa force-pushed the wip-osd-benchmark-for-mclock branch from ecd2c7b to b4811c6 Compare May 20, 2021 14:32
@github-actions github-actions bot added the tests label May 20, 2021
@sseshasa
Copy link
Contributor Author

sseshasa commented May 20, 2021

Added commit 379698b to fix the TestProgress failure seen in the teuthology run (JobID: 6118355). Ran the above test again multiple times. See:
https://pulpito.ceph.com/sseshasa-2021-05-20_13:44:35-rados:mgr-wip-sseshasa-testing-progress-distro-basic-smithi/

All the tests passed in all the runs. For some tests for e.g. 6124846/49/50/53 did encounter the scenario that needed a few more seconds for the recovery to complete.

The failures pertaining to OSD_DOWN health check warnings were because of heartbeat timeouts due to worker thread of an op shard not getting an item to process immediately. The scheduler kept returning status indicating when the next request could be dequeued (future request). This happened continuously for more than 15 secs after which the hearbeat timeouts started to appear resulting in the concerned osd being marked down. The fix is to disable hearbeats when there are no immediate requests to process and re-enable once a work item is ready for processing. See commit a543550.

@neha-ojha @jdurgin @athanatos @tchaikov Please review the above commit for correctness.

Test Results with Above Commit:
https://pulpito.ceph.com/sseshasa-2021-05-25_16:41:19-rados:mgr-wip-sseshasa-testing-prog-1-distro-basic-smithi/
https://pulpito.ceph.com/sseshasa-2021-05-25_16:45:58-rados:mgr-wip-sseshasa-testing-prog-1-distro-basic-smithi/

Will re-run the failed jobs again for completeness.

@sseshasa
Copy link
Contributor Author

jenkins retest this please

@sseshasa sseshasa force-pushed the wip-osd-benchmark-for-mclock branch from b4811c6 to 0e51e97 Compare May 20, 2021 16:45
@sseshasa sseshasa force-pushed the wip-osd-benchmark-for-mclock branch 4 times, most recently from 662e0cc to be97e72 Compare May 25, 2021 18:48
@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@sseshasa sseshasa force-pushed the wip-osd-benchmark-for-mclock branch from be97e72 to 9bc3178 Compare May 26, 2021 05:18
@sseshasa sseshasa force-pushed the wip-osd-benchmark-for-mclock branch from 9bc3178 to b0c6e4f Compare May 26, 2021 12:04
@sseshasa
Copy link
Contributor Author

jenkins test make check

@sseshasa
Copy link
Contributor Author

sseshasa commented May 27, 2021

Latest Run:
https://pulpito.ceph.com/sseshasa-2021-05-26_07:34:51-rados-wip-sseshasa-testing-2021-05-26-1127-distro-basic-smithi/

The progress module test got fixed in the above run. However, there was a failure of a job with ceph_objectstore_test (see https://tracker.ceph.com/issues/50903) due to the delay in osd initialization caused by the duration of the 'osd bench' test. The 'osd bench' duration was improved and the following are the runs post the fix:

ceph_objectstore_test:
https://pulpito.ceph.com/sseshasa-2021-06-01_08:16:40-rados:objectstore-wip-sseshasa-testing-objs-test-2-distro-basic-smithi/

Re-run failed and dead jobs from the latest run:
https://pulpito.ceph.com/sseshasa-2021-06-01_08:27:04-rados-wip-sseshasa-testing-objs-test-2-distro-basic-smithi/

@neha-ojha I think this PR is good to be merged now. Please take a look and merge this if you concur.

@sseshasa sseshasa force-pushed the wip-osd-benchmark-for-mclock branch from b0c6e4f to 3585833 Compare June 1, 2021 08:54
sseshasa added 4 commits June 2, 2021 14:19
Remove the generic "osd_mclock_max_capacity_iops" option and use the
"osd_mclock_max_capacity_iops_[hdd,ssd]" options. It is better to have a
clear indication about the type of underlying device. This helps in
avoiding confusion when trying to read or override the options.

Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
If mclock scheduler is enabled, run the osd bench test as part of osd
initialization sequence in order to determine the max osd capacity. The
iops determined as part of the test is used to override the default
osd_mclock_max_capacity_iops_[hdd,ssd] option depending on the
underlying device type.

The test performs random writes of 100 objects of 4MiB size using
4KiB blocksize. The existing test which was a part of asok_command() is
factored out into a separate method called run_osd_bench_test() so that it
can be used for both purposes. If the test fails, the default values
for the above mentioned options are used.

A new method called update_configuration() in introduced in OpScheduler
base class to facilitate propagation of changes to a config option
that is not user initiated. This method helps in applying changes and
update any internal variable associated with a config option as
long as it is tracked. In this case, the change to the max osd capacity
is propagated to each op shard using the mentioned method. In the
future this method can be useful to propagate changes to advanced
config option(s) that the user is not expected to modify.

Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
…cessed

There could be rare instances when employing the mclock scheduler where a
worker thread for a shard may not get an immediate work item to process.
Such items are designated as future work items. In such cases, the
_process() loop waits until the time indicated by the scheduler to attempt
a dequeue from the scheduler queue again. It may so happen that if there
are multiple threads per shard, a thread may not get an immediate item for
a long time. This time could exceed the heartbeat timeout for the thread
and result in hearbeat timeouts reported for the osd in question. To
prevent this, the heartbeat timeouts for the thread is disabled before
waiting for an item and enabled once the wait period is over.

Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
With mclock scheduler enabled, the recovery throughput is throttled based
on factors like the type of mclock profile enabled, the OSD capacity among
others. Due to this the recovery times may vary and therefore the existing
timeout of 120 secs may not be sufficient.

To address the above, a new method called _is_inprogress_or_complete() is
introduced in the TestProgress Class that checks if the event with the
specified 'id' is in progress by checking the 'progress' key of the
progress command response. This method also handles the corner case where
the event completes just before it's called.

The existing wait_until_true() method in the CephTestCase Class is
modified to accept another function argument called "check_fn". This is
set to the _is_inprogress_or_complete() function described earlier in the
"test_turn_off_module" test that has been observed to fail due to the
reasons already described above. A retry mechanism of a maximum of 5
attempts is introduced after the first timeout is hit. This means that
the wait can extend up to a maximum of 600 secs (120 secs * 5) as long as
there is recovery progress reported by the 'ceph progress' command result.

Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
@sseshasa sseshasa force-pushed the wip-osd-benchmark-for-mclock branch from 3585833 to a438090 Compare June 2, 2021 09:02
@sseshasa
Copy link
Contributor Author

sseshasa commented Jun 2, 2021

jenkins test docs

@neha-ojha
Copy link
Member

Latest Run:
https://pulpito.ceph.com/sseshasa-2021-05-26_07:34:51-rados-wip-sseshasa-testing-2021-05-26-1127-distro-basic-smithi/

The progress module test got fixed in the above run. However, there was a failure of a job with ceph_objectstore_test (see https://tracker.ceph.com/issues/50903) due to the delay in osd initialization caused by the duration of the 'osd bench' test. The 'osd bench' duration was improved and the following are the runs post the fix:

ceph_objectstore_test:
https://pulpito.ceph.com/sseshasa-2021-06-01_08:16:40-rados:objectstore-wip-sseshasa-testing-objs-test-2-distro-basic-smithi/

Re-run failed and dead jobs from the latest run:
https://pulpito.ceph.com/sseshasa-2021-06-01_08:27:04-rados-wip-sseshasa-testing-objs-test-2-distro-basic-smithi/

@neha-ojha I think this PR is good to be merged now. Please take a look and merge this if you concur.

I noticed some dead jobs timing out in wait_for_*, but can't say much without logs though. Let's merge after creating trackers for those and the osd-rep-recov-eio.sh failure.

Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
@sseshasa sseshasa force-pushed the wip-osd-benchmark-for-mclock branch from a438090 to 76420f9 Compare June 3, 2021 09:16
@sseshasa
Copy link
Contributor Author

sseshasa commented Jun 3, 2021

Latest Run:
https://pulpito.ceph.com/sseshasa-2021-05-26_07:34:51-rados-wip-sseshasa-testing-2021-05-26-1127-distro-basic-smithi/
The progress module test got fixed in the above run. However, there was a failure of a job with ceph_objectstore_test (see https://tracker.ceph.com/issues/50903) due to the delay in osd initialization caused by the duration of the 'osd bench' test. The 'osd bench' duration was improved and the following are the runs post the fix:
ceph_objectstore_test:
https://pulpito.ceph.com/sseshasa-2021-06-01_08:16:40-rados:objectstore-wip-sseshasa-testing-objs-test-2-distro-basic-smithi/
Re-run failed and dead jobs from the latest run:
https://pulpito.ceph.com/sseshasa-2021-06-01_08:27:04-rados-wip-sseshasa-testing-objs-test-2-distro-basic-smithi/
@neha-ojha I think this PR is good to be merged now. Please take a look and merge this if you concur.

I noticed some dead jobs timing out in wait_for_*, but can't say much without logs though. Let's merge after creating trackers for those and the osd-rep-recov-eio.sh failure.

Tracker for the failed job:
6145022 - osd-rep-recov-eio.sh: https://tracker.ceph.com/issues/51074

Trackers for the dead jobs:
6145021 - wait_for_recovery timeout: https://tracker.ceph.com/issues/51076
6145024 - Updated existing tracker: https://tracker.ceph.com/issues/50806 since the same assertion failure is hit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants