pacific: osd: Run osd bench test to override default max osd capacity for mclock by sseshasa · Pull Request #41731 · ceph/ceph

sseshasa · 2021-06-07T11:37:52Z

backport tracker: https://tracker.ceph.com/issues/51117

backport of #41308
parent tracker: https://tracker.ceph.com/issues/51116

this backport was staged using ceph-backport.sh version 16.0.0.6848
find the latest version at https://github.com/ceph/ceph/blob/master/src/script/ceph-backport.sh

Note: The initial backport was created using the backport script. The latest commits (i.e. from commit 0a5f87c) added to the PR were cherry-picked.

tchaikov

why do we need to backport a feature to pacific? also i'd suggest put this backport on hold because of the regression of https://tracker.ceph.com/issues/51074.

jdurgin · 2021-06-09T07:09:10Z

feature vs bug is not black and white. in this case, imo it's a usability issue - asking a user to compile and run fio with a librbd backend to be able to try out qos in pacific is too much effort.

I do think we should hold off on merging this backport until the subsequent changes around this area are completed, namely:

fixing the tests
persisting the benchmark result in the config
not running the benchmark every time - if it's already set, don't rerun, unless we set an option (or some other way of indicating to update it tbd in master)

neha-ojha · 2021-06-17T20:42:32Z

Needs to include #41782

Remove the generic "osd_mclock_max_capacity_iops" option and use the "osd_mclock_max_capacity_iops_[hdd,ssd]" options. It is better to have a clear indication about the type of underlying device. This helps in avoiding confusion when trying to read or override the options. Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com> (cherry picked from commit 6ad38a2) Conflicts: src/common/options/osd.yaml.in - Removed non-existent file: src/common/options/osd.yaml.in since the switch to yaml for config options is not available in pacific yet. - Removed config option "osd_mclock_max_capacity_iops" from options.cc.

If mclock scheduler is enabled, run the osd bench test as part of osd initialization sequence in order to determine the max osd capacity. The iops determined as part of the test is used to override the default osd_mclock_max_capacity_iops_[hdd,ssd] option depending on the underlying device type. The test performs random writes of 100 objects of 4MiB size using 4KiB blocksize. The existing test which was a part of asok_command() is factored out into a separate method called run_osd_bench_test() so that it can be used for both purposes. If the test fails, the default values for the above mentioned options are used. A new method called update_configuration() in introduced in OpScheduler base class to facilitate propagation of changes to a config option that is not user initiated. This method helps in applying changes and update any internal variable associated with a config option as long as it is tracked. In this case, the change to the max osd capacity is propagated to each op shard using the mentioned method. In the future this method can be useful to propagate changes to advanced config option(s) that the user is not expected to modify. Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com> (cherry picked from commit db6c995) Conflicts: src/osd/OSD.cc - Retained use of cmd_getval() since cmd_getval_or() is not available for parsing the 'osd bench' command options.

…cessed There could be rare instances when employing the mclock scheduler where a worker thread for a shard may not get an immediate work item to process. Such items are designated as future work items. In such cases, the _process() loop waits until the time indicated by the scheduler to attempt a dequeue from the scheduler queue again. It may so happen that if there are multiple threads per shard, a thread may not get an immediate item for a long time. This time could exceed the heartbeat timeout for the thread and result in hearbeat timeouts reported for the osd in question. To prevent this, the heartbeat timeouts for the thread is disabled before waiting for an item and enabled once the wait period is over. Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com> (cherry picked from commit 9a95492)

With mclock scheduler enabled, the recovery throughput is throttled based on factors like the type of mclock profile enabled, the OSD capacity among others. Due to this the recovery times may vary and therefore the existing timeout of 120 secs may not be sufficient. To address the above, a new method called _is_inprogress_or_complete() is introduced in the TestProgress Class that checks if the event with the specified 'id' is in progress by checking the 'progress' key of the progress command response. This method also handles the corner case where the event completes just before it's called. The existing wait_until_true() method in the CephTestCase Class is modified to accept another function argument called "check_fn". This is set to the _is_inprogress_or_complete() function described earlier in the "test_turn_off_module" test that has been observed to fail due to the reasons already described above. A retry mechanism of a maximum of 5 attempts is introduced after the first timeout is hit. This means that the wait can extend up to a maximum of 600 secs (120 secs * 5) as long as there is recovery progress reported by the 'ceph progress' command result. Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com> (cherry picked from commit 328271d)

Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com> (cherry picked from commit 76420f9) Conflicts: doc/rados/configuration/mclock-config-ref.rst - Removed "confval" directives as it is not yet available in pacific.

Add method mon_cmd_set_config() to save config option key and value to the MON store. The ConfigMonitor command, 'config set' is used to achieve this. A corresponding get method is unnecessary since any config option found on the MON store is loaded during OSD boot-up and set using the md_config_t::set_mon_vals() method. Therefore, the existing versions of ConfigProxy::get_val() method are sufficient to get the latest value for the config option. Fixes: https://tracker.ceph.com/issues/51464 Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com> (cherry picked from commit 1fca4bd)

…tion Add wrapper method "get_val_default()" to the ConfigProxy class that takes the config option key to search. This method in-turn calls another method with the same name added to md_config_t class that does the actual work of searching for the config option. If the option is valid, _get_val_default() is used to get the default value. Otherwise, the wrapper method returns std::nullopt. Fixes: https://tracker.ceph.com/issues/51464 Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com> (cherry picked from commit 9438e5a)

Use "mon_cmd_set_config()" to store the OSD's max iops capacity to the MON store during the first bring-up. Don't run the OSD benchmark test on subsequent boot-ups if a previously persisted iops capacity is available on the MON store and is different from the default iops capacity. Add the 'force_run_benchmark' flag to force a run of the benchmark in case the default iops capacity cannot be determined. Fixes: https://tracker.ceph.com/issues/51464 Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com> (cherry picked from commit 10f8b79)

The new config option "osd_mclock_force_run_benchmark_on_init" is introduced to allow a user to force run the OSD benchmark test on every OSD boot-up even if the historical data about the OSD's iops capacity is available on the MON config store. The 'force_run_benchmark' flag is set to the value indicated by the new config option. By default this new config option is set to false. The utility of this option is to help refresh the OSD iops capacity when the underlying device's performance characteristics have changed significantly. In such cases, the OSD can be restarted with this option enabled temporarily. Once the new iops capacity is updated to the MON store, this option can be removed from the OSD's start-up config. Fixes: https://tracker.ceph.com/issues/51464 Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com> (cherry picked from commit 8725a10) Conflicts: src/common/options/osd.yaml.in - Removed non-existent file: src/common/options/osd.yaml.in since the switch to yaml for config options is not available in pacific yet. - Added new config option "osd_mclock_osd_mclock_force_run_benchmark_on_init" to options.cc.

sseshasa · 2021-08-03T06:28:33Z

Summary of master trackers and associated PRs that are a part of this back-port PR:

Original master tracker: https://tracker.ceph.com/issues/51116
Original PR: osd: Run osd bench test to override default max osd capacity for mclock #41308
Added a subset of the commits from PR: osd: Add mechanism to avoid running OSD bench on every OSD init when mclock_scheduler is enabled #42133 associated with tracker: https://tracker.ceph.com/issues/51464. A couple of things to note:
- The commits address the issues mentioned in the comment in this PR:
  pacific: osd: Run osd bench test to override default max osd capacity for mclock #41731 (comment)
- Since the default scheduler will not be changed to mclock_scheduler, the commits
  related to qa/standalone test scripts are not back-ported in this PR.
Added a subset of the commits from PR: osd: Add config option to skip running the osd benchmark during init and update documentation. #42604 associated with tracker: https://tracker.ceph.com/issues/52025.
Add commits from PR: mon/MonCap: Update osd profile to allow cmd to set iops capacity on mon db #42853 associated with tracker https://tracker.ceph.com/issues/52329

neha-ojha · 2021-08-03T18:56:59Z

Needs to include #41782

My bad, we don't need this because we are not changing the default to mclock_scheduler in pacific, let's include three things in this backport PR:

1 . mechanism to automate benchmarking for qos
2. ability to persist osd_mclock_max_capacity if the benchmark
3. any test changes needed for (1) and (2)

sseshasa · 2021-08-04T11:24:44Z

why do we need to backport a feature to pacific? also i'd suggest put this backport on hold because of the regression of https://tracker.ceph.com/issues/51074.

@tchaikov The changes in https://tracker.ceph.com/issues/51074 will not be backported to pacific as we don't plan on changing the default scheduler. Therefore, all commits related to qa/standalone script are now removed. Can you please unblock this review?

neha-ojha · 2021-08-10T21:30:00Z

https://pulpito.ceph.com/sseshasa-2021-08-06_04:49:51-rados-wip-sseshasa2-testing-2021-08-04-1847-pacific-distro-basic-smithi/

The failed jobs are unrelated, and are tracked in the following tickets.

https://tracker.ceph.com/issues/52116
https://tracker.ceph.com/issues/51728
https://tracker.ceph.com/issues/52078 - needs backport PR
https://tracker.ceph.com/issues/52129

@sseshasa could you please rerun the dead job that failed due to an infra issue? This PR can be merged if that rerun looks good.

sseshasa · 2021-08-11T05:25:45Z

https://pulpito.ceph.com/sseshasa-2021-08-06_04:49:51-rados-wip-sseshasa2-testing-2021-08-04-1847-pacific-distro-basic-smithi/

The failed jobs are unrelated, and are tracked in the following tickets.

https://tracker.ceph.com/issues/52116
https://tracker.ceph.com/issues/51728
https://tracker.ceph.com/issues/52078 - needs backport PR
https://tracker.ceph.com/issues/52129

@sseshasa could you please rerun the dead job that failed due to an infra issue? This PR can be merged if that rerun looks good.

Thanks for looking into the failures. I couldn't get to them the last couple of days. I initiated a re-run of the dead job and it passed:

https://pulpito.ceph.com/sseshasa-2021-08-11_05:19:22-rados-wip-sseshasa2-testing-2021-08-04-1847-pacific-distro-basic-smithi/

sseshasa · 2021-08-11T05:59:24Z

why do we need to backport a feature to pacific? also i'd suggest put this backport on hold because of the regression of https://tracker.ceph.com/issues/51074.

@tchaikov The changes in https://tracker.ceph.com/issues/51074 will not be backported to pacific as we don't plan on changing the default scheduler. Therefore, all commits related to qa/standalone script are now removed. Can you please unblock this review?

@tchaikov Can you please unblock the PR based on my comments above? Thanks!

sseshasa · 2021-08-16T12:06:35Z

why do we need to backport a feature to pacific? also i'd suggest put this backport on hold because of the regression of https://tracker.ceph.com/issues/51074.

@tchaikov The changes in https://tracker.ceph.com/issues/51074 will not be backported to pacific as we don't plan on changing the default scheduler. Therefore, all commits related to qa/standalone script are now removed. Can you please unblock this review?

@tchaikov Can you please unblock the PR based on my comments above? Thanks!

@tchaikov Can you take a look at my comments above and unblock? Thanks.

dismissed.

tchaikov · 2021-08-17T06:52:32Z

@neha-ojha sorry for dismissing your approval by accident. could you kindly redo the approve?

neha-ojha · 2021-08-17T16:46:16Z

@neha-ojha sorry for dismissing your approval by accident. could you kindly redo the approve?

@tchaikov no worries, done.

sseshasa · 2021-08-20T06:44:03Z

TBD: Backport #42853 as well.

Introduce a new dev config option "osd_mclock_skip_benchmark" that when set skips running the OSD benchmark on start-up. By default this option is disabled. This is useful in the following scenarios: - Dev/CI testing, - Configurations that don't need QoS. If the option is enabled, the default OSD iops capacity is read from osd_mclock_max_capacity_iops_[hdd,ssd]. Fixes: https://tracker.ceph.com/issues/52025 Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com> (cherry picked from commit 6ca32bd) Conflicts: src/common/options/osd.yaml.in - Removed non-existent file: src/common/options/osd.yaml.in since the switch to yaml for config options is not available in pacific yet.

…city. Update the steps in the mclock config reference document to manually override an OSDs max IOPS capacity. Provide information on the alternative ways to override the osd_mclock_max_capacity_iops_[hdd,ssd] options for an OSD. Fixes: https://tracker.ceph.com/issues/52025 Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com> (cherry picked from commit 7c25511)

…on db The default mon caps for osds is set to "allow profile osd", which allows only "rw" capability. Osds with mclock scheduler enabled store their max iops capacity on the mon config store. This can be achieved by executing the "config set" command. However, since the osd(s) by default do not have the execute permission, the command fails with "Permission denied" error. Therefore, modify the default osd profile to allow running the "config set" command with restriction to only set keys with name matching either (regex) "osd_mclock_max_capacity_iops_hdd" or "osd_mclock_max_capacity_iops_ssd" so that the osd has the permission to update the mon config store with the desired information. Fixes: https://tracker.ceph.com/issues/52329 Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com> (cherry picked from commit 2cdbe81)

Assign the default caps for osds to be the same as what the AuthMonitor sets for a new osd. See AuthMonitor::validate_osd_new() which sets the following caps for a new osd: mon='allow profile osd' mgr='allow profile osd' osd=''allow *' When an actual real world cluster is deployed, the above caps are applied. Unless the user modifies the defaults, a cluster will operate with the above caps. Therefore, it makes sense to use the defaults when testing Ceph so that issues if any due to the default settings may be caught and fixed. Therefore, the caps for the 'osd' type is reset to the default in generate_caps(). The caps for 'mgr' already reflects the system defaults. The caps for 'mds' type is not changed in this commit and will be investigated and changed if necessary later. Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com> (cherry picked from commit 4b0dba2)

sseshasa · 2021-09-13T11:45:42Z

@neha-ojha @jdurgin The latest pushes include the following commits related to:

Skip OSD benchmark based on config option.
Update 'osd profile' caps to allow running the 'config set' command to set the max OSD capacity.

As already decided, I have not included any qa related changes. If the changes look good, then we can remove the DNM label and run this through teuthology.

sseshasa · 2021-09-24T07:06:13Z

Pulpito Run Info:
https://pulpito.ceph.com/yuriw-2021-09-22_14:54:43-rados-wip-yuri8-testing-2021-09-21-0415-pacific-distro-basic-smithi/

Failures (Unrelated):

https://pulpito.ceph.com/yuriw-2021-09-22_14:54:43-rados-wip-yuri8-testing-2021-09-21-0415-pacific-distro-basic-smithi/6402688
Tracker issue: https://tracker.ceph.com/issues/52116
https://pulpito.ceph.com/yuriw-2021-09-22_14:54:43-rados-wip-yuri8-testing-2021-09-21-0415-pacific-distro-basic-smithi/6402710
Tracker issue not found. Probably a known issue. Error seen during setting up of the environment and is unrelated to
the PR.
https://pulpito.ceph.com/yuriw-2021-09-22_14:54:43-rados-wip-yuri8-testing-2021-09-21-0415-pacific-distro-basic-smithi/6402717
Tracker Issue: https://tracker.ceph.com/issues/51728
https://pulpito.ceph.com/yuriw-2021-09-22_14:54:43-rados-wip-yuri8-testing-2021-09-21-0415-pacific-distro-basic-smithi/6402805
Tracker issue: https://tracker.ceph.com/issues/52116
https://pulpito.ceph.com/yuriw-2021-09-22_14:54:43-rados-wip-yuri8-testing-2021-09-21-0415-pacific-distro-basic-smithi/6402894
Same as 2 above
https://pulpito.ceph.com/yuriw-2021-09-22_14:54:43-rados-wip-yuri8-testing-2021-09-21-0415-pacific-distro-basic-smithi/6402901
Same as 3 above.
https://pulpito.ceph.com/yuriw-2021-09-22_14:54:43-rados-wip-yuri8-testing-2021-09-21-0415-pacific-distro-basic-smithi/6402919
Same as 4 above.

Dead Jobs (Unrelated):

https://pulpito.ceph.com/yuriw-2021-09-22_14:54:43-rados-wip-yuri8-testing-2021-09-21-0415-pacific-distro-basic-smithi/6402669
Job actually passed. POST to paddles failed with errors.
https://pulpito.ceph.com/yuriw-2021-09-22_14:54:43-rados-wip-yuri8-testing-2021-09-21-0415-pacific-distro-basic-smithi/6402760
Tracker issue: https://tracker.ceph.com/issues/51076
https://pulpito.ceph.com/yuriw-2021-09-22_14:54:43-rados-wip-yuri8-testing-2021-09-21-0415-pacific-distro-basic-smithi/6402802
Tracker issue: https://tracker.ceph.com/issues/49888
https://pulpito.ceph.com/yuriw-2021-09-22_14:54:43-rados-wip-yuri8-testing-2021-09-21-0415-pacific-distro-basic-smithi/6402896
Test Failed Due to Socket Error during the unwinding phase after the upgrade test. This occurred due an infra issue
and is unrelated to the PR.

sseshasa added this to the pacific milestone Jun 7, 2021

sseshasa added the core label Jun 7, 2021

github-actions bot added common documentation tests labels Jun 7, 2021

sseshasa requested review from jdurgin and neha-ojha June 7, 2021 11:38

neha-ojha approved these changes Jun 7, 2021

View reviewed changes

neha-ojha added needs-qa pacific-batch-1 labels Jun 7, 2021

tchaikov previously requested changes Jun 8, 2021

View reviewed changes

neha-ojha added DNM and removed needs-qa pacific-batch-1 labels Jun 8, 2021

sseshasa added 9 commits August 3, 2021 11:24

sseshasa force-pushed the wip-51117-pacific branch from 2bfacba to 5bee321 Compare August 3, 2021 06:27

sseshasa force-pushed the wip-51117-pacific branch from 5bee321 to b48d709 Compare August 4, 2021 10:42

yuriw added wip-yuri8-testing and removed wip-yuri2-testing wip-yuri8-testing labels Aug 9, 2021

tchaikov requested a review from neha-ojha August 17, 2021 06:50

neha-ojha approved these changes Aug 17, 2021

View reviewed changes

yuriw added wip-yuri4-testing and removed wip-yuri4-testing labels Aug 17, 2021

neha-ojha added the DNM label Aug 20, 2021

sseshasa removed the wip-sseshasa2-testing label Sep 1, 2021

sseshasa added 4 commits September 13, 2021 16:34

github-actions bot added the mon label Sep 13, 2021

neha-ojha added needs-qa and removed DNM labels Sep 15, 2021

yuriw added the wip-yuri8-testing label Sep 21, 2021

yuriw merged commit bbc9f0b into ceph:pacific Sep 24, 2021

sseshasa deleted the wip-51117-pacific branch September 27, 2021 05:07

Conversation

sseshasa commented Jun 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tchaikov left a comment

Choose a reason for hiding this comment

Uh oh!

jdurgin commented Jun 9, 2021

Uh oh!

neha-ojha commented Jun 17, 2021

Uh oh!

sseshasa commented Aug 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

neha-ojha commented Aug 3, 2021

Uh oh!

sseshasa commented Aug 4, 2021

Uh oh!

neha-ojha commented Aug 10, 2021

Uh oh!

sseshasa commented Aug 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sseshasa commented Aug 11, 2021

Uh oh!

sseshasa commented Aug 16, 2021

Uh oh!

tchaikov commented Aug 17, 2021

Uh oh!

neha-ojha commented Aug 17, 2021

Uh oh!

sseshasa commented Aug 20, 2021

Uh oh!

sseshasa commented Sep 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sseshasa commented Sep 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

sseshasa commented Jun 7, 2021 •

edited

Loading

sseshasa commented Aug 3, 2021 •

edited

Loading

sseshasa commented Aug 11, 2021 •

edited

Loading

sseshasa commented Sep 13, 2021 •

edited

Loading

sseshasa commented Sep 24, 2021 •

edited

Loading