pacific: osd: Run osd bench test to override default max osd capacity for mclock#41731
pacific: osd: Run osd bench test to override default max osd capacity for mclock#41731yuriw merged 13 commits intoceph:pacificfrom
Conversation
tchaikov
left a comment
There was a problem hiding this comment.
why do we need to backport a feature to pacific? also i'd suggest put this backport on hold because of the regression of https://tracker.ceph.com/issues/51074.
|
feature vs bug is not black and white. in this case, imo it's a usability issue - asking a user to compile and run fio with a librbd backend to be able to try out qos in pacific is too much effort. I do think we should hold off on merging this backport until the subsequent changes around this area are completed, namely:
|
|
Needs to include #41782 |
Remove the generic "osd_mclock_max_capacity_iops" option and use the "osd_mclock_max_capacity_iops_[hdd,ssd]" options. It is better to have a clear indication about the type of underlying device. This helps in avoiding confusion when trying to read or override the options. Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com> (cherry picked from commit 6ad38a2) Conflicts: src/common/options/osd.yaml.in - Removed non-existent file: src/common/options/osd.yaml.in since the switch to yaml for config options is not available in pacific yet. - Removed config option "osd_mclock_max_capacity_iops" from options.cc.
If mclock scheduler is enabled, run the osd bench test as part of osd initialization sequence in order to determine the max osd capacity. The iops determined as part of the test is used to override the default osd_mclock_max_capacity_iops_[hdd,ssd] option depending on the underlying device type. The test performs random writes of 100 objects of 4MiB size using 4KiB blocksize. The existing test which was a part of asok_command() is factored out into a separate method called run_osd_bench_test() so that it can be used for both purposes. If the test fails, the default values for the above mentioned options are used. A new method called update_configuration() in introduced in OpScheduler base class to facilitate propagation of changes to a config option that is not user initiated. This method helps in applying changes and update any internal variable associated with a config option as long as it is tracked. In this case, the change to the max osd capacity is propagated to each op shard using the mentioned method. In the future this method can be useful to propagate changes to advanced config option(s) that the user is not expected to modify. Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com> (cherry picked from commit db6c995) Conflicts: src/osd/OSD.cc - Retained use of cmd_getval() since cmd_getval_or() is not available for parsing the 'osd bench' command options.
…cessed There could be rare instances when employing the mclock scheduler where a worker thread for a shard may not get an immediate work item to process. Such items are designated as future work items. In such cases, the _process() loop waits until the time indicated by the scheduler to attempt a dequeue from the scheduler queue again. It may so happen that if there are multiple threads per shard, a thread may not get an immediate item for a long time. This time could exceed the heartbeat timeout for the thread and result in hearbeat timeouts reported for the osd in question. To prevent this, the heartbeat timeouts for the thread is disabled before waiting for an item and enabled once the wait period is over. Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com> (cherry picked from commit 9a95492)
With mclock scheduler enabled, the recovery throughput is throttled based on factors like the type of mclock profile enabled, the OSD capacity among others. Due to this the recovery times may vary and therefore the existing timeout of 120 secs may not be sufficient. To address the above, a new method called _is_inprogress_or_complete() is introduced in the TestProgress Class that checks if the event with the specified 'id' is in progress by checking the 'progress' key of the progress command response. This method also handles the corner case where the event completes just before it's called. The existing wait_until_true() method in the CephTestCase Class is modified to accept another function argument called "check_fn". This is set to the _is_inprogress_or_complete() function described earlier in the "test_turn_off_module" test that has been observed to fail due to the reasons already described above. A retry mechanism of a maximum of 5 attempts is introduced after the first timeout is hit. This means that the wait can extend up to a maximum of 600 secs (120 secs * 5) as long as there is recovery progress reported by the 'ceph progress' command result. Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com> (cherry picked from commit 328271d)
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com> (cherry picked from commit 76420f9) Conflicts: doc/rados/configuration/mclock-config-ref.rst - Removed "confval" directives as it is not yet available in pacific.
Add method mon_cmd_set_config() to save config option key and value to the MON store. The ConfigMonitor command, 'config set' is used to achieve this. A corresponding get method is unnecessary since any config option found on the MON store is loaded during OSD boot-up and set using the md_config_t::set_mon_vals() method. Therefore, the existing versions of ConfigProxy::get_val() method are sufficient to get the latest value for the config option. Fixes: https://tracker.ceph.com/issues/51464 Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com> (cherry picked from commit 1fca4bd)
…tion Add wrapper method "get_val_default()" to the ConfigProxy class that takes the config option key to search. This method in-turn calls another method with the same name added to md_config_t class that does the actual work of searching for the config option. If the option is valid, _get_val_default() is used to get the default value. Otherwise, the wrapper method returns std::nullopt. Fixes: https://tracker.ceph.com/issues/51464 Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com> (cherry picked from commit 9438e5a)
Use "mon_cmd_set_config()" to store the OSD's max iops capacity to the MON store during the first bring-up. Don't run the OSD benchmark test on subsequent boot-ups if a previously persisted iops capacity is available on the MON store and is different from the default iops capacity. Add the 'force_run_benchmark' flag to force a run of the benchmark in case the default iops capacity cannot be determined. Fixes: https://tracker.ceph.com/issues/51464 Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com> (cherry picked from commit 10f8b79)
The new config option "osd_mclock_force_run_benchmark_on_init" is introduced to allow a user to force run the OSD benchmark test on every OSD boot-up even if the historical data about the OSD's iops capacity is available on the MON config store. The 'force_run_benchmark' flag is set to the value indicated by the new config option. By default this new config option is set to false. The utility of this option is to help refresh the OSD iops capacity when the underlying device's performance characteristics have changed significantly. In such cases, the OSD can be restarted with this option enabled temporarily. Once the new iops capacity is updated to the MON store, this option can be removed from the OSD's start-up config. Fixes: https://tracker.ceph.com/issues/51464 Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com> (cherry picked from commit 8725a10) Conflicts: src/common/options/osd.yaml.in - Removed non-existent file: src/common/options/osd.yaml.in since the switch to yaml for config options is not available in pacific yet. - Added new config option "osd_mclock_osd_mclock_force_run_benchmark_on_init" to options.cc.
2bfacba to
5bee321
Compare
My bad, we don't need this because we are not changing the default to mclock_scheduler in pacific, let's include three things in this backport PR: 1 . mechanism to automate benchmarking for qos |
5bee321 to
b48d709
Compare
@tchaikov The changes in https://tracker.ceph.com/issues/51074 will not be backported to pacific as we don't plan on changing the default scheduler. Therefore, all commits related to qa/standalone script are now removed. Can you please unblock this review? |
|
The failed jobs are unrelated, and are tracked in the following tickets. https://tracker.ceph.com/issues/52116 @sseshasa could you please rerun the dead job that failed due to an infra issue? This PR can be merged if that rerun looks good. |
Thanks for looking into the failures. I couldn't get to them the last couple of days. I initiated a re-run of the dead job and it passed: |
@tchaikov Can you please unblock the PR based on my comments above? Thanks! |
@tchaikov Can you take a look at my comments above and unblock? Thanks. |
|
@neha-ojha sorry for dismissing your approval by accident. could you kindly redo the approve? |
@tchaikov no worries, done. |
|
TBD: Backport #42853 as well. |
Introduce a new dev config option "osd_mclock_skip_benchmark" that when set skips running the OSD benchmark on start-up. By default this option is disabled. This is useful in the following scenarios: - Dev/CI testing, - Configurations that don't need QoS. If the option is enabled, the default OSD iops capacity is read from osd_mclock_max_capacity_iops_[hdd,ssd]. Fixes: https://tracker.ceph.com/issues/52025 Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com> (cherry picked from commit 6ca32bd) Conflicts: src/common/options/osd.yaml.in - Removed non-existent file: src/common/options/osd.yaml.in since the switch to yaml for config options is not available in pacific yet.
…city. Update the steps in the mclock config reference document to manually override an OSDs max IOPS capacity. Provide information on the alternative ways to override the osd_mclock_max_capacity_iops_[hdd,ssd] options for an OSD. Fixes: https://tracker.ceph.com/issues/52025 Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com> (cherry picked from commit 7c25511)
…on db The default mon caps for osds is set to "allow profile osd", which allows only "rw" capability. Osds with mclock scheduler enabled store their max iops capacity on the mon config store. This can be achieved by executing the "config set" command. However, since the osd(s) by default do not have the execute permission, the command fails with "Permission denied" error. Therefore, modify the default osd profile to allow running the "config set" command with restriction to only set keys with name matching either (regex) "osd_mclock_max_capacity_iops_hdd" or "osd_mclock_max_capacity_iops_ssd" so that the osd has the permission to update the mon config store with the desired information. Fixes: https://tracker.ceph.com/issues/52329 Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com> (cherry picked from commit 2cdbe81)
Assign the default caps for osds to be the same as what the AuthMonitor sets for a new osd. See AuthMonitor::validate_osd_new() which sets the following caps for a new osd: mon='allow profile osd' mgr='allow profile osd' osd=''allow *' When an actual real world cluster is deployed, the above caps are applied. Unless the user modifies the defaults, a cluster will operate with the above caps. Therefore, it makes sense to use the defaults when testing Ceph so that issues if any due to the default settings may be caught and fixed. Therefore, the caps for the 'osd' type is reset to the default in generate_caps(). The caps for 'mgr' already reflects the system defaults. The caps for 'mds' type is not changed in this commit and will be investigated and changed if necessary later. Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com> (cherry picked from commit 4b0dba2)
|
@neha-ojha @jdurgin The latest pushes include the following commits related to:
As already decided, I have not included any qa related changes. If the changes look good, then we can remove the DNM label and run this through teuthology. |
backport tracker: https://tracker.ceph.com/issues/51117
backport of #41308
parent tracker: https://tracker.ceph.com/issues/51116
this backport was staged using ceph-backport.sh version 16.0.0.6848
find the latest version at https://github.com/ceph/ceph/blob/master/src/script/ceph-backport.sh
Note: The initial backport was created using the backport script. The latest commits (i.e. from commit 0a5f87c) added to the PR were cherry-picked.