Skip to content

pacific: osd: Run osd bench test to override default max osd capacity for mclock#41731

Merged
yuriw merged 13 commits intoceph:pacificfrom
sseshasa:wip-51117-pacific
Sep 24, 2021
Merged

pacific: osd: Run osd bench test to override default max osd capacity for mclock#41731
yuriw merged 13 commits intoceph:pacificfrom
sseshasa:wip-51117-pacific

Conversation

@sseshasa
Copy link
Contributor

@sseshasa sseshasa commented Jun 7, 2021

backport tracker: https://tracker.ceph.com/issues/51117


backport of #41308
parent tracker: https://tracker.ceph.com/issues/51116

this backport was staged using ceph-backport.sh version 16.0.0.6848
find the latest version at https://github.com/ceph/ceph/blob/master/src/script/ceph-backport.sh

Note: The initial backport was created using the backport script. The latest commits (i.e. from commit 0a5f87c) added to the PR were cherry-picked.

Copy link
Contributor

@tchaikov tchaikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to backport a feature to pacific? also i'd suggest put this backport on hold because of the regression of https://tracker.ceph.com/issues/51074.

@jdurgin
Copy link
Member

jdurgin commented Jun 9, 2021

feature vs bug is not black and white. in this case, imo it's a usability issue - asking a user to compile and run fio with a librbd backend to be able to try out qos in pacific is too much effort.

I do think we should hold off on merging this backport until the subsequent changes around this area are completed, namely:

  1. fixing the tests
  2. persisting the benchmark result in the config
  3. not running the benchmark every time - if it's already set, don't rerun, unless we set an option (or some other way of indicating to update it tbd in master)

@neha-ojha
Copy link
Member

Needs to include #41782

Remove the generic "osd_mclock_max_capacity_iops" option and use the
"osd_mclock_max_capacity_iops_[hdd,ssd]" options. It is better to have a
clear indication about the type of underlying device. This helps in
avoiding confusion when trying to read or override the options.

Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
(cherry picked from commit 6ad38a2)

Conflicts:
    src/common/options/osd.yaml.in
- Removed non-existent file: src/common/options/osd.yaml.in since the
  switch to yaml for config options is not available in pacific yet.
- Removed config option "osd_mclock_max_capacity_iops" from options.cc.
If mclock scheduler is enabled, run the osd bench test as part of osd
initialization sequence in order to determine the max osd capacity. The
iops determined as part of the test is used to override the default
osd_mclock_max_capacity_iops_[hdd,ssd] option depending on the
underlying device type.

The test performs random writes of 100 objects of 4MiB size using
4KiB blocksize. The existing test which was a part of asok_command() is
factored out into a separate method called run_osd_bench_test() so that it
can be used for both purposes. If the test fails, the default values
for the above mentioned options are used.

A new method called update_configuration() in introduced in OpScheduler
base class to facilitate propagation of changes to a config option
that is not user initiated. This method helps in applying changes and
update any internal variable associated with a config option as
long as it is tracked. In this case, the change to the max osd capacity
is propagated to each op shard using the mentioned method. In the
future this method can be useful to propagate changes to advanced
config option(s) that the user is not expected to modify.

Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
(cherry picked from commit db6c995)

Conflicts:
    src/osd/OSD.cc
- Retained use of cmd_getval() since cmd_getval_or() is not available
  for parsing the 'osd bench' command options.
…cessed

There could be rare instances when employing the mclock scheduler where a
worker thread for a shard may not get an immediate work item to process.
Such items are designated as future work items. In such cases, the
_process() loop waits until the time indicated by the scheduler to attempt
a dequeue from the scheduler queue again. It may so happen that if there
are multiple threads per shard, a thread may not get an immediate item for
a long time. This time could exceed the heartbeat timeout for the thread
and result in hearbeat timeouts reported for the osd in question. To
prevent this, the heartbeat timeouts for the thread is disabled before
waiting for an item and enabled once the wait period is over.

Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
(cherry picked from commit 9a95492)
With mclock scheduler enabled, the recovery throughput is throttled based
on factors like the type of mclock profile enabled, the OSD capacity among
others. Due to this the recovery times may vary and therefore the existing
timeout of 120 secs may not be sufficient.

To address the above, a new method called _is_inprogress_or_complete() is
introduced in the TestProgress Class that checks if the event with the
specified 'id' is in progress by checking the 'progress' key of the
progress command response. This method also handles the corner case where
the event completes just before it's called.

The existing wait_until_true() method in the CephTestCase Class is
modified to accept another function argument called "check_fn". This is
set to the _is_inprogress_or_complete() function described earlier in the
"test_turn_off_module" test that has been observed to fail due to the
reasons already described above. A retry mechanism of a maximum of 5
attempts is introduced after the first timeout is hit. This means that
the wait can extend up to a maximum of 600 secs (120 secs * 5) as long as
there is recovery progress reported by the 'ceph progress' command result.

Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
(cherry picked from commit 328271d)
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
(cherry picked from commit 76420f9)

Conflicts:
    doc/rados/configuration/mclock-config-ref.rst
- Removed "confval" directives as it is not yet available in pacific.
Add method mon_cmd_set_config() to save config option key and
value to the MON store. The ConfigMonitor command, 'config set' is
used to achieve this.

A corresponding get method is unnecessary since any config option
found on the MON store is loaded during OSD boot-up and set using
the md_config_t::set_mon_vals() method. Therefore, the existing
versions of ConfigProxy::get_val() method are sufficient to get
the latest value for the config option.

Fixes: https://tracker.ceph.com/issues/51464
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
(cherry picked from commit 1fca4bd)
…tion

Add wrapper method "get_val_default()" to the ConfigProxy class that takes
the config option key to search. This method in-turn calls another method
with the same name added to md_config_t class that does the actual work of
searching for the config option. If the option is valid, _get_val_default()
is used to get the default value. Otherwise, the wrapper method returns
std::nullopt.

Fixes: https://tracker.ceph.com/issues/51464
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
(cherry picked from commit 9438e5a)
Use "mon_cmd_set_config()" to store the OSD's max iops capacity to
the MON store during the first bring-up. Don't run the OSD benchmark
test on subsequent boot-ups if a previously persisted iops capacity is
available on the MON store and is different from the default iops
capacity.

Add the 'force_run_benchmark' flag to force a run of the benchmark
in case the default iops capacity cannot be determined.

Fixes: https://tracker.ceph.com/issues/51464
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
(cherry picked from commit 10f8b79)
The new config option "osd_mclock_force_run_benchmark_on_init" is
introduced to allow a user to force run the OSD benchmark test on every
OSD boot-up even if the historical data about the OSD's iops capacity is
available on the MON config store. The 'force_run_benchmark' flag is set
to the value indicated by the new config option.

By default this new config option is set to false.

The utility of this option is to help refresh the OSD iops capacity
when the underlying device's performance characteristics have changed
significantly. In such cases, the OSD can be restarted with this option
enabled temporarily. Once the new iops capacity is updated to the MON
store, this option can be removed from the OSD's start-up config.

Fixes: https://tracker.ceph.com/issues/51464
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
(cherry picked from commit 8725a10)

Conflicts:
    src/common/options/osd.yaml.in
- Removed non-existent file: src/common/options/osd.yaml.in since the
  switch to yaml for config options is not available in pacific yet.
- Added new config option "osd_mclock_osd_mclock_force_run_benchmark_on_init"
  to options.cc.
@sseshasa sseshasa force-pushed the wip-51117-pacific branch from 2bfacba to 5bee321 Compare August 3, 2021 06:27
@sseshasa
Copy link
Contributor Author

sseshasa commented Aug 3, 2021

Summary of master trackers and associated PRs that are a part of this back-port PR:

  1. Original master tracker: https://tracker.ceph.com/issues/51116
    Original PR: osd: Run osd bench test to override default max osd capacity for mclock #41308

  2. Added a subset of the commits from PR: osd: Add mechanism to avoid running OSD bench on every OSD init when mclock_scheduler is enabled #42133 associated with tracker: https://tracker.ceph.com/issues/51464. A couple of things to note:

  3. Added a subset of the commits from PR: osd: Add config option to skip running the osd benchmark during init and update documentation. #42604 associated with tracker: https://tracker.ceph.com/issues/52025.

  4. Add commits from PR: mon/MonCap: Update osd profile to allow cmd to set iops capacity on mon db #42853 associated with tracker https://tracker.ceph.com/issues/52329

@neha-ojha
Copy link
Member

Needs to include #41782

My bad, we don't need this because we are not changing the default to mclock_scheduler in pacific, let's include three things in this backport PR:

1 . mechanism to automate benchmarking for qos
2. ability to persist osd_mclock_max_capacity if the benchmark
3. any test changes needed for (1) and (2)

@sseshasa sseshasa force-pushed the wip-51117-pacific branch from 5bee321 to b48d709 Compare August 4, 2021 10:42
@sseshasa
Copy link
Contributor Author

sseshasa commented Aug 4, 2021

why do we need to backport a feature to pacific? also i'd suggest put this backport on hold because of the regression of https://tracker.ceph.com/issues/51074.

@tchaikov The changes in https://tracker.ceph.com/issues/51074 will not be backported to pacific as we don't plan on changing the default scheduler. Therefore, all commits related to qa/standalone script are now removed. Can you please unblock this review?

@neha-ojha
Copy link
Member

https://pulpito.ceph.com/sseshasa-2021-08-06_04:49:51-rados-wip-sseshasa2-testing-2021-08-04-1847-pacific-distro-basic-smithi/

The failed jobs are unrelated, and are tracked in the following tickets.

https://tracker.ceph.com/issues/52116
https://tracker.ceph.com/issues/51728
https://tracker.ceph.com/issues/52078 - needs backport PR
https://tracker.ceph.com/issues/52129

@sseshasa could you please rerun the dead job that failed due to an infra issue? This PR can be merged if that rerun looks good.

@sseshasa
Copy link
Contributor Author

sseshasa commented Aug 11, 2021

https://pulpito.ceph.com/sseshasa-2021-08-06_04:49:51-rados-wip-sseshasa2-testing-2021-08-04-1847-pacific-distro-basic-smithi/

The failed jobs are unrelated, and are tracked in the following tickets.

https://tracker.ceph.com/issues/52116
https://tracker.ceph.com/issues/51728
https://tracker.ceph.com/issues/52078 - needs backport PR
https://tracker.ceph.com/issues/52129

@sseshasa could you please rerun the dead job that failed due to an infra issue? This PR can be merged if that rerun looks good.

Thanks for looking into the failures. I couldn't get to them the last couple of days. I initiated a re-run of the dead job and it passed:

https://pulpito.ceph.com/sseshasa-2021-08-11_05:19:22-rados-wip-sseshasa2-testing-2021-08-04-1847-pacific-distro-basic-smithi/

@sseshasa
Copy link
Contributor Author

why do we need to backport a feature to pacific? also i'd suggest put this backport on hold because of the regression of https://tracker.ceph.com/issues/51074.

@tchaikov The changes in https://tracker.ceph.com/issues/51074 will not be backported to pacific as we don't plan on changing the default scheduler. Therefore, all commits related to qa/standalone script are now removed. Can you please unblock this review?

@tchaikov Can you please unblock the PR based on my comments above? Thanks!

@sseshasa
Copy link
Contributor Author

why do we need to backport a feature to pacific? also i'd suggest put this backport on hold because of the regression of https://tracker.ceph.com/issues/51074.

@tchaikov The changes in https://tracker.ceph.com/issues/51074 will not be backported to pacific as we don't plan on changing the default scheduler. Therefore, all commits related to qa/standalone script are now removed. Can you please unblock this review?

@tchaikov Can you please unblock the PR based on my comments above? Thanks!

@tchaikov Can you take a look at my comments above and unblock? Thanks.

@tchaikov tchaikov dismissed stale reviews from neha-ojha and themself August 17, 2021 06:49

dismissed.

@tchaikov tchaikov requested a review from neha-ojha August 17, 2021 06:50
@tchaikov
Copy link
Contributor

@neha-ojha sorry for dismissing your approval by accident. could you kindly redo the approve?

@neha-ojha
Copy link
Member

@neha-ojha sorry for dismissing your approval by accident. could you kindly redo the approve?

@tchaikov no worries, done.

@sseshasa
Copy link
Contributor Author

TBD: Backport #42853 as well.

Introduce a new dev config option "osd_mclock_skip_benchmark" that
when set skips running the OSD benchmark on start-up. By default
this option is disabled. This is useful in the following scenarios:

 - Dev/CI testing,
 - Configurations that don't need QoS.

If the option is enabled, the default OSD iops capacity is read from
osd_mclock_max_capacity_iops_[hdd,ssd].

Fixes: https://tracker.ceph.com/issues/52025
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
(cherry picked from commit 6ca32bd)

 Conflicts:
	src/common/options/osd.yaml.in
 - Removed non-existent file: src/common/options/osd.yaml.in since the
   switch to yaml for config options is not available in pacific yet.
…city.

Update the steps in the mclock config reference document to manually
override an OSDs max IOPS capacity. Provide information on the alternative
ways to override the osd_mclock_max_capacity_iops_[hdd,ssd] options for
an OSD.

Fixes: https://tracker.ceph.com/issues/52025
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
(cherry picked from commit 7c25511)
…on db

The default mon caps for osds is set to "allow profile osd", which allows
only "rw" capability. Osds with mclock scheduler enabled store their max
iops capacity on the mon config store. This can be achieved by executing
the "config set" command. However, since the osd(s) by default do not have
the execute permission, the command fails with "Permission denied" error.

Therefore, modify the default osd profile to allow running the "config set"
command with restriction to only set keys with name matching either (regex)
"osd_mclock_max_capacity_iops_hdd" or "osd_mclock_max_capacity_iops_ssd"
so that the osd has the permission to update the mon config store with the
desired information.

Fixes: https://tracker.ceph.com/issues/52329
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
(cherry picked from commit 2cdbe81)
Assign the default caps for osds to be the same as what the AuthMonitor
sets for a new osd. See AuthMonitor::validate_osd_new() which sets the
following caps for a new osd:

 mon='allow profile osd'
 mgr='allow profile osd'
 osd=''allow *'

When an actual real world cluster is deployed, the above caps are applied.
Unless the user modifies the defaults, a cluster will operate with the
above caps. Therefore, it makes sense to use the defaults when testing
Ceph so that issues if any due to the default settings may be caught and
fixed.

Therefore, the caps for the 'osd' type is reset to the default in
generate_caps(). The caps for 'mgr' already reflects the system defaults.
The caps for 'mds' type is not changed in this commit and will be
investigated and changed if necessary later.

Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
(cherry picked from commit 4b0dba2)
@github-actions github-actions bot added the mon label Sep 13, 2021
@sseshasa
Copy link
Contributor Author

sseshasa commented Sep 13, 2021

@neha-ojha @jdurgin The latest pushes include the following commits related to:

  • Skip OSD benchmark based on config option.
  • Update 'osd profile' caps to allow running the 'config set' command to set the max OSD capacity.

As already decided, I have not included any qa related changes. If the changes look good, then we can remove the DNM label and run this through teuthology.

@sseshasa
Copy link
Contributor Author

sseshasa commented Sep 24, 2021

Pulpito Run Info:
https://pulpito.ceph.com/yuriw-2021-09-22_14:54:43-rados-wip-yuri8-testing-2021-09-21-0415-pacific-distro-basic-smithi/

Failures (Unrelated):

  1. https://pulpito.ceph.com/yuriw-2021-09-22_14:54:43-rados-wip-yuri8-testing-2021-09-21-0415-pacific-distro-basic-smithi/6402688
    Tracker issue: https://tracker.ceph.com/issues/52116

  2. https://pulpito.ceph.com/yuriw-2021-09-22_14:54:43-rados-wip-yuri8-testing-2021-09-21-0415-pacific-distro-basic-smithi/6402710
    Tracker issue not found. Probably a known issue. Error seen during setting up of the environment and is unrelated to
    the PR.

  3. https://pulpito.ceph.com/yuriw-2021-09-22_14:54:43-rados-wip-yuri8-testing-2021-09-21-0415-pacific-distro-basic-smithi/6402717
    Tracker Issue: https://tracker.ceph.com/issues/51728

  4. https://pulpito.ceph.com/yuriw-2021-09-22_14:54:43-rados-wip-yuri8-testing-2021-09-21-0415-pacific-distro-basic-smithi/6402805
    Tracker issue: https://tracker.ceph.com/issues/52116

  5. https://pulpito.ceph.com/yuriw-2021-09-22_14:54:43-rados-wip-yuri8-testing-2021-09-21-0415-pacific-distro-basic-smithi/6402894
    Same as 2 above

  6. https://pulpito.ceph.com/yuriw-2021-09-22_14:54:43-rados-wip-yuri8-testing-2021-09-21-0415-pacific-distro-basic-smithi/6402901
    Same as 3 above.

  7. https://pulpito.ceph.com/yuriw-2021-09-22_14:54:43-rados-wip-yuri8-testing-2021-09-21-0415-pacific-distro-basic-smithi/6402919
    Same as 4 above.

Dead Jobs (Unrelated):

  1. https://pulpito.ceph.com/yuriw-2021-09-22_14:54:43-rados-wip-yuri8-testing-2021-09-21-0415-pacific-distro-basic-smithi/6402669
    Job actually passed. POST to paddles failed with errors.

  2. https://pulpito.ceph.com/yuriw-2021-09-22_14:54:43-rados-wip-yuri8-testing-2021-09-21-0415-pacific-distro-basic-smithi/6402760
    Tracker issue: https://tracker.ceph.com/issues/51076

  3. https://pulpito.ceph.com/yuriw-2021-09-22_14:54:43-rados-wip-yuri8-testing-2021-09-21-0415-pacific-distro-basic-smithi/6402802
    Tracker issue: https://tracker.ceph.com/issues/49888

  4. https://pulpito.ceph.com/yuriw-2021-09-22_14:54:43-rados-wip-yuri8-testing-2021-09-21-0415-pacific-distro-basic-smithi/6402896
    Test Failed Due to Socket Error during the unwinding phase after the upgrade test. This occurred due an infra issue
    and is unrelated to the PR.

@yuriw yuriw merged commit bbc9f0b into ceph:pacific Sep 24, 2021
@sseshasa sseshasa deleted the wip-51117-pacific branch September 27, 2021 05:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants