osd: Run osd bench test to override default max osd capacity for mclock#41308
osd: Run osd bench test to override default max osd capacity for mclock#41308neha-ojha merged 5 commits intoceph:masterfrom
Conversation
|
Logs messages from testing: Underlying device: SSD Underlying device: HDD |
def4c08 to
4219c37
Compare
| } else { | ||
| double rate = count / elapsed; | ||
| double iops = rate / bsize; | ||
| dout(1) << __func__ |
There was a problem hiding this comment.
I am not sure we need to log all the above lines at level 1 in this function. Only adding the following summary line to the osd log at level 1 and maybe cluster log as well, when we have successfully overridden the defaults, should be enough.
There was a problem hiding this comment.
'ceph config show' displaying it is good enough - no need to add it to the cluster log. agree about the osd log levels
There was a problem hiding this comment.
second, i'd suggest not put this in clog.
There was a problem hiding this comment.
I have used derr for the error case. As suggested, I just added a log at level 1 when the test succeeds.
| } else { | ||
| double rate = count / elapsed; | ||
| double iops = rate / bsize; | ||
| dout(1) << __func__ |
There was a problem hiding this comment.
second, i'd suggest not put this in clog.
4219c37 to
f87b7f7
Compare
61c587f to
de2eb9a
Compare
neha-ojha
left a comment
There was a problem hiding this comment.
few nits on the documentation, code changes look fine
| your Ceph configuration file). | ||
| 1. Bring up your Ceph cluster and login to the Ceph node hosting the OSDs that | ||
| you wish to benchmark. | ||
| 2. Ensure that the bluestore throttle options ( i.e. |
There was a problem hiding this comment.
do we need this? these will be set to defaults as per the setup procedure provided so far
There was a problem hiding this comment.
Sure, I can remove this since defaults will be set.
On another note it may be worthwhile to come up with a script that performs these steps and determine the bluestore throttle values that work. We can for example set a tolerance of some percentage of the baseline throughput below which the script stops and displays the throttle values. This obviously cannot be done during OSD init as it consumes time. What do you think?
e5e747c to
ecd2c7b
Compare
ecd2c7b to
b4811c6
Compare
|
Added commit 379698b to fix the TestProgress failure seen in the teuthology run (JobID: 6118355). Ran the above test again multiple times. See: All the tests passed in all the runs. For some tests for e.g. 6124846/49/50/53 did encounter the scenario that needed a few more seconds for the recovery to complete. The failures pertaining to OSD_DOWN health check warnings were because of heartbeat timeouts due to worker thread of an op shard not getting an item to process immediately. The scheduler kept returning status indicating when the next request could be dequeued (future request). This happened continuously for more than 15 secs after which the hearbeat timeouts started to appear resulting in the concerned osd being marked down. The fix is to disable hearbeats when there are no immediate requests to process and re-enable once a work item is ready for processing. See commit a543550. @neha-ojha @jdurgin @athanatos @tchaikov Please review the above commit for correctness. Test Results with Above Commit: Will re-run the failed jobs again for completeness. |
|
jenkins retest this please |
b4811c6 to
0e51e97
Compare
662e0cc to
be97e72
Compare
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
be97e72 to
9bc3178
Compare
9bc3178 to
b0c6e4f
Compare
|
jenkins test make check |
|
The progress module test got fixed in the above run. However, there was a failure of a job with ceph_objectstore_test (see https://tracker.ceph.com/issues/50903) due to the delay in osd initialization caused by the duration of the 'osd bench' test. The 'osd bench' duration was improved and the following are the runs post the fix: ceph_objectstore_test: Re-run failed and dead jobs from the latest run: @neha-ojha I think this PR is good to be merged now. Please take a look and merge this if you concur. |
b0c6e4f to
3585833
Compare
Remove the generic "osd_mclock_max_capacity_iops" option and use the "osd_mclock_max_capacity_iops_[hdd,ssd]" options. It is better to have a clear indication about the type of underlying device. This helps in avoiding confusion when trying to read or override the options. Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
If mclock scheduler is enabled, run the osd bench test as part of osd initialization sequence in order to determine the max osd capacity. The iops determined as part of the test is used to override the default osd_mclock_max_capacity_iops_[hdd,ssd] option depending on the underlying device type. The test performs random writes of 100 objects of 4MiB size using 4KiB blocksize. The existing test which was a part of asok_command() is factored out into a separate method called run_osd_bench_test() so that it can be used for both purposes. If the test fails, the default values for the above mentioned options are used. A new method called update_configuration() in introduced in OpScheduler base class to facilitate propagation of changes to a config option that is not user initiated. This method helps in applying changes and update any internal variable associated with a config option as long as it is tracked. In this case, the change to the max osd capacity is propagated to each op shard using the mentioned method. In the future this method can be useful to propagate changes to advanced config option(s) that the user is not expected to modify. Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
…cessed There could be rare instances when employing the mclock scheduler where a worker thread for a shard may not get an immediate work item to process. Such items are designated as future work items. In such cases, the _process() loop waits until the time indicated by the scheduler to attempt a dequeue from the scheduler queue again. It may so happen that if there are multiple threads per shard, a thread may not get an immediate item for a long time. This time could exceed the heartbeat timeout for the thread and result in hearbeat timeouts reported for the osd in question. To prevent this, the heartbeat timeouts for the thread is disabled before waiting for an item and enabled once the wait period is over. Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
With mclock scheduler enabled, the recovery throughput is throttled based on factors like the type of mclock profile enabled, the OSD capacity among others. Due to this the recovery times may vary and therefore the existing timeout of 120 secs may not be sufficient. To address the above, a new method called _is_inprogress_or_complete() is introduced in the TestProgress Class that checks if the event with the specified 'id' is in progress by checking the 'progress' key of the progress command response. This method also handles the corner case where the event completes just before it's called. The existing wait_until_true() method in the CephTestCase Class is modified to accept another function argument called "check_fn". This is set to the _is_inprogress_or_complete() function described earlier in the "test_turn_off_module" test that has been observed to fail due to the reasons already described above. A retry mechanism of a maximum of 5 attempts is introduced after the first timeout is hit. This means that the wait can extend up to a maximum of 600 secs (120 secs * 5) as long as there is recovery progress reported by the 'ceph progress' command result. Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
3585833 to
a438090
Compare
|
jenkins test docs |
I noticed some dead jobs timing out in wait_for_*, but can't say much without logs though. Let's merge after creating trackers for those and the osd-rep-recov-eio.sh failure. |
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
a438090 to
76420f9
Compare
Tracker for the failed job: Trackers for the dead jobs: |
If mclock scheduler is enabled, run the osd bench test as part of osd
initialization sequence in order to determine the max osd capacity. The
iops determined as part of the test is used to override the default
osd_mclock_max_capacity_iops_[hdd,ssd].
The test performs random writes of 100 objects of 4MiB size using
4KiB blocksize. The existing test which was a part of asok_command() is
factored out into a separate method called run_osd_bench_test() so that it
can be used for both purposes.
A new method called update_configuration() in introduced in OpScheduler
base class to facilitate propagation of changes to a config option
that is not user initiated. This method helps in applying changes and
update any internal variable associated with a config option as
long as it is tracked. In this case, the change to the max osd capacity
is propagated to each op shard using the mentioned method. In the
future this method can be useful to propagate changes to advanced
config option(s) that the user is not expected to modify.
Update mclock-config-ref to reflect automated OSD benchmarking.
Signed-off-by: Sridhar Seshasayee sseshasa@redhat.com
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume tox