qa: Disable OSD benchmark from running for tests.#67058
Conversation
Disable OSD bench from benchmarking the OSDs for teuthology tests. This is to help prevent a cluster warning pertaining to the IOPS value not lying within a typical threshold range from being raised. The tests can rely on the built-in static values as defined by osd_mclock_max_capacity_iops_[ssd|hdd] which should be good enough. Fixes: https://tracker.ceph.com/issues/74501 Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
Teuthology Runs:Rados Suite: https://pulpito.ceph.com/sseshasa-2026-01-23_05:28:15-rados-main-distro-default-trial/ orch/cephadm Suite: Both runs do not show the OSD bench related cluster warnings. |
|
Is this really desirable as a global config @rzarzynski @ljflores ? Shouldn't there be test coverage for the benchmark somewhere? |
|
Why are we just disabling tests instead of fixing them? What does the warning mean? Have storage devices just progressed enough that that max limit needs to be increased? |
|
@batrick: I think @sseshasa has applied a global change for a global issue. The benchmark was run at every OSDs in every job to determine a fundamental constant for mClock. It's not a test per se. @djgalloway: the reason is simple – an issue which likely is just a minor (tuning expected boundaries) generates so much noise ("Jobs: see 125 failed; Logs: https://pulpito.ceph.com/yuriw-2026-01-21_19:35:54-orch-reef-release-distro-default-trial/") that overloads reviewing of QA runs / potentially hides other problems. I'm fine with the merge as a makeshift workaround to let Sridhar analyze the problem and come with a fix. I agree we should revert this commit ultimately. |
|
I fear this will be forgotten. I'm not sure a revert is necessary but the mclock QA tests should have this turned on explicitly. (Anywhere else too?) |
|
@batrick As @rzarzynski mentioned, the bench test was triggered as part of every test when OSD(s) are brought up. It's not associated with the teuthology job. The bench test is performed to a get an idea of the IOPS capacity of an OSD from the objectstore's perspective. The mClock scheduler eventually consumes this for allocating a specific quantum to different services on the OSD. The OSD bench test itself is an existing tool which has its own standalone test and is leveraged for mClock's purpose. For teuthology tests, this is not a 'must have' as we don't use it to test the scheduling aspect of mClock. A generic static value is sufficient for teuthology tests to run. There are deterministic tests for mClock scheduling that we run outside of teuthology on machines whose environment is known and can control. We use CBT for this purpose. But there are cases where the bench test throws up unrealistic IOPS measurements and to catch this a threshold range is defined based on the underlying device type. In the trial machines, this threshold was breached (>80K IOPS) leading to the cluster warning. It's apparent that the devices on these machines are significantly faster than what was present on the smithi machines. In addition to bumping up the threshold values, which seems reasonable in this case, I am looking into OSD bench tool and possibly the fio_objectstore_tool to improve the consistency and predictability of results. The important thing to note here is that we need to use a tool that closely mimics IOs with the objectstore in place in order to get a reasonably good estimate of the IOPS capacity at the OSD layer. We can use https://tracker.ceph.com/issues/74567 to track the progress. |
|
As https://tracker.ceph.com/issues/74501 is "consumed" by this PR, it has been copied into https://tracker.ceph.com/issues/74567 yesterday – we shouldn't forget. |
I updated my comment above to mention the correct tracker. Thanks for pointing it out. |
|
I guess my point was - even if it's not related to a test, if we're hitting this condition, users/customers could be too and best to not bury our heads in the sand. I'm happy to see a separate tracker opened to investigate alternatives to ignoring the warning. |
|
Agreed, I just saw this and am slightly amazed this was already merged. |
Disable OSD bench from benchmarking the OSDs for teuthology tests. This is to help prevent a cluster warning pertaining to the IOPS value not lying within a typical threshold range from being raised.
The tests can rely on the built-in static values as defined by osd_mclock_max_capacity_iops_[ssd|hdd] which should be good enough.
Fixes: https://tracker.ceph.com/issues/74501
Signed-off-by: Sridhar Seshasayee sseshasa@redhat.com
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins test classic perfJenkins Job | Jenkins Job Definitionjenkins test crimson perfJenkins Job | Jenkins Job Definitionjenkins test signedJenkins Job | Jenkins Job Definitionjenkins test make checkJenkins Job | Jenkins Job Definitionjenkins test make check arm64Jenkins Job | Jenkins Job Definitionjenkins test submodulesJenkins Job | Jenkins Job Definitionjenkins test dashboardJenkins Job | Jenkins Job Definitionjenkins test dashboard cephadmJenkins Job | Jenkins Job Definitionjenkins test apiJenkins Job | Jenkins Job Definitionjenkins test docsReadTheDocs | Github Workflow Definitionjenkins test ceph-volume allJenkins Jobs | Jenkins Jobs Definitionjenkins test windowsJenkins Job | Jenkins Job Definitionjenkins test rook e2eJenkins Job | Jenkins Job DefinitionYou must only issue one Jenkins command per-comment. Jenkins does not understand
comments with more than one command.