Skip to content

qa: Disable OSD benchmark from running for tests.#67058

Merged
sseshasa merged 1 commit intoceph:mainfrom
sseshasa:wip-fix-iops-threshold-warning-74501
Jan 23, 2026
Merged

qa: Disable OSD benchmark from running for tests.#67058
sseshasa merged 1 commit intoceph:mainfrom
sseshasa:wip-fix-iops-threshold-warning-74501

Conversation

@sseshasa
Copy link
Contributor

@sseshasa sseshasa commented Jan 23, 2026

Disable OSD bench from benchmarking the OSDs for teuthology tests. This is to help prevent a cluster warning pertaining to the IOPS value not lying within a typical threshold range from being raised.

The tests can rely on the built-in static values as defined by osd_mclock_max_capacity_iops_[ssd|hdd] which should be good enough.

Fixes: https://tracker.ceph.com/issues/74501
Signed-off-by: Sridhar Seshasayee sseshasa@redhat.com

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands

You must only issue one Jenkins command per-comment. Jenkins does not understand
comments with more than one command.

Disable OSD bench from benchmarking the OSDs for teuthology tests. This is to
help prevent a cluster warning pertaining to the IOPS value not lying within
a typical threshold range from being raised.

The tests can rely on the built-in static values as defined by
osd_mclock_max_capacity_iops_[ssd|hdd] which should be good enough.

Fixes: https://tracker.ceph.com/issues/74501
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
@sseshasa
Copy link
Contributor Author

Teuthology Runs:

Rados Suite:
Below is a re-run of failed tests from https://pulpito.ceph.com/lflores-2026-01-21_20:56:39-rados-main-distro-default-trial/
with this PR included:

https://pulpito.ceph.com/sseshasa-2026-01-23_05:28:15-rados-main-distro-default-trial/

orch/cephadm Suite:
https://pulpito.ceph.com/sseshasa-2026-01-23_08:23:46-orch:cephadm-main-distro-default-trial/

Both runs do not show the OSD bench related cluster warnings.

@ljflores ljflores requested a review from aclamk January 23, 2026 16:25
@sseshasa sseshasa merged commit 84d5b44 into ceph:main Jan 23, 2026
22 of 26 checks passed
@batrick
Copy link
Member

batrick commented Jan 23, 2026

Is this really desirable as a global config @rzarzynski @ljflores ? Shouldn't there be test coverage for the benchmark somewhere?

@djgalloway
Copy link
Contributor

Why are we just disabling tests instead of fixing them? What does the warning mean? Have storage devices just progressed enough that that max limit needs to be increased?

@rzarzynski
Copy link
Contributor

@batrick: I think @sseshasa has applied a global change for a global issue. The benchmark was run at every OSDs in every job to determine a fundamental constant for mClock. It's not a test per se.

@djgalloway: the reason is simple – an issue which likely is just a minor (tuning expected boundaries) generates so much noise ("Jobs: see 125 failed; Logs: https://pulpito.ceph.com/yuriw-2026-01-21_19:35:54-orch-reef-release-distro-default-trial/") that overloads reviewing of QA runs / potentially hides other problems.

I'm fine with the merge as a makeshift workaround to let Sridhar analyze the problem and come with a fix. I agree we should revert this commit ultimately.

@batrick
Copy link
Member

batrick commented Jan 27, 2026

I fear this will be forgotten. I'm not sure a revert is necessary but the mclock QA tests should have this turned on explicitly. (Anywhere else too?)

@sseshasa
Copy link
Contributor Author

sseshasa commented Jan 27, 2026

@batrick As @rzarzynski mentioned, the bench test was triggered as part of every test when OSD(s) are brought up. It's not associated with the teuthology job. The bench test is performed to a get an idea of the IOPS capacity of an OSD from the objectstore's perspective. The mClock scheduler eventually consumes this for allocating a specific quantum to different services on the OSD. The OSD bench test itself is an existing tool which has its own standalone test and is leveraged for mClock's purpose.

For teuthology tests, this is not a 'must have' as we don't use it to test the scheduling aspect of mClock. A generic static value is sufficient for teuthology tests to run. There are deterministic tests for mClock scheduling that we run outside of teuthology on machines whose environment is known and can control. We use CBT for this purpose.

But there are cases where the bench test throws up unrealistic IOPS measurements and to catch this a threshold range is defined based on the underlying device type. In the trial machines, this threshold was breached (>80K IOPS) leading to the cluster warning. It's apparent that the devices on these machines are significantly faster than what was present on the smithi machines.

In addition to bumping up the threshold values, which seems reasonable in this case, I am looking into OSD bench tool and possibly the fio_objectstore_tool to improve the consistency and predictability of results. The important thing to note here is that we need to use a tool that closely mimics IOs with the objectstore in place in order to get a reasonably good estimate of the IOPS capacity at the OSD layer. We can use https://tracker.ceph.com/issues/74567 to track the progress.

@rzarzynski
Copy link
Contributor

As https://tracker.ceph.com/issues/74501 is "consumed" by this PR, it has been copied into https://tracker.ceph.com/issues/74567 yesterday – we shouldn't forget.

@sseshasa
Copy link
Contributor Author

As https://tracker.ceph.com/issues/74501 is "consumed" by this PR, it has been copied into https://tracker.ceph.com/issues/74567 yesterday – we shouldn't forget.

I updated my comment above to mention the correct tracker. Thanks for pointing it out.

@djgalloway
Copy link
Contributor

I guess my point was - even if it's not related to a test, if we're hitting this condition, users/customers could be too and best to not bury our heads in the sand. I'm happy to see a separate tracker opened to investigate alternatives to ignoring the warning.

@markhpc
Copy link
Member

markhpc commented Jan 29, 2026

Agreed, I just saw this and am slightly amazed this was already merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants