qa: test_pool_min_size should kill osds first then mark them down#65074
qa: test_pool_min_size should kill osds first then mark them down#65074SrinivasaBharath merged 1 commit intoceph:mainfrom
Conversation
The objective of test_pool_min_size is to inject up to M failures in a K+M pool to prove that it has enough redundancy to stay active. It was selecting OSDs and then killing and marking them out one at a time. Testing with wide erasure codes (high values of K and M - for example 8+4) found that this test sometimes failed with a PG become dead. Debugging showed that what was happening is that after one OSD had been killed and marked out this allowed rebalancing and async recovery to start which further reduced the redundancy of the PG, when the remaining error injects happened the PG correctly became dead. In practice OSDs are not normally killed and marked out one after another in quick succession. The more common scenario is that one or more OSDs fail at about the same time (lets say over a couple of minutes) and then after mon_osd_down_out_interval (10 mins) the mon will mark them out. Killing the OSDs first and then marking them out prevents additional async recovery from starting. If OSDs do fail over a long period of time such that the mon marks each OSD out then hopefully there is enough time for async recovery to run between the failures. This commit changes the error inject to kill all the selected OSDs first and then to mark them out. Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
|
jenkins test make check |
|
jenkins test make check arm64 |
|
@bill-scales thanks for making the change, I will review this soon. |
|
jenkins test api |
kamoltat
left a comment
There was a problem hiding this comment.
Thank you for the explanation, this makes sense.
|
@bill-scales since this a test file change and not a C++ code change, If you have not done so already, you can run a teuthology test using |
|
I've already run a few teuthology runs with this change, but let me try some targeted tests while it runs through QA as well. I've seen the issue more frequently with larger EC profiles such as 8+4 so I'll make sure to get these included in the mix |
|
Three runs with teuthology using PRs 65074, 65063 and 65067 but focusing on testing this change Run 1 - thrash-erasure-code
Run 2 - thrash-erasure-code-overwrites
Run 3 - thrash-erasure-code only with 8+6
There is 1 failure in run 1 and 2 failures in run 3 |
|
Problem with run https://pulpito.ceph.com/billscales-2025-08-28_12:34:14-rados:thrash-erasure-code-main-distro-default-smithi/8469734/ is that the backfill_toofull error inject (implemented inside the OSD by a conf setting) stopped the chosen PG from recovering from the prior error inject. PG 1.6 was in state active+remapped+backfill_toofull with an acting set [12,15,8,1,13,8,10,6,4,9,2,11,5,12] (note the duplication of OSD 8) prior to the error inject starting. The error inject then tried to take OSD 8 offline twice. The |
|
Problem with run https://pulpito.ceph.com/billscales-2025-08-28_16:31:51-rados:thrash-erasure-code-main-distro-default-smithi/8470070/ is a test cleanup problem:
We need to fix get_rand_pg_acting_set so it doesn't raise an exception if the thrasher is being stopped These fixes need to be added to PR #65063 |
|
Problem with https://pulpito.ceph.com/billscales-2025-08-28_16:31:51-rados:thrash-erasure-code-main-distro-default-smithi/8470073/ is that the backfill_toofull error inject (implemented inside the OSD by a conf setting) stopped a different PG from recovering from the prior error inject. The error inject works on the chosen PG but this other PG that hasn't recovered from the previous error inject because of the backfill_toofull error inject becomes dead. The test fails because of the dead PG. While this PR has fixed one of the causes of PGs going dead during this error inject, it hasn't fixed all causes It also has similar problems to 8470070 in that after the thrasher times out recovery, the watchdog barks and stops the test (we want this to happen to stop rados.py) but then the cleanup at the end of thrashosds.py we timeout waiting for recovery because all the OSDs have been killed. We need thashosds.py to be smarter and not try to clean up if the watchdog has fired |
|
More runs with this fix targeting this specific test case: https://pulpito.ceph.com/billscales-2025-08-29_12:56:56-rados:thrash-erasure-code-main-distro-default-smithi/ I'm happy from the testing that this PR does fix the issue it was trying to address. The testing has shown there is still more work to do to make these tests reliable. I've updated #65063 because of these tests to fix one of the failures noted above. I've got a fix for the backfill_toofull causing dead PG problems seen above, but haven't managed to get a test run showing this working yet. I'll deliver that as a separate PR once testing shows it works. I've got a fix for thrashosds to stop its clean up causing exceptions that overwrite the failure reason caused by the watchdog barking (its clean up doesn't go well when the watchdog kills the OSDs) but that will also be another PR. |
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
|
Apologies for the delay on approval; I found an issue with a separate PR in the batch. We're rerunning tests and should have fresh results soon. |
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
The objective of test_pool_min_size is to inject up to M failures in a K+M pool to prove that it has enough redundancy to stay active.
It was selecting OSDs and then killing and marking them out one at a time. Testing with wide erasure codes (high values of K and M - for example 8+4) found that this test sometimes failed with a PG become dead. Debugging showed that what was happening is that after one OSD had been killed and marked out this allowed rebalancing and async recovery to start which further reduced the redundancy of the PG, when the remaining error injects happened the PG
correctly became dead.
In practice OSDs are not normally killed and marked out one after another in quick succession. The more common scenario is that one or more OSDs fail at about the same time (lets say over a couple of minutes) and then after mon_osd_down_out_interval (10 mins) the mon
will mark them out. Killing the OSDs first and then marking them out prevents additional async recovery from starting.
If OSDs do fail over a long period of time such that the mon marks each OSD out then hopefully there is enough time for async recovery to run between the
failures.
This commit changes the error inject to kill all the selected OSDs first and then to mark them out.
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins test classic perfJenkins Job | Jenkins Job Definitionjenkins test crimson perfJenkins Job | Jenkins Job Definitionjenkins test signedJenkins Job | Jenkins Job Definitionjenkins test make checkJenkins Job | Jenkins Job Definitionjenkins test make check arm64Jenkins Job | Jenkins Job Definitionjenkins test submodulesJenkins Job | Jenkins Job Definitionjenkins test dashboardJenkins Job | Jenkins Job Definitionjenkins test dashboard cephadmJenkins Job | Jenkins Job Definitionjenkins test apiJenkins Job | Jenkins Job Definitionjenkins test docsReadTheDocs | Github Workflow Definitionjenkins test ceph-volume allJenkins Jobs | Jenkins Jobs Definitionjenkins test windowsJenkins Job | Jenkins Job Definitionjenkins test rook e2eJenkins Job | Jenkins Job Definition