Skip to content

qa: test_pool_min_size should kill osds first then mark them down#65074

Merged
SrinivasaBharath merged 1 commit intoceph:mainfrom
bill-scales:test_pool_min_size
Feb 5, 2026
Merged

qa: test_pool_min_size should kill osds first then mark them down#65074
SrinivasaBharath merged 1 commit intoceph:mainfrom
bill-scales:test_pool_min_size

Conversation

@bill-scales
Copy link
Contributor

@bill-scales bill-scales commented Aug 17, 2025

The objective of test_pool_min_size is to inject up to M failures in a K+M pool to prove that it has enough redundancy to stay active.

It was selecting OSDs and then killing and marking them out one at a time. Testing with wide erasure codes (high values of K and M - for example 8+4) found that this test sometimes failed with a PG become dead. Debugging showed that what was happening is that after one OSD had been killed and marked out this allowed rebalancing and async recovery to start which further reduced the redundancy of the PG, when the remaining error injects happened the PG
correctly became dead.

In practice OSDs are not normally killed and marked out one after another in quick succession. The more common scenario is that one or more OSDs fail at about the same time (lets say over a couple of minutes) and then after mon_osd_down_out_interval (10 mins) the mon
will mark them out. Killing the OSDs first and then marking them out prevents additional async recovery from starting.

If OSDs do fail over a long period of time such that the mon marks each OSD out then hopefully there is enough time for async recovery to run between the
failures.

This commit changes the error inject to kill all the selected OSDs first and then to mark them out.

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands

The objective of test_pool_min_size is to inject up to M failures
in a K+M pool to prove that it has enough redundancy to stay
active.

It was selecting OSDs and then killing and marking them out
one at a time. Testing with wide erasure codes (high values of
K and M - for example 8+4) found that this test sometimes
failed with a PG become dead. Debugging showed that what
was happening is that after one OSD had been killed and
marked out this allowed rebalancing and async recovery to
start which further reduced the redundancy of the PG,
when the remaining error injects happened the PG
correctly became dead.

In practice OSDs are not normally killed and marked out
one after another in quick succession. The more common
scenario is that one or more OSDs fail at about the
same time (lets say over a couple of minutes) and then
after mon_osd_down_out_interval (10 mins) the mon
will mark them out. Killing the OSDs first and then
marking them out prevents additional async recovery
from starting.

If OSDs do fail over a long period of time such that
the mon marks each OSD out then hopefully there is
enough time for async recovery to run between the
failures.

This commit changes the error inject to kill all the
selected OSDs first and then to mark them out.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
@github-actions github-actions bot added the tests label Aug 17, 2025
@bill-scales bill-scales requested a review from kamoltat August 17, 2025 15:56
@bill-scales
Copy link
Contributor Author

jenkins test make check

@bill-scales
Copy link
Contributor Author

jenkins test make check arm64

@kamoltat
Copy link
Member

@bill-scales thanks for making the change, I will review this soon.

@bill-scales
Copy link
Contributor Author

jenkins test api

Copy link
Member

@kamoltat kamoltat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the explanation, this makes sense.

@kamoltat
Copy link
Member

kamoltat commented Aug 27, 2025

@bill-scales since this a test file change and not a C++ code change, If you have not done so already, you can run a teuthology test using test_pool_min_size as the suite branch. As you are already aware, there are 2 RADOS sub suites that use test_pool_min_size, 1. thrash-erasure-code. 2. thrash-erasure-code-overwrites. I think intentionally targeting these 2 suites rather than relying on the batched qa-runs that Yuri and Bharath schedule might cover more combinations.

@bill-scales
Copy link
Contributor Author

I've already run a few teuthology runs with this change, but let me try some targeted tests while it runs through QA as well. I've seen the issue more frequently with larger EC profiles such as 8+4 so I'll make sure to get these included in the mix

@bill-scales
Copy link
Contributor Author

bill-scales commented Aug 28, 2025

Three runs with teuthology using PRs 65074, 65063 and 65067 but focusing on testing this change

Run 1 - thrash-erasure-code

teuthology-suite -m smithi -s rados/thrash-erasure-code --suite-branch ec_teuth --suite-repo https://github.com/bill-scales/ceph.git -S ded026476ccb5c8e680746e180992ce9e75b0e25 --priority 80 --subset 1/15 --filter "thrashers/minsize_recovery" -e ####

https://pulpito.ceph.com/billscales-2025-08-28_12:34:14-rados:thrash-erasure-code-main-distro-default-smithi/


Run 2 - thrash-erasure-code-overwrites

teuthology-suite -m smithi -s rados/thrash-erasure-code-overwrites --suite-branch ec_teuth --suite-repo https://github.com/bill-scales/ceph.git -S ded026476ccb5c8e680746e180992ce9e75b0e25 --priority 80 --subset 1/6 --filter "thrashers/minsize_recovery" -e ####


https://pulpito.ceph.com/billscales-2025-08-28_12:36:47-rados:thrash-erasure-code-overwrites-main-distro-default-smithi/

Run 3 - thrash-erasure-code only with 8+6

teuthology-suite -m smithi -s rados/thrash-erasure-code --suite-branch ec_teuth --suite-repo https://github.com/bill-scales/ceph.git -S ded026476ccb5c8e680746e180992ce9e75b0e25 --priority 80 --subset 1/2 --filter-all "thrashers/minsize_recovery,workloads/ec-rados-plugin=jerasure-k=8-m=6-crush" -e ####


https://pulpito.ceph.com/billscales-2025-08-28_16:31:51-rados:thrash-erasure-code-main-distro-default-smithi/

There is 1 failure in run 1 and 2 failures in run 3

@bill-scales
Copy link
Contributor Author

Problem with run https://pulpito.ceph.com/billscales-2025-08-28_12:34:14-rados:thrash-erasure-code-main-distro-default-smithi/8469734/ is that the backfill_toofull error inject (implemented inside the OSD by a conf setting) stopped the chosen PG from recovering from the prior error inject.

PG 1.6 was in state active+remapped+backfill_toofull with an acting set [12,15,8,1,13,8,10,6,4,9,2,11,5,12] (note the duplication of OSD 8) prior to the error inject starting. The error inject then tried to take OSD 8 offline twice. The
error inject was allowed to start because wait_for_recovery() passes so long as no PG is in backfilling or recover - it is not checking for backfill_toofull.

@bill-scales
Copy link
Contributor Author

Problem with run https://pulpito.ceph.com/billscales-2025-08-28_16:31:51-rados:thrash-erasure-code-main-distro-default-smithi/8470070/ is a test cleanup problem:

  1. rados.py finishes its I/O test and cleans up deleting the pool
  2. thrashosds starts trying to stop the thrasher "INFO:tasks.thrashosds:joining thrashosds"
  3. get_rand_pg_acting_set raises an exception "MaxWhileTries(error_msg)" because it can't find a PG in the deleted pool and so the thrasher exits with an exception
  4. The watchdog barks and kills all the osds because the thrasher died "daemonwatchdog.daemon_watchdog:OSDThrasher failed"
  5. We timeout waiting for recovery in the cleanup at the end of thrashosds.py because the osds have been killed

We need to fix get_rand_pg_acting_set so it doesn't raise an exception if the thrasher is being stopped
We should stop the thrasher telling the watchdog that it has died if it is in the process of being stopped

These fixes need to be added to PR #65063

@bill-scales
Copy link
Contributor Author

Problem with https://pulpito.ceph.com/billscales-2025-08-28_16:31:51-rados:thrash-erasure-code-main-distro-default-smithi/8470073/ is that the backfill_toofull error inject (implemented inside the OSD by a conf setting) stopped a different PG from recovering from the prior error inject.

The error inject works on the chosen PG but this other PG that hasn't recovered from the previous error inject because of the backfill_toofull error inject becomes dead. The test fails because of the dead PG.

While this PR has fixed one of the causes of PGs going dead during this error inject, it hasn't fixed all causes

It also has similar problems to 8470070 in that after the thrasher times out recovery, the watchdog barks and stops the test (we want this to happen to stop rados.py) but then the cleanup at the end of thrashosds.py we timeout waiting for recovery because all the OSDs have been killed.

We need thashosds.py to be smarter and not try to clean up if the watchdog has fired

@bill-scales
Copy link
Contributor Author

More runs with this fix targeting this specific test case:

https://pulpito.ceph.com/billscales-2025-08-29_12:56:56-rados:thrash-erasure-code-main-distro-default-smithi/
https://pulpito.ceph.com/billscales-2025-08-29_13:50:22-rados:thrash-erasure-code-overwrites-main-distro-default-smithi/
https://pulpito.ceph.com/billscales-2025-08-29_14:07:56-rados:thrash-erasure-code-main-distro-default-smithi/
https://pulpito.ceph.com/billscales-2025-08-29_15:07:51-rados:thrash-erasure-code-main-distro-default-smithi/

I'm happy from the testing that this PR does fix the issue it was trying to address. The testing has shown there is still more work to do to make these tests reliable. I've updated #65063 because of these tests to fix one of the failures noted above.

I've got a fix for the backfill_toofull causing dead PG problems seen above, but haven't managed to get a test run showing this working yet. I'll deliver that as a separate PR once testing shows it works.

I've got a fix for thrashosds to stop its clean up causing exceptions that overwrite the failure reason caused by the watchdog barking (its clean up doesn't go well when the watchdog kills the OSDs) but that will also be another PR.

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@ljflores
Copy link
Member

ljflores commented Dec 2, 2025

Apologies for the delay on approval; I found an issue with a separate PR in the batch. We're rerunning tests and should have fresh results soon.

@github-actions
Copy link

github-actions bot commented Feb 1, 2026

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Feb 1, 2026
@JonBailey1993
Copy link
Contributor

@github-actions github-actions bot removed the stale label Feb 5, 2026
@SrinivasaBharath SrinivasaBharath merged commit 68fc42a into ceph:main Feb 5, 2026
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants