qa: test_pool_min_size should kill osds first then mark them down by bill-scales · Pull Request #65074 · ceph/ceph

bill-scales · 2025-08-17T15:56:27Z

The objective of test_pool_min_size is to inject up to M failures in a K+M pool to prove that it has enough redundancy to stay active.

It was selecting OSDs and then killing and marking them out one at a time. Testing with wide erasure codes (high values of K and M - for example 8+4) found that this test sometimes failed with a PG become dead. Debugging showed that what was happening is that after one OSD had been killed and marked out this allowed rebalancing and async recovery to start which further reduced the redundancy of the PG, when the remaining error injects happened the PG
correctly became dead.

In practice OSDs are not normally killed and marked out one after another in quick succession. The more common scenario is that one or more OSDs fail at about the same time (lets say over a couple of minutes) and then after mon_osd_down_out_interval (10 mins) the mon
will mark them out. Killing the OSDs first and then marking them out prevents additional async recovery from starting.

If OSDs do fail over a long period of time such that the mon marks each OSD out then hopefully there is enough time for async recovery to run between the
failures.

This commit changes the error inject to kill all the selected OSDs first and then to mark them out.

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins test classic perf Jenkins Job | Jenkins Job Definition
jenkins test crimson perf Jenkins Job | Jenkins Job Definition
jenkins test signed Jenkins Job | Jenkins Job Definition
jenkins test make check Jenkins Job | Jenkins Job Definition
jenkins test make check arm64 Jenkins Job | Jenkins Job Definition
jenkins test submodules Jenkins Job | Jenkins Job Definition
jenkins test dashboard Jenkins Job | Jenkins Job Definition
jenkins test dashboard cephadm Jenkins Job | Jenkins Job Definition
jenkins test api Jenkins Job | Jenkins Job Definition
jenkins test docs ReadTheDocs | Github Workflow Definition
jenkins test ceph-volume all Jenkins Jobs | Jenkins Jobs Definition
jenkins test windows Jenkins Job | Jenkins Job Definition
jenkins test rook e2e Jenkins Job | Jenkins Job Definition

The objective of test_pool_min_size is to inject up to M failures in a K+M pool to prove that it has enough redundancy to stay active. It was selecting OSDs and then killing and marking them out one at a time. Testing with wide erasure codes (high values of K and M - for example 8+4) found that this test sometimes failed with a PG become dead. Debugging showed that what was happening is that after one OSD had been killed and marked out this allowed rebalancing and async recovery to start which further reduced the redundancy of the PG, when the remaining error injects happened the PG correctly became dead. In practice OSDs are not normally killed and marked out one after another in quick succession. The more common scenario is that one or more OSDs fail at about the same time (lets say over a couple of minutes) and then after mon_osd_down_out_interval (10 mins) the mon will mark them out. Killing the OSDs first and then marking them out prevents additional async recovery from starting. If OSDs do fail over a long period of time such that the mon marks each OSD out then hopefully there is enough time for async recovery to run between the failures. This commit changes the error inject to kill all the selected OSDs first and then to mark them out. Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>

bill-scales · 2025-08-21T14:49:51Z

jenkins test make check

bill-scales · 2025-08-21T14:49:57Z

jenkins test make check arm64

kamoltat · 2025-08-21T14:51:51Z

@bill-scales thanks for making the change, I will review this soon.

bill-scales · 2025-08-22T14:28:55Z

jenkins test api

kamoltat

Thank you for the explanation, this makes sense.

kamoltat · 2025-08-27T05:15:23Z

@bill-scales since this a test file change and not a C++ code change, If you have not done so already, you can run a teuthology test using test_pool_min_size as the suite branch. As you are already aware, there are 2 RADOS sub suites that use test_pool_min_size, 1. thrash-erasure-code. 2. thrash-erasure-code-overwrites. I think intentionally targeting these 2 suites rather than relying on the batched qa-runs that Yuri and Bharath schedule might cover more combinations.

bill-scales · 2025-08-27T06:55:09Z

I've already run a few teuthology runs with this change, but let me try some targeted tests while it runs through QA as well. I've seen the issue more frequently with larger EC profiles such as 8+4 so I'll make sure to get these included in the mix

bill-scales · 2025-08-28T16:29:53Z

Three runs with teuthology using PRs 65074, 65063 and 65067 but focusing on testing this change

Run 1 - thrash-erasure-code

teuthology-suite -m smithi -s rados/thrash-erasure-code --suite-branch ec_teuth --suite-repo https://github.com/bill-scales/ceph.git -S ded026476ccb5c8e680746e180992ce9e75b0e25 --priority 80 --subset 1/15 --filter "thrashers/minsize_recovery" -e #### 
https://pulpito.ceph.com/billscales-2025-08-28_12:34:14-rados:thrash-erasure-code-main-distro-default-smithi/ 

Run 2 - thrash-erasure-code-overwrites

teuthology-suite -m smithi -s rados/thrash-erasure-code-overwrites --suite-branch ec_teuth --suite-repo https://github.com/bill-scales/ceph.git -S ded026476ccb5c8e680746e180992ce9e75b0e25 --priority 80 --subset 1/6 --filter "thrashers/minsize_recovery" -e #### 

https://pulpito.ceph.com/billscales-2025-08-28_12:36:47-rados:thrash-erasure-code-overwrites-main-distro-default-smithi/

Run 3 - thrash-erasure-code only with 8+6

teuthology-suite -m smithi -s rados/thrash-erasure-code --suite-branch ec_teuth --suite-repo https://github.com/bill-scales/ceph.git -S ded026476ccb5c8e680746e180992ce9e75b0e25 --priority 80 --subset 1/2 --filter-all "thrashers/minsize_recovery,workloads/ec-rados-plugin=jerasure-k=8-m=6-crush" -e #### 

https://pulpito.ceph.com/billscales-2025-08-28_16:31:51-rados:thrash-erasure-code-main-distro-default-smithi/

There is 1 failure in run 1 and 2 failures in run 3

bill-scales · 2025-08-28T16:58:49Z

Problem with run https://pulpito.ceph.com/billscales-2025-08-28_12:34:14-rados:thrash-erasure-code-main-distro-default-smithi/8469734/ is that the backfill_toofull error inject (implemented inside the OSD by a conf setting) stopped the chosen PG from recovering from the prior error inject.

PG 1.6 was in state active+remapped+backfill_toofull with an acting set [12,15,8,1,13,8,10,6,4,9,2,11,5,12] (note the duplication of OSD 8) prior to the error inject starting. The error inject then tried to take OSD 8 offline twice. The
error inject was allowed to start because wait_for_recovery() passes so long as no PG is in backfilling or recover - it is not checking for backfill_toofull.

bill-scales · 2025-08-29T08:48:18Z

Problem with run https://pulpito.ceph.com/billscales-2025-08-28_16:31:51-rados:thrash-erasure-code-main-distro-default-smithi/8470070/ is a test cleanup problem:

rados.py finishes its I/O test and cleans up deleting the pool
thrashosds starts trying to stop the thrasher "INFO:tasks.thrashosds:joining thrashosds"
get_rand_pg_acting_set raises an exception "MaxWhileTries(error_msg)" because it can't find a PG in the deleted pool and so the thrasher exits with an exception
The watchdog barks and kills all the osds because the thrasher died "daemonwatchdog.daemon_watchdog:OSDThrasher failed"
We timeout waiting for recovery in the cleanup at the end of thrashosds.py because the osds have been killed

We need to fix get_rand_pg_acting_set so it doesn't raise an exception if the thrasher is being stopped
We should stop the thrasher telling the watchdog that it has died if it is in the process of being stopped

These fixes need to be added to PR #65063

bill-scales · 2025-08-29T08:56:18Z

Problem with https://pulpito.ceph.com/billscales-2025-08-28_16:31:51-rados:thrash-erasure-code-main-distro-default-smithi/8470073/ is that the backfill_toofull error inject (implemented inside the OSD by a conf setting) stopped a different PG from recovering from the prior error inject.

The error inject works on the chosen PG but this other PG that hasn't recovered from the previous error inject because of the backfill_toofull error inject becomes dead. The test fails because of the dead PG.

While this PR has fixed one of the causes of PGs going dead during this error inject, it hasn't fixed all causes

It also has similar problems to 8470070 in that after the thrasher times out recovery, the watchdog barks and stops the test (we want this to happen to stop rados.py) but then the cleanup at the end of thrashosds.py we timeout waiting for recovery because all the OSDs have been killed.

We need thashosds.py to be smarter and not try to clean up if the watchdog has fired

bill-scales · 2025-08-29T17:14:33Z

More runs with this fix targeting this specific test case:

https://pulpito.ceph.com/billscales-2025-08-29_12:56:56-rados:thrash-erasure-code-main-distro-default-smithi/
https://pulpito.ceph.com/billscales-2025-08-29_13:50:22-rados:thrash-erasure-code-overwrites-main-distro-default-smithi/
https://pulpito.ceph.com/billscales-2025-08-29_14:07:56-rados:thrash-erasure-code-main-distro-default-smithi/
https://pulpito.ceph.com/billscales-2025-08-29_15:07:51-rados:thrash-erasure-code-main-distro-default-smithi/

I'm happy from the testing that this PR does fix the issue it was trying to address. The testing has shown there is still more work to do to make these tests reliable. I've updated #65063 because of these tests to fix one of the failures noted above.

I've got a fix for the backfill_toofull causing dead PG problems seen above, but haven't managed to get a test run showing this working yet. I'll deliver that as a separate PR once testing shows it works.

I've got a fix for thrashosds to stop its clean up causing exceptions that overwrite the failure reason caused by the watchdog barking (its clean up doesn't go well when the watchdog kills the OSDs) but that will also be another PR.

github-actions · 2025-10-28T18:02:45Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

ljflores · 2025-12-02T23:59:29Z

Apologies for the delay on approval; I found an issue with a separate PR in the batch. We're rerunning tests and should have fresh results soon.

github-actions · 2026-02-01T01:31:15Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

JonBailey1993 · 2026-02-05T16:26:40Z

Rados approved: https://tracker.ceph.com/projects/rados/wiki/MAIN#httpstrackercephcomissues73898

github-actions bot added the tests label Aug 17, 2025

bill-scales requested a review from kamoltat August 17, 2025 15:56

kamoltat approved these changes Aug 27, 2025

View reviewed changes

kamoltat assigned bill-scales Aug 27, 2025

kamoltat added the needs-qa label Aug 27, 2025

github-actions bot added the stale label Oct 28, 2025

bill-scales removed the stale label Nov 17, 2025

SrinivasaBharath added core wip-bharath10-testing labels Nov 17, 2025

github-actions bot added the stale label Feb 1, 2026

github-actions bot removed the stale label Feb 5, 2026

SrinivasaBharath merged commit 68fc42a into ceph:main Feb 5, 2026
26 checks passed

SrinivasaBharath removed the wip-bharath10-testing label Feb 5, 2026

Conversation

bill-scales commented Aug 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contribution Guidelines

Checklist

Uh oh!

bill-scales commented Aug 21, 2025

Uh oh!

bill-scales commented Aug 21, 2025

Uh oh!

kamoltat commented Aug 21, 2025

Uh oh!

bill-scales commented Aug 22, 2025

Uh oh!

kamoltat left a comment

Choose a reason for hiding this comment

Uh oh!

kamoltat commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bill-scales commented Aug 27, 2025

Uh oh!

bill-scales commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bill-scales commented Aug 28, 2025

Uh oh!

bill-scales commented Aug 29, 2025

Uh oh!

bill-scales commented Aug 29, 2025

Uh oh!

bill-scales commented Aug 29, 2025

Uh oh!

github-actions bot commented Oct 28, 2025

Uh oh!

ljflores commented Dec 2, 2025

Uh oh!

github-actions bot commented Feb 1, 2026

Uh oh!

JonBailey1993 commented Feb 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

bill-scales commented Aug 17, 2025 •

edited

Loading

kamoltat commented Aug 27, 2025 •

edited

Loading

bill-scales commented Aug 28, 2025 •

edited

Loading