Skip to content

qa: Fix OSD thrasher bugs during test clean up#65063

Open
bill-scales wants to merge 1 commit intoceph:mainfrom
bill-scales:issue71917
Open

qa: Fix OSD thrasher bugs during test clean up#65063
bill-scales wants to merge 1 commit intoceph:mainfrom
bill-scales:issue71917

Conversation

@bill-scales
Copy link
Contributor

@bill-scales bill-scales commented Aug 15, 2025

The OSD thrasher can get stuck for 30 minutes searching for a pool when the pools have all been deleted, it should terminate the loop if the thrasher is told to stop

The OSD thrasher can cause an exception if a pool is deleted between querying the list of pools and choosing a PG from the pool. If the PG list is empty the thrasher should look for another pool

Fixes: https://tracker.ceph.com/issues/71917

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands

@github-actions github-actions bot added the tests label Aug 15, 2025
@bill-scales bill-scales requested a review from kamoltat August 15, 2025 08:49
@bill-scales
Copy link
Contributor Author

jenkins test make check

@bill-scales
Copy link
Contributor Author

jenkins test make check arm64

@bill-scales
Copy link
Contributor Author

jenkins test make check

@bill-scales
Copy link
Contributor Author

jenkins test make check arm64

@bill-scales
Copy link
Contributor Author

jenkins test make check

@bill-scales
Copy link
Contributor Author

jenkins test make check arm64

@bill-scales
Copy link
Contributor Author

jenkins test make check

@bill-scales
Copy link
Contributor Author

jenkins test make check arm64

@bill-scales
Copy link
Contributor Author

jenkins test make check

The OSD thrasher can get stuck for 30 minutes
searching for a pool when the pools have all been deleted,
it should terminate the loop if the thrasher is told to stop

The OSD thrasher can cause an exception if a pool is deleted
between querying the list of pools and choosing a PG from the
pool.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
@bill-scales
Copy link
Contributor Author

Updated following lots of teuthology runs to test this change, the problems are occurring when looking for a PG after deleting the pool. The thrasher has already been told to stop by this point but hasn't got round to noticing that. Therefore we just need to add some checks for stopping and give up on the current error inject

These runs all include this fix:
https://pulpito.ceph.com/billscales-2025-08-29_12:56:56-rados:thrash-erasure-code-main-distro-default-smithi/
https://pulpito.ceph.com/billscales-2025-08-29_13:50:22-rados:thrash-erasure-code-overwrites-main-distro-default-smithi/
https://pulpito.ceph.com/billscales-2025-08-29_14:07:56-rados:thrash-erasure-code-main-distro-default-smithi/
https://pulpito.ceph.com/billscales-2025-08-29_15:07:51-rados:thrash-erasure-code-main-distro-default-smithi/

@bill-scales
Copy link
Contributor Author

jenkins test make check

@bill-scales
Copy link
Contributor Author

jenkins test api

@github-actions
Copy link

github-actions bot commented Nov 7, 2025

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@bill-scales
Copy link
Contributor Author

@kamoltat can you review this change, its a simple reliability improvement to the OSD thrasher by getting it to check if the thrasher is being stopped and exiting the error inject. Currently at the end of tests pools are deleted and this can cause the thrasher to fail the test because it can't find any pool/PG to inject an error. This change stops those failures happening.

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Jan 27, 2026
@github-actions
Copy link

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

@github-actions github-actions bot closed this Feb 26, 2026
@bill-scales bill-scales reopened this Feb 26, 2026
@github-actions github-actions bot removed the stale label Feb 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant