test/ceph-helpers: Pass timeout and add timeout for commands in test_pg_scrub by NitzanMordhai · Pull Request #66457 · ceph/ceph

NitzanMordhai · 2025-12-01T09:40:31Z

In test_pg_scrub, after killing an OSD, subsequent pg_scrub checks and calls to flush_pg_stats can hang or timeout with the default time because the OSD is no longer running. This was causing test failures.

This fix addresses two issues:

test_pg_scrub: Explicitly pass the WAIT_FOR_CLEAN_TIMEOUT and TIMEOUT variables (both set to 2) to the pg_scrub call after the OSD is killed. This prevents a hang in the wait_for_clean check within pg_scrub.
flush_pg_stats: Add an explicit timeout to the ceph tell osd.$osd flush_pg_stats command, allowing it to fail quickly when an OSD is unresponsive.

Fixes: https://tracker.ceph.com/issues/74004

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins test classic perf Jenkins Job | Jenkins Job Definition
jenkins test crimson perf Jenkins Job | Jenkins Job Definition
jenkins test signed Jenkins Job | Jenkins Job Definition
jenkins test make check Jenkins Job | Jenkins Job Definition
jenkins test make check arm64 Jenkins Job | Jenkins Job Definition
jenkins test submodules Jenkins Job | Jenkins Job Definition
jenkins test dashboard Jenkins Job | Jenkins Job Definition
jenkins test dashboard cephadm Jenkins Job | Jenkins Job Definition
jenkins test api Jenkins Job | Jenkins Job Definition
jenkins test docs ReadTheDocs | Github Workflow Definition
jenkins test ceph-volume all Jenkins Jobs | Jenkins Jobs Definition
jenkins test windows Jenkins Job | Jenkins Job Definition
jenkins test rook e2e Jenkins Job | Jenkins Job Definition

You must only issue one Jenkins command per-comment. Jenkins does not understand
comments with more than one command.

ronen-fr

LGTM otherwise

ronen-fr · 2025-12-01T13:07:32Z

qa/standalone/ceph-helpers.sh

    pg_scrub 1.0 || return 1
    kill_daemons $dir KILL osd || return 1
-    ! TIMEOUT=2 pg_scrub 1.0 || return 1
+    ! WAIT_FOR_CLEAN_TIMEOUT=2 TIMEOUT=2 pg_scrub 1.0 || return 1


The default for WAIT_FOR_CLEAN_TIMEOUT is 90. And the regular OSD to MON update cycle is 5 sec.
Isn't '2' a bit short?

Its after kill_daemons, i thought we shouldn't wait too much.

I'd use something larger than 5, anyway

kamoltat · 2025-12-02T18:36:23Z

I know this PR is getting tested, however, I think we might also have to look into how over-thrashing causes stale PGs and stale PGs causes PG query to fail per current discussion in https://tracker.ceph.com/issues/74004

NitzanMordhai · 2025-12-03T06:22:37Z

I know this PR is getting tested, however, I think we might also have to look into how over-thrashing causes stale PGs and stale PGs causes PG query to fail per current discussion in https://tracker.ceph.com/issues/74004

Thanks @kamoltat i added my analysis to the tracker. by the way, i don't think we are doing any osd thrashing during standalone tests, but i may be wrong, we only kill osd during that specific test.

ljflores

@NitzanMordhai can you fix the Signed-off-by check?

sseshasa · 2025-12-11T13:33:10Z

@NitzanMordhai I see multiple related failures in standalone tests.

See https://tracker.ceph.com/projects/rados/wiki/MAIN#httpstrackercephcomissues74059 and the related tracker of this PR. This needs to be addressed before merging the PR.

NitzanMordhai · 2025-12-11T14:03:41Z

@NitzanMordhai I see multiple related failures in standalone tests in the following run.

See https://tracker.ceph.com/projects/rados/wiki/MAIN#httpstrackercephcomissues74059 and the related tracker of this PR. This needs to be addressed before merging this PR.

@sseshasa thanks! the return 1 that i added caused that..
seq=$(timeout $timeout ceph tell osd.$osd flush_pg_stats) || return 1
need to fix it

…pg_scrub In test_pg_scrub, after killing an OSD, subsequent pg_scrub checks and calls to flush_pg_stats can hang or timeout with the default time because the OSD is no longer running. This was causing test failures. This fix addresses two issues: 1. test_pg_scrub: Explicitly pass the WAIT_FOR_CLEAN_TIMEOUT and TIMEOUT variables (both set to 2) to the pg_scrub call after the OSD is killed. This prevents a hang in the wait_for_clean check within pg_scrub. 2. flush_pg_stats: Add an explicit timeout to the ceph tell osd.$osd flush_pg_stats command, allowing it to fail quickly when an OSD is unresponsive. Fixes: https://tracker.ceph.com/issues/74004 Signed-off-by: Nitzan Mordechai <nmordec@ibm.com>

NitzanMordhai · 2025-12-11T15:54:44Z

Made the change and waiting for the results: https://pulpito.ceph.com/nmordech-2025-12-11_14:38:40-rados:standalone-wip-bharath5-testing-2025-12-02-1511-distro-default-smithi/

ljflores · 2025-12-15T19:06:01Z

@NitzanMordhai please see Sridhar's comment about regression from testing: https://tracker.ceph.com/issues/74004#note-11

NitzanMordhai · 2025-12-18T12:43:37Z

@NitzanMordhai please see Sridhar's comment about regression from testing: https://tracker.ceph.com/issues/74004#note-11

please see #66457 (comment)
and: #66457 (comment)

The PR was fixed, and i reran the failing jobs

rzarzynski

LGTM assuming it passes the QA.

ljflores · 2026-01-08T20:32:25Z

@NitzanMordhai please see Sridhar's comment about regression from testing: https://tracker.ceph.com/issues/74004#note-11

please see #66457 (comment) and: #66457 (comment)

The PR was fixed, and i reran the failing jobs

@NitzanMordhai as this change is only for the QA suite, I think this can be approved/merged based on the tests you scheduled. WDYT? Can you summarize the results for the tests you scheduled?

NitzanMordhai · 2026-01-11T12:40:22Z

Based on test of this branch in those tests: https://pulpito.ceph.com/nmordech-2025-12-11_14:38:40-rados:standalone-wip-bharath5-testing-2025-12-02-1511-distro-default-smithi/

we can merge that fix. The results show that we no longer timeout and fail on tests that daemons were killed before checking pg_scrub

rzarzynski · 2026-01-12T19:09:54Z

The DNM has lost its rationale b/c of #66457 (comment).

kshtsk · 2026-02-03T15:39:35Z

I tried this patch on my tentacle, still have similar issue:

2026-02-03T16:22:33.371 INFO:tasks.workunit.client.0.vm06.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1715: wait_for_pg_clean:  is_pg_clean 1.0
2026-02-03T16:22:33.371 INFO:tasks.workunit.client.0.vm06.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1577: is_pg_clean:  local pgid=1.0
2026-02-03T16:22:33.372 INFO:tasks.workunit.client.0.vm06.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1578: is_pg_clean:  local pg_state
2026-02-03T16:22:33.372 INFO:tasks.workunit.client.0.vm06.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1579: is_pg_clean:  ceph pg 1.0 query
2026-02-03T16:22:33.372 INFO:tasks.workunit.client.0.vm06.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1579: is_pg_clean:  jq -r '.state '

NitzanMordhai requested a review from ronen-fr December 1, 2025 09:40

NitzanMordhai requested a review from a team as a code owner December 1, 2025 09:40

github-actions bot added core tests labels Dec 1, 2025

ronen-fr approved these changes Dec 1, 2025

View reviewed changes

NitzanMordhai added the needs-qa label Dec 2, 2025

SrinivasaBharath added the wip-bharath5-testing label Dec 2, 2025

NitzanMordhai force-pushed the wip-nitzan-pg-scrub-standalone-test-hang branch from 81e8df0 to b36c66d Compare December 3, 2025 06:24

rzarzynski approved these changes Dec 8, 2025

View reviewed changes

ljflores reviewed Dec 8, 2025

View reviewed changes

NitzanMordhai force-pushed the wip-nitzan-pg-scrub-standalone-test-hang branch from b36c66d to 118c5be Compare December 11, 2025 14:06

ljflores added DNM TESTED and removed needs-qa wip-bharath5-testing labels Dec 15, 2025

rzarzynski approved these changes Jan 8, 2026

View reviewed changes

NitzanMordhai merged commit bd20e52 into ceph:main Jan 11, 2026
13 checks passed

NitzanMordhai deleted the wip-nitzan-pg-scrub-standalone-test-hang branch January 11, 2026 12:40

This was referenced Jan 12, 2026

squid: test/ceph-helpers: Pass timeout and add timeout for commands in test_pg_scrub #66887

Open

tentacle: test/ceph-helpers: Pass timeout and add timeout for commands in test_pg_scrub #66888

Open

rzarzynski removed the DNM label Jan 12, 2026

Conversation

NitzanMordhai commented Dec 1, 2025 • edited by ljflores Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contribution Guidelines

Checklist

Uh oh!

ronen-fr left a comment

Choose a reason for hiding this comment

Uh oh!

ronen-fr Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

NitzanMordhai Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

ronen-fr Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

kamoltat commented Dec 2, 2025

Uh oh!

NitzanMordhai commented Dec 3, 2025

Uh oh!

ljflores left a comment

Choose a reason for hiding this comment

Uh oh!

sseshasa commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NitzanMordhai commented Dec 11, 2025

Uh oh!

NitzanMordhai commented Dec 11, 2025

Uh oh!

ljflores commented Dec 15, 2025

Uh oh!

NitzanMordhai commented Dec 18, 2025

Uh oh!

rzarzynski left a comment

Choose a reason for hiding this comment

Uh oh!

ljflores commented Jan 8, 2026

Uh oh!

NitzanMordhai commented Jan 11, 2026

Uh oh!

Uh oh!

rzarzynski commented Jan 12, 2026

Uh oh!

kshtsk commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

NitzanMordhai commented Dec 1, 2025 •

edited by ljflores

Loading

sseshasa commented Dec 11, 2025 •

edited

Loading