Skip to content

test/ceph-helpers: Pass timeout and add timeout for commands in test_pg_scrub#66457

Merged
NitzanMordhai merged 1 commit intoceph:mainfrom
NitzanMordhai:wip-nitzan-pg-scrub-standalone-test-hang
Jan 11, 2026
Merged

test/ceph-helpers: Pass timeout and add timeout for commands in test_pg_scrub#66457
NitzanMordhai merged 1 commit intoceph:mainfrom
NitzanMordhai:wip-nitzan-pg-scrub-standalone-test-hang

Conversation

@NitzanMordhai
Copy link
Contributor

@NitzanMordhai NitzanMordhai commented Dec 1, 2025

In test_pg_scrub, after killing an OSD, subsequent pg_scrub checks and calls to flush_pg_stats can hang or timeout with the default time because the OSD is no longer running. This was causing test failures.

This fix addresses two issues:

  1. test_pg_scrub: Explicitly pass the WAIT_FOR_CLEAN_TIMEOUT and TIMEOUT variables (both set to 2) to the pg_scrub call after the OSD is killed. This prevents a hang in the wait_for_clean check within pg_scrub.
  2. flush_pg_stats: Add an explicit timeout to the ceph tell osd.$osd flush_pg_stats command, allowing it to fail quickly when an OSD is unresponsive.

Fixes: https://tracker.ceph.com/issues/74004

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands

You must only issue one Jenkins command per-comment. Jenkins does not understand
comments with more than one command.

@NitzanMordhai NitzanMordhai requested a review from a team as a code owner December 1, 2025 09:40
Copy link
Contributor

@ronen-fr ronen-fr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise

pg_scrub 1.0 || return 1
kill_daemons $dir KILL osd || return 1
! TIMEOUT=2 pg_scrub 1.0 || return 1
! WAIT_FOR_CLEAN_TIMEOUT=2 TIMEOUT=2 pg_scrub 1.0 || return 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default for WAIT_FOR_CLEAN_TIMEOUT is 90. And the regular OSD to MON update cycle is 5 sec.
Isn't '2' a bit short?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its after kill_daemons, i thought we shouldn't wait too much.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd use something larger than 5, anyway

@kamoltat
Copy link
Member

kamoltat commented Dec 2, 2025

I know this PR is getting tested, however, I think we might also have to look into how over-thrashing causes stale PGs and stale PGs causes PG query to fail per current discussion in https://tracker.ceph.com/issues/74004

@NitzanMordhai
Copy link
Contributor Author

I know this PR is getting tested, however, I think we might also have to look into how over-thrashing causes stale PGs and stale PGs causes PG query to fail per current discussion in https://tracker.ceph.com/issues/74004

Thanks @kamoltat i added my analysis to the tracker. by the way, i don't think we are doing any osd thrashing during standalone tests, but i may be wrong, we only kill osd during that specific test.

@NitzanMordhai NitzanMordhai force-pushed the wip-nitzan-pg-scrub-standalone-test-hang branch from 81e8df0 to b36c66d Compare December 3, 2025 06:24
Copy link
Member

@ljflores ljflores left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NitzanMordhai can you fix the Signed-off-by check?

@sseshasa
Copy link
Contributor

sseshasa commented Dec 11, 2025

@NitzanMordhai I see multiple related failures in standalone tests.

See https://tracker.ceph.com/projects/rados/wiki/MAIN#httpstrackercephcomissues74059 and the related tracker of this PR. This needs to be addressed before merging the PR.

@NitzanMordhai
Copy link
Contributor Author

@NitzanMordhai I see multiple related failures in standalone tests in the following run.

See https://tracker.ceph.com/projects/rados/wiki/MAIN#httpstrackercephcomissues74059 and the related tracker of this PR. This needs to be addressed before merging this PR.

@sseshasa thanks! the return 1 that i added caused that..
seq=$(timeout $timeout ceph tell osd.$osd flush_pg_stats) || return 1
need to fix it

…pg_scrub

In test_pg_scrub, after killing an OSD, subsequent pg_scrub checks and calls to flush_pg_stats
can hang or timeout with the default time because the OSD is no longer running.
This was causing test failures.

This fix addresses two issues:
1.  test_pg_scrub: Explicitly pass the WAIT_FOR_CLEAN_TIMEOUT and TIMEOUT variables (both set to 2)
    to the pg_scrub call after the OSD is killed. This prevents a hang in the wait_for_clean
    check within pg_scrub.
2.  flush_pg_stats: Add an explicit timeout to the ceph tell osd.$osd flush_pg_stats command,
    allowing it to fail quickly when an OSD is unresponsive.

Fixes: https://tracker.ceph.com/issues/74004
Signed-off-by: Nitzan Mordechai <nmordec@ibm.com>
@NitzanMordhai NitzanMordhai force-pushed the wip-nitzan-pg-scrub-standalone-test-hang branch from b36c66d to 118c5be Compare December 11, 2025 14:06
@NitzanMordhai
Copy link
Contributor Author

@ljflores
Copy link
Member

@NitzanMordhai please see Sridhar's comment about regression from testing: https://tracker.ceph.com/issues/74004#note-11

@NitzanMordhai
Copy link
Contributor Author

@NitzanMordhai please see Sridhar's comment about regression from testing: https://tracker.ceph.com/issues/74004#note-11

please see #66457 (comment)
and: #66457 (comment)

The PR was fixed, and i reran the failing jobs

Copy link
Contributor

@rzarzynski rzarzynski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM assuming it passes the QA.

@ljflores
Copy link
Member

ljflores commented Jan 8, 2026

@NitzanMordhai please see Sridhar's comment about regression from testing: https://tracker.ceph.com/issues/74004#note-11

please see #66457 (comment) and: #66457 (comment)

The PR was fixed, and i reran the failing jobs

@NitzanMordhai as this change is only for the QA suite, I think this can be approved/merged based on the tests you scheduled. WDYT? Can you summarize the results for the tests you scheduled?

@NitzanMordhai
Copy link
Contributor Author

Based on test of this branch in those tests: https://pulpito.ceph.com/nmordech-2025-12-11_14:38:40-rados:standalone-wip-bharath5-testing-2025-12-02-1511-distro-default-smithi/

we can merge that fix. The results show that we no longer timeout and fail on tests that daemons were killed before checking pg_scrub

@NitzanMordhai NitzanMordhai merged commit bd20e52 into ceph:main Jan 11, 2026
13 checks passed
@NitzanMordhai NitzanMordhai deleted the wip-nitzan-pg-scrub-standalone-test-hang branch January 11, 2026 12:40
@rzarzynski rzarzynski removed the DNM label Jan 12, 2026
@rzarzynski
Copy link
Contributor

The DNM has lost its rationale b/c of #66457 (comment).

@kshtsk
Copy link
Contributor

kshtsk commented Feb 3, 2026

I tried this patch on my tentacle, still have similar issue:

2026-02-03T16:22:33.371 INFO:tasks.workunit.client.0.vm06.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1715: wait_for_pg_clean:  is_pg_clean 1.0
2026-02-03T16:22:33.371 INFO:tasks.workunit.client.0.vm06.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1577: is_pg_clean:  local pgid=1.0
2026-02-03T16:22:33.372 INFO:tasks.workunit.client.0.vm06.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1578: is_pg_clean:  local pg_state
2026-02-03T16:22:33.372 INFO:tasks.workunit.client.0.vm06.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1579: is_pg_clean:  ceph pg 1.0 query
2026-02-03T16:22:33.372 INFO:tasks.workunit.client.0.vm06.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1579: is_pg_clean:  jq -r '.state '

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants