test/ceph-helpers: Pass timeout and add timeout for commands in test_pg_scrub#66457
Conversation
qa/standalone/ceph-helpers.sh
Outdated
| pg_scrub 1.0 || return 1 | ||
| kill_daemons $dir KILL osd || return 1 | ||
| ! TIMEOUT=2 pg_scrub 1.0 || return 1 | ||
| ! WAIT_FOR_CLEAN_TIMEOUT=2 TIMEOUT=2 pg_scrub 1.0 || return 1 |
There was a problem hiding this comment.
The default for WAIT_FOR_CLEAN_TIMEOUT is 90. And the regular OSD to MON update cycle is 5 sec.
Isn't '2' a bit short?
There was a problem hiding this comment.
Its after kill_daemons, i thought we shouldn't wait too much.
There was a problem hiding this comment.
I'd use something larger than 5, anyway
|
I know this PR is getting tested, however, I think we might also have to look into how over-thrashing causes stale PGs and stale PGs causes PG query to fail per current discussion in https://tracker.ceph.com/issues/74004 |
Thanks @kamoltat i added my analysis to the tracker. by the way, i don't think we are doing any osd thrashing during standalone tests, but i may be wrong, we only kill osd during that specific test. |
81e8df0 to
b36c66d
Compare
ljflores
left a comment
There was a problem hiding this comment.
@NitzanMordhai can you fix the Signed-off-by check?
|
@NitzanMordhai I see multiple related failures in standalone tests. See https://tracker.ceph.com/projects/rados/wiki/MAIN#httpstrackercephcomissues74059 and the related tracker of this PR. This needs to be addressed before merging the PR. |
@sseshasa thanks! the return 1 that i added caused that.. |
…pg_scrub
In test_pg_scrub, after killing an OSD, subsequent pg_scrub checks and calls to flush_pg_stats
can hang or timeout with the default time because the OSD is no longer running.
This was causing test failures.
This fix addresses two issues:
1. test_pg_scrub: Explicitly pass the WAIT_FOR_CLEAN_TIMEOUT and TIMEOUT variables (both set to 2)
to the pg_scrub call after the OSD is killed. This prevents a hang in the wait_for_clean
check within pg_scrub.
2. flush_pg_stats: Add an explicit timeout to the ceph tell osd.$osd flush_pg_stats command,
allowing it to fail quickly when an OSD is unresponsive.
Fixes: https://tracker.ceph.com/issues/74004
Signed-off-by: Nitzan Mordechai <nmordec@ibm.com>
b36c66d to
118c5be
Compare
|
Made the change and waiting for the results: https://pulpito.ceph.com/nmordech-2025-12-11_14:38:40-rados:standalone-wip-bharath5-testing-2025-12-02-1511-distro-default-smithi/ |
|
@NitzanMordhai please see Sridhar's comment about regression from testing: https://tracker.ceph.com/issues/74004#note-11 |
please see #66457 (comment) The PR was fixed, and i reran the failing jobs |
rzarzynski
left a comment
There was a problem hiding this comment.
LGTM assuming it passes the QA.
@NitzanMordhai as this change is only for the QA suite, I think this can be approved/merged based on the tests you scheduled. WDYT? Can you summarize the results for the tests you scheduled? |
|
Based on test of this branch in those tests: https://pulpito.ceph.com/nmordech-2025-12-11_14:38:40-rados:standalone-wip-bharath5-testing-2025-12-02-1511-distro-default-smithi/ we can merge that fix. The results show that we no longer timeout and fail on tests that daemons were killed before checking pg_scrub |
|
The DNM has lost its rationale b/c of #66457 (comment). |
|
I tried this patch on my tentacle, still have similar issue: |
In test_pg_scrub, after killing an OSD, subsequent pg_scrub checks and calls to flush_pg_stats can hang or timeout with the default time because the OSD is no longer running. This was causing test failures.
This fix addresses two issues:
Fixes: https://tracker.ceph.com/issues/74004
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins test classic perfJenkins Job | Jenkins Job Definitionjenkins test crimson perfJenkins Job | Jenkins Job Definitionjenkins test signedJenkins Job | Jenkins Job Definitionjenkins test make checkJenkins Job | Jenkins Job Definitionjenkins test make check arm64Jenkins Job | Jenkins Job Definitionjenkins test submodulesJenkins Job | Jenkins Job Definitionjenkins test dashboardJenkins Job | Jenkins Job Definitionjenkins test dashboard cephadmJenkins Job | Jenkins Job Definitionjenkins test apiJenkins Job | Jenkins Job Definitionjenkins test docsReadTheDocs | Github Workflow Definitionjenkins test ceph-volume allJenkins Jobs | Jenkins Jobs Definitionjenkins test windowsJenkins Job | Jenkins Job Definitionjenkins test rook e2eJenkins Job | Jenkins Job DefinitionYou must only issue one Jenkins command per-comment. Jenkins does not understand
comments with more than one command.