qa/workunits/rados/test.sh: add timeout mechanism and more info to workloads with parallel tests#64219
Conversation
ljflores
left a comment
There was a problem hiding this comment.
Looks good for a WIP! Let me know when it's ready to be fully reviewed and I can help with that.
Do you know a good way to verify the script locally? I would imagine it involves something like:
ninja ceph_test_rados
../qa/workunits/rados/test.sh
@ljflores good question, didn't think about that, I scheduled some teuthology jobs, I guess we'll see if the script works first, but yeah doing it locally first would have been better ... whoops |
|
With the new test change we are able to reproduce 3 failures and failed them within 25mins and pin point which pid/test causes the failure: ksirivad-2025-06-26_20:28:15-rados:monthrash-wip-ksirivad-fix-msg-v2-distro-default-smithi/8352030/teuthology.log 2025-06-26T21:13:25.179 INFO:tasks.workunit.client.0.smithi103.stdout: watch_notify: unknown file: Failure 2025-06-26T21:13:25.179 INFO:tasks.workunit.client.0.smithi103.stdout: watch_notify: C++ exception with description "(110) Connection timed out [osd:110]" thrown in the test body. 2025-06-26T21:13:25.179 INFO:tasks.workunit.client.0.smithi103.stdout: watch_notify: 2025-06-26T21:13:25.179 INFO:tasks.workunit.client.0.smithi103.stdout: watch_notify: [ FAILED ] NeoRadosWatchNotifyPoll.WatchNotify (50468 ms) 2025-06-26T21:13:25.179 INFO:tasks.workunit.client.0.smithi103.stdout: watch_notify: [ RUN ] NeoRadosWatchNotifyPoll.WatchNotifyTimeout 2025-06-26T21:13:25.179 INFO:tasks.workunit.client.0.smithi103.stdout: watch_notify: [ OK ] NeoRadosWatchNotifyPoll.WatchNotifyTimeout (3691 ms) 2025-06-26T21:13:25.179 INFO:tasks.workunit.client.0.smithi103.stdout: watch_notify: [ RUN ] NeoRadosWatchNotifyPoll.WrongWatchType 2025-06-26T21:13:25.179 INFO:tasks.workunit.client.0.smithi103.stdout: watch_notify: [ OK ] NeoRadosWatchNotifyPoll.WrongWatchType (3035 ms) 2025-06-26T21:13:25.179 INFO:tasks.workunit.client.0.smithi103.stdout: watch_notify: [ RUN ] NeoRadosWatchNotifyPoll.WatchNotifyCancel 2025-06-26T21:13:25.180 INFO:tasks.workunit.client.0.smithi103.stdout: watch_notify: [ OK ] NeoRadosWatchNotifyPoll.WatchNotifyCancel (8172 ms) 2025-06-26T21:13:25.180 INFO:tasks.workunit.client.0.smithi103.stdout: watch_notify: [----------] 4 tests from NeoRadosWatchNotifyPoll (65368 ms total) 2025-06-26T21:13:25.180 INFO:tasks.workunit.client.0.smithi103.stdout: watch_notify: 2025-06-26T21:13:25.180 INFO:tasks.workunit.client.0.smithi103.stdout: watch_notify: [----------] Global test environment tear-down 2025-06-26T21:13:25.180 INFO:tasks.workunit.client.0.smithi103.stdout: watch_notify: [==========] 6 tests from 2 test suites ran. (76902 ms total) 2025-06-26T21:13:25.180 INFO:tasks.workunit.client.0.smithi103.stdout: watch_notify: [ PASSED ] 5 tests. 2025-06-26T21:13:25.180 INFO:tasks.workunit.client.0.smithi103.stdout: watch_notify: [ FAILED ] 1 test, listed below: 2025-06-26T21:13:25.180 INFO:tasks.workunit.client.0.smithi103.stdout: watch_notify: [ FAILED ] NeoRadosWatchNotifyPoll.WatchNotify 2025-06-26T21:13:25.180 INFO:tasks.workunit.client.0.smithi103.stdout: watch_notify: 2025-06-26T21:13:25.180 INFO:tasks.workunit.client.0.smithi103.stdout: watch_notify: 1 FAILED TEST . . > 2025-06-26T21:29:58.681 INFO:tasks.workunit.client.0.smithi103.stderr:+ echo 'Waiting for test watch_notify (PID 16474)...' 2025-06-26T21:29:58.681 INFO:tasks.workunit.client.0.smithi103.stdout:Waiting for test watch_notify (PID 16474)... 2025-06-26T21:29:58.682 INFO:tasks.workunit.client.0.smithi103.stderr:+ kill -0 16474 2025-06-26T21:29:58.682 INFO:tasks.workunit.client.0.smithi103.stderr:+ wait 16474 2025-06-26T21:29:58.682 INFO:tasks.workunit.client.0.smithi103.stdout:Test watch_notify (PID 16474) failed with non-zero exit status 2025-06-26T21:29:58.683 INFO:tasks.workunit.client.0.smithi103.stderr:+ echo 'Test watch_notify (PID 16474) failed with non-zero exit status' 2025-06-26T21:29:58.683 INFO:tasks.workunit.client.0.smithi103.stderr:+ ret=1 ksirivad-2025-06-26_20:28:15-rados:monthrash-wip-ksirivad-fix-msg-v2-distro-default-smithi/8352012/teuthology.log 2025-06-26T20:49:55.694 INFO:tasks.workunit.client.0.smithi089.stdout: api_io: [ SKIPPED ] LibRadosIoEC.RmXattr (0 ms) 2025-06-26T20:49:55.695 INFO:tasks.workunit.client.0.smithi089.stdout: api_io: [ RUN ] LibRadosIoEC.XattrIter 2025-06-26T20:49:55.695 INFO:tasks.workunit.client.0.smithi089.stdout: api_io: ./src/test/librados/io.cc:422: Skipped 2025-06-26T20:49:55.695 INFO:tasks.workunit.client.0.smithi089.stdout: api_io: 2025-06-26T20:49:55.695 INFO:tasks.workunit.client.0.smithi089.stdout: api_io: [ SKIPPED ] LibRadosIoEC.XattrIter (0 ms) 2025-06-26T20:49:55.724 INFO:tasks.workunit.client.0.smithi089.stderr:bash: line 1: 14730 Segmentation fault (core dumped) ceph_test_rados_api_io --gtest_output=xml:/home/ubuntu/cephtest/archive/unit_test_xml_report/api_io.xml 2>&1 2025-06-26T20:49:55.724 INFO:tasks.workunit.client.0.smithi089.stderr: 14731 Done | tee ceph_test_rados_api_io.log 2025-06-26T20:49:55.724 INFO:tasks.workunit.client.0.smithi089.stderr: 14732 Done | sed "s/^/ api_io: /" . . . 2025-06-26T21:07:09.754 INFO:tasks.workunit.client.0.smithi089.stdout:Waiting for test api_io (PID 14725)... 2025-06-26T21:07:09.754 INFO:tasks.workunit.client.0.smithi089.stderr:+ echo 'Waiting for test api_io (PID 14725)...' 2025-06-26T21:07:09.754 INFO:tasks.workunit.client.0.smithi089.stderr:+ kill -0 14725 2025-06-26T21:07:09.754 INFO:tasks.workunit.client.0.smithi089.stderr:+ wait 14725 2025-06-26T21:07:09.754 INFO:tasks.workunit.client.0.smithi089.stdout:Test api_io (PID 14725) failed with non-zero exit status 2025-06-26T21:07:09.755 INFO:tasks.workunit.client.0.smithi089.stderr:+ echo 'Test api_io (PID 14725) failed with non-zero exit status' 2025-06-26T21:07:09.755 INFO:tasks.workunit.client.0.smithi089.stderr:+ ret=1 ksirivad-2025-06-26_20:28:15-rados:monthrash-wip-ksirivad-fix-msg-v2-distro-default-smithi/8352036/teuthology.log 2025-06-26T21:15:43.355 INFO:tasks.workunit.client.0.smithi181.stdout: watch_notify: [ OK ] NeoRadosWatchNotify.WatchNotifyTimeout (7384 ms) 2025-06-26T21:15:43.355 INFO:tasks.workunit.client.0.smithi181.stdout: watch_notify: [----------] 2 tests from NeoRadosWatchNotify (10809 ms total) 2025-06-26T21:15:43.355 INFO:tasks.workunit.client.0.smithi181.stdout: watch_notify: 2025-06-26T21:15:43.355 INFO:tasks.workunit.client.0.smithi181.stdout: watch_notify: [----------] 4 tests from NeoRadosWatchNotifyPoll 2025-06-26T21:15:43.355 INFO:tasks.workunit.client.0.smithi181.stdout: watch_notify: [ RUN ] NeoRadosWatchNotifyPoll.WatchNotify 2025-06-26T21:15:43.355 INFO:tasks.workunit.client.0.smithi181.stdout: watch_notify: unknown file: Failure 2025-06-26T21:15:43.355 INFO:tasks.workunit.client.0.smithi181.stdout: watch_notify: C++ exception with description "(110) Connection timed out [osd:110]" thrown in the test body. 2025-06-26T21:15:43.355 INFO:tasks.workunit.client.0.smithi181.stdout: watch_notify: 2025-06-26T21:15:43.355 INFO:tasks.workunit.client.0.smithi181.stdout: watch_notify: [ FAILED ] NeoRadosWatchNotifyPoll.WatchNotify (50285 ms) 2025-06-26T21:15:43.355 INFO:tasks.workunit.client.0.smithi181.stdout: watch_notify: [ RUN ] NeoRadosWatchNotifyPoll.WatchNotifyTimeout 2025-06-26T21:15:43.355 INFO:tasks.workunit.client.0.smithi181.stdout: watch_notify: [ OK ] NeoRadosWatchNotifyPoll.WatchNotifyTimeout (9241 ms) 2025-06-26T21:15:43.356 INFO:tasks.workunit.client.0.smithi181.stdout: watch_notify: [ RUN ] NeoRadosWatchNotifyPoll.WrongWatchType 2025-06-26T21:15:43.356 INFO:tasks.workunit.client.0.smithi181.stdout: watch_notify: [ OK ] NeoRadosWatchNotifyPoll.WrongWatchType (3948 ms) 2025-06-26T21:15:43.356 INFO:tasks.workunit.client.0.smithi181.stdout: watch_notify: [ RUN ] NeoRadosWatchNotifyPoll.WatchNotifyCancel 2025-06-26T21:15:43.356 INFO:tasks.workunit.client.0.smithi181.stdout: watch_notify: [ OK ] NeoRadosWatchNotifyPoll.WatchNotifyCancel (3013 ms) 2025-06-26T21:15:43.356 INFO:tasks.workunit.client.0.smithi181.stdout: watch_notify: [----------] 4 tests from NeoRadosWatchNotifyPoll (66488 ms total) 2025-06-26T21:15:43.356 INFO:tasks.workunit.client.0.smithi181.stdout: watch_notify: 2025-06-26T21:15:43.356 INFO:tasks.workunit.client.0.smithi181.stdout: watch_notify: [----------] Global test environment tear-down 2025-06-26T21:15:43.356 INFO:tasks.workunit.client.0.smithi181.stdout: watch_notify: [==========] 6 tests from 2 test suites ran. (77298 ms total) 2025-06-26T21:15:43.356 INFO:tasks.workunit.client.0.smithi181.stdout: watch_notify: [ PASSED ] 5 tests. 2025-06-26T21:15:43.356 INFO:tasks.workunit.client.0.smithi181.stdout: watch_notify: [ FAILED ] 1 test, listed below: 2025-06-26T21:15:43.356 INFO:tasks.workunit.client.0.smithi181.stdout: watch_notify: [ FAILED ] NeoRadosWatchNotifyPoll.WatchNotify 2025-06-26T21:15:43.357 INFO:tasks.workunit.client.0.smithi181.stdout: watch_notify: 2025-06-26T21:15:43.357 INFO:tasks.workunit.client.0.smithi181.stdout: watch_notify: 1 FAILED TEST . . 2025-06-26T21:31:36.342 INFO:tasks.workunit.client.0.smithi181.stdout:Waiting for test watch_notify (PID 16785)... 2025-06-26T21:31:36.342 INFO:tasks.workunit.client.0.smithi181.stderr:+ pid=16785 2025-06-26T21:31:36.342 INFO:tasks.workunit.client.0.smithi181.stderr:+ echo 'Waiting for test watch_notify (PID 16785)...' 2025-06-26T21:31:36.343 INFO:tasks.workunit.client.0.smithi181.stderr:+ kill -0 16785 2025-06-26T21:31:36.343 INFO:tasks.workunit.client.0.smithi181.stderr:+ wait 16785 2025-06-26T21:31:36.343 INFO:tasks.workunit.client.0.smithi181.stdout:Test watch_notify (PID 16785) failed with non-zero exit status 2025-06-26T21:31:36.343 INFO:tasks.workunit.client.0.smithi181.stderr:+ echo 'Test watch_notify (PID 16785) failed with non-zero exit status' 2025-06-26T21:31:36.344 INFO:tasks.workunit.client.0.smithi181.stderr:+ ret=1 |
|
Currently we do not perform xml output for neo-rados suite, the commit 3a20d0b adds this. |
|
By enabling xml output for neo-rados suite, we are able to get teuthology to fail descriptively: https://pulpito.ceph.com/ksirivad-2025-06-27_18:50:34-rados:monthrash-main-distro-default-smithi/8353691/ |
3a20d0b to
94c9f1c
Compare
627ac95 to
1af98a2
Compare
|
Failure due to timeout plus unit-tests are more descriptive now:
|
1af98a2 to
87e49e3
Compare
There was a problem hiding this comment.
Looks good so far!
Beside comment below regarding timeout per-test / total timeout - it looks like "serial + timeout" case might be missed here.
The added timeout logic bit is only run under if [ $parallel -eq 1 ] meaning test.sh --serial --timeout 60 is not supported.
My suggestion would be to define --timeout as total timeout (for all tests), then we could also apply it to "serial" by decreasing the waited time from total time each time a test passes.
qa/workunits/rados/test.sh
Outdated
| echo "ERROR: Test $t ($pid) - TIMED OUT after $max_wait seconds" | ||
| kill -9 $pid 2>/dev/null || true | ||
| ret=1 | ||
| break |
| check_interval=10 | ||
| pid=${pids[$t]} | ||
| echo "Checking Test $t (PID $pid)..." | ||
| # Check in a loop with timeout |
There was a problem hiding this comment.
IIUC, after each test passes, timeout should be updated to
timeout = timeout - waited
Otherwise according to timeout=1800 # 30 minutes default value we would get 30 minutes for each test. A timeout for each test is also probably ok but should be noted in the comment above that this is the intent.
There was a problem hiding this comment.
Yes, well now after Bill's suggestion, we increased the timeout to 5400 (90 mins) for each test.
Will add more descriptive comments at the top so user is well aware that it is for each test
I can cherry-pick it and resolve the conflict. I'll review the rest as well |
f0c4d83 to
7c2e90d
Compare
…ests This commit adds a 90-minute timeout (configurable) for each test in the RADOS test suite when running in parallel mode. Previously, if a test hung, the entire test script would hang indefinitely, causing teutholgoy jobs to time out without useful diagnostic information. The implementation: Adds a timeout check in the process monitoring loop Forcibly terminates tests that run longer than 90 minutes (or whatever is configured) Ensures wait is only called after a process has ended or been killed Improves error reporting to identify which specific test caused the issue This change prevents indefinite hanging of the test script and provides better visibility into which test is problematic when timeouts occur. Fixes: https://tracker.ceph.com/issues/70772 Signed-off-by: Kamoltat Sirivadhna <ksirivad@redhat.com>
This commit adds XML output support for the neorados test suite, ensuring test results are properly captured in XML format. This change aligns the neorados tests with the existing behavior of the standard RADOS tests, which were already generating XML output files and utilize teuthology scanner to capture failures if any. Fixes: https://tracker.ceph.com/issues/70772 Signed-off-by: Kamoltat Sirivadhna <ksirivad@redhat.com>
Currently the failure description in teuthology doesn't give us much detail rather than COMMAND_FAILURE when the test timedout, this is because when we timeout and kill the process there are no XML files generated, therefore, this commit will create an XML file report when we decide to timeout and kill the long running test process. Fixes: https://tracker.ceph.com/issues/70772 Signed-off-by: Kamoltat Sirivadhna <ksirivad@redhat.com>
This commit addes a `--vstart` flag that developers can pass to run this script locally. The user must be in the `ceph/build` directory, which is explained as a comment in the script. The script prepares the test to be run locally by doing the following: 1. Compiling all the necessry targets 2. Setting up a test cluster 3. Prefixing `./bin/` to all executable commands to make them vstart-friendly 4. Tearing down the test cluster at the end. Fixes: https://tracker.ceph.com/issues/70772 Signed-off-by: Laura Flores <lflores@ibm.com>
- For --serial we use timeout function and --timeout value to to reutrn exit 1 if test exceeds --timeout. - Update documentation for command-line arguments such that user understands the order of argument matters. - Use test arrays for better organization of test definitions - Add test type tracking to properly handle both rados and neorados tests when falling back on XML - Change timeout default value to 90 minutes Fixes: https://tracker.ceph.com/issues/70772 Signed-off-by: Kamoltat Sirivadhna <ksirivad@redhat.com>
7c2e90d to
bfa6edb
Compare
|
Latest test running at: https://pulpito.ceph.com/ksirivad-2025-07-02_21:05:49-rados:monthrash-main-distro-default-smithi/ (Edit: All failures seems to be descriptive as intended, involving the correct timeout) |
|
Also surprisingly, I could not find a single test in |
|
jenkins test api |
|
jenkins test make check arm64 |
|
@kamoltat this is on my to do list to review! |
Matan-B
left a comment
There was a problem hiding this comment.
lgtm!
Also surprisingly, I could not find a single test in
suites/radosthat usesrados/test.sh --serial. I've seenrados/test.sh --crimsonbut notrados/test.sh --serial
I've introduced that specifically for Crimson (if --crimson is used parallel=0). This was useful since this test was still flaky from time to time with Crimson and it was hard to spot which test failed since other kept running.
I'll run this through the crimson suite as well once this is ready so we could test the --serial case!
| # ./test.sh --timeout 3600 # Parallel mode, 60-min timeout | ||
| # ./test.sh --serial --timeout 60 # Serial mode, 1-min timeout | ||
| # ./test.sh --crimson --timeout 0 # Crimson mode, no timeout | ||
| # ../qa/workunits/rados/test.sh --vstart # Run tests locally from `ceph/build` dir |

This commit adds a 90-minute timeout (configurable) for each test in the RADOS test suite
when running in parallel mode. Previously, if a test hung, the entire test
script would hang indefinitely,
causing teutholgoy jobs to time out without useful diagnostic information.
The implementation:
Adds a timeout check in the process monitoring loop
Forcibly terminates tests that run longer than 90 minutes (or whatever
is configured)
Ensures wait is only called after a process has ended or been killed
Improves error reporting to identify which specific test caused the issue
This change prevents indefinite hanging of the test script and provides
better visibility into which test is problematic when timeouts occur.
Also, we added XML output support for the
neoradostest suite,ensuring test results are properly captured in XML format. This change
aligns the neorados tests with the existing behavior of the standard
RADOS tests, which were already generating XML output files and utilize
teuthology scanner to capture failures if any.
Fixes: https://tracker.ceph.com/issues/70772
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins test classic perfJenkins Job | Jenkins Job Definitionjenkins test crimson perfJenkins Job | Jenkins Job Definitionjenkins test signedJenkins Job | Jenkins Job Definitionjenkins test make checkJenkins Job | Jenkins Job Definitionjenkins test make check arm64Jenkins Job | Jenkins Job Definitionjenkins test submodulesJenkins Job | Jenkins Job Definitionjenkins test dashboardJenkins Job | Jenkins Job Definitionjenkins test dashboard cephadmJenkins Job | Jenkins Job Definitionjenkins test apiJenkins Job | Jenkins Job Definitionjenkins test docsReadTheDocs | Github Workflow Definitionjenkins test ceph-volume allJenkins Jobs | Jenkins Jobs Definitionjenkins test windowsJenkins Job | Jenkins Job Definitionjenkins test rook e2eJenkins Job | Jenkins Job Definition