Bug #70772
openrados/verify: rados/test.sh hangs indefinitely after completing all api tests successfully.
0%
Description
/a/skanta-2025-03-27_08:02:07-rados-wip-bharath10-testing-2025-03-27-0430-distro-default-smithi/8212858
After completing all the api tests successfully, the script hung indefinitely until being timed out 6 hrs later.
The timeline is shown below:
2025-03-28T07:21:38.357 INFO:tasks.workunit:Running workunit rados/test.sh...
2025-03-28T07:21:38.358 DEBUG:teuthology.orchestra.run.smithi171:workunit test rados/test.sh> mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=bb974f864d212eb3e8e6f2885c2e87e25791786c TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 CEPH_MNT=/home/ubuntu/cephtest/mnt.0 ALLOW_TIMEOUTS=1 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 6h /home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test.sh
...
2025-03-28T07:29:47.798 INFO:tasks.workunit.client.0.smithi171.stdout: api_aio_pp: [ RUN ] LibRadosAio/Cancel.AfterComplete/cancel
2025-03-28T07:29:47.798 INFO:tasks.workunit.client.0.smithi171.stdout: api_aio_pp: [ OK ] LibRadosAio/Cancel.AfterComplete/cancel (15 ms)
2025-03-28T07:29:47.798 INFO:tasks.workunit.client.0.smithi171.stdout: api_aio_pp: [----------] 6 tests from LibRadosAio/Cancel (295 ms total)
2025-03-28T07:29:47.798 INFO:tasks.workunit.client.0.smithi171.stdout: api_aio_pp:
2025-03-28T07:29:47.798 INFO:tasks.workunit.client.0.smithi171.stdout: api_aio_pp: [----------] Global test environment tear-down
2025-03-28T07:29:47.798 INFO:tasks.workunit.client.0.smithi171.stdout: api_aio_pp: [==========] 63 tests from 5 test suites ran. (489334 ms total)
2025-03-28T07:29:47.798 INFO:tasks.workunit.client.0.smithi171.stdout: api_aio_pp: [ PASSED ] 63 tests.
2025-03-28T07:29:51.841 INFO:tasks.ceph.mon.a.smithi063.stderr:==00:00:10:41.393 39084== Warning: unimplemented fcntl command: 1036
2025-03-28T07:29:51.844 INFO:tasks.ceph.mon.a.smithi063.stderr:==00:00:10:41.396 39084== Warning: unimplemented fcntl command: 1036
... 6 hrs later
2025-03-28T13:21:38.415 INFO:tasks.workunit.client.0.smithi171.stderr:++ cleanup
2025-03-28T13:21:38.415 INFO:tasks.workunit.client.0.smithi171.stderr:++ pkill -P 42784
2025-03-28T13:21:38.441 DEBUG:teuthology.orchestra.run:got remote process result: 124
2025-03-28T13:21:38.442 INFO:tasks.workunit.client.0.smithi171.stderr:++ true
2025-03-28T13:21:38.442 INFO:tasks.workunit:Stopping ['rados/test.sh'] on client.0...
2025-03-28T13:21:38.442 DEBUG:teuthology.orchestra.run.smithi171:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0
2025-03-28T13:21:38.750 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
File "/home/teuthworker/src/git.ceph.com_teuthology_984923e827011a0695000fc899d82bc4a3ea3e35/teuthology/run_tasks.py", line 105, in run_tasks
manager = run_one_task(taskname, ctx=ctx, config=config)
File "/home/teuthworker/src/git.ceph.com_teuthology_984923e827011a0695000fc899d82bc4a3ea3e35/teuthology/run_tasks.py", line 83, in run_one_task
return task(**kwargs)
File "/home/teuthworker/src/github.com_ceph_ceph-c_bb974f864d212eb3e8e6f2885c2e87e25791786c/qa/tasks/workunit.py", line 126, in task
File "/home/teuthworker/src/git.ceph.com_teuthology_984923e827011a0695000fc899d82bc4a3ea3e35/teuthology/parallel.py", line 84, in __exit__
for result in self:
File "/home/teuthworker/src/git.ceph.com_teuthology_984923e827011a0695000fc899d82bc4a3ea3e35/teuthology/parallel.py", line 98, in __next__
resurrect_traceback(result)
File "/home/teuthworker/src/git.ceph.com_teuthology_984923e827011a0695000fc899d82bc4a3ea3e35/teuthology/parallel.py", line 30, in resurrect_traceback
raise exc.exc_info[1]
File "/home/teuthworker/src/git.ceph.com_teuthology_984923e827011a0695000fc899d82bc4a3ea3e35/teuthology/parallel.py", line 23, in capture_traceback
return func(*args, **kwargs)
File "/home/teuthworker/src/github.com_ceph_ceph-c_bb974f864d212eb3e8e6f2885c2e87e25791786c/qa/tasks/workunit.py", line 434, in _run_tests
File "/home/teuthworker/src/git.ceph.com_teuthology_984923e827011a0695000fc899d82bc4a3ea3e35/teuthology/orchestra/remote.py", line 535, in run
r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
File "/home/teuthworker/src/git.ceph.com_teuthology_984923e827011a0695000fc899d82bc4a3ea3e35/teuthology/orchestra/run.py", line 461, in run
r.wait()
File "/home/teuthworker/src/git.ceph.com_teuthology_984923e827011a0695000fc899d82bc4a3ea3e35/teuthology/orchestra/run.py", line 161, in wait
self._raise_for_status()
File "/home/teuthworker/src/git.ceph.com_teuthology_984923e827011a0695000fc899d82bc4a3ea3e35/teuthology/orchestra/run.py", line 181, in _raise_for_status
raise CommandFailedError(
teuthology.exceptions.CommandFailedError: Command failed (workunit test rados/test.sh) on smithi171 with status 124: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=bb974f864d212eb3e8e6f2885c2e87e25791786c TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 CEPH_MNT=/home/ubuntu/cephtest/mnt.0 ALLOW_TIMEOUTS=1 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 6h /home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test.sh'
Categorizing this as a test issue since no crashes or api test failures were observed.
Updated by Laura Flores 11 months ago
/a/lflores-2025-04-14_22:14:25-rados-wip-lflores-testing-4-2025-04-14-1457-distro-default-gibba/8240914
Updated by Laura Flores 11 months ago
- Assignee set to Laura Flores
pkill -P 42784
Need to see where this pid is coming from. I'll have a look.
Updated by Shraddha Agrawal 9 months ago
- Backport set to tentacle
/a/yuriw-2025-05-20_14:56:40-rados-wip-yuri3-testing-2025-05-12-0801-tentacle-distro-default-smithi/8290636
Updated by Shraddha Agrawal 9 months ago
/a/skanta-2025-06-18_11:22:46-rados-wip-yuri-testing-2025-06-12-0731-distro-default-smithi/8334513
Updated by Sridhar Seshasayee 9 months ago
/a/skanta-2025-06-17_15:42:09-rados-wip-bharath13-testing-2025-06-17-1621-distro-default-smithi/8332769
Updated by Shraddha Agrawal 9 months ago
/a/yuriw-2025-06-19_19:29:56-rados-wip-yuri11-testing-2025-06-19-0935-distro-default-smithi/8337308
Updated by Radoslaw Zarzynski 9 months ago
- Status changed from New to In Progress
Updated by Shraddha Agrawal 9 months ago
/a/skanta-2025-06-25_05:53:02-rados-wip-bharath4-testing-2025-06-24-0841-distro-default-smithi/8347694
Updated by Kamoltat (Junior) Sirivadhna 9 months ago
/a/ksirivad-2025-06-25_02:26:36-rados:monthrash-wip-ksirivad-fix-msg-v2-distro-default-smithi/8347672/
Updated by Kamoltat (Junior) Sirivadhna 9 months ago
Filed a PR which hopefully give us more insights to which and why the rados test hangs
https://github.com/ceph/ceph/pull/64219
Updated by Kamoltat (Junior) Sirivadhna 9 months ago · Edited
With the new test change we are able to reproduce 3 failures and failed them within 25mins and pin point which pid/test causes the failure:
ksirivad-2025-06-26_20:28:15-rados:monthrash-wip-ksirivad-fix-msg-v2-distro-default-smithi/8352030/teuthology.log
2025-06-26T21:13:25.179 INFO:tasks.workunit.client.0.smithi103.stdout: watch_notify: unknown file: Failure 2025-06-26T21:13:25.179 INFO:tasks.workunit.client.0.smithi103.stdout: watch_notify: C++ exception with description "(110) Connection timed out [osd:110]" thrown in the test body. 2025-06-26T21:13:25.179 INFO:tasks.workunit.client.0.smithi103.stdout: watch_notify: 2025-06-26T21:13:25.179 INFO:tasks.workunit.client.0.smithi103.stdout: watch_notify: [ FAILED ] NeoRadosWatchNotifyPoll.WatchNotify (50468 ms) 2025-06-26T21:13:25.179 INFO:tasks.workunit.client.0.smithi103.stdout: watch_notify: [ RUN ] NeoRadosWatchNotifyPoll.WatchNotifyTimeout 2025-06-26T21:13:25.179 INFO:tasks.workunit.client.0.smithi103.stdout: watch_notify: [ OK ] NeoRadosWatchNotifyPoll.WatchNotifyTimeout (3691 ms) 2025-06-26T21:13:25.179 INFO:tasks.workunit.client.0.smithi103.stdout: watch_notify: [ RUN ] NeoRadosWatchNotifyPoll.WrongWatchType 2025-06-26T21:13:25.179 INFO:tasks.workunit.client.0.smithi103.stdout: watch_notify: [ OK ] NeoRadosWatchNotifyPoll.WrongWatchType (3035 ms) 2025-06-26T21:13:25.179 INFO:tasks.workunit.client.0.smithi103.stdout: watch_notify: [ RUN ] NeoRadosWatchNotifyPoll.WatchNotifyCancel 2025-06-26T21:13:25.180 INFO:tasks.workunit.client.0.smithi103.stdout: watch_notify: [ OK ] NeoRadosWatchNotifyPoll.WatchNotifyCancel (8172 ms) 2025-06-26T21:13:25.180 INFO:tasks.workunit.client.0.smithi103.stdout: watch_notify: [----------] 4 tests from NeoRadosWatchNotifyPoll (65368 ms total) 2025-06-26T21:13:25.180 INFO:tasks.workunit.client.0.smithi103.stdout: watch_notify: 2025-06-26T21:13:25.180 INFO:tasks.workunit.client.0.smithi103.stdout: watch_notify: [----------] Global test environment tear-down 2025-06-26T21:13:25.180 INFO:tasks.workunit.client.0.smithi103.stdout: watch_notify: [==========] 6 tests from 2 test suites ran. (76902 ms total) 2025-06-26T21:13:25.180 INFO:tasks.workunit.client.0.smithi103.stdout: watch_notify: [ PASSED ] 5 tests. 2025-06-26T21:13:25.180 INFO:tasks.workunit.client.0.smithi103.stdout: watch_notify: [ FAILED ] 1 test, listed below: 2025-06-26T21:13:25.180 INFO:tasks.workunit.client.0.smithi103.stdout: watch_notify: [ FAILED ] NeoRadosWatchNotifyPoll.WatchNotify 2025-06-26T21:13:25.180 INFO:tasks.workunit.client.0.smithi103.stdout: watch_notify: 2025-06-26T21:13:25.180 INFO:tasks.workunit.client.0.smithi103.stdout: watch_notify: 1 FAILED TEST . . > 2025-06-26T21:29:58.681 INFO:tasks.workunit.client.0.smithi103.stderr:+ echo 'Waiting for test watch_notify (PID 16474)...' 2025-06-26T21:29:58.681 INFO:tasks.workunit.client.0.smithi103.stdout:Waiting for test watch_notify (PID 16474)... 2025-06-26T21:29:58.682 INFO:tasks.workunit.client.0.smithi103.stderr:+ kill -0 16474 2025-06-26T21:29:58.682 INFO:tasks.workunit.client.0.smithi103.stderr:+ wait 16474 2025-06-26T21:29:58.682 INFO:tasks.workunit.client.0.smithi103.stdout:Test watch_notify (PID 16474) failed with non-zero exit status 2025-06-26T21:29:58.683 INFO:tasks.workunit.client.0.smithi103.stderr:+ echo 'Test watch_notify (PID 16474) failed with non-zero exit status' 2025-06-26T21:29:58.683 INFO:tasks.workunit.client.0.smithi103.stderr:+ ret=1
ksirivad-2025-06-26_20:28:15-rados:monthrash-wip-ksirivad-fix-msg-v2-distro-default-smithi/8352030/remote/smithi103/unit_test_xml_report/api_watch_notify.xml
Here how come the XML files didn't pick this up ....¶
<?xml version="1.0" encoding="UTF-8"?>
<testsuites tests="11" failures="0" disabled="0" errors="0" time="24.598" timestamp="2025-06-26T21:12:08.038" name="AllTests">
<testsuite name="LibRadosWatchNotify" tests="10" failures="0" disabled="0" skipped="0" errors="0" time="11.506" timestamp="2025-06-26T21:12:10.026">
<testcase name="WatchNotify" file="./src/test/librados/watch_notify.cc" line="135" status="run" result="completed" time="0.366" timestamp="2025-06-26T21:12:10.026" classname="LibRadosWatchNotify" />
<testcase name="Watch2Delete" file="./src/test/librados/watch_notify.cc" line="193" status="run" result="completed" time="1.021" timestamp="2025-06-26T21:12:10.392" classname="LibRadosWatchNotify" />
<testcase name="AioWatchDelete" file="./src/test/librados/watch_notify.cc" line="222" status="run" result="completed" time="1.01" timestamp="2025-06-26T21:12:11.414" classname="LibRadosWatchNotify" />
<testcase name="WatchNotify2" file="./src/test/librados/watch_notify.cc" line="261" status="run" result="completed" time="0.018" timestamp="2025-06-26T21:12:12.424" classname="LibRadosWatchNotify" />
<testcase name="AioWatchNotify2" file="./src/test/librados/watch_notify.cc" line="308" status="run" result="completed" time="0.014" timestamp="2025-06-26T21:12:12.442" classname="LibRadosWatchNotify" />
<testcase name="AioNotify" file="./src/test/librados/watch_notify.cc" line="362" status="run" result="completed" time="0.009" timestamp="2025-06-26T21:12:12.457" classname="LibRadosWatchNotify" />
<testcase name="WatchNotify2Multi" file="./src/test/librados/watch_notify.cc" line="417" status="run" result="completed" time="0.016" timestamp="2025-06-26T21:12:12.466" classname="LibRadosWatchNotify" />
<testcase name="WatchNotify2Timeout" file="./src/test/librados/watch_notify.cc" line="465" status="run" result="completed" time="3.01" timestamp="2025-06-26T21:12:12.482" classname="LibRadosWatchNotify" />
<testcase name="Watch3Timeout" file="./src/test/librados/watch_notify.cc" line="519" status="run" result="completed" time="5.023" timestamp="2025-06-26T21:12:15.493" classname="LibRadosWatchNotify" />
<testcase name="AioWatchDelete2" file="./src/test/librados/watch_notify.cc" line="621" status="run" result="completed" time="1.016" timestamp="2025-06-26T21:12:20.516" classname="LibRadosWatchNotify" />
</testsuite>
<testsuite name="LibRadosWatchNotifyEC" tests="1" failures="0" disabled="0" skipped="0" errors="0" time="1.127" timestamp="2025-06-26T21:12:28.198">
<testcase name="WatchNotify" file="./src/test/librados/watch_notify.cc" line="163" status="run" result="completed" time="1.127" timestamp="2025-06-26T21:12:28.198" classname="LibRadosWatchNotifyEC" />
</testsuite>
</testsuites>
ksirivad-2025-06-26_20:28:15-rados:monthrash-wip-ksirivad-fix-msg-v2-distro-default-smithi/8352012/teuthology.log
2025-06-26T20:49:55.694 INFO:tasks.workunit.client.0.smithi089.stdout: api_io: [ SKIPPED ] LibRadosIoEC.RmXattr (0 ms) 2025-06-26T20:49:55.695 INFO:tasks.workunit.client.0.smithi089.stdout: api_io: [ RUN ] LibRadosIoEC.XattrIter 2025-06-26T20:49:55.695 INFO:tasks.workunit.client.0.smithi089.stdout: api_io: ./src/test/librados/io.cc:422: Skipped 2025-06-26T20:49:55.695 INFO:tasks.workunit.client.0.smithi089.stdout: api_io: 2025-06-26T20:49:55.695 INFO:tasks.workunit.client.0.smithi089.stdout: api_io: [ SKIPPED ] LibRadosIoEC.XattrIter (0 ms) 2025-06-26T20:49:55.724 INFO:tasks.workunit.client.0.smithi089.stderr:bash: line 1: 14730 Segmentation fault (core dumped) ceph_test_rados_api_io --gtest_output=xml:/home/ubuntu/cephtest/archive/unit_test_xml_report/api_io.xml 2>&1 2025-06-26T20:49:55.724 INFO:tasks.workunit.client.0.smithi089.stderr: 14731 Done | tee ceph_test_rados_api_io.log 2025-06-26T20:49:55.724 INFO:tasks.workunit.client.0.smithi089.stderr: 14732 Done | sed "s/^/ api_io: /" . . . 2025-06-26T21:07:09.754 INFO:tasks.workunit.client.0.smithi089.stdout:Waiting for test api_io (PID 14725)... 2025-06-26T21:07:09.754 INFO:tasks.workunit.client.0.smithi089.stderr:+ echo 'Waiting for test api_io (PID 14725)...' 2025-06-26T21:07:09.754 INFO:tasks.workunit.client.0.smithi089.stderr:+ kill -0 14725 2025-06-26T21:07:09.754 INFO:tasks.workunit.client.0.smithi089.stderr:+ wait 14725 2025-06-26T21:07:09.754 INFO:tasks.workunit.client.0.smithi089.stdout:Test api_io (PID 14725) failed with non-zero exit status 2025-06-26T21:07:09.755 INFO:tasks.workunit.client.0.smithi089.stderr:+ echo 'Test api_io (PID 14725) failed with non-zero exit status' 2025-06-26T21:07:09.755 INFO:tasks.workunit.client.0.smithi089.stderr:+ ret=1
ksirivad-2025-06-26_20:28:15-rados:monthrash-wip-ksirivad-fix-msg-v2-distro-default-smithi/8352036/teuthology.log
2025-06-26T21:15:43.355 INFO:tasks.workunit.client.0.smithi181.stdout: watch_notify: [ OK ] NeoRadosWatchNotify.WatchNotifyTimeout (7384 ms) 2025-06-26T21:15:43.355 INFO:tasks.workunit.client.0.smithi181.stdout: watch_notify: [----------] 2 tests from NeoRadosWatchNotify (10809 ms total) 2025-06-26T21:15:43.355 INFO:tasks.workunit.client.0.smithi181.stdout: watch_notify: 2025-06-26T21:15:43.355 INFO:tasks.workunit.client.0.smithi181.stdout: watch_notify: [----------] 4 tests from NeoRadosWatchNotifyPoll 2025-06-26T21:15:43.355 INFO:tasks.workunit.client.0.smithi181.stdout: watch_notify: [ RUN ] NeoRadosWatchNotifyPoll.WatchNotify 2025-06-26T21:15:43.355 INFO:tasks.workunit.client.0.smithi181.stdout: watch_notify: unknown file: Failure 2025-06-26T21:15:43.355 INFO:tasks.workunit.client.0.smithi181.stdout: watch_notify: C++ exception with description "(110) Connection timed out [osd:110]" thrown in the test body. 2025-06-26T21:15:43.355 INFO:tasks.workunit.client.0.smithi181.stdout: watch_notify: 2025-06-26T21:15:43.355 INFO:tasks.workunit.client.0.smithi181.stdout: watch_notify: [ FAILED ] NeoRadosWatchNotifyPoll.WatchNotify (50285 ms) 2025-06-26T21:15:43.355 INFO:tasks.workunit.client.0.smithi181.stdout: watch_notify: [ RUN ] NeoRadosWatchNotifyPoll.WatchNotifyTimeout 2025-06-26T21:15:43.355 INFO:tasks.workunit.client.0.smithi181.stdout: watch_notify: [ OK ] NeoRadosWatchNotifyPoll.WatchNotifyTimeout (9241 ms) 2025-06-26T21:15:43.356 INFO:tasks.workunit.client.0.smithi181.stdout: watch_notify: [ RUN ] NeoRadosWatchNotifyPoll.WrongWatchType 2025-06-26T21:15:43.356 INFO:tasks.workunit.client.0.smithi181.stdout: watch_notify: [ OK ] NeoRadosWatchNotifyPoll.WrongWatchType (3948 ms) 2025-06-26T21:15:43.356 INFO:tasks.workunit.client.0.smithi181.stdout: watch_notify: [ RUN ] NeoRadosWatchNotifyPoll.WatchNotifyCancel 2025-06-26T21:15:43.356 INFO:tasks.workunit.client.0.smithi181.stdout: watch_notify: [ OK ] NeoRadosWatchNotifyPoll.WatchNotifyCancel (3013 ms) 2025-06-26T21:15:43.356 INFO:tasks.workunit.client.0.smithi181.stdout: watch_notify: [----------] 4 tests from NeoRadosWatchNotifyPoll (66488 ms total) 2025-06-26T21:15:43.356 INFO:tasks.workunit.client.0.smithi181.stdout: watch_notify: 2025-06-26T21:15:43.356 INFO:tasks.workunit.client.0.smithi181.stdout: watch_notify: [----------] Global test environment tear-down 2025-06-26T21:15:43.356 INFO:tasks.workunit.client.0.smithi181.stdout: watch_notify: [==========] 6 tests from 2 test suites ran. (77298 ms total) 2025-06-26T21:15:43.356 INFO:tasks.workunit.client.0.smithi181.stdout: watch_notify: [ PASSED ] 5 tests. 2025-06-26T21:15:43.356 INFO:tasks.workunit.client.0.smithi181.stdout: watch_notify: [ FAILED ] 1 test, listed below: 2025-06-26T21:15:43.356 INFO:tasks.workunit.client.0.smithi181.stdout: watch_notify: [ FAILED ] NeoRadosWatchNotifyPoll.WatchNotify 2025-06-26T21:15:43.357 INFO:tasks.workunit.client.0.smithi181.stdout: watch_notify: 2025-06-26T21:15:43.357 INFO:tasks.workunit.client.0.smithi181.stdout: watch_notify: 1 FAILED TEST . . 2025-06-26T21:31:36.342 INFO:tasks.workunit.client.0.smithi181.stdout:Waiting for test watch_notify (PID 16785)... 2025-06-26T21:31:36.342 INFO:tasks.workunit.client.0.smithi181.stderr:+ pid=16785 2025-06-26T21:31:36.342 INFO:tasks.workunit.client.0.smithi181.stderr:+ echo 'Waiting for test watch_notify (PID 16785)...' 2025-06-26T21:31:36.343 INFO:tasks.workunit.client.0.smithi181.stderr:+ kill -0 16785 2025-06-26T21:31:36.343 INFO:tasks.workunit.client.0.smithi181.stderr:+ wait 16785 2025-06-26T21:31:36.343 INFO:tasks.workunit.client.0.smithi181.stdout:Test watch_notify (PID 16785) failed with non-zero exit status 2025-06-26T21:31:36.343 INFO:tasks.workunit.client.0.smithi181.stderr:+ echo 'Test watch_notify (PID 16785) failed with non-zero exit status' 2025-06-26T21:31:36.344 INFO:tasks.workunit.client.0.smithi181.stderr:+ ret=1
I'm surprised teuthology XML GTEST failure scanner did not catch this, I wonder why the failure reason still says command failed with rados/test.sh timeout 3h????
Updated by Shraddha Agrawal 9 months ago
/a/skanta-2025-06-26_23:33:10-rados-wip-bharath6-testing-2025-06-26-1840-tentacle-distro-default-smithi/8352363
Updated by Laura Flores 9 months ago
I edited Junior's script a bit so it can be run locally on a vstart cluster:
https://github.com/ceph/ceph/commit/e432e35081405e1966022b513ff960ed2fe6037f
Updated by Shraddha Agrawal 9 months ago
/a/skanta-2025-07-04_23:32:34-rados-wip-bharath13-testing-2025-07-04-0559-distro-default-smithi/8370615
Updated by Laura Flores 9 months ago
Junior's accompanying PR is under review: https://github.com/ceph/ceph/pull/64219
Updated by Shraddha Agrawal 9 months ago
/a/skanta-2025-07-03_10:29:59-rados-wip-bharath5-testing-2025-06-30-2106-distro-default-smithi/8368522
Updated by Laura Flores 8 months ago
Laura Flores wrote in #note-23:
Junior's accompanying PR is under review: https://github.com/ceph/ceph/pull/64219
This PR is in testing and will hopefully uncover more debug info.
Updated by Shraddha Agrawal 8 months ago
/a/skanta-2025-07-13_23:08:24-rados-wip-bharath4-testing-2025-07-13-0539-distro-default-smithi/8384537
Updated by Shraddha Agrawal 8 months ago
/a/skanta-2025-07-26_22:27:26-rados-wip-bharath7-testing-2025-07-26-0611-tentacle-distro-default-smithi/8409770
Updated by Laura Flores 8 months ago
Will re-look at this since Junior's PR has merged.
Updated by Laura Flores 8 months ago
Shraddha Agrawal wrote in #note-26:
/a/skanta-2025-07-13_23:08:24-rados-wip-bharath4-testing-2025-07-13-0539-distro-default-smithi/8384537
This one is actually https://tracker.ceph.com/issues/71930:
2025-07-14T00:00:44.495 INFO:tasks.workunit.client.0.smithi134.stdout: api_tier_pp: *** Caught signal (Segmentation fault) ** 2025-07-14T00:00:44.495 INFO:tasks.workunit.client.0.smithi134.stdout: api_tier_pp: in thread 7f8049de7640 thread_name:msgr-worker-0 2025-07-14T00:00:44.496 INFO:tasks.workunit.client.0.smithi134.stdout: api_tier_pp: ceph version 20.3.0-1576-g5903ec9f (5903ec9fe3549e1eca582d2465a3c233125bca78) tentacle (dev - RelWithDebInfo) 2025-07-14T00:00:44.496 INFO:tasks.workunit.client.0.smithi134.stdout: api_tier_pp: 1: /lib64/libc.so.6(+0x3ea60) [0x7f804e03ea60] 2025-07-14T00:00:44.496 INFO:tasks.workunit.client.0.smithi134.stdout: api_tier_pp: 2: /lib64/librados.so.2(+0xcafd8) [0x7f804fa9bfd8] 2025-07-14T00:00:44.496 INFO:tasks.workunit.client.0.smithi134.stdout: api_tier_pp: 3: /lib64/librados.so.2(+0xa426a) [0x7f804fa7526a] 2025-07-14T00:00:44.496 INFO:tasks.workunit.client.0.smithi134.stdout: api_tier_pp: 4: /lib64/librados.so.2(+0x458fd) [0x7f804fa168fd] 2025-07-14T00:00:44.497 INFO:tasks.workunit.client.0.smithi134.stdout: api_tier_pp: 5: /usr/lib64/ceph/libceph-common.so.2(+0x7844f5) [0x7f804f5844f5] 2025-07-14T00:00:44.497 INFO:tasks.workunit.client.0.smithi134.stdout: api_tier_pp: 6: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0x114c) [0x7f804f412bbc] 2025-07-14T00:00:44.497 INFO:tasks.workunit.client.0.smithi134.stdout: api_tier_pp: 7: (Objecter::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x17b) [0x7f804f3fe01b] 2025-07-14T00:00:44.497 INFO:tasks.workunit.client.0.smithi134.stdout: api_tier_pp: 8: (DispatchQueue::fast_dispatch(boost::intrusive_ptr<Message> const&)+0xfd) [0x7f804f1bb0ad] 2025-07-14T00:00:44.497 INFO:tasks.workunit.client.0.smithi134.stdout: api_tier_pp: 9: /usr/lib64/ceph/libceph-common.so.2(+0x44e375) [0x7f804f24e375] 2025-07-14T00:00:44.498 INFO:tasks.workunit.client.0.smithi134.stdout: api_tier_pp: 10: (ProtocolV1::handle_message_footer(char*, int)+0xfdb) [0x7f804f26ddab] 2025-07-14T00:00:44.498 INFO:tasks.workunit.client.0.smithi134.stdout: api_tier_pp: 11: /usr/lib64/ceph/libceph-common.so.2(+0x471d5a) [0x7f804f271d5a] 2025-07-14T00:00:44.498 INFO:tasks.workunit.client.0.smithi134.stdout: api_tier_pp: 12: (AsyncConnection::process()+0x66b) [0x7f804f25a2ab] 2025-07-14T00:00:44.498 INFO:tasks.workunit.client.0.smithi134.stdout: api_tier_pp: 13: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x1d1) [0x7f804f29eaf1] 2025-07-14T00:00:44.498 INFO:tasks.workunit.client.0.smithi134.stdout: api_tier_pp: 14: /usr/lib64/ceph/libceph-common.so.2(+0x49f656) [0x7f804f29f656] 2025-07-14T00:00:44.499 INFO:tasks.workunit.client.0.smithi134.stdout: api_tier_pp: 15: /lib64/libstdc++.so.6(+0xdbae4) [0x7f804e4dbae4] 2025-07-14T00:00:44.499 INFO:tasks.workunit.client.0.smithi134.stdout: api_tier_pp: 16: /lib64/libc.so.6(+0x8a3b2) [0x7f804e08a3b2] 2025-07-14T00:00:44.499 INFO:tasks.workunit.client.0.smithi134.stdout: api_tier_pp: 17: /lib64/libc.so.6(+0x10f430) [0x7f804e10f430]
Updated by Kamoltat (Junior) Sirivadhna 7 months ago
- Copied to Backport #72597: tentacle: rados/verify: rados/test.sh hangs indefinitely after completing all api tests successfully. added
Updated by Kamoltat (Junior) Sirivadhna 7 months ago
- Copied to Backport #72598: squid: rados/verify: rados/test.sh hangs indefinitely after completing all api tests successfully. added
Updated by Laura Flores 7 months ago
Need to check for valid re-occurrences with the new debug PR patch included.
Updated by Sridhar Seshasayee 7 months ago
skanta-2025-08-24_23:24:05-rados-wip-bharath9-testing-2025-08-24-1258-tentacle-distro-default-smithi/
[8461781, 8461788]
Updated by Jaya Prakash 6 months ago
/a/skanta-2025-08-28_03:20:37-rados-wip-bharath1-testing-2025-08-26-1433-distro-default-smithi/8467898
Updated by Laura Flores 6 months ago
Laura Flores wrote in #note-32:
Need to check for valid re-occurrences with the new debug PR patch included.
TODO: Schedule tests to reproduce valid occurrences.
Updated by Laura Flores 6 months ago
- Has duplicate Bug #72785: api test TIMED OUT after 5400 seconds added
Updated by Kamoltat (Junior) Sirivadhna 4 months ago
skanta-2025-11-01_02:37:10-rados-wip-bharath4-testing-2025-10-31-1459-distro-default-smithi/8578630/
2025-11-01T03:29:26.121 INFO:tasks.workunit.client.0.smithi188.stderr:+ pid=44312 2025-11-01T03:29:26.122 INFO:tasks.workunit.client.0.smithi188.stdout:Checking Test watch_notify (PID 44312)... 2025-11-01T03:29:26.122 INFO:tasks.workunit.client.0.smithi188.stderr:+ echo 'Checking Test watch_notify (PID 44312)...' 2025-11-01T03:29:26.122 INFO:tasks.workunit.client.0.smithi188.stderr:+ kill -0 44312 2025-11-01T03:29:26.122 INFO:tasks.workunit.client.0.smithi188.stderr:+ wait 44312 2025-11-01T03:29:26.122 INFO:tasks.workunit.client.0.smithi188.stdout:ERROR: Test watch_notify (PID 44312) failed with non-zero exit status 2025-11-01T03:29:26.123 INFO:tasks.workunit.client.0.smithi188.stderr:+ echo 'ERROR: Test watch_notify (PID 44312) failed with non-zero exit status' 2025-11-01T03:29:26.123 INFO:tasks.workunit.client.0.smithi188.stdout:Check the logs for failures in watch_notify 2025-11-01T03:29:26.123 INFO:tasks.workunit.client.0.smithi188.stderr:+ echo 'Check the logs for failures in watch_notify' 2025-11-01T03:29:26.123 INFO:tasks.workunit.client.0.smithi188.stderr:+ ret=1
Updated by Laura Flores 4 months ago
Checking with @Nitzan Mordechai on how best to approach this.
Updated by Laura Flores 4 months ago · Edited
/a/skanta-2025-11-22_04:28:18-rados-wip-lflores-testing-4-2025-11-21-1738-distro-default-smithi/8619618
2025-11-22T06:00:22.966 INFO:tasks.workunit.client.0.smithi088.stdout: watch_notify: terminate called without an active exception 2025-11-22T06:00:23.090 INFO:tasks.workunit.client.0.smithi088.stderr:bash: line 1: 45344 Aborted (core dumped) ceph_test_neorados_watch_notify --gtest_output=xml:/home/ubuntu/cephtest/archive/unit_test_xml_report/neorados_watch_notify.xml 2>&1 2025-11-22T06:00:23.090 INFO:tasks.workunit.client.0.smithi088.stderr: 45345 Done | tee ceph_test_neorados_watch_notify.log 2025-11-22T06:00:23.090 INFO:tasks.workunit.client.0.smithi088.stderr: 45346 Done | sed "s/^/ watch_notify: /" 2025-11-22T06:00:29.806 INFO:tasks.workunit.client.0.smithi088.stdout: api_io: Running main() from gmock_main.cc
Coredump available at: /a/skanta-2025-11-22_04:28:18-rados-wip-lflores-testing-4-2025-11-21-1738-distro-default-smithi/8619618/remote/smithi088/coredump
Updated by Laura Flores 4 months ago
Plan is to analyze the coredump now that we know which test it's coming from.
Updated by Laura Flores 2 months ago
Need to talk with neorados people about the coredump generated here.
Updated by Sridhar Seshasayee 20 days ago
/a/sseshasa-2026-02-26_14:56:45-rados-wip-sseshasa-testing-2026-02-26-1772100687-distro-default-trial/72333