Project

General

Profile

Actions

Bug #70772

open

rados/verify: rados/test.sh hangs indefinitely after completing all api tests successfully.

Added by Sridhar Seshasayee 12 months ago. Updated 12 days ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
Tests
Target version:
-
% Done:

0%

Source:
Backport:
tentacle
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
rados
Component(RADOS):
Pull request ID:
Tags (freeform):
Merge Commit:
Fixed In:
Released In:
Upkeep Timestamp:

Description

/a/skanta-2025-03-27_08:02:07-rados-wip-bharath10-testing-2025-03-27-0430-distro-default-smithi/8212858

After completing all the api tests successfully, the script hung indefinitely until being timed out 6 hrs later.

The timeline is shown below:

2025-03-28T07:21:38.357 INFO:tasks.workunit:Running workunit rados/test.sh...
2025-03-28T07:21:38.358 DEBUG:teuthology.orchestra.run.smithi171:workunit test rados/test.sh> mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=bb974f864d212eb3e8e6f2885c2e87e25791786c TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 CEPH_MNT=/home/ubuntu/cephtest/mnt.0 ALLOW_TIMEOUTS=1 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 6h /home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test.sh

...

2025-03-28T07:29:47.798 INFO:tasks.workunit.client.0.smithi171.stdout:               api_aio_pp: [ RUN      ] LibRadosAio/Cancel.AfterComplete/cancel
2025-03-28T07:29:47.798 INFO:tasks.workunit.client.0.smithi171.stdout:               api_aio_pp: [       OK ] LibRadosAio/Cancel.AfterComplete/cancel (15 ms)
2025-03-28T07:29:47.798 INFO:tasks.workunit.client.0.smithi171.stdout:               api_aio_pp: [----------] 6 tests from LibRadosAio/Cancel (295 ms total)
2025-03-28T07:29:47.798 INFO:tasks.workunit.client.0.smithi171.stdout:               api_aio_pp:
2025-03-28T07:29:47.798 INFO:tasks.workunit.client.0.smithi171.stdout:               api_aio_pp: [----------] Global test environment tear-down
2025-03-28T07:29:47.798 INFO:tasks.workunit.client.0.smithi171.stdout:               api_aio_pp: [==========] 63 tests from 5 test suites ran. (489334 ms total)
2025-03-28T07:29:47.798 INFO:tasks.workunit.client.0.smithi171.stdout:               api_aio_pp: [  PASSED  ] 63 tests.
2025-03-28T07:29:51.841 INFO:tasks.ceph.mon.a.smithi063.stderr:==00:00:10:41.393 39084== Warning: unimplemented fcntl command: 1036
2025-03-28T07:29:51.844 INFO:tasks.ceph.mon.a.smithi063.stderr:==00:00:10:41.396 39084== Warning: unimplemented fcntl command: 1036

... 6 hrs later

2025-03-28T13:21:38.415 INFO:tasks.workunit.client.0.smithi171.stderr:++ cleanup
2025-03-28T13:21:38.415 INFO:tasks.workunit.client.0.smithi171.stderr:++ pkill -P 42784
2025-03-28T13:21:38.441 DEBUG:teuthology.orchestra.run:got remote process result: 124
2025-03-28T13:21:38.442 INFO:tasks.workunit.client.0.smithi171.stderr:++ true
2025-03-28T13:21:38.442 INFO:tasks.workunit:Stopping ['rados/test.sh'] on client.0...
2025-03-28T13:21:38.442 DEBUG:teuthology.orchestra.run.smithi171:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0
2025-03-28T13:21:38.750 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_teuthology_984923e827011a0695000fc899d82bc4a3ea3e35/teuthology/run_tasks.py", line 105, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_teuthology_984923e827011a0695000fc899d82bc4a3ea3e35/teuthology/run_tasks.py", line 83, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_bb974f864d212eb3e8e6f2885c2e87e25791786c/qa/tasks/workunit.py", line 126, in task
  File "/home/teuthworker/src/git.ceph.com_teuthology_984923e827011a0695000fc899d82bc4a3ea3e35/teuthology/parallel.py", line 84, in __exit__
    for result in self:
  File "/home/teuthworker/src/git.ceph.com_teuthology_984923e827011a0695000fc899d82bc4a3ea3e35/teuthology/parallel.py", line 98, in __next__
    resurrect_traceback(result)
  File "/home/teuthworker/src/git.ceph.com_teuthology_984923e827011a0695000fc899d82bc4a3ea3e35/teuthology/parallel.py", line 30, in resurrect_traceback
    raise exc.exc_info[1]
  File "/home/teuthworker/src/git.ceph.com_teuthology_984923e827011a0695000fc899d82bc4a3ea3e35/teuthology/parallel.py", line 23, in capture_traceback
    return func(*args, **kwargs)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_bb974f864d212eb3e8e6f2885c2e87e25791786c/qa/tasks/workunit.py", line 434, in _run_tests
  File "/home/teuthworker/src/git.ceph.com_teuthology_984923e827011a0695000fc899d82bc4a3ea3e35/teuthology/orchestra/remote.py", line 535, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_teuthology_984923e827011a0695000fc899d82bc4a3ea3e35/teuthology/orchestra/run.py", line 461, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_teuthology_984923e827011a0695000fc899d82bc4a3ea3e35/teuthology/orchestra/run.py", line 161, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_teuthology_984923e827011a0695000fc899d82bc4a3ea3e35/teuthology/orchestra/run.py", line 181, in _raise_for_status
    raise CommandFailedError(
teuthology.exceptions.CommandFailedError: Command failed (workunit test rados/test.sh) on smithi171 with status 124: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=bb974f864d212eb3e8e6f2885c2e87e25791786c TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 CEPH_MNT=/home/ubuntu/cephtest/mnt.0 ALLOW_TIMEOUTS=1 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 6h /home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test.sh'

Categorizing this as a test issue since no crashes or api test failures were observed.


Related issues 3 (2 open1 closed)

Has duplicate RADOS - Bug #72785: api test TIMED OUT after 5400 secondsDuplicate

Actions
Copied to RADOS - Backport #72597: tentacle: rados/verify: rados/test.sh hangs indefinitely after completing all api tests successfully.In ProgressKamoltat (Junior) SirivadhnaActions
Copied to RADOS - Backport #72598: squid: rados/verify: rados/test.sh hangs indefinitely after completing all api tests successfully.In ProgressKamoltat (Junior) SirivadhnaActions
Actions #1

Updated by Sridhar Seshasayee 12 months ago

  • Description updated (diff)
Actions #2

Updated by Laura Flores 12 months ago

Let's observe...

Actions #3

Updated by Laura Flores 11 months ago

/a/lflores-2025-04-14_22:14:25-rados-wip-lflores-testing-4-2025-04-14-1457-distro-default-gibba/8240914

Actions #4

Updated by Laura Flores 11 months ago

  • Assignee set to Laura Flores

pkill -P 42784

Need to see where this pid is coming from. I'll have a look.

Actions #5

Updated by Laura Flores 11 months ago

Bump up

Actions #6

Updated by Radoslaw Zarzynski 10 months ago

scrub note: bump up.

Actions #7

Updated by Laura Flores 10 months ago

Bump up

Actions #8

Updated by Shraddha Agrawal 9 months ago

  • Backport set to tentacle

/a/yuriw-2025-05-20_14:56:40-rados-wip-yuri3-testing-2025-05-12-0801-tentacle-distro-default-smithi/8290636

Actions #9

Updated by Laura Flores 9 months ago

Bump up

Actions #10

Updated by Shraddha Agrawal 9 months ago

/a/skanta-2025-06-18_11:22:46-rados-wip-yuri-testing-2025-06-12-0731-distro-default-smithi/8334513

Actions #11

Updated by Sridhar Seshasayee 9 months ago

/a/skanta-2025-06-17_15:42:09-rados-wip-bharath13-testing-2025-06-17-1621-distro-default-smithi/8332769

Actions #12

Updated by Shraddha Agrawal 9 months ago

/a/yuriw-2025-06-19_19:29:56-rados-wip-yuri11-testing-2025-06-19-0935-distro-default-smithi/8337308

Actions #13

Updated by Laura Flores 9 months ago

Adding to my TODO list to investigate.

Actions #14

Updated by Radoslaw Zarzynski 9 months ago

  • Status changed from New to In Progress
Actions #15

Updated by Shraddha Agrawal 9 months ago

/a/skanta-2025-06-25_05:53:02-rados-wip-bharath4-testing-2025-06-24-0841-distro-default-smithi/8347694

Actions #16

Updated by Kamoltat (Junior) Sirivadhna 9 months ago

/a/ksirivad-2025-06-25_02:26:36-rados:monthrash-wip-ksirivad-fix-msg-v2-distro-default-smithi/8347672/

Actions #17

Updated by Kamoltat (Junior) Sirivadhna 9 months ago

Filed a PR which hopefully give us more insights to which and why the rados test hangs
https://github.com/ceph/ceph/pull/64219

Actions #18

Updated by Kamoltat (Junior) Sirivadhna 9 months ago · Edited

With the new test change we are able to reproduce 3 failures and failed them within 25mins and pin point which pid/test causes the failure:

ksirivad-2025-06-26_20:28:15-rados:monthrash-wip-ksirivad-fix-msg-v2-distro-default-smithi/8352030/teuthology.log

2025-06-26T21:13:25.179 INFO:tasks.workunit.client.0.smithi103.stdout:             watch_notify: unknown file: Failure
2025-06-26T21:13:25.179 INFO:tasks.workunit.client.0.smithi103.stdout:             watch_notify: C++ exception with description "(110) Connection timed out [osd:110]" thrown in the test body.
2025-06-26T21:13:25.179 INFO:tasks.workunit.client.0.smithi103.stdout:             watch_notify:
2025-06-26T21:13:25.179 INFO:tasks.workunit.client.0.smithi103.stdout:             watch_notify: [  FAILED  ] NeoRadosWatchNotifyPoll.WatchNotify (50468 ms)
2025-06-26T21:13:25.179 INFO:tasks.workunit.client.0.smithi103.stdout:             watch_notify: [ RUN      ] NeoRadosWatchNotifyPoll.WatchNotifyTimeout
2025-06-26T21:13:25.179 INFO:tasks.workunit.client.0.smithi103.stdout:             watch_notify: [       OK ] NeoRadosWatchNotifyPoll.WatchNotifyTimeout (3691 ms)
2025-06-26T21:13:25.179 INFO:tasks.workunit.client.0.smithi103.stdout:             watch_notify: [ RUN      ] NeoRadosWatchNotifyPoll.WrongWatchType
2025-06-26T21:13:25.179 INFO:tasks.workunit.client.0.smithi103.stdout:             watch_notify: [       OK ] NeoRadosWatchNotifyPoll.WrongWatchType (3035 ms)
2025-06-26T21:13:25.179 INFO:tasks.workunit.client.0.smithi103.stdout:             watch_notify: [ RUN      ] NeoRadosWatchNotifyPoll.WatchNotifyCancel
2025-06-26T21:13:25.180 INFO:tasks.workunit.client.0.smithi103.stdout:             watch_notify: [       OK ] NeoRadosWatchNotifyPoll.WatchNotifyCancel (8172 ms)
2025-06-26T21:13:25.180 INFO:tasks.workunit.client.0.smithi103.stdout:             watch_notify: [----------] 4 tests from NeoRadosWatchNotifyPoll (65368 ms total)
2025-06-26T21:13:25.180 INFO:tasks.workunit.client.0.smithi103.stdout:             watch_notify:
2025-06-26T21:13:25.180 INFO:tasks.workunit.client.0.smithi103.stdout:             watch_notify: [----------] Global test environment tear-down
2025-06-26T21:13:25.180 INFO:tasks.workunit.client.0.smithi103.stdout:             watch_notify: [==========] 6 tests from 2 test suites ran. (76902 ms total)
2025-06-26T21:13:25.180 INFO:tasks.workunit.client.0.smithi103.stdout:             watch_notify: [  PASSED  ] 5 tests.
2025-06-26T21:13:25.180 INFO:tasks.workunit.client.0.smithi103.stdout:             watch_notify: [  FAILED  ] 1 test, listed below:
2025-06-26T21:13:25.180 INFO:tasks.workunit.client.0.smithi103.stdout:             watch_notify: [  FAILED  ] NeoRadosWatchNotifyPoll.WatchNotify
2025-06-26T21:13:25.180 INFO:tasks.workunit.client.0.smithi103.stdout:             watch_notify:
2025-06-26T21:13:25.180 INFO:tasks.workunit.client.0.smithi103.stdout:             watch_notify:  1 FAILED TEST
.
.
>
2025-06-26T21:29:58.681 INFO:tasks.workunit.client.0.smithi103.stderr:+ echo 'Waiting for test watch_notify (PID 16474)...'
2025-06-26T21:29:58.681 INFO:tasks.workunit.client.0.smithi103.stdout:Waiting for test watch_notify (PID 16474)...
2025-06-26T21:29:58.682 INFO:tasks.workunit.client.0.smithi103.stderr:+ kill -0 16474
2025-06-26T21:29:58.682 INFO:tasks.workunit.client.0.smithi103.stderr:+ wait 16474
2025-06-26T21:29:58.682 INFO:tasks.workunit.client.0.smithi103.stdout:Test watch_notify (PID 16474) failed with non-zero exit status
2025-06-26T21:29:58.683 INFO:tasks.workunit.client.0.smithi103.stderr:+ echo 'Test watch_notify (PID 16474) failed with non-zero exit status'
2025-06-26T21:29:58.683 INFO:tasks.workunit.client.0.smithi103.stderr:+ ret=1

ksirivad-2025-06-26_20:28:15-rados:monthrash-wip-ksirivad-fix-msg-v2-distro-default-smithi/8352030/remote/smithi103/unit_test_xml_report/api_watch_notify.xml

Here how come the XML files didn't pick this up ....

<?xml version="1.0" encoding="UTF-8"?>
<testsuites tests="11" failures="0" disabled="0" errors="0" time="24.598" timestamp="2025-06-26T21:12:08.038" name="AllTests">
  <testsuite name="LibRadosWatchNotify" tests="10" failures="0" disabled="0" skipped="0" errors="0" time="11.506" timestamp="2025-06-26T21:12:10.026">
    <testcase name="WatchNotify" file="./src/test/librados/watch_notify.cc" line="135" status="run" result="completed" time="0.366" timestamp="2025-06-26T21:12:10.026" classname="LibRadosWatchNotify" />
    <testcase name="Watch2Delete" file="./src/test/librados/watch_notify.cc" line="193" status="run" result="completed" time="1.021" timestamp="2025-06-26T21:12:10.392" classname="LibRadosWatchNotify" />
    <testcase name="AioWatchDelete" file="./src/test/librados/watch_notify.cc" line="222" status="run" result="completed" time="1.01" timestamp="2025-06-26T21:12:11.414" classname="LibRadosWatchNotify" />
    <testcase name="WatchNotify2" file="./src/test/librados/watch_notify.cc" line="261" status="run" result="completed" time="0.018" timestamp="2025-06-26T21:12:12.424" classname="LibRadosWatchNotify" />
    <testcase name="AioWatchNotify2" file="./src/test/librados/watch_notify.cc" line="308" status="run" result="completed" time="0.014" timestamp="2025-06-26T21:12:12.442" classname="LibRadosWatchNotify" />
    <testcase name="AioNotify" file="./src/test/librados/watch_notify.cc" line="362" status="run" result="completed" time="0.009" timestamp="2025-06-26T21:12:12.457" classname="LibRadosWatchNotify" />
    <testcase name="WatchNotify2Multi" file="./src/test/librados/watch_notify.cc" line="417" status="run" result="completed" time="0.016" timestamp="2025-06-26T21:12:12.466" classname="LibRadosWatchNotify" />
    <testcase name="WatchNotify2Timeout" file="./src/test/librados/watch_notify.cc" line="465" status="run" result="completed" time="3.01" timestamp="2025-06-26T21:12:12.482" classname="LibRadosWatchNotify" />
    <testcase name="Watch3Timeout" file="./src/test/librados/watch_notify.cc" line="519" status="run" result="completed" time="5.023" timestamp="2025-06-26T21:12:15.493" classname="LibRadosWatchNotify" />
    <testcase name="AioWatchDelete2" file="./src/test/librados/watch_notify.cc" line="621" status="run" result="completed" time="1.016" timestamp="2025-06-26T21:12:20.516" classname="LibRadosWatchNotify" />
  </testsuite>
  <testsuite name="LibRadosWatchNotifyEC" tests="1" failures="0" disabled="0" skipped="0" errors="0" time="1.127" timestamp="2025-06-26T21:12:28.198">
    <testcase name="WatchNotify" file="./src/test/librados/watch_notify.cc" line="163" status="run" result="completed" time="1.127" timestamp="2025-06-26T21:12:28.198" classname="LibRadosWatchNotifyEC" />
  </testsuite>
</testsuites>

ksirivad-2025-06-26_20:28:15-rados:monthrash-wip-ksirivad-fix-msg-v2-distro-default-smithi/8352012/teuthology.log

2025-06-26T20:49:55.694 INFO:tasks.workunit.client.0.smithi089.stdout:                   api_io: [  SKIPPED ] LibRadosIoEC.RmXattr (0 ms)
2025-06-26T20:49:55.695 INFO:tasks.workunit.client.0.smithi089.stdout:                   api_io: [ RUN      ] LibRadosIoEC.XattrIter
2025-06-26T20:49:55.695 INFO:tasks.workunit.client.0.smithi089.stdout:                   api_io: ./src/test/librados/io.cc:422: Skipped
2025-06-26T20:49:55.695 INFO:tasks.workunit.client.0.smithi089.stdout:                   api_io:
2025-06-26T20:49:55.695 INFO:tasks.workunit.client.0.smithi089.stdout:                   api_io: [  SKIPPED ] LibRadosIoEC.XattrIter (0 ms)
2025-06-26T20:49:55.724 INFO:tasks.workunit.client.0.smithi089.stderr:bash: line 1: 14730 Segmentation fault      (core dumped) ceph_test_rados_api_io --gtest_output=xml:/home/ubuntu/cephtest/archive/unit_test_xml_report/api_io.xml 2>&1
2025-06-26T20:49:55.724 INFO:tasks.workunit.client.0.smithi089.stderr:     14731 Done                    | tee ceph_test_rados_api_io.log
2025-06-26T20:49:55.724 INFO:tasks.workunit.client.0.smithi089.stderr:     14732 Done                    | sed "s/^/                   api_io: /" 
.
.
.
2025-06-26T21:07:09.754 INFO:tasks.workunit.client.0.smithi089.stdout:Waiting for test api_io (PID 14725)...
2025-06-26T21:07:09.754 INFO:tasks.workunit.client.0.smithi089.stderr:+ echo 'Waiting for test api_io (PID 14725)...'
2025-06-26T21:07:09.754 INFO:tasks.workunit.client.0.smithi089.stderr:+ kill -0 14725
2025-06-26T21:07:09.754 INFO:tasks.workunit.client.0.smithi089.stderr:+ wait 14725
2025-06-26T21:07:09.754 INFO:tasks.workunit.client.0.smithi089.stdout:Test api_io (PID 14725) failed with non-zero exit status
2025-06-26T21:07:09.755 INFO:tasks.workunit.client.0.smithi089.stderr:+ echo 'Test api_io (PID 14725) failed with non-zero exit status'
2025-06-26T21:07:09.755 INFO:tasks.workunit.client.0.smithi089.stderr:+ ret=1

ksirivad-2025-06-26_20:28:15-rados:monthrash-wip-ksirivad-fix-msg-v2-distro-default-smithi/8352036/teuthology.log

2025-06-26T21:15:43.355 INFO:tasks.workunit.client.0.smithi181.stdout:             watch_notify: [       OK ] NeoRadosWatchNotify.WatchNotifyTimeout (7384 ms)
2025-06-26T21:15:43.355 INFO:tasks.workunit.client.0.smithi181.stdout:             watch_notify: [----------] 2 tests from NeoRadosWatchNotify (10809 ms total)
2025-06-26T21:15:43.355 INFO:tasks.workunit.client.0.smithi181.stdout:             watch_notify:
2025-06-26T21:15:43.355 INFO:tasks.workunit.client.0.smithi181.stdout:             watch_notify: [----------] 4 tests from NeoRadosWatchNotifyPoll
2025-06-26T21:15:43.355 INFO:tasks.workunit.client.0.smithi181.stdout:             watch_notify: [ RUN      ] NeoRadosWatchNotifyPoll.WatchNotify
2025-06-26T21:15:43.355 INFO:tasks.workunit.client.0.smithi181.stdout:             watch_notify: unknown file: Failure
2025-06-26T21:15:43.355 INFO:tasks.workunit.client.0.smithi181.stdout:             watch_notify: C++ exception with description "(110) Connection timed out [osd:110]" thrown in the test body.
2025-06-26T21:15:43.355 INFO:tasks.workunit.client.0.smithi181.stdout:             watch_notify:
2025-06-26T21:15:43.355 INFO:tasks.workunit.client.0.smithi181.stdout:             watch_notify: [  FAILED  ] NeoRadosWatchNotifyPoll.WatchNotify (50285 ms)
2025-06-26T21:15:43.355 INFO:tasks.workunit.client.0.smithi181.stdout:             watch_notify: [ RUN      ] NeoRadosWatchNotifyPoll.WatchNotifyTimeout
2025-06-26T21:15:43.355 INFO:tasks.workunit.client.0.smithi181.stdout:             watch_notify: [       OK ] NeoRadosWatchNotifyPoll.WatchNotifyTimeout (9241 ms)
2025-06-26T21:15:43.356 INFO:tasks.workunit.client.0.smithi181.stdout:             watch_notify: [ RUN      ] NeoRadosWatchNotifyPoll.WrongWatchType
2025-06-26T21:15:43.356 INFO:tasks.workunit.client.0.smithi181.stdout:             watch_notify: [       OK ] NeoRadosWatchNotifyPoll.WrongWatchType (3948 ms)
2025-06-26T21:15:43.356 INFO:tasks.workunit.client.0.smithi181.stdout:             watch_notify: [ RUN      ] NeoRadosWatchNotifyPoll.WatchNotifyCancel
2025-06-26T21:15:43.356 INFO:tasks.workunit.client.0.smithi181.stdout:             watch_notify: [       OK ] NeoRadosWatchNotifyPoll.WatchNotifyCancel (3013 ms)
2025-06-26T21:15:43.356 INFO:tasks.workunit.client.0.smithi181.stdout:             watch_notify: [----------] 4 tests from NeoRadosWatchNotifyPoll (66488 ms total)
2025-06-26T21:15:43.356 INFO:tasks.workunit.client.0.smithi181.stdout:             watch_notify:
2025-06-26T21:15:43.356 INFO:tasks.workunit.client.0.smithi181.stdout:             watch_notify: [----------] Global test environment tear-down
2025-06-26T21:15:43.356 INFO:tasks.workunit.client.0.smithi181.stdout:             watch_notify: [==========] 6 tests from 2 test suites ran. (77298 ms total)
2025-06-26T21:15:43.356 INFO:tasks.workunit.client.0.smithi181.stdout:             watch_notify: [  PASSED  ] 5 tests.
2025-06-26T21:15:43.356 INFO:tasks.workunit.client.0.smithi181.stdout:             watch_notify: [  FAILED  ] 1 test, listed below:
2025-06-26T21:15:43.356 INFO:tasks.workunit.client.0.smithi181.stdout:             watch_notify: [  FAILED  ] NeoRadosWatchNotifyPoll.WatchNotify
2025-06-26T21:15:43.357 INFO:tasks.workunit.client.0.smithi181.stdout:             watch_notify:
2025-06-26T21:15:43.357 INFO:tasks.workunit.client.0.smithi181.stdout:             watch_notify:  1 FAILED TEST

.
.

2025-06-26T21:31:36.342 INFO:tasks.workunit.client.0.smithi181.stdout:Waiting for test watch_notify (PID 16785)...
2025-06-26T21:31:36.342 INFO:tasks.workunit.client.0.smithi181.stderr:+ pid=16785
2025-06-26T21:31:36.342 INFO:tasks.workunit.client.0.smithi181.stderr:+ echo 'Waiting for test watch_notify (PID 16785)...'
2025-06-26T21:31:36.343 INFO:tasks.workunit.client.0.smithi181.stderr:+ kill -0 16785
2025-06-26T21:31:36.343 INFO:tasks.workunit.client.0.smithi181.stderr:+ wait 16785
2025-06-26T21:31:36.343 INFO:tasks.workunit.client.0.smithi181.stdout:Test watch_notify (PID 16785) failed with non-zero exit status
2025-06-26T21:31:36.343 INFO:tasks.workunit.client.0.smithi181.stderr:+ echo 'Test watch_notify (PID 16785) failed with non-zero exit status'
2025-06-26T21:31:36.344 INFO:tasks.workunit.client.0.smithi181.stderr:+ ret=1

I'm surprised teuthology XML GTEST failure scanner did not catch this, I wonder why the failure reason still says command failed with rados/test.sh timeout 3h????

Actions #19

Updated by Shraddha Agrawal 9 months ago

/a/skanta-2025-06-26_23:33:10-rados-wip-bharath6-testing-2025-06-26-1840-tentacle-distro-default-smithi/8352363

Actions #20

Updated by Laura Flores 9 months ago

I edited Junior's script a bit so it can be run locally on a vstart cluster:
https://github.com/ceph/ceph/commit/e432e35081405e1966022b513ff960ed2fe6037f

FYI @Kamoltat (Junior) Sirivadhna

Actions #21

Updated by Laura Flores 9 months ago

Still in progress..

Actions #22

Updated by Shraddha Agrawal 9 months ago

/a/skanta-2025-07-04_23:32:34-rados-wip-bharath13-testing-2025-07-04-0559-distro-default-smithi/8370615

Actions #23

Updated by Laura Flores 9 months ago

Junior's accompanying PR is under review: https://github.com/ceph/ceph/pull/64219

Actions #24

Updated by Shraddha Agrawal 9 months ago

/a/skanta-2025-07-03_10:29:59-rados-wip-bharath5-testing-2025-06-30-2106-distro-default-smithi/8368522

Actions #25

Updated by Laura Flores 8 months ago

Laura Flores wrote in #note-23:

Junior's accompanying PR is under review: https://github.com/ceph/ceph/pull/64219

This PR is in testing and will hopefully uncover more debug info.

Actions #26

Updated by Shraddha Agrawal 8 months ago

/a/skanta-2025-07-13_23:08:24-rados-wip-bharath4-testing-2025-07-13-0539-distro-default-smithi/8384537

Actions #27

Updated by Shraddha Agrawal 8 months ago

/a/skanta-2025-07-26_22:27:26-rados-wip-bharath7-testing-2025-07-26-0611-tentacle-distro-default-smithi/8409770

Actions #28

Updated by Laura Flores 8 months ago

Will re-look at this since Junior's PR has merged.

Actions #29

Updated by Laura Flores 8 months ago

Shraddha Agrawal wrote in #note-26:

/a/skanta-2025-07-13_23:08:24-rados-wip-bharath4-testing-2025-07-13-0539-distro-default-smithi/8384537

This one is actually https://tracker.ceph.com/issues/71930:

2025-07-14T00:00:44.495 INFO:tasks.workunit.client.0.smithi134.stdout:              api_tier_pp: *** Caught signal (Segmentation fault) **
2025-07-14T00:00:44.495 INFO:tasks.workunit.client.0.smithi134.stdout:              api_tier_pp:  in thread 7f8049de7640 thread_name:msgr-worker-0
2025-07-14T00:00:44.496 INFO:tasks.workunit.client.0.smithi134.stdout:              api_tier_pp:  ceph version 20.3.0-1576-g5903ec9f (5903ec9fe3549e1eca582d2465a3c233125bca78) tentacle (dev - RelWithDebInfo)
2025-07-14T00:00:44.496 INFO:tasks.workunit.client.0.smithi134.stdout:              api_tier_pp:  1: /lib64/libc.so.6(+0x3ea60) [0x7f804e03ea60]
2025-07-14T00:00:44.496 INFO:tasks.workunit.client.0.smithi134.stdout:              api_tier_pp:  2: /lib64/librados.so.2(+0xcafd8) [0x7f804fa9bfd8]
2025-07-14T00:00:44.496 INFO:tasks.workunit.client.0.smithi134.stdout:              api_tier_pp:  3: /lib64/librados.so.2(+0xa426a) [0x7f804fa7526a]
2025-07-14T00:00:44.496 INFO:tasks.workunit.client.0.smithi134.stdout:              api_tier_pp:  4: /lib64/librados.so.2(+0x458fd) [0x7f804fa168fd]
2025-07-14T00:00:44.497 INFO:tasks.workunit.client.0.smithi134.stdout:              api_tier_pp:  5: /usr/lib64/ceph/libceph-common.so.2(+0x7844f5) [0x7f804f5844f5]
2025-07-14T00:00:44.497 INFO:tasks.workunit.client.0.smithi134.stdout:              api_tier_pp:  6: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0x114c) [0x7f804f412bbc]
2025-07-14T00:00:44.497 INFO:tasks.workunit.client.0.smithi134.stdout:              api_tier_pp:  7: (Objecter::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x17b) [0x7f804f3fe01b]
2025-07-14T00:00:44.497 INFO:tasks.workunit.client.0.smithi134.stdout:              api_tier_pp:  8: (DispatchQueue::fast_dispatch(boost::intrusive_ptr<Message> const&)+0xfd) [0x7f804f1bb0ad]
2025-07-14T00:00:44.497 INFO:tasks.workunit.client.0.smithi134.stdout:              api_tier_pp:  9: /usr/lib64/ceph/libceph-common.so.2(+0x44e375) [0x7f804f24e375]
2025-07-14T00:00:44.498 INFO:tasks.workunit.client.0.smithi134.stdout:              api_tier_pp:  10: (ProtocolV1::handle_message_footer(char*, int)+0xfdb) [0x7f804f26ddab]
2025-07-14T00:00:44.498 INFO:tasks.workunit.client.0.smithi134.stdout:              api_tier_pp:  11: /usr/lib64/ceph/libceph-common.so.2(+0x471d5a) [0x7f804f271d5a]
2025-07-14T00:00:44.498 INFO:tasks.workunit.client.0.smithi134.stdout:              api_tier_pp:  12: (AsyncConnection::process()+0x66b) [0x7f804f25a2ab]
2025-07-14T00:00:44.498 INFO:tasks.workunit.client.0.smithi134.stdout:              api_tier_pp:  13: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x1d1) [0x7f804f29eaf1]
2025-07-14T00:00:44.498 INFO:tasks.workunit.client.0.smithi134.stdout:              api_tier_pp:  14: /usr/lib64/ceph/libceph-common.so.2(+0x49f656) [0x7f804f29f656]
2025-07-14T00:00:44.499 INFO:tasks.workunit.client.0.smithi134.stdout:              api_tier_pp:  15: /lib64/libstdc++.so.6(+0xdbae4) [0x7f804e4dbae4]
2025-07-14T00:00:44.499 INFO:tasks.workunit.client.0.smithi134.stdout:              api_tier_pp:  16: /lib64/libc.so.6(+0x8a3b2) [0x7f804e08a3b2]
2025-07-14T00:00:44.499 INFO:tasks.workunit.client.0.smithi134.stdout:              api_tier_pp:  17: /lib64/libc.so.6(+0x10f430) [0x7f804e10f430]

Actions #30

Updated by Kamoltat (Junior) Sirivadhna 7 months ago

  • Copied to Backport #72597: tentacle: rados/verify: rados/test.sh hangs indefinitely after completing all api tests successfully. added
Actions #31

Updated by Kamoltat (Junior) Sirivadhna 7 months ago

  • Copied to Backport #72598: squid: rados/verify: rados/test.sh hangs indefinitely after completing all api tests successfully. added
Actions #32

Updated by Laura Flores 7 months ago

Need to check for valid re-occurrences with the new debug PR patch included.

Actions #33

Updated by Sridhar Seshasayee 7 months ago

skanta-2025-08-24_23:24:05-rados-wip-bharath9-testing-2025-08-24-1258-tentacle-distro-default-smithi/
[8461781, 8461788]

Actions #34

Updated by Jaya Prakash 6 months ago

/a/skanta-2025-08-28_03:20:37-rados-wip-bharath1-testing-2025-08-26-1433-distro-default-smithi/8467898

Actions #35

Updated by Laura Flores 6 months ago

Laura Flores wrote in #note-32:

Need to check for valid re-occurrences with the new debug PR patch included.

TODO: Schedule tests to reproduce valid occurrences.

Actions #36

Updated by Laura Flores 6 months ago

  • Has duplicate Bug #72785: api test TIMED OUT after 5400 seconds added
Actions #37

Updated by Kamoltat (Junior) Sirivadhna 4 months ago

skanta-2025-11-01_02:37:10-rados-wip-bharath4-testing-2025-10-31-1459-distro-default-smithi/8578630/

2025-11-01T03:29:26.121 INFO:tasks.workunit.client.0.smithi188.stderr:+ pid=44312
2025-11-01T03:29:26.122 INFO:tasks.workunit.client.0.smithi188.stdout:Checking Test watch_notify (PID 44312)...
2025-11-01T03:29:26.122 INFO:tasks.workunit.client.0.smithi188.stderr:+ echo 'Checking Test watch_notify (PID 44312)...'
2025-11-01T03:29:26.122 INFO:tasks.workunit.client.0.smithi188.stderr:+ kill -0 44312
2025-11-01T03:29:26.122 INFO:tasks.workunit.client.0.smithi188.stderr:+ wait 44312
2025-11-01T03:29:26.122 INFO:tasks.workunit.client.0.smithi188.stdout:ERROR: Test watch_notify (PID 44312) failed with non-zero exit status
2025-11-01T03:29:26.123 INFO:tasks.workunit.client.0.smithi188.stderr:+ echo 'ERROR: Test watch_notify (PID 44312) failed with non-zero exit status'
2025-11-01T03:29:26.123 INFO:tasks.workunit.client.0.smithi188.stdout:Check the logs for failures in watch_notify
2025-11-01T03:29:26.123 INFO:tasks.workunit.client.0.smithi188.stderr:+ echo 'Check the logs for failures in watch_notify'
2025-11-01T03:29:26.123 INFO:tasks.workunit.client.0.smithi188.stderr:+ ret=1

Actions #38

Updated by Laura Flores 4 months ago

Checking with @Nitzan Mordechai on how best to approach this.

Actions #39

Updated by Laura Flores 4 months ago · Edited

/a/skanta-2025-11-22_04:28:18-rados-wip-lflores-testing-4-2025-11-21-1738-distro-default-smithi/8619618

2025-11-22T06:00:22.966 INFO:tasks.workunit.client.0.smithi088.stdout:             watch_notify: terminate called without an active exception
2025-11-22T06:00:23.090 INFO:tasks.workunit.client.0.smithi088.stderr:bash: line 1: 45344 Aborted                 (core dumped) ceph_test_neorados_watch_notify --gtest_output=xml:/home/ubuntu/cephtest/archive/unit_test_xml_report/neorados_watch_notify.xml 2>&1
2025-11-22T06:00:23.090 INFO:tasks.workunit.client.0.smithi088.stderr:     45345 Done                    | tee ceph_test_neorados_watch_notify.log
2025-11-22T06:00:23.090 INFO:tasks.workunit.client.0.smithi088.stderr:     45346 Done                    | sed "s/^/             watch_notify: /" 
2025-11-22T06:00:29.806 INFO:tasks.workunit.client.0.smithi088.stdout:                   api_io: Running main() from gmock_main.cc

Coredump available at: /a/skanta-2025-11-22_04:28:18-rados-wip-lflores-testing-4-2025-11-21-1738-distro-default-smithi/8619618/remote/smithi088/coredump

Actions #40

Updated by Laura Flores 4 months ago

Plan is to analyze the coredump now that we know which test it's coming from.

Actions #41

Updated by Radoslaw Zarzynski 3 months ago

Bump up.

Actions #42

Updated by Laura Flores 3 months ago

Bump up

Actions #43

Updated by Laura Flores 2 months ago

Need to talk with neorados people about the coredump generated here.

Actions #44

Updated by Laura Flores 2 months ago

Bump up

Actions #45

Updated by Laura Flores about 2 months ago

Bump up

Actions #46

Updated by Sridhar Seshasayee 20 days ago

/a/sseshasa-2026-02-26_14:56:45-rados-wip-sseshasa-testing-2026-02-26-1772100687-distro-default-trial/72333

Actions #47

Updated by Radoslaw Zarzynski 19 days ago

@Laura Flores: anything new?

Actions #48

Updated by Radoslaw Zarzynski 12 days ago

Bump up.

Actions

Also available in: Atom PDF