Actions
Bug #67029
openapi_tier_pp is killed after running for three hours
Status:
New
Priority:
Normal
Assignee:
-
Category:
Tests
Target version:
-
% Done:
0%
Source:
Development
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Tags (freeform):
Merge Commit:
Fixed In:
Released In:
Upkeep Timestamp:
Description
It looks like the api_tier_pp test gets killed by some sort of timer/timeout.
2024-07-17T11:08:09.585 INFO:tasks.workunit.client.0.smithi177.stderr:++ cleanup 2024-07-17T11:08:09.585 INFO:tasks.workunit.client.0.smithi177.stderr:++ pkill -P 40427 2024-07-17T11:08:09.621 DEBUG:teuthology.orchestra.run:got remote process result: 124 2024-07-17T11:08:09.622 INFO:tasks.workunit.client.0.smithi177.stderr:++ true 2024-07-17T11:08:09.625 INFO:tasks.workunit.client.0.smithi177.stdout: api_tier_pp: [ 2024-07-17T11:08:09.626 INFO:tasks.workunit:Stopping ['rados/test.sh'] on client.0...
2024-07-17T08:08:09.780 INFO:tasks.workunit.client.0.smithi177.stderr:+ ff=api_tier_pp 2024-07-17T08:08:09.780 INFO:tasks.workunit.client.0.smithi177.stderr:+ pid=40636 2024-07-17T08:08:09.780 INFO:tasks.workunit.client.0.smithi177.stderr:+ echo 'test api_tier_pp on pid 40636' 2024-07-17T08:08:09.780 INFO:tasks.workunit.client.0.smithi177.stdout:test api_tier_pp on pid 40636 2024-07-17T08:08:09.786 INFO:tasks.workunit.client.0.smithi177.stderr:+ pids[$f]=40636 2024-07-17T08:08:09.786 INFO:tasks.workunit.client.0.smithi177.stderr:+ for f in api_aio api_aio_pp api_io api_io_pp api_asio api_list api_lock api_lock_pp api_misc api_misc_pp api_tier_pp api_pool api_snapshots api_snapshots_pp api_stat api_stat_pp api_watch_notify api_watch_notify_pp api_cmd api_cmd_pp api_service api_service_pp api_c_write_operations api_c_read_operations list_parallel open_pools_parallel delete_pools_parallel 2024-07-17T08:08:09.786 INFO:tasks.workunit.client.0.smithi177.stderr:+ '[' 1 -eq 1 ']' 2024-07-17T08:08:09.786 INFO:tasks.workunit.client.0.smithi177.stderr:+ bash -o pipefail -exc 'ceph_test_rados_api_tier_pp --gtest_output=xml:/home/ubuntu/cephtest/archive/unit_test_xml_report/api_tier_pp.xml 2>&1 | tee ceph_test_rados_api_tier_pp.log | sed "s/^/ api_tier_pp: /"' ... 2024-07-17T08:12:16.090 INFO:tasks.workunit.client.0.smithi177.stdout: api_tier_pp: [==========] Running 77 tests from 4 test suites. ... 2024-07-17T08:20:41.232 INFO:tasks.workunit.client.0.smithi177.stdout: api_tier_pp: [ RUN ] LibRadosTwoPoolsECPP.TryFlush
There's only one client running on smithi177 at that time and it's been running for three hours.
$ find . -name ceph-client.admin\*log.gz -print -exec bash -c "zcat {}|tail -1" \; |grep -B1 2024-07-17T11
./ceph-client.admin.40811.log.gz
2024-07-17T11:08:08.944+0000 7fe15e862640 1 --2- 172.21.15.177:0/880473647 >> [v2:172.21.15.19:3300/0,v1:172.21.15.19:6789/0] conn(0x7fe11c02e190 0x7fe11c02ece0 unknown :-1 s=START_CONNECT pgs=0 cs=0 l=1 rev1=0 crypto rx=0 tx=0 comp rx=0 tx=0)._fault waiting 12.800000
$ zcat ./ceph-client.admin.40811.log.gz|(head -1 && tail -1)
2024-07-17T08:08:09.803+0000 7fe161c11a80 1 Processor -- start
2024-07-17T11:08:08.944+0000 7fe15e862640 1 --2- 172.21.15.177:0/880473647 >> [v2:172.21.15.19:3300/0,v1:172.21.15.19:6789/0] conn(0x7fe11c02e190 0x7fe11c02ece0 unknown :-1 s=START_CONNECT pgs=0 cs=0 l=1 rev1=0 crypto rx=0 tx=0 comp rx=0 tx=0)._fault waiting 12.800000
That log contains a lot of entries like the following that seem like a networking issue, but why only this one test?
2024-07-17T10:53:04.808+0000 7fe147fff640 1 -- 172.21.15.177:0/880473647 --> v2:172.21.15.155:6804/2258574764 -- ping magic: 0 -- 0x7fe1140a8310 con 0x7fe150032ae0 2024-07-17T10:53:08.934+0000 7fe15d860640 1 -- 172.21.15.177:0/880473647 >> [v2:172.21.15.66:3300/0,v1:172.21.15.66:6789/0] conn(0x7fe11c02f600 msgr2=0x7fe11c07ff20 unknown :-1 s=STATE_CONNECTING_RE l=1).process reconnect failed to v2:172.21.15.66:3300/0 2024-07-17T10:53:08.934+0000 7fe15d860640 1 --2- 172.21.15.177:0/880473647 >> [v2:172.21.15.66:3300/0,v1:172.21.15.66:6789/0] conn(0x7fe11c02f600 0x7fe11c07ff20 unknown :-1 s=START_CONNECT pgs=0 cs=0 l=1 rev1=0 crypto rx=0 tx=0 comp rx=0 tx=0)._fault waiting 12.800000 2024-07-17T10:53:08.934+0000 7fe15d860640 1 -- 172.21.15.177:0/880473647 >> [v2:172.21.15.155:3300/0,v1:172.21.15.155:6789/0] conn(0x7fe11c02f220 msgr2=0x7fe11c082690 unknown :-1 s=STATE_CONNECTING_RE l=1).process reconnect failed to v2:172.21.15.155:3300/0 2024-07-17T10:53:08.934+0000 7fe15d860640 1 --2- 172.21.15.177:0/880473647 >> [v2:172.21.15.155:3300/0,v1:172.21.15.155:6789/0] conn(0x7fe11c02f220 0x7fe11c082690 unknown :-1 s=START_CONNECT pgs=0 cs=0 l=1 rev1=0 crypto rx=0 tx=0 comp rx=0 tx=0)._fault waiting 12.800000 2024-07-17T10:53:08.935+0000 7fe15e862640 1 -- 172.21.15.177:0/880473647 >> [v2:172.21.15.19:3300/0,v1:172.21.15.19:6789/0] conn(0x7fe11c082bd0 msgr2=0x7fe11c02ece0 unknown :-1 s=STATE_CONNECTING_RE l=1).process reconnect failed to v2:172.21.15.19:3300/0 2024-07-17T10:53:08.935+0000 7fe15e862640 1 --2- 172.21.15.177:0/880473647 >> [v2:172.21.15.19:3300/0,v1:172.21.15.19:6789/0] conn(0x7fe11c082bd0 0x7fe11c02ece0 unknown :-1 s=START_CONNECT pgs=0 cs=0 l=1 rev1=0 crypto rx=0 tx=0 comp rx=0 tx=0)._fault waiting 12.800000
This issue showed up 8 times on the /a/bhubbard-2024-07-17_05:49:47-rados-wip-badone-testing-distro-default-smithi run. All of them ran for about 3.5 hours as opposed to the successful runs that took around 40 minutes.
Actions