Project

General

Profile

Actions

Bug #67029

open

api_tier_pp is killed after running for three hours

Added by Brad Hubbard over 1 year ago. Updated over 1 year ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Tests
Target version:
-
% Done:

0%

Source:
Development
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Tags (freeform):
Merge Commit:
Fixed In:
Released In:
Upkeep Timestamp:

Description

It looks like the api_tier_pp test gets killed by some sort of timer/timeout.

2024-07-17T11:08:09.585 INFO:tasks.workunit.client.0.smithi177.stderr:++ cleanup
2024-07-17T11:08:09.585 INFO:tasks.workunit.client.0.smithi177.stderr:++ pkill -P 40427
2024-07-17T11:08:09.621 DEBUG:teuthology.orchestra.run:got remote process result: 124
2024-07-17T11:08:09.622 INFO:tasks.workunit.client.0.smithi177.stderr:++ true
2024-07-17T11:08:09.625 INFO:tasks.workunit.client.0.smithi177.stdout:              api_tier_pp: [
2024-07-17T11:08:09.626 INFO:tasks.workunit:Stopping ['rados/test.sh'] on client.0...
2024-07-17T08:08:09.780 INFO:tasks.workunit.client.0.smithi177.stderr:+ ff=api_tier_pp
2024-07-17T08:08:09.780 INFO:tasks.workunit.client.0.smithi177.stderr:+ pid=40636
2024-07-17T08:08:09.780 INFO:tasks.workunit.client.0.smithi177.stderr:+ echo 'test api_tier_pp on pid 40636'
2024-07-17T08:08:09.780 INFO:tasks.workunit.client.0.smithi177.stdout:test api_tier_pp on pid 40636
2024-07-17T08:08:09.786 INFO:tasks.workunit.client.0.smithi177.stderr:+ pids[$f]=40636
2024-07-17T08:08:09.786 INFO:tasks.workunit.client.0.smithi177.stderr:+ for f in api_aio api_aio_pp api_io api_io_pp api_asio api_list api_lock api_lock_pp api_misc api_misc_pp api_tier_pp api_pool api_snapshots api_snapshots_pp api_stat api_stat_pp api_watch_notify api_watch_notify_pp api_cmd api_cmd_pp api_service api_service_pp api_c_write_operations api_c_read_operations list_parallel open_pools_parallel delete_pools_parallel
2024-07-17T08:08:09.786 INFO:tasks.workunit.client.0.smithi177.stderr:+ '[' 1 -eq 1 ']'
2024-07-17T08:08:09.786 INFO:tasks.workunit.client.0.smithi177.stderr:+ bash -o pipefail -exc 'ceph_test_rados_api_tier_pp --gtest_output=xml:/home/ubuntu/cephtest/archive/unit_test_xml_report/api_tier_pp.xml  2>&1 | tee ceph_test_rados_api_tier_pp.log | sed "s/^/              api_tier_pp: /"'
...
2024-07-17T08:12:16.090 INFO:tasks.workunit.client.0.smithi177.stdout:              api_tier_pp: [==========] Running 77 tests from 4 test suites.
...
2024-07-17T08:20:41.232 INFO:tasks.workunit.client.0.smithi177.stdout:              api_tier_pp: [ RUN      ] LibRadosTwoPoolsECPP.TryFlush

There's only one client running on smithi177 at that time and it's been running for three hours.

$ find . -name ceph-client.admin\*log.gz -print -exec bash -c "zcat {}|tail -1" \; |grep -B1 2024-07-17T11
./ceph-client.admin.40811.log.gz
2024-07-17T11:08:08.944+0000 7fe15e862640  1 --2- 172.21.15.177:0/880473647 >> [v2:172.21.15.19:3300/0,v1:172.21.15.19:6789/0] conn(0x7fe11c02e190 0x7fe11c02ece0 unknown :-1 s=START_CONNECT pgs=0 cs=0 l=1 rev1=0 crypto rx=0 tx=0 comp rx=0 tx=0)._fault waiting 12.800000

$ zcat ./ceph-client.admin.40811.log.gz|(head -1 && tail -1)
2024-07-17T08:08:09.803+0000 7fe161c11a80  1  Processor -- start
2024-07-17T11:08:08.944+0000 7fe15e862640  1 --2- 172.21.15.177:0/880473647 >> [v2:172.21.15.19:3300/0,v1:172.21.15.19:6789/0] conn(0x7fe11c02e190 0x7fe11c02ece0 unknown :-1 s=START_CONNECT pgs=0 cs=0 l=1 rev1=0 crypto rx=0 tx=0 comp rx=0 tx=0)._fault waiting 12.800000

That log contains a lot of entries like the following that seem like a networking issue, but why only this one test?

2024-07-17T10:53:04.808+0000 7fe147fff640  1 -- 172.21.15.177:0/880473647 --> v2:172.21.15.155:6804/2258574764 -- ping magic: 0 -- 0x7fe1140a8310 con 0x7fe150032ae0
2024-07-17T10:53:08.934+0000 7fe15d860640  1 -- 172.21.15.177:0/880473647 >> [v2:172.21.15.66:3300/0,v1:172.21.15.66:6789/0] conn(0x7fe11c02f600 msgr2=0x7fe11c07ff20 unknown :-1 s=STATE_CONNECTING_RE l=1).process reconnect failed to v2:172.21.15.66:3300/0
2024-07-17T10:53:08.934+0000 7fe15d860640  1 --2- 172.21.15.177:0/880473647 >> [v2:172.21.15.66:3300/0,v1:172.21.15.66:6789/0] conn(0x7fe11c02f600 0x7fe11c07ff20 unknown :-1 s=START_CONNECT pgs=0 cs=0 l=1 rev1=0 crypto rx=0 tx=0 comp rx=0 tx=0)._fault waiting 12.800000
2024-07-17T10:53:08.934+0000 7fe15d860640  1 -- 172.21.15.177:0/880473647 >> [v2:172.21.15.155:3300/0,v1:172.21.15.155:6789/0] conn(0x7fe11c02f220 msgr2=0x7fe11c082690 unknown :-1 s=STATE_CONNECTING_RE l=1).process reconnect failed to v2:172.21.15.155:3300/0
2024-07-17T10:53:08.934+0000 7fe15d860640  1 --2- 172.21.15.177:0/880473647 >> [v2:172.21.15.155:3300/0,v1:172.21.15.155:6789/0] conn(0x7fe11c02f220 0x7fe11c082690 unknown :-1 s=START_CONNECT pgs=0 cs=0 l=1 rev1=0 crypto rx=0 tx=0 comp rx=0 tx=0)._fault waiting 12.800000
2024-07-17T10:53:08.935+0000 7fe15e862640  1 -- 172.21.15.177:0/880473647 >> [v2:172.21.15.19:3300/0,v1:172.21.15.19:6789/0] conn(0x7fe11c082bd0 msgr2=0x7fe11c02ece0 unknown :-1 s=STATE_CONNECTING_RE l=1).process reconnect failed to v2:172.21.15.19:3300/0
2024-07-17T10:53:08.935+0000 7fe15e862640  1 --2- 172.21.15.177:0/880473647 >> [v2:172.21.15.19:3300/0,v1:172.21.15.19:6789/0] conn(0x7fe11c082bd0 0x7fe11c02ece0 unknown :-1 s=START_CONNECT pgs=0 cs=0 l=1 rev1=0 crypto rx=0 tx=0 comp rx=0 tx=0)._fault waiting 12.800000

This issue showed up 8 times on the /a/bhubbard-2024-07-17_05:49:47-rados-wip-badone-testing-distro-default-smithi run. All of them ran for about 3.5 hours as opposed to the successful runs that took around 40 minutes.

Actions #1

Updated by Brad Hubbard over 1 year ago

  • Description updated (diff)
Actions

Also available in: Atom PDF