Project

General

Profile

Actions

Bug #59127

closed

Job that normally complete much sooner last almost 12 hours

Added by Laura Flores about 3 years ago. Updated over 2 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
-
% Done:

0%

Source:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Tags (freeform):

Description

Some job durations last almost 12 hours that normally complete much sooner. This occurs across branches, so it's unlikely to be a Ceph regression.

Mostly cephadm and thrash-old-clients jobs are affected in the rados suite.

Example: https://pulpito.ceph.com/yuriw-2023-03-17_23:38:21-rados-reef-distro-default-smithi/7212164/


Related issues 6 (4 open2 closed)

Related to teuthology - Bug #59118: teuthology.orchestra.run:timed out waiting for gevent copy_file_toClosed

Actions
Related to Infrastructure - Bug #59123: Timeout opening channelNew

Actions
Related to RADOS - Bug #56393: failed to complete snap trimming before timeoutDuplicateMatan Breizman

Actions
Related to Infrastructure - Bug #59282: OSError: [Errno 107] Transport endpoint is not connectedNew

Actions
Related to RADOS - Bug #59285: mon/mon-last-epoch-clean.sh: TEST_mon_last_clean_epoch failure due to stuck pgsNew

Actions
Related to RADOS - Bug #59286: mon/test_mon_osdmap_prune.sh: test times out after 5+ hoursNew

Actions
Actions #1

Updated by Laura Flores about 3 years ago

  • Related to Bug #59118: teuthology.orchestra.run:timed out waiting for gevent copy_file_to added
Actions #2

Updated by Ronen Friedman about 3 years ago

A suggestion by Mark Kogan: could it be that networking configuration was changed such that cluster data is routed thru the
"external" network instead of the multi Gigs one?
That would match the scale of the disruption.

Actions #3

Updated by Laura Flores about 3 years ago

  • Related to Bug #59123: Timeout opening channel added
Actions #4

Updated by Kamoltat (Junior) Sirivadhna about 3 years ago

Analysis of slowness in `task/progress`.

Good run (0:26:53):

/a/yuriw-2023-03-21_00:35:27-rados-main-distro-default-smithi/7214888/

Description: rados/mgr/{clusters/{2-node-mgr} debug/mgr mgr_ttl_cache/disable
mon_election/classic random-objectstore$/{bluestore-comp-zlib} supported-random-distro$/{centos_8} tasks/progress}

Bad run (8:33:56):

/a/yuriw-2023-03-18_00:57:11-rados-wip-yuri3-testing-2023-03-17-1235-quincy-distro-default-smithi/7213210/

Description: rados/mgr/{clusters/{2-node-mgr} debug/mgr mgr_ttl_cache/disable
mon_election/classic random-objectstore$/{bluestore-comp-lz4} supported-random-distro$/{centos_8} tasks/progress}

Analysis: https://docs.google.com/document/d/1KgMGNAK0kSWxyxC5axd2qTsLdZJJVgc-swSKgVdTvAU/edit#heading=h.umis5id5f357

Summary:

From comparing the logs between the two runs it is safe to say that everything in Ceph is just slower from starting up the OSDs to shutting down at the end of the run for the bad run. task/progress contains 5 tests, in the bad run it takes around 1 hr for each test to finish, while in the good run it takes only 5 mins.

According to the logs, basically in the bad run Ceph is like 7 times slower in starting all the OSDs and 12 times slower when it comes to performing all the operations in each test. Loggings by itself takes 6 times longer (6 seconds to log a config while the good run takes < 1 second)

I suspect this is network issue.

Actions #5

Updated by Laura Flores about 3 years ago

  • Related to Bug #56393: failed to complete snap trimming before timeout added
Actions #6

Updated by Laura Flores almost 3 years ago

  • Related to Bug #59282: OSError: [Errno 107] Transport endpoint is not connected added
Actions #7

Updated by Laura Flores almost 3 years ago

  • Related to Bug #59285: mon/mon-last-epoch-clean.sh: TEST_mon_last_clean_epoch failure due to stuck pgs added
Actions #8

Updated by Laura Flores almost 3 years ago

  • Related to Bug #59286: mon/test_mon_osdmap_prune.sh: test times out after 5+ hours added
Actions #9

Updated by Zack Cerza over 2 years ago

  • Status changed from New to Can't reproduce

If this pops up again we can re-open and take advantage of Junior's investigation

Actions

Also available in: Atom PDF