Actions

Copy link

Bug #59127

closed

Job that normally complete much sooner last almost 12 hours

Added by Laura Flores about 3 years ago. Updated over 2 years ago.

Status:

Can't reproduce

Priority:

Normal

Assignee:

Category:

% Done:

Source:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Tags (freeform):

Description

Some job durations last almost 12 hours that normally complete much sooner. This occurs across branches, so it's unlikely to be a Ceph regression.

Mostly cephadm and thrash-old-clients jobs are affected in the rados suite.

Example: https://pulpito.ceph.com/yuriw-2023-03-17_23:38:21-rados-reef-distro-default-smithi/7212164/

Related issues 6 (4 open — 2 closed)

Actions

Copy link

Updated by Laura Flores about 3 years ago

Related to Bug #59118: teuthology.orchestra.run:timed out waiting for gevent copy_file_to added

Actions

Copy link

Updated by Ronen Friedman about 3 years ago

A suggestion by Mark Kogan: could it be that networking configuration was changed such that cluster data is routed thru the
"external" network instead of the multi Gigs one?
That would match the scale of the disruption.

Actions

Copy link

Updated by Laura Flores about 3 years ago

Related to Bug #59123: Timeout opening channel added

Actions

Copy link

Updated by Kamoltat (Junior) Sirivadhna about 3 years ago

Analysis of slowness in `task/progress`.¶

Good run (0:26:53):¶

/a/yuriw-2023-03-21_00:35:27-rados-main-distro-default-smithi/7214888/

Description: rados/mgr/{clusters/{2-node-mgr} debug/mgr mgr_ttl_cache/disable
mon_election/classic random-objectstore$/{bluestore-comp-zlib} supported-random-distro$/{centos_8} tasks/progress}

Bad run (8:33:56):¶

/a/yuriw-2023-03-18_00:57:11-rados-wip-yuri3-testing-2023-03-17-1235-quincy-distro-default-smithi/7213210/

Description: rados/mgr/{clusters/{2-node-mgr} debug/mgr mgr_ttl_cache/disable
mon_election/classic random-objectstore$/{bluestore-comp-lz4} supported-random-distro$/{centos_8} tasks/progress}

Analysis: https://docs.google.com/document/d/1KgMGNAK0kSWxyxC5axd2qTsLdZJJVgc-swSKgVdTvAU/edit#heading=h.umis5id5f357 ¶

Summary:¶

From comparing the logs between the two runs it is safe to say that everything in Ceph is just slower from starting up the OSDs to shutting down at the end of the run for the bad run. task/progress contains 5 tests, in the bad run it takes around 1 hr for each test to finish, while in the good run it takes only 5 mins.

According to the logs, basically in the bad run Ceph is like 7 times slower in starting all the OSDs and 12 times slower when it comes to performing all the operations in each test. Loggings by itself takes 6 times longer (6 seconds to log a config while the good run takes < 1 second)

I suspect this is network issue.

Actions

Copy link

Updated by Laura Flores about 3 years ago

Related to Bug #56393: failed to complete snap trimming before timeout added

Actions

Copy link

Updated by Laura Flores almost 3 years ago

Related to Bug #59282: OSError: [Errno 107] Transport endpoint is not connected added

Actions

Copy link

Updated by Laura Flores almost 3 years ago

Related to Bug #59285: mon/mon-last-epoch-clean.sh: TEST_mon_last_clean_epoch failure due to stuck pgs added

Actions

Copy link

Updated by Laura Flores almost 3 years ago

Related to Bug #59286: mon/test_mon_osdmap_prune.sh: test times out after 5+ hours added

Actions

Copy link

Updated by Zack Cerza over 2 years ago

Status changed from New to Can't reproduce

If this pops up again we can re-open and take advantage of Junior's investigation

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Tools » teuthology

Tags

Custom queries

Bug #59127

Job that normally complete much sooner last almost 12 hours

Updated by Laura Flores about 3 years ago

Updated by Ronen Friedman about 3 years ago

Updated by Laura Flores about 3 years ago

Updated by Kamoltat (Junior) Sirivadhna about 3 years ago

Analysis of slowness in `task/progress`.¶

Good run (0:26:53):¶

Bad run (8:33:56):¶

Analysis: https://docs.google.com/document/d/1KgMGNAK0kSWxyxC5axd2qTsLdZJJVgc-swSKgVdTvAU/edit#heading=h.umis5id5f357 ¶

Summary:¶

Updated by Laura Flores about 3 years ago

Updated by Laura Flores almost 3 years ago

Updated by Laura Flores almost 3 years ago

Updated by Laura Flores almost 3 years ago

Updated by Zack Cerza over 2 years ago

Project

General

Profile

Tools » teuthology

Tags

Custom queries

Bug #59127

Job that normally complete much sooner last almost 12 hours

Updated by Laura Flores about 3 years ago

Updated by Ronen Friedman about 3 years ago

Updated by Laura Flores about 3 years ago

Updated by Kamoltat (Junior) Sirivadhna about 3 years ago

Analysis of slowness in `task/progress`.¶

Good run (0:26:53):¶

Bad run (8:33:56):¶

Analysis: https://docs.google.com/document/d/1KgMGNAK0kSWxyxC5axd2qTsLdZJJVgc-swSKgVdTvAU/edit#heading=h.umis5id5f357¶

Summary:¶

Updated by Laura Flores about 3 years ago

Updated by Laura Flores almost 3 years ago

Updated by Laura Flores almost 3 years ago

Updated by Laura Flores almost 3 years ago

Updated by Zack Cerza over 2 years ago

Analysis: https://docs.google.com/document/d/1KgMGNAK0kSWxyxC5axd2qTsLdZJJVgc-swSKgVdTvAU/edit#heading=h.umis5id5f357 ¶