qa/suites/rados/thrash-erasure-code-big/thrashers: add `osd max backfills` setting to mapgap and pggrow by ljflores · Pull Request #46346 · ceph/ceph

ljflores · 2022-05-19T22:00:53Z

All rados/thrash-erasure-code-big tests that die due to the “wait_for_recovery” timeout have one thing in common: They contain either thrashers/pggrow or thrashers/mapgap.

The difference between pggrow and mapgap vs. all other non-offending thrashers (default, careful, fastread, and morepggrow) is that they lack an override setting for osd max backfills. osd max backfills is the max number of backfill operations allowed to/from an OSD. The higher the number, the quicker the recovery. By default, this value is 1. On all of the non-offending thrashers (default, careful, fastread, and morepggrow), the default 1 value gets overridden in their .yaml files with a value > 1. This is not the case for pggrow and mapgap, however, as they lack an osd max backfills override setting.

The mclock op scheduler is known to override osd max backfills with a high value, but all of the thrash-erasure-code-big thrashers have their op queue set to “debug_random”, which chooses randomly between op queues (the debug_random op queue is set to override the default mclock_scheduler here in qa/config/rados.yaml). So, coupled with the “debug_random” op queue, the low osd max backfill setting is causing some tests to time out in recovery.

WITHOUT osd max backfills, as they are now, “mapgap” and “pggrow” tests die due to timed-out recovery about 17/100 times, as seen here with a pggrow test: http://pulpito.front.sepia.ceph.com/lflores-2022-05-18_14:24:29-rados:thrash-erasure-code-big-master-distro-default-smithi/

WITH osd max backfills specified, as I have suggested in this PR, 99/100 tests passed, with one test failing for a different reason:
http://pulpito.front.sepia.ceph.com/lflores-2022-05-17_22:40:27-rados:thrash-erasure-code-big-master-distro-default-smithi/

I also scheduled 145 tests WITH osd max backfills that are a mix of pggrow and mapgap thrashers. 144/145 tests passed, with one test failing for a different reason. http://pulpito.front.sepia.ceph.com/lflores-2022-05-17_15:27:54-rados:thrash-erasure-code-big-master-distro-default-smithi/

Fixes: https://tracker.ceph.com/issues/51076
Signed-off-by: Laura Flores lflores@redhat.com

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows

jdurgin

Excellent analysis!

…ills` setting to mapgap and pggrow All `rados/thrash-erasure-code-big` tests that die due to the “wait_for_recovery” timeout have one thing in common: They contain either `thrashers/pggrow` or `thrashers/mapgap`. The difference between pggrow and mapgap vs. all other non-offending thrashers (default, careful, fastread, and morepggrow) is that they lack an override setting for `osd max backfills`. `osd max backfills` is the max number of backfill operations allowed to/from an OSD. The higher the number, the quicker the recovery. By default, this value is 1. On all of the non-offending thrashers (default, careful, fastread, and morepggrow), the default 1 value gets overridden in their .yaml files with a value > 1. This is not the case for pggrow and mapgap, however, as they lack an `osd max backfills` override setting. The mclock op scheduler is known to override `osd max backfills` with a high value, but all of the thrash-erasure-code-big thrashers have their op queue set to “debug_random”, which chooses randomly between op queues (the debug_random op queue is set to override the default mclock_scheduler in qa/config/rados.yaml). So, coupled with the “debug_random” op queue, the low `osd max backfill` setting is causing some tests to time out in recovery. WITHOUT `osd max backfills`, as they are now, “mapgap” and “pggrow” tests die due to timed-out recovery about 17/100 times, as seen here with a pggrow test: http://pulpito.front.sepia.ceph.com/lflores-2022-05-18_14:24:29-rados:thrash-erasure-code-big-master-distro-default-smithi/ WITH `osd max backfills` specified, as I have suggested in this PR, 99/100 tests passed, with one test failing for a different reason: http://pulpito.front.sepia.ceph.com/lflores-2022-05-17_22:40:27-rados:thrash-erasure-code-big-master-distro-default-smithi/ I also scheduled 145 tests WITH `osd max backfills` that are a mix of pggrow and mapgap thrashers. 144/145 tests passed, with one test failing for a different reason. http://pulpito.front.sepia.ceph.com/lflores-2022-05-17_15:27:54-rados:thrash-erasure-code-big-master-distro-default-smithi/ Fixes: https://tracker.ceph.com/issues/51076 Signed-off-by: Laura Flores <lflores@redhat.com>

ljflores · 2022-05-19T23:30:34Z

Force-pushed to fix the tracker link in the commit.

sseshasa · 2022-05-23T06:44:48Z

Excellent analysis!

+1

sseshasa

LGTM.

ljflores · 2022-05-23T13:53:38Z

jenkins test make check

neha-ojha

🎉

ljflores · 2022-05-23T23:52:25Z

Validation for this PR, for documentation purposes. No dead jobs due to recovery timeout; failures unrelated:

(Tested on master with the the fixed QA suite)
http://pulpito.front.sepia.ceph.com/lflores-2022-05-17_22:40:27-rados:thrash-erasure-code-big-master-distro-default-smithi/
http://pulpito.front.sepia.ceph.com/lflores-2022-05-17_15:27:54-rados:thrash-erasure-code-big-master-distro-default-smithi/

ljflores requested review from neha-ojha and sseshasa May 19, 2022 22:00

github-actions bot added the core label May 19, 2022

jdurgin approved these changes May 19, 2022

View reviewed changes

ljflores force-pushed the wip-lflores-testing-recovery branch from 79a73e4 to 4006267 Compare May 19, 2022 23:29

sseshasa approved these changes May 23, 2022

View reviewed changes

neha-ojha approved these changes May 23, 2022

View reviewed changes

neha-ojha merged commit f0aeb2e into ceph:master May 23, 2022

ljflores deleted the wip-lflores-testing-recovery branch May 23, 2022 23:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

qa/suites/rados/thrash-erasure-code-big/thrashers: add `osd max backfills` setting to mapgap and pggrow#46346

qa/suites/rados/thrash-erasure-code-big/thrashers: add `osd max backfills` setting to mapgap and pggrow#46346
neha-ojha merged 1 commit intoceph:masterfrom
ljflores:wip-lflores-testing-recovery

ljflores commented May 19, 2022 •

edited

Loading

Uh oh!

jdurgin left a comment

Uh oh!

ljflores commented May 19, 2022

Uh oh!

sseshasa commented May 23, 2022

Uh oh!

sseshasa left a comment

Uh oh!

ljflores commented May 23, 2022

Uh oh!

neha-ojha left a comment

Uh oh!

ljflores commented May 23, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ljflores commented May 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contribution Guidelines

Checklist

Uh oh!

jdurgin left a comment

Choose a reason for hiding this comment

Uh oh!

ljflores commented May 19, 2022

Uh oh!

sseshasa commented May 23, 2022

Uh oh!

sseshasa left a comment

Choose a reason for hiding this comment

Uh oh!

ljflores commented May 23, 2022

Uh oh!

neha-ojha left a comment

Choose a reason for hiding this comment

Uh oh!

ljflores commented May 23, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ljflores commented May 19, 2022 •

edited

Loading