qa/tasks/ceph_manager.py: increase test_pool_min_size timeout by kamoltat · Pull Request #47138 · ceph/ceph

kamoltat · 2022-07-18T00:08:13Z

Problem

In the test where we thrash OSDs with EC pools using
test_pool_min_size as one of the actions, there are some
edge cases that will make the recovery process
of OSDs longer than normal. One of these edge cases
is when you kill and revive two OSDs consecutively that
share the same acting set with a PG. This can
lead to the state of the PG being down temporarily
as it is waiting for more OSD in its acting set to be
up. The amount of time we give to the recovery
the process to happen is too short which leads to
issues like https://tracker.ceph.com/issues/49777
and https://tracker.ceph.com/issues/51904. Moreover,
there are some cases where we also did not give enough
time for the test to track the OSD recovery process properly
as we can observe from issue like https://tracker.ceph.com/issues/54511.

Solution

Provided buffer time before we check
for recovery in ceph_manager.wait_for_recovery().
This addresses https://tracker.ceph.com/issues/54511
Increased timeout in ceph_manager.wait_for_clean().
This addresses https://tracker.ceph.com/issues/51904.
Increased sleep time for ceph_manager.all_active_or_peered().
This addresses https://tracker.ceph.com/issues/49777
Improve loggings for better debugging purposes in ceph_manager.py.
This helps make the future debugging process easier for developers.

Fixes:
https://tracker.ceph.com/issues/49777
https://tracker.ceph.com/issues/54511
https://tracker.ceph.com/issues/51904

Signed-off-by: Kamoltat ksirivad@redhat.com

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows

1. When `test_pool_min_size` hit the case where `not all PGs are active or peered` we dump each PG state that doesn't have active or peered state 2. Improve logs message in `inject_pause()`. 3. Add logs for the `test_map_discontinuity()`. 4. In the function, `choose_action()`, added more logs regarding `chance_down`. 5. Added more loggings to `primary_affinity()`, `thrash_pg_upmap_items()`, `thrash_pg_upmap()`. 6. Make self.is_clean() dump the pgs that are not active+clean. Signed-off-by: Kamoltat <ksirivad@redhat.com>

kamoltat · 2022-07-18T12:25:59Z

https://pulpito.ceph.com/ksirivad-2022-07-18_00:39:03-rados:thrash-erasure-code-main-distro-default-smithi/

300 Jobs scheduled with the commits from this PR.

Description of the runs:
rados:thrash-erasure-code/{ceph clusters/{fixed-2 openstack} fast/fast mon_election/classic msgr-failures/osd-delay objectstore/bluestore-bitmap rados recovery-overrides/{more-partial-recovery} supported-random-distro$/{centos_8} thrashers/minsize_recovery thrashosds-health workloads/ec-small-objects-balanced}

Result:
300/300 Passed!

In test_pool_min_size(): 1. Provided buffer time before we check for recovery in ceph_manager.wait_for_recovery() 2. Increased timeout in ceph_manager.wait_for_clean() 3. Increased sleep time for ceph_manager.all_active_or_peered() Fixes: https://tracker.ceph.com/issues/49777 https://tracker.ceph.com/issues/54511 https://tracker.ceph.com/issues/51904 Signed-off-by: Kamoltat <ksirivad@redhat.com>

kamoltat · 2022-07-18T14:07:30Z

jenkins test windows

kamoltat · 2022-07-18T15:14:11Z

jenkins test api

ljflores

Looks really good! I just had a suggestion for the logging. As for testing, I feel like the 300 tests you have already run are sufficient.

ljflores · 2022-07-18T15:16:28Z

qa/tasks/ceph_manager.py

+    def dump_pgs_not_active_peered(self, pgs):
+        for pg in pgs:
+            if (not pg['state'].count('active')) and (not pg['state'].count('peered')):
+                self.log('PG %s is not active or peered' % pg['pgid'])
+                self.log(pg)
+


Can we put all the "not active or peered" pgs into a list and dump them at the end instead of logging them one at a time?

@ljflores
For me, it is easier to look at each pg stats individually rather than in a big list. Also, I was following the convention that has been implemented before me: https://github.com/ceph/ceph/blob/main/qa/tasks/ceph_manager.py#L2689-L2717

Let me know what you think!

kamoltat · 2022-07-21T12:55:43Z

@ljflores @NitzanMordhai PTAL

neha-ojha

Let's verify from the teuthology logs that test_pool_min_size is performing the sequence of steps correctly.

kamoltat · 2022-08-03T13:55:28Z

@yuriweinstein Rados approved

Related Failures:

/a/yuriw-2022-07-22_03:30:40-rados-wip-yuri3-testing-2022-07-21-1604-distro-default-smithi/6943775

due to ceph/ceph: Pull Request 46036

Unrelated Failures already with Tracker:
Bug #52124: Invalid read of size 8 in handle_recovery_delete() - RADOS - Ceph
Bug #52321: qa/tasks/rook times out: 'check osd count' reached maximum tries (90) after waiting for 900 seconds - Orchestrator - Ceph
Bug #56652: cephadm/test_repos.sh: rllib.error.HTTPError: HTTP Error 504: Gateway Timeout - Orchestrator - Ceph
Bug #55001: rados/test.sh: Early exit right after LibRados global tests complete - RADOS - Ceph
Bug #53768: timed out waiting for admin_socket to appear after osd.2 restart in thrasher/defaults workload/small-objects - crimson - Ceph
Bug #48819: fsck error: found stray (per-pg) omap data on omap_head - bluestore - Ceph
Bug #55853: test_cls_rgw.sh: failures in 'cls_rgw.index_list' and 'cls_rgw.index_list_delimited` - rgw - Ceph
Bug #52652: ERROR: test_module_commands (tasks.mgr.test_module_selftest.TestModuleSelftest) - mgr - Ceph
Bug #55142: [ERR] : Unhandled exception from module 'devicehealth' while running on mgr.gibba002.nzpbzu: disk I/O error - cephsqlite - Ceph
Bug #44587: failed to write to cgroup.procs: - Orchestrator - Ceph
Bug #56716: RuntimeError: ceph version 16.2.10-515.g3bc1d6b2 was not installed, found 16.2.10-513.gc2f1041a.el8. - teuthology - Ceph

Unrelated Failures without Tracker:
Bug #57015: bluestore::NCB::__restore_allocator::No Valid allocation info on disk (empty file) - RADOS - Ceph
Infrastructure: Bug #56717: The package cache file is corrupted - teuthology - Ceph

github-actions bot added the tests label Jul 18, 2022

kamoltat self-assigned this Jul 18, 2022

kamoltat force-pushed the wip-ksirivad-fix-test-pool-min-size branch from 8761dfa to ed73288 Compare July 18, 2022 12:30

kamoltat requested a review from a team July 18, 2022 13:35

kamoltat added the needs-review label Jul 18, 2022

kamoltat added the core label Jul 18, 2022

ljflores reviewed Jul 18, 2022

View reviewed changes

neha-ojha requested a review from NitzanMordhai July 19, 2022 01:01

neha-ojha approved these changes Jul 21, 2022

View reviewed changes

neha-ojha added needs-qa and removed needs-review labels Jul 21, 2022

yuriw added wip-yuri4-testing wip-yuri3-testing and removed wip-yuri4-testing labels Jul 21, 2022

yuriw merged commit dc218e4 into ceph:main Aug 3, 2022

kamoltat added needs-quincy-backport backport required for quincy needs-pacific-backport PR needs a pacific backport labels Aug 3, 2022

This was referenced Aug 3, 2022

quincy: qa/tasks/ceph_manager.py: increase test_pool_min_size timeout #47445

Merged

pacific: qa/tasks/ceph_manager.py: increase test_pool_min_size timeout #47446

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

qa/tasks/ceph_manager.py: increase test_pool_min_size timeout#47138

qa/tasks/ceph_manager.py: increase test_pool_min_size timeout#47138
yuriw merged 2 commits intoceph:mainfrom
kamoltat:wip-ksirivad-fix-test-pool-min-size

kamoltat commented Jul 18, 2022 •

edited

Loading

Uh oh!

kamoltat commented Jul 18, 2022 •

edited

Loading

Uh oh!

kamoltat commented Jul 18, 2022

Uh oh!

kamoltat commented Jul 18, 2022

Uh oh!

ljflores left a comment

Uh oh!

ljflores Jul 18, 2022

Uh oh!

kamoltat Jul 18, 2022

Uh oh!

kamoltat commented Jul 21, 2022

Uh oh!

neha-ojha left a comment

Uh oh!

kamoltat commented Aug 3, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

kamoltat commented Jul 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contribution Guidelines

Checklist

Uh oh!

kamoltat commented Jul 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kamoltat commented Jul 18, 2022

Uh oh!

kamoltat commented Jul 18, 2022

Uh oh!

ljflores left a comment

Choose a reason for hiding this comment

Uh oh!

ljflores Jul 18, 2022

Choose a reason for hiding this comment

Uh oh!

kamoltat Jul 18, 2022

Choose a reason for hiding this comment

Uh oh!

kamoltat commented Jul 21, 2022

Uh oh!

neha-ojha left a comment

Choose a reason for hiding this comment

Uh oh!

kamoltat commented Aug 3, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kamoltat commented Jul 18, 2022 •

edited

Loading

kamoltat commented Jul 18, 2022 •

edited

Loading