Skip to content

qa/tasks/ceph_manager.py: increase test_pool_min_size timeout#47138

Merged
yuriw merged 2 commits intoceph:mainfrom
kamoltat:wip-ksirivad-fix-test-pool-min-size
Aug 3, 2022
Merged

qa/tasks/ceph_manager.py: increase test_pool_min_size timeout#47138
yuriw merged 2 commits intoceph:mainfrom
kamoltat:wip-ksirivad-fix-test-pool-min-size

Conversation

@kamoltat
Copy link
Member

@kamoltat kamoltat commented Jul 18, 2022

Problem

In the test where we thrash OSDs with EC pools using
test_pool_min_size as one of the actions, there are some
edge cases that will make the recovery process
of OSDs longer than normal. One of these edge cases
is when you kill and revive two OSDs consecutively that
share the same acting set with a PG. This can
lead to the state of the PG being down temporarily
as it is waiting for more OSD in its acting set to be
up. The amount of time we give to the recovery
the process to happen is too short which leads to
issues like https://tracker.ceph.com/issues/49777
and https://tracker.ceph.com/issues/51904. Moreover,
there are some cases where we also did not give enough
time for the test to track the OSD recovery process properly
as we can observe from issue like https://tracker.ceph.com/issues/54511.

Solution

  1. Provided buffer time before we check
    for recovery in ceph_manager.wait_for_recovery().
    This addresses https://tracker.ceph.com/issues/54511

  2. Increased timeout in ceph_manager.wait_for_clean().
    This addresses https://tracker.ceph.com/issues/51904.

  3. Increased sleep time for ceph_manager.all_active_or_peered().
    This addresses https://tracker.ceph.com/issues/49777

  4. Improve loggings for better debugging purposes in ceph_manager.py.
    This helps make the future debugging process easier for developers.

Fixes:
https://tracker.ceph.com/issues/49777
https://tracker.ceph.com/issues/54511
https://tracker.ceph.com/issues/51904

Signed-off-by: Kamoltat ksirivad@redhat.com

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

1. When `test_pool_min_size` hit the case where
`not all PGs are active or peered` we dump
each PG state that doesn't have active or
peered state

2. Improve logs message in `inject_pause()`.

3. Add logs for the `test_map_discontinuity()`.

4. In the function, `choose_action()`,
added more logs regarding `chance_down`.

5. Added more loggings to
`primary_affinity()`,
`thrash_pg_upmap_items()`,
`thrash_pg_upmap()`.

6. Make self.is_clean() dump the pgs that
are not active+clean.

Signed-off-by: Kamoltat <ksirivad@redhat.com>
@github-actions github-actions bot added the tests label Jul 18, 2022
@kamoltat kamoltat self-assigned this Jul 18, 2022
@kamoltat
Copy link
Member Author

kamoltat commented Jul 18, 2022

https://pulpito.ceph.com/ksirivad-2022-07-18_00:39:03-rados:thrash-erasure-code-main-distro-default-smithi/

300 Jobs scheduled with the commits from this PR.

Description of the runs:
rados:thrash-erasure-code/{ceph clusters/{fixed-2 openstack} fast/fast mon_election/classic msgr-failures/osd-delay objectstore/bluestore-bitmap rados recovery-overrides/{more-partial-recovery} supported-random-distro$/{centos_8} thrashers/minsize_recovery thrashosds-health workloads/ec-small-objects-balanced}

Result:
300/300 Passed!

In test_pool_min_size():

1. Provided buffer time before we check
for recovery in ceph_manager.wait_for_recovery()

2. Increased timeout in ceph_manager.wait_for_clean()

3. Increased sleep time for
ceph_manager.all_active_or_peered()

Fixes:
https://tracker.ceph.com/issues/49777
https://tracker.ceph.com/issues/54511
https://tracker.ceph.com/issues/51904

Signed-off-by: Kamoltat <ksirivad@redhat.com>
@kamoltat kamoltat force-pushed the wip-ksirivad-fix-test-pool-min-size branch from 8761dfa to ed73288 Compare July 18, 2022 12:30
@kamoltat kamoltat requested a review from a team July 18, 2022 13:35
@kamoltat
Copy link
Member Author

jenkins test windows

@kamoltat kamoltat added the core label Jul 18, 2022
@kamoltat
Copy link
Member Author

jenkins test api

Copy link
Member

@ljflores ljflores left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good! I just had a suggestion for the logging. As for testing, I feel like the 300 tests you have already run are sufficient.

Comment on lines +2732 to +2737
def dump_pgs_not_active_peered(self, pgs):
for pg in pgs:
if (not pg['state'].count('active')) and (not pg['state'].count('peered')):
self.log('PG %s is not active or peered' % pg['pgid'])
self.log(pg)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put all the "not active or peered" pgs into a list and dump them at the end instead of logging them one at a time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ljflores
For me, it is easier to look at each pg stats individually rather than in a big list. Also, I was following the convention that has been implemented before me: https://github.com/ceph/ceph/blob/main/qa/tasks/ceph_manager.py#L2689-L2717

Let me know what you think!

@neha-ojha neha-ojha requested a review from NitzanMordhai July 19, 2022 01:01
@kamoltat
Copy link
Member Author

@ljflores @NitzanMordhai PTAL

Copy link
Member

@neha-ojha neha-ojha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's verify from the teuthology logs that test_pool_min_size is performing the sequence of steps correctly.

@kamoltat
Copy link
Member Author

kamoltat commented Aug 3, 2022

@yuriweinstein Rados approved

Related Failures:

/a/yuriw-2022-07-22_03:30:40-rados-wip-yuri3-testing-2022-07-21-1604-distro-default-smithi/6943775

due to ceph/ceph: Pull Request 46036

Unrelated Failures already with Tracker:
Bug #52124: Invalid read of size 8 in handle_recovery_delete() - RADOS - Ceph
Bug #52321: qa/tasks/rook times out: 'check osd count' reached maximum tries (90) after waiting for 900 seconds - Orchestrator - Ceph
Bug #56652: cephadm/test_repos.sh: rllib.error.HTTPError: HTTP Error 504: Gateway Timeout - Orchestrator - Ceph
Bug #55001: rados/test.sh: Early exit right after LibRados global tests complete - RADOS - Ceph
Bug #53768: timed out waiting for admin_socket to appear after osd.2 restart in thrasher/defaults workload/small-objects - crimson - Ceph
Bug #48819: fsck error: found stray (per-pg) omap data on omap_head - bluestore - Ceph
Bug #55853: test_cls_rgw.sh: failures in 'cls_rgw.index_list' and 'cls_rgw.index_list_delimited` - rgw - Ceph
Bug #52652: ERROR: test_module_commands (tasks.mgr.test_module_selftest.TestModuleSelftest) - mgr - Ceph
Bug #55142: [ERR] : Unhandled exception from module 'devicehealth' while running on mgr.gibba002.nzpbzu: disk I/O error - cephsqlite - Ceph
Bug #44587: failed to write to cgroup.procs: - Orchestrator - Ceph
Bug #56716: RuntimeError: ceph version 16.2.10-515.g3bc1d6b2 was not installed, found 16.2.10-513.gc2f1041a.el8. - teuthology - Ceph

Unrelated Failures without Tracker:
Bug #57015: bluestore::NCB::__restore_allocator::No Valid allocation info on disk (empty file) - RADOS - Ceph
Infrastructure: Bug #56717: The package cache file is corrupted - teuthology - Ceph

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants