Skip to content

mon/OSDMonitor: Added extra check before mon.go_recovery_stretch_mode()#47340

Merged
yuriw merged 2 commits intoceph:mainfrom
kamoltat:wip-ksirivad-recreate-zilla-2104207
Oct 3, 2022
Merged

mon/OSDMonitor: Added extra check before mon.go_recovery_stretch_mode()#47340
yuriw merged 2 commits intoceph:mainfrom
kamoltat:wip-ksirivad-recreate-zilla-2104207

Conversation

@kamoltat
Copy link
Member

@kamoltat kamoltat commented Jul 28, 2022

Problem:
There are certain scenarios in degraded
stretched cluster where will try to
go into the
function Monitor::go_recovery_stretch_mode()
that will lead to a ceph_assert.

Solution:
Make sure dead_mon_buckets.size() == 0
in OSDMonitor:update_from_paxos()
before going into Monitor::go_recovery_stretch_mode().

Fixes:
https://tracker.ceph.com/issues/57017

TODO: Need to separate the log commits and drop them before merging.

Signed-off-by: Kamoltat ksirivad@redhat.com

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@kamoltat kamoltat requested a review from a team as a code owner July 28, 2022 19:07
@kamoltat kamoltat self-assigned this Jul 28, 2022
@github-actions github-actions bot added the core label Jul 28, 2022
@kamoltat kamoltat force-pushed the wip-ksirivad-recreate-zilla-2104207 branch 3 times, most recently from 68d0c0f to 99ff33e Compare August 3, 2022 14:31
@github-actions github-actions bot added the mon label Aug 3, 2022
@kamoltat kamoltat changed the title [DNM] qa/standalone/mon: init mon-stretched-cluster.sh [WIP] mon/OSDMonitor: Added extra check before mon.go_recovery_stretch_mode() Aug 3, 2022
@kamoltat
Copy link
Member Author

kamoltat commented Aug 4, 2022

jenkins test make check

@kamoltat
Copy link
Member Author

kamoltat commented Aug 4, 2022

jenkins test windows

@kamoltat kamoltat force-pushed the wip-ksirivad-recreate-zilla-2104207 branch from 99ff33e to 32fade0 Compare August 9, 2022 18:25
Added bug reproducer for
https://bugzilla.redhat.com/show_bug.cgi?id=2104207

Added more logs in MON.

Signed-off-by: Kamoltat <ksirivad@redhat.com>
Problem:
There are certain scenarios in degraded
stretched cluster where will try to
go into the
function ``Monitor::go_recovery_stretch_mode()``
that will lead to a `ceph_assert`.

Solution:
Make sure ``dead_mon_buckets.size() == 0``
in ``OSDMonitor:update_from_paxos()``
before going into ``Monitor::go_recovery_stretch_mode()``.

Fixes:
https://bugzilla.redhat.com/show_bug.cgi?id=2104207

Signed-off-by: Kamoltat <ksirivad@redhat.com>
@kamoltat kamoltat force-pushed the wip-ksirivad-recreate-zilla-2104207 branch from 32fade0 to d95c41a Compare August 9, 2022 18:27
@rzarzynski
Copy link
Contributor

Do we still need the WIP? There was a push since then.

@kamoltat
Copy link
Member Author

removed

@kamoltat kamoltat changed the title [WIP] mon/OSDMonitor: Added extra check before mon.go_recovery_stretch_mode() mon/OSDMonitor: Added extra check before mon.go_recovery_stretch_mode() Aug 26, 2022
@kamoltat
Copy link
Member Author

jenkins test api

@kamoltat
Copy link
Member Author

jenkins test make check arm64

@kamoltat
Copy link
Member Author

jenkins test windows

@kamoltat
Copy link
Member Author

jenkins test make check arm64

@kamoltat
Copy link
Member Author

jenkins test make check arm64

@kamoltat
Copy link
Member Author

jenkins test windows

@kamoltat
Copy link
Member Author

jenkins test api

Copy link
Member

@gregsfortytwo gregsfortytwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not dig into the test case, but the assert patch looks good and the debugging out put is fine.

(osdmap.num_up_osd / (double)osdmap.num_osd) >
cct->_conf.get_val<double>("mon_stretch_cluster_recovery_ratio")) {
cct->_conf.get_val<double>("mon_stretch_cluster_recovery_ratio") &&
mon.dead_mon_buckets.size() == 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. This definitely works for our current 2-site setup, but we'll need to adjust it if we start supporting 3-site (and other count) stretch clusters with the explicit stretch mechanisms. That was something I was considering (and trying to keep easy) when writing the other code.
So I'd rather do something that won't require changing when we hit that point, but I don't have a good simple solution, so looks good.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense. Thank you

@kamoltat
Copy link
Member Author

jenkins test make check arm64

@kamoltat
Copy link
Member Author

jenkins test make check arm64

@kamoltat
Copy link
Member Author

jenkins test make check arm64

@kamoltat
Copy link
Member Author

kamoltat commented Oct 3, 2022

Rados Approved!

Rerun produced the same failures and dead jobs except for the failures caused by: ceph/ceph: Pull Request 40066 which is gone as expected after dropping the PR.

wip-yuri2-testing-2022-09-27-1455
1 related failure:

7046379 - caused by ceph/ceph: Pull Request 40066

11 unrelated failures + dead jobs
known trackers include:

Bug #57311: rook: ensure CRDs are installed first - Orchestrator - Ceph
Bug #49287: podman: setting cgroup config for procHooks process caused: Unit libpod-$hash.scope not found - Orchestrator - Ceph
Bug #57386: cephadm/test_dashboard_e2e.sh: Expected to find content: '/^foo$/' within the selector: 'cd-modal .badge' but never did - Dashboard - Ceph
Bug #52321: qa/tasks/rook times out: 'check osd count' reached maximum tries (90) after waiting for 900 seconds - Orchestrator - Ceph
Bug #53768: timed out waiting for admin_socket to appear after osd.2 restart in thrasher/defaults workload/small-objects - Ceph - Ceph
Bug #50222: osd: 5.2s0 deep-scrub : stat mismatch - RADOS - Ceph
Bug #57731: Problem: package container-selinux conflicts with udica < 0.2.6-1 provided by udica-0.2.4-1 - Infrastructure - Ceph
Bug #54029: orch:cephadm/workunits/{agent/on mon_election/connectivity task/test_orch_cli} test failing - Orchestrator - Ceph
Bug #53886: ansible: Failed to update apt cache - Infrastructure - Ceph
Bug #49888: rados/singleton: radosbench.py: teuthology.exceptions.MaxWhileTries: reached maximum tries (3650) after waiting for 21900 seconds - RADOS - Ceph

newly opened tracker:
Bug #57731: Problem: package container-selinux conflicts with udica < 0.2.6-1 provided by udica-0.2.4-1 - Infrastructure - Ceph (infrastructure related)
Bug #57736: Failed to download metadata for repo 'rhel-8-for-x86_64-appstream-rpms': Cannot download repomd.xml - Infrastructure - Ceph (infrastructure related)

@kamoltat
Copy link
Member Author

kamoltat commented Dec 9, 2022

This PR introduced: https://tracker.ceph.com/issues/58239, we are in the process of fixing the issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants