mon/OSDMonitor: Added extra check before mon.go_recovery_stretch_mode() by kamoltat · Pull Request #47340 · ceph/ceph

kamoltat · 2022-07-28T19:07:35Z

Problem:
There are certain scenarios in degraded
stretched cluster where will try to
go into the
function Monitor::go_recovery_stretch_mode()
that will lead to a ceph_assert.

Solution:
Make sure dead_mon_buckets.size() == 0
in OSDMonitor:update_from_paxos()
before going into Monitor::go_recovery_stretch_mode().

Fixes:
https://tracker.ceph.com/issues/57017

TODO: Need to separate the log commits and drop them before merging.

Signed-off-by: Kamoltat ksirivad@redhat.com

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows

kamoltat · 2022-08-04T13:32:42Z

jenkins test make check

kamoltat · 2022-08-04T13:32:52Z

jenkins test windows

Added bug reproducer for https://bugzilla.redhat.com/show_bug.cgi?id=2104207 Added more logs in MON. Signed-off-by: Kamoltat <ksirivad@redhat.com>

Problem: There are certain scenarios in degraded stretched cluster where will try to go into the function ``Monitor::go_recovery_stretch_mode()`` that will lead to a `ceph_assert`. Solution: Make sure ``dead_mon_buckets.size() == 0`` in ``OSDMonitor:update_from_paxos()`` before going into ``Monitor::go_recovery_stretch_mode()``. Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2104207 Signed-off-by: Kamoltat <ksirivad@redhat.com>

rzarzynski · 2022-08-17T18:43:34Z

Do we still need the WIP? There was a push since then.

kamoltat · 2022-08-26T18:27:08Z

removed

kamoltat · 2022-08-26T18:58:58Z

jenkins test api

kamoltat · 2022-08-26T18:59:13Z

jenkins test make check arm64

kamoltat · 2022-08-26T18:59:20Z

jenkins test windows

kamoltat · 2022-08-26T18:59:55Z

jenkins test make check arm64

kamoltat · 2022-09-27T13:27:52Z

jenkins test make check arm64

kamoltat · 2022-09-27T13:28:20Z

jenkins test windows

kamoltat · 2022-09-27T13:28:34Z

jenkins test api

gregsfortytwo

I did not dig into the test case, but the assert patch looks good and the debugging out put is fine.

gregsfortytwo · 2022-09-27T14:16:03Z

src/mon/OSDMonitor.cc

 	    (osdmap.num_up_osd / (double)osdmap.num_osd) >
-	    cct->_conf.get_val<double>("mon_stretch_cluster_recovery_ratio")) {
+	    cct->_conf.get_val<double>("mon_stretch_cluster_recovery_ratio") &&
+      mon.dead_mon_buckets.size() == 0) {


Hmm. This definitely works for our current 2-site setup, but we'll need to adjust it if we start supporting 3-site (and other count) stretch clusters with the explicit stretch mechanisms. That was something I was considering (and trying to keep easy) when writing the other code.
So I'd rather do something that won't require changing when we hit that point, but I don't have a good simple solution, so looks good.

make sense. Thank you

kamoltat · 2022-09-27T14:39:20Z

jenkins test make check arm64

kamoltat · 2022-09-28T14:36:06Z

jenkins test make check arm64

kamoltat · 2022-09-28T14:44:24Z

jenkins test make check arm64

kamoltat · 2022-10-03T19:43:16Z

Rados Approved!

Rerun produced the same failures and dead jobs except for the failures caused by: ceph/ceph: Pull Request 40066 which is gone as expected after dropping the PR.

wip-yuri2-testing-2022-09-27-1455
1 related failure:

7046379 - caused by ceph/ceph: Pull Request 40066

11 unrelated failures + dead jobs
known trackers include:

Bug #57311: rook: ensure CRDs are installed first - Orchestrator - Ceph
Bug #49287: podman: setting cgroup config for procHooks process caused: Unit libpod-$hash.scope not found - Orchestrator - Ceph
Bug #57386: cephadm/test_dashboard_e2e.sh: Expected to find content: '/^foo$/' within the selector: 'cd-modal .badge' but never did - Dashboard - Ceph
Bug #52321: qa/tasks/rook times out: 'check osd count' reached maximum tries (90) after waiting for 900 seconds - Orchestrator - Ceph
Bug #53768: timed out waiting for admin_socket to appear after osd.2 restart in thrasher/defaults workload/small-objects - Ceph - Ceph
Bug #50222: osd: 5.2s0 deep-scrub : stat mismatch - RADOS - Ceph
Bug #57731: Problem: package container-selinux conflicts with udica < 0.2.6-1 provided by udica-0.2.4-1 - Infrastructure - Ceph
Bug #54029: orch:cephadm/workunits/{agent/on mon_election/connectivity task/test_orch_cli} test failing - Orchestrator - Ceph
Bug #53886: ansible: Failed to update apt cache - Infrastructure - Ceph
Bug #49888: rados/singleton: radosbench.py: teuthology.exceptions.MaxWhileTries: reached maximum tries (3650) after waiting for 21900 seconds - RADOS - Ceph

newly opened tracker:
Bug #57731: Problem: package container-selinux conflicts with udica < 0.2.6-1 provided by udica-0.2.4-1 - Infrastructure - Ceph (infrastructure related)
Bug #57736: Failed to download metadata for repo 'rhel-8-for-x86_64-appstream-rpms': Cannot download repomd.xml - Infrastructure - Ceph (infrastructure related)

kamoltat · 2022-12-09T22:56:04Z

This PR introduced: https://tracker.ceph.com/issues/58239, we are in the process of fixing the issue

kamoltat requested a review from a team as a code owner July 28, 2022 19:07

kamoltat self-assigned this Jul 28, 2022

github-actions bot added the core label Jul 28, 2022

kamoltat force-pushed the wip-ksirivad-recreate-zilla-2104207 branch 3 times, most recently from 68d0c0f to 99ff33e Compare August 3, 2022 14:31

github-actions bot added the mon label Aug 3, 2022

kamoltat changed the title ~~[DNM] qa/standalone/mon: init mon-stretched-cluster.sh~~ [WIP] mon/OSDMonitor: Added extra check before mon.go_recovery_stretch_mode() Aug 3, 2022

kamoltat force-pushed the wip-ksirivad-recreate-zilla-2104207 branch from 99ff33e to 32fade0 Compare August 9, 2022 18:25

kamoltat added 2 commits August 9, 2022 18:27

qa/standalone/mon: init mon-stretched-cluster.sh

62fe3cb

Added bug reproducer for https://bugzilla.redhat.com/show_bug.cgi?id=2104207 Added more logs in MON. Signed-off-by: Kamoltat <ksirivad@redhat.com>

kamoltat force-pushed the wip-ksirivad-recreate-zilla-2104207 branch from 32fade0 to d95c41a Compare August 9, 2022 18:27

kamoltat changed the title ~~[WIP] mon/OSDMonitor: Added extra check before mon.go_recovery_stretch_mode()~~ mon/OSDMonitor: Added extra check before mon.go_recovery_stretch_mode() Aug 26, 2022

kamoltat requested a review from gregsfortytwo September 27, 2022 13:27

gregsfortytwo approved these changes Sep 27, 2022

View reviewed changes

kamoltat added the needs-qa label Sep 27, 2022

ljflores added the wip-yuri2-testing label Sep 27, 2022

kamoltat mentioned this pull request Sep 28, 2022

osd/OSDMap: Check for uneven weights & != 2 buckets post stretch mode #48209

Merged

14 tasks

yuriw merged commit 0d5e2e5 into ceph:main Oct 3, 2022

kamoltat added needs-quincy-backport backport required for quincy needs-pacific-backport PR needs a pacific backport labels Nov 8, 2022

This was referenced Nov 8, 2022

[DNM]quincy:mon/OSDMonitor: Added extra check before mon.go_recovery_stretch_mode() #48802

Closed

pacific:mon/OSDMonitor: Added extra check before mon.go_recovery_stretch_mode() #48803

Merged

kamoltat mentioned this pull request Apr 19, 2023

mon/Monitor.cc: exit function if !osdmon()->is_writeable() #50857

Merged

14 tasks

Conversation

kamoltat commented Jul 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contribution Guidelines

Checklist

Uh oh!

kamoltat commented Aug 4, 2022

Uh oh!

kamoltat commented Aug 4, 2022

Uh oh!

rzarzynski commented Aug 17, 2022

Uh oh!

kamoltat commented Aug 26, 2022

Uh oh!

kamoltat commented Aug 26, 2022

Uh oh!

kamoltat commented Aug 26, 2022

Uh oh!

kamoltat commented Aug 26, 2022

Uh oh!

kamoltat commented Aug 26, 2022

Uh oh!

kamoltat commented Sep 27, 2022

Uh oh!

kamoltat commented Sep 27, 2022

Uh oh!

kamoltat commented Sep 27, 2022

Uh oh!

gregsfortytwo left a comment

Choose a reason for hiding this comment

Uh oh!

gregsfortytwo Sep 27, 2022

Choose a reason for hiding this comment

Uh oh!

kamoltat Sep 27, 2022

Choose a reason for hiding this comment

Uh oh!

kamoltat commented Sep 27, 2022

Uh oh!

kamoltat commented Sep 28, 2022

Uh oh!

kamoltat commented Sep 28, 2022

Uh oh!

kamoltat commented Oct 3, 2022

Uh oh!

kamoltat commented Dec 9, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kamoltat commented Jul 28, 2022 •

edited

Loading