mon/Monitor.cc: exit function if !osdmon()->is_writeable() by kamoltat · Pull Request #50857 · ceph/ceph

kamoltat · 2023-04-04T03:52:24Z

Problem:

In the function maybe_go_degraded_stretch_mode()
when osdmon is not writeable we shouldn't go into
trigger_degraded_stretch_mode because we will
crash at ceph_assert(osdmon()->is_writeable()).
The current code does not exit maybe_go_degraded_stretch_mode()
when we are waiting for osdmon to be writeable, therefore,
we crash.

Solution:

Exit the function by returning nothing after going into
wait_for_writeable_ctx, since at that point we would have
queued the context and all we have to do is wait for finish
context to execute maybe_go_degraded_stretch_mode again.

Also, added a bit of logging so that user is aware
when osdmon and monmon are not writeable.

We fix other parts of the monitor code that are missing
the return after wait_for_writeable_ctx and wait_for_readable_ctx
as well.

Fixes: https://tracker.ceph.com/issues/59271

Signed-off-by: Kamoltat ksirivad@redhat.com

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows

github-actions · 2023-04-04T03:52:46Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

kamoltat · 2023-04-07T00:25:21Z

jenkins test api

kamoltat · 2023-04-07T00:35:48Z

https://pulpito.ceph.com/ksirivad-2023-04-06_21:19:15-rados:standalone-wip-ksirivad-iswriteable-distro-default-smithi/

Test Passed 120/120 with fix.

Without fix usually 3/100 will die with the issue recreated.

kamoltat · 2023-04-07T13:58:28Z

jenkins test api

kamoltat · 2023-04-10T15:39:05Z

jenkins test make check arm64

gregsfortytwo

Well I sure messed that up nicely. sigh

Reviewed-by: Greg Farnum gfarnum@redhat.com

gregsfortytwo

So, while preparing for my Cephalocon talk, I discovered another missing return statement following an is_writeable/is_readable() check that isn't in this PR.

@kamoltat, I think we've found enough of these now to warrant a full audit — have you done one of those? If not, can you? I don't think it'll take that long but we definitely need to go over them all and squash this category of issue.

batrick

Good find!

Problem: In the function `maybe_go_degraded_stretch_mode()` when `osdmon` is not writeable we shouldn't go into `trigger_degraded_stretch_mode` because we will crash at `ceph_assert(osdmon()->is_writeable())`. The current code does not exit `maybe_go_degraded_stretch_mode()` when we are waiting for `osdmon` to be writeable, therefore, we crash. Solution: Exit the function by returning nothing after going into `wait_for_writeable_ctx`, since at that point we would have queued the context and all we have to do is wait for finish context to execute `maybe_go_degraded_stretch_mode` again. Also, added a bit of logging so that user is aware when `osdmon` and `monmon` are not writeable. We fix other parts of the monitor code that are missing the return after `wait_for_writeable_ctx` and `wait_for_readable_ctx` as well. Fixes: https://tracker.ceph.com/issues/59271 Signed-off-by: Kamoltat <ksirivad@redhat.com>

Separate `mon-stretch` from `mon`. Renamed `mon-stretched-cluster.sh` to `mon-stretch-fail-recovery.sh`. This isolation of stretch cluster test will enable developers to get results faster for stretch-cluster related stuff. Signed-off-by: Kamoltat <ksirivad@redhat.com>

kamoltat · 2023-04-17T16:10:03Z

Hi @gregsfortytwo just did the audit like you asked for the Monitor part and yep just added all the missing return statement to other places that needed them. Let me know if I'm missing anything!

kamoltat · 2023-04-19T03:17:18Z

Tested the latest change with 120/120 passing jobs.

https://pulpito.ceph.com/ksirivad-2023-04-18_20:07:58-rados:standalone-wip-ksirivad-iswriteable-distro-default-smithi/

gregsfortytwo

Excellent, thanks!

kamoltat · 2023-04-19T13:24:06Z

jenkins test windows

kamoltat · 2023-04-19T13:42:42Z

NOTE: since this PR fixes an issue that was discovered through a standalone test that was added to main: #47340

When back-porting it is best to combine this PR with #47340 into 1 backport.

kamoltat · 2023-05-09T01:03:18Z

No related failures.

Analysed both:

http://pulpito.front.sepia.ceph.com/lflores-2023-05-03_18:20:26-rados-wip-yuri-testing-2023-04-19-1343-distro-default-smithi/

Pulpito :: Results Dashboard

Known Failures:

Bug #59196: ceph_test_lazy_omap_stats segfault while waiting for active+clean - RADOS - Ceph ceph_test_lazy_omap_stats segfault while waiting for active+clean

Bug #57386: cephadm/test_dashboard_e2e.sh: Expected to find content: '/^foo$/' within the selector: 'cd-modal .badge' but never did - Dashboard - Ceph cephadm/test_dashboard_e2e.sh

Bug #59057: rados/test_envlibrados_for_rocksdb.sh: No rule to make target 'rocksdb_env_librados_test' on centos 8 - RADOS - Ceph rados/test_envlibrados_for_rocksdb.sh: No rule to make target 'rocksdb_env_librados_test' on centos 8

Bug #58585: rook: failed to pull kubelet image - Orchestrator - Ceph rook: failed to pull kubelet image

Bug #58224: cephadm/test_repos.sh: urllib.error.HTTPError: HTTP Error 504: Gateway Timeout - Orchestrator - Ceph cephadm/test_repos.sh: urllib.error.HTTPError: HTTP Error 504: Gateway Timeout

Packaging Issues - Command failed on smithi114 with status 1: 'sudo yum install -y kernel'

Packing Issues - Command failed on smithi184 with status 1: 'sudo yum -y install cephfs-java'

Import Module Issues - No module named 'tasks '

Known Deadjobs:

Bug #59380: rados/singleton-nomsgr: test failing from "Health check failed: 1 full osd(s) (OSD_FULL)" and "Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)" - rgw - Ceph - rados/singleton-nomsgr: test failing from "Health check failed: 1 full osd(s) (OSD_FULL)" and "Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)"

Error reimaging machines: reached maximum tries (60) after waiting for 900 seconds

centos8 ansible error: Failed to download metadata for repo 'CentOS-PowerTools': Yum repo downloading error

kamoltat requested a review from a team as a code owner April 4, 2023 03:52

github-actions bot added core mon needs-rebase labels Apr 4, 2023

kamoltat force-pushed the wip-ksirivad-iswriteable branch 2 times, most recently from dd834af to eee6cee Compare April 5, 2023 20:17

github-actions bot removed the needs-rebase label Apr 5, 2023

kamoltat force-pushed the wip-ksirivad-iswriteable branch 3 times, most recently from ea43296 to 5883989 Compare April 6, 2023 18:58

kamoltat self-assigned this Apr 6, 2023

kamoltat changed the title ~~[DNM] mon-stretch: recreating FAILED ceph_assert(osdmon()->is_writeable())~~ mon/Monitor.cc: exit function if !osdmon()->is_writeable() Apr 6, 2023

kamoltat added the needs-review label Apr 6, 2023

kamoltat requested a review from gregsfortytwo April 10, 2023 15:40

gregsfortytwo approved these changes Apr 11, 2023

View reviewed changes

kamoltat added needs-qa and removed needs-review labels Apr 11, 2023

gregsfortytwo requested changes Apr 16, 2023

View reviewed changes

batrick approved these changes Apr 16, 2023

View reviewed changes

kamoltat added 2 commits April 17, 2023 16:03

kamoltat force-pushed the wip-ksirivad-iswriteable branch from 5883989 to 431c455 Compare April 17, 2023 16:06

kamoltat added needs-quincy-backport backport required for quincy needs-pacific-backport PR needs a pacific backport labels Apr 17, 2023

kamoltat added the needs-reef-backport label Apr 17, 2023

gregsfortytwo approved these changes Apr 19, 2023

View reviewed changes

yuriw added the wip-yuri-testing label Apr 19, 2023

kamoltat merged commit 78a4330 into ceph:main May 9, 2023

Conversation

kamoltat commented Apr 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contribution Guidelines

Checklist

Uh oh!

github-actions bot commented Apr 4, 2023

Uh oh!

kamoltat commented Apr 7, 2023

Uh oh!

kamoltat commented Apr 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kamoltat commented Apr 7, 2023

Uh oh!

kamoltat commented Apr 10, 2023

Uh oh!

gregsfortytwo left a comment

Choose a reason for hiding this comment

Uh oh!

gregsfortytwo left a comment

Choose a reason for hiding this comment

Uh oh!

batrick left a comment

Choose a reason for hiding this comment

Uh oh!

kamoltat commented Apr 17, 2023

Uh oh!

kamoltat commented Apr 19, 2023

Uh oh!

gregsfortytwo left a comment

Choose a reason for hiding this comment

Uh oh!

kamoltat commented Apr 19, 2023

Uh oh!

kamoltat commented Apr 19, 2023

Uh oh!

kamoltat commented May 9, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kamoltat commented Apr 4, 2023 •

edited

Loading

kamoltat commented Apr 7, 2023 •

edited

Loading