mon/Monitor.cc: exit function if !osdmon()->is_writeable() #50857
mon/Monitor.cc: exit function if !osdmon()->is_writeable() #50857
Conversation
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
dd834af to
eee6cee
Compare
ea43296 to
5883989
Compare
|
jenkins test api |
|
Test Passed 120/120 with fix. Without fix usually 3/100 will die with the issue recreated. |
|
jenkins test api |
|
jenkins test make check arm64 |
gregsfortytwo
left a comment
There was a problem hiding this comment.
Well I sure messed that up nicely. sigh
Reviewed-by: Greg Farnum gfarnum@redhat.com
gregsfortytwo
left a comment
There was a problem hiding this comment.
So, while preparing for my Cephalocon talk, I discovered another missing return statement following an is_writeable/is_readable() check that isn't in this PR.
@kamoltat, I think we've found enough of these now to warrant a full audit — have you done one of those? If not, can you? I don't think it'll take that long but we definitely need to go over them all and squash this category of issue.
Problem: In the function `maybe_go_degraded_stretch_mode()` when `osdmon` is not writeable we shouldn't go into `trigger_degraded_stretch_mode` because we will crash at `ceph_assert(osdmon()->is_writeable())`. The current code does not exit `maybe_go_degraded_stretch_mode()` when we are waiting for `osdmon` to be writeable, therefore, we crash. Solution: Exit the function by returning nothing after going into `wait_for_writeable_ctx`, since at that point we would have queued the context and all we have to do is wait for finish context to execute `maybe_go_degraded_stretch_mode` again. Also, added a bit of logging so that user is aware when `osdmon` and `monmon` are not writeable. We fix other parts of the monitor code that are missing the return after `wait_for_writeable_ctx` and `wait_for_readable_ctx` as well. Fixes: https://tracker.ceph.com/issues/59271 Signed-off-by: Kamoltat <ksirivad@redhat.com>
Separate `mon-stretch` from `mon`. Renamed `mon-stretched-cluster.sh` to `mon-stretch-fail-recovery.sh`. This isolation of stretch cluster test will enable developers to get results faster for stretch-cluster related stuff. Signed-off-by: Kamoltat <ksirivad@redhat.com>
5883989 to
431c455
Compare
|
Hi @gregsfortytwo just did the audit like you asked for the Monitor part and yep just added all the missing |
|
Tested the latest change with 120/120 passing jobs. |
|
jenkins test windows |
|
No related failures. Analysed both: Known Failures: Bug #59196: ceph_test_lazy_omap_stats segfault while waiting for active+clean - RADOS - Ceph ceph_test_lazy_omap_stats segfault while waiting for active+clean Bug #57386: cephadm/test_dashboard_e2e.sh: Expected to find content: '/^foo$/' within the selector: 'cd-modal .badge' but never did - Dashboard - Ceph cephadm/test_dashboard_e2e.sh Bug #59057: rados/test_envlibrados_for_rocksdb.sh: No rule to make target 'rocksdb_env_librados_test' on centos 8 - RADOS - Ceph rados/test_envlibrados_for_rocksdb.sh: No rule to make target 'rocksdb_env_librados_test' on centos 8 Bug #58585: rook: failed to pull kubelet image - Orchestrator - Ceph rook: failed to pull kubelet image Bug #58224: cephadm/test_repos.sh: urllib.error.HTTPError: HTTP Error 504: Gateway Timeout - Orchestrator - Ceph cephadm/test_repos.sh: urllib.error.HTTPError: HTTP Error 504: Gateway Timeout Packaging Issues - Command failed on smithi114 with status 1: 'sudo yum install -y kernel' Packing Issues - Command failed on smithi184 with status 1: 'sudo yum -y install cephfs-java' Import Module Issues - No module named 'tasks ' Known Deadjobs: Bug #59380: rados/singleton-nomsgr: test failing from "Health check failed: 1 full osd(s) (OSD_FULL)" and "Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)" - rgw - Ceph - rados/singleton-nomsgr: test failing from "Health check failed: 1 full osd(s) (OSD_FULL)" and "Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)" Error reimaging machines: reached maximum tries (60) after waiting for 900 seconds centos8 ansible error: Failed to download metadata for repo 'CentOS-PowerTools': Yum repo downloading error |
Problem:
In the function
maybe_go_degraded_stretch_mode()when
osdmonis not writeable we shouldn't go intotrigger_degraded_stretch_modebecause we willcrash at
ceph_assert(osdmon()->is_writeable()).The current code does not exit
maybe_go_degraded_stretch_mode()when we are waiting for
osdmonto be writeable, therefore,we crash.
Solution:
Exit the function by returning nothing after going into
wait_for_writeable_ctx, since at that point we would havequeued the context and all we have to do is wait for finish
context to execute
maybe_go_degraded_stretch_modeagain.Also, added a bit of logging so that user is aware
when
osdmonandmonmonare not writeable.We fix other parts of the monitor code that are missing
the return after
wait_for_writeable_ctxandwait_for_readable_ctxas well.
Fixes: https://tracker.ceph.com/issues/59271
Signed-off-by: Kamoltat ksirivad@redhat.com
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windows