Skip to content

mon/Monitor.cc: exit function if !osdmon()->is_writeable() #50857

Merged
kamoltat merged 2 commits intoceph:mainfrom
kamoltat:wip-ksirivad-iswriteable
May 9, 2023
Merged

mon/Monitor.cc: exit function if !osdmon()->is_writeable() #50857
kamoltat merged 2 commits intoceph:mainfrom
kamoltat:wip-ksirivad-iswriteable

Conversation

@kamoltat
Copy link
Member

@kamoltat kamoltat commented Apr 4, 2023

Problem:

In the function maybe_go_degraded_stretch_mode()
when osdmon is not writeable we shouldn't go into
trigger_degraded_stretch_mode because we will
crash at ceph_assert(osdmon()->is_writeable()).
The current code does not exit maybe_go_degraded_stretch_mode()
when we are waiting for osdmon to be writeable, therefore,
we crash.

Solution:

Exit the function by returning nothing after going into
wait_for_writeable_ctx, since at that point we would have
queued the context and all we have to do is wait for finish
context to execute maybe_go_degraded_stretch_mode again.

Also, added a bit of logging so that user is aware
when osdmon and monmon are not writeable.

We fix other parts of the monitor code that are missing
the return after wait_for_writeable_ctx and wait_for_readable_ctx
as well.

Fixes: https://tracker.ceph.com/issues/59271

Signed-off-by: Kamoltat ksirivad@redhat.com

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@github-actions
Copy link

github-actions bot commented Apr 4, 2023

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@kamoltat kamoltat force-pushed the wip-ksirivad-iswriteable branch 2 times, most recently from dd834af to eee6cee Compare April 5, 2023 20:17
@kamoltat kamoltat force-pushed the wip-ksirivad-iswriteable branch 3 times, most recently from ea43296 to 5883989 Compare April 6, 2023 18:58
@kamoltat kamoltat self-assigned this Apr 6, 2023
@kamoltat kamoltat changed the title [DNM] mon-stretch: recreating FAILED ceph_assert(osdmon()->is_writeable()) mon/Monitor.cc: exit function if !osdmon()->is_writeable() Apr 6, 2023
@kamoltat
Copy link
Member Author

kamoltat commented Apr 7, 2023

jenkins test api

@kamoltat
Copy link
Member Author

kamoltat commented Apr 7, 2023

https://pulpito.ceph.com/ksirivad-2023-04-06_21:19:15-rados:standalone-wip-ksirivad-iswriteable-distro-default-smithi/

Test Passed 120/120 with fix.

Without fix usually 3/100 will die with the issue recreated.

@kamoltat
Copy link
Member Author

kamoltat commented Apr 7, 2023

jenkins test api

@kamoltat
Copy link
Member Author

jenkins test make check arm64

@kamoltat kamoltat requested a review from gregsfortytwo April 10, 2023 15:40
Copy link
Member

@gregsfortytwo gregsfortytwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I sure messed that up nicely. sigh

Reviewed-by: Greg Farnum gfarnum@redhat.com

Copy link
Member

@gregsfortytwo gregsfortytwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, while preparing for my Cephalocon talk, I discovered another missing return statement following an is_writeable/is_readable() check that isn't in this PR.

@kamoltat, I think we've found enough of these now to warrant a full audit — have you done one of those? If not, can you? I don't think it'll take that long but we definitely need to go over them all and squash this category of issue.

Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good find!

Problem:

In the function `maybe_go_degraded_stretch_mode()`
when `osdmon` is not writeable we shouldn't go into
`trigger_degraded_stretch_mode` because we will
crash at `ceph_assert(osdmon()->is_writeable())`.
The current code does not exit `maybe_go_degraded_stretch_mode()`
when we are waiting for `osdmon` to be writeable, therefore,
we crash.

Solution:

Exit the function by returning nothing after going into
`wait_for_writeable_ctx`, since at that point we would have
queued the context and all we have to do is wait for finish
context to execute `maybe_go_degraded_stretch_mode` again.

Also, added a bit of logging so that user is aware
when `osdmon` and `monmon` are not writeable.

We fix other parts of the monitor code that are missing
the return after `wait_for_writeable_ctx` and `wait_for_readable_ctx`
as well.

Fixes: https://tracker.ceph.com/issues/59271

Signed-off-by: Kamoltat <ksirivad@redhat.com>
Separate `mon-stretch` from `mon`.

Renamed `mon-stretched-cluster.sh` to
`mon-stretch-fail-recovery.sh`.

This isolation of stretch cluster test will enable
developers to get results faster for stretch-cluster
related stuff.

Signed-off-by: Kamoltat <ksirivad@redhat.com>
@kamoltat kamoltat force-pushed the wip-ksirivad-iswriteable branch from 5883989 to 431c455 Compare April 17, 2023 16:06
@kamoltat
Copy link
Member Author

Hi @gregsfortytwo just did the audit like you asked for the Monitor part and yep just added all the missing return statement to other places that needed them. Let me know if I'm missing anything!

@kamoltat kamoltat added needs-quincy-backport backport required for quincy needs-pacific-backport PR needs a pacific backport labels Apr 17, 2023
@kamoltat
Copy link
Member Author

Copy link
Member

@gregsfortytwo gregsfortytwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent, thanks!

@kamoltat
Copy link
Member Author

jenkins test windows

@kamoltat
Copy link
Member Author

NOTE: since this PR fixes an issue that was discovered through a standalone test that was added to main: #47340

When back-porting it is best to combine this PR with #47340 into 1 backport.

@kamoltat
Copy link
Member Author

kamoltat commented May 9, 2023

No related failures.

Analysed both:

http://pulpito.front.sepia.ceph.com/lflores-2023-05-03_18:20:26-rados-wip-yuri-testing-2023-04-19-1343-distro-default-smithi/

Pulpito :: Results Dashboard

Known Failures:

Bug #59196: ceph_test_lazy_omap_stats segfault while waiting for active+clean - RADOS - Ceph ceph_test_lazy_omap_stats segfault while waiting for active+clean

Bug #57386: cephadm/test_dashboard_e2e.sh: Expected to find content: '/^foo$/' within the selector: 'cd-modal .badge' but never did - Dashboard - Ceph cephadm/test_dashboard_e2e.sh

Bug #59057: rados/test_envlibrados_for_rocksdb.sh: No rule to make target 'rocksdb_env_librados_test' on centos 8 - RADOS - Ceph rados/test_envlibrados_for_rocksdb.sh: No rule to make target 'rocksdb_env_librados_test' on centos 8

Bug #58585: rook: failed to pull kubelet image - Orchestrator - Ceph rook: failed to pull kubelet image

Bug #58224: cephadm/test_repos.sh: urllib.error.HTTPError: HTTP Error 504: Gateway Timeout - Orchestrator - Ceph cephadm/test_repos.sh: urllib.error.HTTPError: HTTP Error 504: Gateway Timeout

Packaging Issues - Command failed on smithi114 with status 1: 'sudo yum install -y kernel'

Packing Issues - Command failed on smithi184 with status 1: 'sudo yum -y install cephfs-java'

Import Module Issues - No module named 'tasks '

Known Deadjobs:

Bug #59380: rados/singleton-nomsgr: test failing from "Health check failed: 1 full osd(s) (OSD_FULL)" and "Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)" - rgw - Ceph - rados/singleton-nomsgr: test failing from "Health check failed: 1 full osd(s) (OSD_FULL)" and "Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)"

Error reimaging machines: reached maximum tries (60) after waiting for 900 seconds

centos8 ansible error: Failed to download metadata for repo 'CentOS-PowerTools': Yum repo downloading error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants