Skip to content

pacific:mon/Monitor.cc: exit function if !osdmon()->is_writeable() && mon/OSDMonitor: Added extra check before mon.go_recovery_stretch_mode()#51414

Merged
yuriw merged 4 commits intoceph:pacificfrom
kamoltat:wip-ksirivad-backport-pacific-47340-and-50857
Dec 28, 2023
Merged

pacific:mon/Monitor.cc: exit function if !osdmon()->is_writeable() && mon/OSDMonitor: Added extra check before mon.go_recovery_stretch_mode()#51414
yuriw merged 4 commits intoceph:pacificfrom
kamoltat:wip-ksirivad-backport-pacific-47340-and-50857

Conversation

@kamoltat
Copy link
Member

@kamoltat kamoltat commented May 9, 2023

Problem:

  1. There are certain scenarios in degraded
    stretched cluster where will try to
    go into the
    function Monitor::go_recovery_stretch_mode()
    that will lead to a ceph_assert.

  2. In the function maybe_go_degraded_stretch_mode()
    when osdmon is not writeable we shouldn't go into
    trigger_degraded_stretch_mode because we will
    crash at ceph_assert(osdmon()->is_writeable()).
    The current code does not exit maybe_go_degraded_stretch_mode()
    when we are waiting for osdmon to be writeable, therefore,
    we crash.

Solution:

  1. Make sure dead_mon_buckets.size() == 0
    in OSDMonitor:update_from_paxos()
    before going into Monitor::go_recovery_stretch_mode().
  2. Exit the function by returning nothing after going into
    wait_for_writeable_ctx, since at that point we would have
    queued the context and all we have to do is wait for finish
    context to execute maybe_go_degraded_stretch_mode again.
    Also, added a bit of logging so that user is aware
    when osdmon and monmon are not writeable.
    We fix other parts of the monitor code that are missing
    the return after wait_for_writeable_ctx and wait_for_readable_ctx
    as well.

Fixes:
https://tracker.ceph.com/issues/57017
https://tracker.ceph.com/issues/59271

Backporting relevant commits from main PRs:

#47340
#50857

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

kamoltat added 4 commits May 9, 2023 17:59
Added bug reproducer for
https://bugzilla.redhat.com/show_bug.cgi?id=2104207

Added more logs in MON.

Signed-off-by: Kamoltat <ksirivad@redhat.com>
(cherry picked from commit 62fe3cb)
Problem:
There are certain scenarios in degraded
stretched cluster where will try to
go into the
function ``Monitor::go_recovery_stretch_mode()``
that will lead to a `ceph_assert`.

Solution:
Make sure ``dead_mon_buckets.size() == 0``
in ``OSDMonitor:update_from_paxos()``
before going into ``Monitor::go_recovery_stretch_mode()``.

Fixes:
https://bugzilla.redhat.com/show_bug.cgi?id=2104207

Signed-off-by: Kamoltat <ksirivad@redhat.com>
(cherry picked from commit d95c41a)
Problem:

In the function `maybe_go_degraded_stretch_mode()`
when `osdmon` is not writeable we shouldn't go into
`trigger_degraded_stretch_mode` because we will
crash at `ceph_assert(osdmon()->is_writeable())`.
The current code does not exit `maybe_go_degraded_stretch_mode()`
when we are waiting for `osdmon` to be writeable, therefore,
we crash.

Solution:

Exit the function by returning nothing after going into
`wait_for_writeable_ctx`, since at that point we would have
queued the context and all we have to do is wait for finish
context to execute `maybe_go_degraded_stretch_mode` again.

Also, added a bit of logging so that user is aware
when `osdmon` and `monmon` are not writeable.

We fix other parts of the monitor code that are missing
the return after `wait_for_writeable_ctx` and `wait_for_readable_ctx`
as well.

Fixes: https://tracker.ceph.com/issues/59271

Signed-off-by: Kamoltat <ksirivad@redhat.com>
(cherry picked from commit 7f7c2d5)
Separate `mon-stretch` from `mon`.

Renamed `mon-stretched-cluster.sh` to
`mon-stretch-fail-recovery.sh`.

This isolation of stretch cluster test will enable
developers to get results faster for stretch-cluster
related stuff.

Signed-off-by: Kamoltat <ksirivad@redhat.com>
(cherry picked from commit 431c455)
@kamoltat kamoltat self-assigned this May 9, 2023
@kamoltat kamoltat requested a review from a team as a code owner May 9, 2023 18:02
@github-actions github-actions bot added this to the pacific milestone May 9, 2023
@kamoltat kamoltat added the backport: no-conflicts Backport without conflicts label May 9, 2023
@kamoltat kamoltat requested a review from gregsfortytwo May 15, 2023 19:53
@kamoltat kamoltat modified the milestones: pacific, v16.2.15 Dec 19, 2023
Copy link
Contributor

@NitzanMordhai NitzanMordhai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@NitzanMordhai
Copy link
Contributor

Rados approved
PACIFIC - RADOS - Ceph

@yuriw yuriw merged commit e3484e8 into ceph:pacific Dec 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants