Skip to content

osd/OSD: osd_fast_shutdown_notify_mon not quite right#44807

Merged
yuriw merged 2 commits intoceph:masterfrom
NitzanMordhai:wip-nitzan-fast-shutdown-notify-mon
Mar 25, 2022
Merged

osd/OSD: osd_fast_shutdown_notify_mon not quite right#44807
yuriw merged 2 commits intoceph:masterfrom
NitzanMordhai:wip-nitzan-fast-shutdown-notify-mon

Conversation

@NitzanMordhai
Copy link
Contributor

@NitzanMordhai NitzanMordhai commented Jan 27, 2022

When osd_fast_shutdown and osd_fast_shutdown_notify_mon set as true, OSD marked as Down
it should be marked as Dead,

Fixed: https://tracker.ceph.com/issues/53327

Signed-off-by: Nitzan Mordechai nmordech@redhat.com

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@github-actions github-actions bot added the core label Jan 27, 2022
@neha-ojha neha-ojha requested a review from liewegas January 27, 2022 23:43
@NitzanMordhai NitzanMordhai force-pushed the wip-nitzan-fast-shutdown-notify-mon branch from 93378e3 to 7f71ae1 Compare January 28, 2022 08:00
@NitzanMordhai
Copy link
Contributor Author

Changed to add check also for osd_fast_shutdown as well as osd_fast_shutdown_notify_mon before mark as dead

@NitzanMordhai NitzanMordhai force-pushed the wip-nitzan-fast-shutdown-notify-mon branch 2 times, most recently from 4b675f6 to 2c5d957 Compare January 30, 2022 13:42
Copy link
Member

@jdurgin jdurgin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks right, however it seems set_state(STOPPING); should precede the MarkMeDead message, since that's what stops us from accepting new connections and processing new messages.

Since the is_stopping_cond is waiting with a timeout, STATE_STOPPING may not be set yet at this point.

@NitzanMordhai NitzanMordhai force-pushed the wip-nitzan-fast-shutdown-notify-mon branch from 2c5d957 to c2892ca Compare February 3, 2022 07:30
@NitzanMordhai NitzanMordhai requested a review from jdurgin February 3, 2022 07:31
@mlausch
Copy link
Contributor

mlausch commented Feb 16, 2022

I tested the functionality of this patch.
The MOSDMarkMeDead is send to the mon only in about 1/3 of the cases (while stopping all OSDs of a host).
Mostly the OSD process is exited before the message could be transmitted to the mon
I think there should be waited for a ack like in the down message as well.

@ronen-fr
Copy link
Contributor

ronen-fr commented Feb 16, 2022

Note also @mlausch 's comment in the tracker. I see two options:

  • wait for the ack - but that might miss the main idea of shutting down quickly even if the Mon is not responding immediately (at least - I think that's part of the idea), or
  • a delay based on internal OSD data. Either until we know that the message has gone out (not sure how easy is it to create the required low-level mechanism), or using a short (1 sec?) delay.

@jdurgin ? @neha-ojha ?

@jdurgin
Copy link
Member

jdurgin commented Feb 16, 2022

@ronen-fr I'd suggest the 1st approach - waiting for the ack - to keep it simple.

The speed of shutting down is mainly a concern during an upgrade. It's worse to not get marked down and affect client i/o than to shutdown a little bit slower to be sure the message gets to the mon.

@yuriw
Copy link
Contributor

yuriw commented Feb 16, 2022

@NitzanMordhai pls add needs-qa when fixed

@neha-ojha
Copy link
Member

Note also @mlausch 's comment in the tracker. I see two options:

  • wait for the ack - but that might miss the main idea of shutting down quickly even if the Mon is not responding immediately (at least - I think that's part of the idea), or
  • a delay based on internal OSD data. Either until we know that the message has gone out (not sure how easy is it to create the required low-level mechanism), or using a short (1 sec?) delay.

@jdurgin ? @neha-ojha ?

I agree that waiting for an ack mechanism is probably the better way to go. Also, osd_fast_shutdown_notify_mon is set to false by default(because of #44016 (comment)), so we should be make sure we are testing this code path correctly, maybe add a test based on what @mlausch is doing.

@jdurgin
Copy link
Member

jdurgin commented Feb 16, 2022

Note also @mlausch 's comment in the tracker. I see two options:

  • wait for the ack - but that might miss the main idea of shutting down quickly even if the Mon is not responding immediately (at least - I think that's part of the idea), or
  • a delay based on internal OSD data. Either until we know that the message has gone out (not sure how easy is it to create the required low-level mechanism), or using a short (1 sec?) delay.

@jdurgin ? @neha-ojha ?

I agree that waiting for an ack mechanism is probably the better way to go. Also, osd_fast_shutdown_notify_mon is set to false by default(because of #44016 (comment)), so we should be make sure we are testing this code path correctly, maybe add a test based on what @mlausch is doing.

I think we should change that default - there's very little impact outside of an edge case when mons are unresponsive.

@neha-ojha
Copy link
Member

Note also @mlausch 's comment in the tracker. I see two options:

  • wait for the ack - but that might miss the main idea of shutting down quickly even if the Mon is not responding immediately (at least - I think that's part of the idea), or
  • a delay based on internal OSD data. Either until we know that the message has gone out (not sure how easy is it to create the required low-level mechanism), or using a short (1 sec?) delay.

@jdurgin ? @neha-ojha ?

I agree that waiting for an ack mechanism is probably the better way to go. Also, osd_fast_shutdown_notify_mon is set to false by default(because of #44016 (comment)), so we should be make sure we are testing this code path correctly, maybe add a test based on what @mlausch is doing.

I think we should change that default - there's very little impact outside of an edge case when mons are unresponsive.

Agreed, how about we cherry-pick the commit from #44016 in this PR and test them together? @satoru-takeuchi does that sound good to you?

@satoru-takeuchi
Copy link
Contributor

@neha-ojha Yes, sounds great.

@neha-ojha
Copy link
Member

@neha-ojha Yes, sounds great.

Thanks @satoru-takeuchi!

@NitzanMordhai please cherry-pick bf4d358 into this PR and keep the original Signed-off-by, thanks!

@NitzanMordhai NitzanMordhai force-pushed the wip-nitzan-fast-shutdown-notify-mon branch from c2892ca to 44761f2 Compare February 21, 2022 14:26
@github-actions github-actions bot added the mon label Feb 21, 2022
@NitzanMordhai NitzanMordhai force-pushed the wip-nitzan-fast-shutdown-notify-mon branch from 44761f2 to 985a90b Compare February 23, 2022 10:01
@mlausch
Copy link
Contributor

mlausch commented Feb 27, 2022

HI @NitzanMordhai. I tried out the current code. The Mon skipps the new Message type, because it is seems not be allowed send by a OSD.
I have created you a PR one your branch. I hope this is the right way to do this.

@mlausch
Copy link
Contributor

mlausch commented Mar 15, 2022

works for me as well.

src/osd/OSD.cc Outdated
whoami,
osdmap->get_addrs(whoami),
osdmap->get_epoch(),
true // request ack
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason to not mark the osd dead during a graceful shutdown?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason to not mark the osd dead during a graceful shutdown?

I don't think there is any reason not, i couldn't find any

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's make it unconditional then

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mlausch
Copy link
Contributor

mlausch commented Mar 21, 2022

@neha-ojha can you add a backport lable for pacific as well?

@neha-ojha
Copy link
Member

@neha-ojha can you add a backport lable for pacific as well?

The associated tracker has pacific and quincy in the backport field, https://tracker.ceph.com/issues/53327#note-3. I added the additional needs-quincy-backport label, because I'd like to include this patch in the first cut of quincy.

When osd_fast_shutdown and osd_fast_shutdown_notify_mon set as true, OSD marked as Down
it should be marked as Dead,

Fixed: https://tracker.ceph.com/issues/53327

Signed-off-by: Nitzan Mordechai <nmordech@redhat.com>

nd

nd
@NitzanMordhai NitzanMordhai force-pushed the wip-nitzan-fast-shutdown-notify-mon branch from 8d65ce8 to 07302d5 Compare March 23, 2022 14:38
@neha-ojha neha-ojha requested a review from jdurgin March 23, 2022 16:50
@yuriw yuriw merged commit ea1727e into ceph:master Mar 25, 2022
sseshasa added a commit to sseshasa/ceph that referenced this pull request Mar 25, 2022
Modify test_activate_osd() to get the type of scheduler in use and then
verify the value of osd_max_backfills. This is because mclock scheduler
overrides this option to 1000 upon OSD initialization.

The test earlier used to pass because the OSD daemon was killed but not
marked down and upon being brought up, the wait for OSD up check was
passing quickly. But the OSD still didn't have the latest config values.

But now upon killing the OSD, the osd_fast_shutdown sequence notifies the
mon (see PR: ceph#44807) and is marked down
and dead. Upon bringing it up, the wait for OSD up check takes a longer
time and this is sufficient for the config values to be updated. This
results in the correct values being read from the config 'Values' map.

Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
ljflores pushed a commit to ljflores/ceph that referenced this pull request Mar 25, 2022
Modify test_activate_osd() to get the type of scheduler in use and then
verify the value of osd_max_backfills. This is because mclock scheduler
overrides this option to 1000 upon OSD initialization.

The test earlier used to pass because the OSD daemon was killed but not
marked down and upon being brought up, the wait for OSD up check was
passing quickly. But the OSD still didn't have the latest config values.

But now upon killing the OSD, the osd_fast_shutdown sequence notifies the
mon (see PR: ceph#44807) and is marked down
and dead. Upon bringing it up, the wait for OSD up check takes a longer
time and this is sufficient for the config values to be updated. This
results in the correct values being read from the config 'Values' map.

Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
(cherry picked from commit 3aa2df2)
ljflores pushed a commit to ljflores/ceph that referenced this pull request Mar 25, 2022
Modify test_activate_osd() to get the type of scheduler in use and then
verify the value of osd_max_backfills. This is because mclock scheduler
overrides this option to 1000 upon OSD initialization.

The test earlier used to pass because the OSD daemon was killed but not
marked down and upon being brought up, the wait for OSD up check was
passing quickly. But the OSD still didn't have the latest config values.

But now upon killing the OSD, the osd_fast_shutdown sequence notifies the
mon (see PR: ceph#44807) and is marked down
and dead. Upon bringing it up, the wait for OSD up check takes a longer
time and this is sufficient for the config values to be updated. This
results in the correct values being read from the config 'Values' map.

Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
(cherry picked from commit 3aa2df2)
@ljflores ljflores mentioned this pull request Mar 25, 2022
14 tasks
ljflores pushed a commit to ljflores/ceph that referenced this pull request Mar 25, 2022
Modify test_activate_osd() to get the type of scheduler in use and then
verify the value of osd_max_backfills. This is because mclock scheduler
overrides this option to 1000 upon OSD initialization.

The test earlier used to pass because the OSD daemon was killed but not
marked down and upon being brought up, the wait for OSD up check was
passing quickly. But the OSD still didn't have the latest config values.

But now upon killing the OSD, the osd_fast_shutdown sequence notifies the
mon (see PR: ceph#44807) and is marked down
and dead. Upon bringing it up, the wait for OSD up check takes a longer
time and this is sufficient for the config values to be updated. This
results in the correct values being read from the config 'Values' map.

Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
(cherry picked from commit 3aa2df2)
zalsader pushed a commit to zalsader/ceph that referenced this pull request Apr 11, 2022
Modify test_activate_osd() to get the type of scheduler in use and then
verify the value of osd_max_backfills. This is because mclock scheduler
overrides this option to 1000 upon OSD initialization.

The test earlier used to pass because the OSD daemon was killed but not
marked down and upon being brought up, the wait for OSD up check was
passing quickly. But the OSD still didn't have the latest config values.

But now upon killing the OSD, the osd_fast_shutdown sequence notifies the
mon (see PR: ceph#44807) and is marked down
and dead. Upon bringing it up, the wait for OSD up check takes a longer
time and this is sufficient for the config values to be updated. This
results in the correct values being read from the config 'Values' map.

Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
mchangir pushed a commit to mchangir/ceph that referenced this pull request Apr 13, 2022
Modify test_activate_osd() to get the type of scheduler in use and then
verify the value of osd_max_backfills. This is because mclock scheduler
overrides this option to 1000 upon OSD initialization.

The test earlier used to pass because the OSD daemon was killed but not
marked down and upon being brought up, the wait for OSD up check was
passing quickly. But the OSD still didn't have the latest config values.

But now upon killing the OSD, the osd_fast_shutdown sequence notifies the
mon (see PR: ceph#44807) and is marked down
and dead. Upon bringing it up, the wait for OSD up check takes a longer
time and this is sufficient for the config values to be updated. This
results in the correct values being read from the config 'Values' map.

Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
(cherry picked from commit 3aa2df2)
dpaganel pushed a commit to dpaganel/ceph that referenced this pull request May 17, 2022
Modify test_activate_osd() to get the type of scheduler in use and then
verify the value of osd_max_backfills. This is because mclock scheduler
overrides this option to 1000 upon OSD initialization.

The test earlier used to pass because the OSD daemon was killed but not
marked down and upon being brought up, the wait for OSD up check was
passing quickly. But the OSD still didn't have the latest config values.

But now upon killing the OSD, the osd_fast_shutdown sequence notifies the
mon (see PR: ceph#44807) and is marked down
and dead. Upon bringing it up, the wait for OSD up check takes a longer
time and this is sufficient for the config values to be updated. This
results in the correct values being read from the config 'Values' map.

Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
NitzanMordhai pushed a commit to NitzanMordhai/ceph that referenced this pull request Nov 19, 2023
Modify test_activate_osd() to get the type of scheduler in use and then
verify the value of osd_max_backfills. This is because mclock scheduler
overrides this option to 1000 upon OSD initialization.

The test earlier used to pass because the OSD daemon was killed but not
marked down and upon being brought up, the wait for OSD up check was
passing quickly. But the OSD still didn't have the latest config values.

But now upon killing the OSD, the osd_fast_shutdown sequence notifies the
mon (see PR: ceph#44807) and is marked down
and dead. Upon bringing it up, the wait for OSD up check takes a longer
time and this is sufficient for the config values to be updated. This
results in the correct values being read from the config 'Values' map.

Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
(cherry picked from commit 3aa2df2)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants