Skip to content

mgr/cephadm: set OSD flags during upgrade#50508

Closed
adk3798 wants to merge 2 commits intoceph:mainfrom
adk3798:upgrade-osd-flags
Closed

mgr/cephadm: set OSD flags during upgrade#50508
adk3798 wants to merge 2 commits intoceph:mainfrom
adk3798:upgrade-osd-flags

Conversation

@adk3798
Copy link
Contributor

@adk3798 adk3798 commented Mar 13, 2023

in this commit at least, this means the noout, noscrub,
and nodeep-scrub flags, although it's configurable with
the mgr/cephadm/osd_upgrade_flags conf option. The idea
is that during upgrade there may be situations that cause
OSDs to get marked down and out (see the linked tracker)
and so it's good practice to have these set during upgrade.
This also gets us in line with ceph-ansible, which set
these flags itself during upgrade.

Using the staggered upgrade test to test this since we
need to be upgrading from a mgr with upgrade code
that contains these changes.

Fixes: https://tracker.ceph.com/issues/56670

Signed-off-by: Adam King <adking@redhat.com>

Additionally, the staggered upgrade test is currently broken, so included an additional commit to attempt to fix that. Would recommend looking at the two commits separately when reviewing, especially since they both modify the staggered upgrade teuthology test

"ceph versions" has stopped being a reliable way of checking
if daemons have been upgraded (see linked tracker). The idea
of this commit is to instead rely on the container image id
that cephadm tracks for each daemon, since we have more control
over that.

Fixes: https://tracker.ceph.com/issues/58535

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

self.log.info(
f"Unset global osd flag {flag}")

def get_set_global_osd_flags(self) -> List[str]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the name of this function confusing, it sort of sounds like it can both get and set flags. I'd go for something more like get_current_global_osd_flags.

Option(
'upgrade_osd_flags',
type='str',
default='noout,nodeep-scrub,noscrub',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my 2 cents: Is it actually a good idea to set these? I mean as long as the upgrade is reasonable fast, everything will be fine. I'm just worried about the degraded reliability of the whole cluster if the upgrade takes longer than usual. Why do we set noout for all osds, even if we don't touch them for e.g. a month if the upgrade is stuck somewhere?

@adk3798
Copy link
Contributor Author

adk3798 commented Apr 4, 2023

https://pulpito.ceph.com/adking-2023-04-04_02:06:27-orch:cephadm-wip-adk-testing-2023-04-03-1541-distro-default-smithi/

Reruns of failed/dead jobs: https://pulpito.ceph.com/adking-2023-04-04_05:19:12-orch:cephadm-wip-adk-testing-2023-04-03-1541-distro-default-smithi/

After reruns, 6 failures. All 6 failures were on tests being added/adjusted by PRs in the run (2 keepalive-only nfs tests, 2 mon set crush location tests, 2 staggered upgrade tests) and those failures don't seem like they should block merging for any PRs in the run outside of those adjusting the tests.

@adk3798
Copy link
Contributor Author

adk3798 commented Apr 4, 2023

https://pulpito.ceph.com/adking-2023-04-04_02:06:27-orch:cephadm-wip-adk-testing-2023-04-03-1541-distro-default-smithi/

Reruns of failed/dead jobs: https://pulpito.ceph.com/adking-2023-04-04_05:19:12-orch:cephadm-wip-adk-testing-2023-04-03-1541-distro-default-smithi/

After reruns, 6 failures. All 6 failures were on tests being added/adjusted by PRs in the run (2 keepalive-only nfs tests, 2 mon set crush location tests, 2 staggered upgrade tests) and those failures don't seem like they should block merging for any PRs in the run outside of those adjusting the tests.

NOTE: This PR is one of the ones adjusting tests, specifically the staggered upgrade. Will need to keep working on the test changes before merging this.

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@github-actions
Copy link

github-actions bot commented May 4, 2023

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

adk3798 added 2 commits May 26, 2023 17:12
in this commit at least, this means the noout, noscrub,
and nodeep-scrub flags, although it's configurable with
the mgr/cephadm/osd_upgrade_flags conf option. The idea
is that during upgrade there may be situations that cause
OSDs to get marked down and out (see the linked tracker)
and so it's good practice to have these set during upgrade.
This also gets us in line with ceph-ansible, which set
these flags itself during upgrade.

Using the staggered upgrade test to test this since we
need to be upgrading from a mgr with upgrade code
that contains these changes.

Fixes: https://tracker.ceph.com/issues/56670

Signed-off-by: Adam King <adking@redhat.com>
"ceph versions" has stopped being a reliable way of checking
if daemons have been upgraded (see linked tracker). The idea
of this commit is to instead rely on the container image id
that cephadm tracks for each daemon, since we have more control
over that.

Fixes: https://tracker.ceph.com/issues/58535

Signed-off-by: Adam King <adking@redhat.com>
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@github-actions github-actions bot removed the stale label Sep 20, 2023
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Nov 19, 2023
@github-actions
Copy link

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

@github-actions github-actions bot closed this Dec 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants