mgr/cephadm: set OSD flags during upgrade by adk3798 · Pull Request #50508 · ceph/ceph

adk3798 · 2023-03-13T20:30:47Z

in this commit at least, this means the noout, noscrub,
and nodeep-scrub flags, although it's configurable with
the mgr/cephadm/osd_upgrade_flags conf option. The idea
is that during upgrade there may be situations that cause
OSDs to get marked down and out (see the linked tracker)
and so it's good practice to have these set during upgrade.
This also gets us in line with ceph-ansible, which set
these flags itself during upgrade.

Using the staggered upgrade test to test this since we
need to be upgrading from a mgr with upgrade code
that contains these changes.

Fixes: https://tracker.ceph.com/issues/56670

Signed-off-by: Adam King <adking@redhat.com>

Additionally, the staggered upgrade test is currently broken, so included an additional commit to attempt to fix that. Would recommend looking at the two commits separately when reviewing, especially since they both modify the staggered upgrade teuthology test

"ceph versions" has stopped being a reliable way of checking
if daemons have been upgraded (see linked tracker). The idea
of this commit is to instead rely on the container image id
that cephadm tracks for each daemon, since we have more control
over that.

Fixes: https://tracker.ceph.com/issues/58535

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows

phlogistonjohn · 2023-03-14T15:20:39Z

src/pybind/mgr/cephadm/module.py

+            self.log.info(
+                f"Unset global osd flag {flag}")
+
+    def get_set_global_osd_flags(self) -> List[str]:


I find the name of this function confusing, it sort of sounds like it can both get and set flags. I'd go for something more like get_current_global_osd_flags.

sebastian-philipp · 2023-03-16T17:00:24Z

src/pybind/mgr/cephadm/module.py

+        Option(
+            'upgrade_osd_flags',
+            type='str',
+            default='noout,nodeep-scrub,noscrub',


my 2 cents: Is it actually a good idea to set these? I mean as long as the upgrade is reasonable fast, everything will be fine. I'm just worried about the degraded reliability of the whole cluster if the upgrade takes longer than usual. Why do we set noout for all osds, even if we don't touch them for e.g. a month if the upgrade is stuck somewhere?

adk3798 · 2023-04-04T18:09:55Z

https://pulpito.ceph.com/adking-2023-04-04_02:06:27-orch:cephadm-wip-adk-testing-2023-04-03-1541-distro-default-smithi/

Reruns of failed/dead jobs: https://pulpito.ceph.com/adking-2023-04-04_05:19:12-orch:cephadm-wip-adk-testing-2023-04-03-1541-distro-default-smithi/

After reruns, 6 failures. All 6 failures were on tests being added/adjusted by PRs in the run (2 keepalive-only nfs tests, 2 mon set crush location tests, 2 staggered upgrade tests) and those failures don't seem like they should block merging for any PRs in the run outside of those adjusting the tests.

adk3798 · 2023-04-04T18:12:04Z

https://pulpito.ceph.com/adking-2023-04-04_02:06:27-orch:cephadm-wip-adk-testing-2023-04-03-1541-distro-default-smithi/

Reruns of failed/dead jobs: https://pulpito.ceph.com/adking-2023-04-04_05:19:12-orch:cephadm-wip-adk-testing-2023-04-03-1541-distro-default-smithi/

After reruns, 6 failures. All 6 failures were on tests being added/adjusted by PRs in the run (2 keepalive-only nfs tests, 2 mon set crush location tests, 2 staggered upgrade tests) and those failures don't seem like they should block merging for any PRs in the run outside of those adjusting the tests.

NOTE: This PR is one of the ones adjusting tests, specifically the staggered upgrade. Will need to keep working on the test changes before merging this.

github-actions · 2023-04-25T18:24:18Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

github-actions · 2023-05-04T03:40:15Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

in this commit at least, this means the noout, noscrub, and nodeep-scrub flags, although it's configurable with the mgr/cephadm/osd_upgrade_flags conf option. The idea is that during upgrade there may be situations that cause OSDs to get marked down and out (see the linked tracker) and so it's good practice to have these set during upgrade. This also gets us in line with ceph-ansible, which set these flags itself during upgrade. Using the staggered upgrade test to test this since we need to be upgrading from a mgr with upgrade code that contains these changes. Fixes: https://tracker.ceph.com/issues/56670 Signed-off-by: Adam King <adking@redhat.com>

"ceph versions" has stopped being a reliable way of checking if daemons have been upgraded (see linked tracker). The idea of this commit is to instead rely on the container image id that cephadm tracks for each daemon, since we have more control over that. Fixes: https://tracker.ceph.com/issues/58535 Signed-off-by: Adam King <adking@redhat.com>

github-actions · 2023-09-12T14:01:50Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

github-actions · 2023-09-20T18:28:09Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

github-actions · 2023-11-19T20:01:40Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

github-actions · 2023-12-19T21:01:26Z

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

adk3798 added tests pybind orchestrator cephadm labels Mar 13, 2023

adk3798 requested review from a team as code owners March 13, 2023 20:30

github-actions bot added the documentation label Mar 13, 2023

adk3798 added the wip-adk-testing label Mar 13, 2023

phlogistonjohn reviewed Mar 14, 2023

View reviewed changes

sebastian-philipp reviewed Mar 16, 2023

View reviewed changes

github-actions bot added the needs-rebase label Apr 25, 2023

adk3798 force-pushed the upgrade-osd-flags branch from 583d207 to 6d915ee Compare May 2, 2023 20:48

github-actions bot added needs-rebase and removed needs-rebase labels May 2, 2023

adk3798 force-pushed the upgrade-osd-flags branch from 6d915ee to 1a0dd74 Compare May 20, 2023 16:12

github-actions bot removed the needs-rebase label May 20, 2023

adk3798 added 2 commits May 26, 2023 17:12

adk3798 force-pushed the upgrade-osd-flags branch from 1a0dd74 to db27d9c Compare May 26, 2023 21:12

adk3798 removed the wip-adk-testing label Jun 13, 2023

github-actions bot added the stale label Sep 12, 2023

github-actions bot added the needs-rebase label Sep 20, 2023

github-actions bot removed the stale label Sep 20, 2023

github-actions bot added the stale label Nov 19, 2023

github-actions bot closed this Dec 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mgr/cephadm: set OSD flags during upgrade#50508

mgr/cephadm: set OSD flags during upgrade#50508
adk3798 wants to merge 2 commits intoceph:mainfrom
adk3798:upgrade-osd-flags

adk3798 commented Mar 13, 2023

Uh oh!

phlogistonjohn Mar 14, 2023

Uh oh!

sebastian-philipp Mar 16, 2023

Uh oh!

adk3798 commented Apr 4, 2023

Uh oh!

adk3798 commented Apr 4, 2023

Uh oh!

github-actions bot commented Apr 25, 2023

Uh oh!

github-actions bot commented May 4, 2023

Uh oh!

github-actions bot commented Sep 12, 2023

Uh oh!

github-actions bot commented Sep 20, 2023

Uh oh!

github-actions bot commented Nov 19, 2023

Uh oh!

github-actions bot commented Dec 19, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

adk3798 commented Mar 13, 2023

Contribution Guidelines

Checklist

Uh oh!

phlogistonjohn Mar 14, 2023

Choose a reason for hiding this comment

Uh oh!

sebastian-philipp Mar 16, 2023

Choose a reason for hiding this comment

Uh oh!

adk3798 commented Apr 4, 2023

Uh oh!

adk3798 commented Apr 4, 2023

Uh oh!

github-actions bot commented Apr 25, 2023

Uh oh!

github-actions bot commented May 4, 2023

Uh oh!

github-actions bot commented Sep 12, 2023

Uh oh!

github-actions bot commented Sep 20, 2023

Uh oh!

github-actions bot commented Nov 19, 2023

Uh oh!

github-actions bot commented Dec 19, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants