mgr/cephadm: redeploy daemons deployed using old image after upgrade by adk3798 · Pull Request #39435 · ceph/ceph

adk3798 · 2021-02-12T07:07:05Z

daemons not upgraded during the upgrade call and mgr daemons that could
have been upgraded by a mgr running the old pre-upgrade image need to be
redeployed after the rest of the upgrade is complete in case something
important has changed in the deployment between the old image and new
image

Fixes: https://tracker.ceph.com/issues/49013

Signed-off-by: Adam King adking@redhat.com

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox

sebastian-philipp

What I don't like here is: there is no persistent state to track those upgrades, this makes reasoning about this code somewhat complicated in case there is an unexpected failover while we're within _post_upgrade_redeploys

One idea would be to do something like

for d in daemons:
    if d.daemon_type == 'mgr':
        self.mgr.cache.schedule_daemon_action(d.hostname, d.name(), 'redeploy')

But this would have the downside of doing upgrade steps after the upgrade has already completed. I don't like this either.

🤷‍♂️

sebastian-philipp · 2021-02-12T09:29:39Z

src/pybind/mgr/cephadm/upgrade.py

+            assert d.daemon_type is not None
+            assert d.daemon_id is not None
+            assert d.hostname is not None


I really need to fix the types. having those members being optional is irritating.

src/pybind/mgr/cephadm/upgrade.py

adk3798 · 2021-02-12T21:42:37Z

Now overhauled to construct a list of the daemons that need to be redeployed as part of the upgrade state. As each daemon is redeployed it is removed from the list and the upgrade state is saved again so we don't have to redeploy those daemons again in case of a fail over. _post_upgrade_redeploys now also return a bool to mark its success. If it returns True the upgrade will continue and be marked as complete. If it returns False the upgrade will not be completed and the redeploys will be reattempted when the upgrade is continued (the same system as when an ok_to_stop call fails during the actual upgrades).

adk3798 · 2021-02-17T22:01:29Z

jenkins test make check

sebastian-philipp · 2021-02-18T11:08:28Z

https://pulpito.ceph.com/swagner-2021-02-17_14:01:13-rados:cephadm-wip-swagner-testing-2021-02-16-1806-distro-basic-smithi/

sebastian-philipp · 2021-02-18T11:15:34Z

if I remember correctly, @liewegas had the idea of using the meta data stored along the daemons to know the ceph version that deployed the daemons, right?

adk3798 · 2021-02-18T13:54:28Z

if I remember correctly, @liewegas had the idea of using the meta data stored along the daemons to know the ceph version that deployed the daemons, right?

Yes, I think the idea was to store what mgr version deployed the daemon in the unit.meta file from liewegas@9b9cc97 and then that could be included as part of the DaemonDesciption. Although, I don't think that change adding the unit.meta file exists in master.

liewegas · 2021-02-18T14:39:55Z

see https://github.com/ceph/ceph/pull/39550. @sebastian-philipp i wonder if we should merge this part without the autotuning bits?

liewegas · 2021-02-18T14:41:39Z

What I don't like here is: there is no persistent state to track those upgrades, this makes reasoning about this code somewhat complicated in case there is an unexpected failover while we're within _post_upgrade_redeploys

One idea would be to do something like
for d in daemons:
    if d.daemon_type == 'mgr':
        self.mgr.cache.schedule_daemon_action(d.hostname, d.name(), 'redeploy')
But this would have the downside of doing upgrade steps after the upgrade has already completed. I don't like this either.

I tried to take a "stateless" approach to _do_upgrade so that it will always reconcile the current cluster state with our desired version, avoiding any complexity from intervening mgr restarts, canceled/restarted upgrades, restarted upgrades to different versions, and so on...

sebastian-philipp · 2021-03-02T16:08:46Z

#39644 is merged

adk3798 · 2021-03-03T20:19:57Z

This now makes use of the unit.meta file. The container digest of the mgr used to deploy the daemon is stored in the unit.meta file and then, after the upgrade process, all daemons who were not deployed by a mgr using the new container are redeployed.

sebastian-philipp · 2021-03-12T11:03:27Z

https://pulpito.ceph.com/swagner-2021-03-11_16:09:20-rados:cephadm-wip-swagner-testing-2021-03-11-1320-distro-basic-smithi/

Lots of errors:

https://tracker.ceph.com/issues/49633
https://tracker.ceph.com/issues/49740
upgrade suite is totally broken
tasks.cephadm_cases.test_cli.TestCephadmCLI.test_cephfs_mirror broke yesterday or the day before.

I'm a little hesitant to merge this with the broken upgrade suite.

sebastian-philipp · 2021-03-12T11:04:13Z

hold on

sebastian-philipp · 2021-03-12T11:13:59Z

rerun:

https://pulpito.ceph.com/swagner-2021-03-12_11:13:04-rados:cephadm:upgrade-wip-swagner-testing-2021-03-11-1320-distro-basic-smithi/

with --ceph wip-swagner-testing-2021-03-11-1320 --suite-repo http://github.com/sebastian-philipp/ceph.git --suite-branch wip-swagner-testing-2021-03-11-1320-rm-iscsi

src/pybind/mgr/orchestrator/_interface.py

github-actions · 2021-03-16T17:10:17Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

liewegas · 2021-03-17T20:02:48Z

src/pybind/mgr/cephadm/serve.py

                            'service_name': daemon_spec.service_name,
                            'ports': daemon_spec.ports,
                            'ip': daemon_spec.ip,
+                            'deployed_by': self.mgr.get_active_mgr_digests(),


I wonder if instead of digests we should just the ceph version string, e.g. 16.1.0-917-g08ab4ae8. it's easier to get that the digests (we can drop get_active_mgr_digets() above) and it is also something a human can make sense of.

I like the idea 👍. I'll try it out and if there's no issues I'll swap over to that.

src/pybind/mgr/cephadm/upgrade.py

github-actions · 2021-03-18T22:12:38Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

…t.meta For use in upgrade. It's useful to see if daemon was deployed by mgr running same container digest as container we are upgrading to. Signed-off-by: Adam King <adking@redhat.com>

src/pybind/mgr/cephadm/upgrade.py

liewegas · 2021-03-18T22:20:24Z

src/pybind/mgr/cephadm/utils.py

@@ -23,7 +23,8 @@ class CephadmNoImage(Enum):
 # NOTE: order important here as these are used for upgrade order
 CEPH_TYPES = ['mgr', 'mon', 'crash', 'osd', 'mds', 'rgw', 'rbd-mirror', 'cephfs-mirror']
 GATEWAY_TYPES = ['iscsi', 'nfs']


Perhaps in a separate patch, do we always want to move iscsi and nfs back into the CEPH_TYPES? So that their ceph.confs are updated?

I think that can be done at the same time we fix the reconfigs during upgrade. That's why they got their own group to begin with.

Add extra check that daemons were deployed by mgr using new image during upgrade. Makes sure unit.run file for all daemons are updated if they changed between old and new images. Fixes: https://tracker.ceph.com/issues/49013 Signed-off-by: Adam King <adking@redhat.com>

adk3798 added needs-review cephadm labels Feb 12, 2021

adk3798 requested a review from a team as a code owner February 12, 2021 07:07

github-actions bot added the pybind label Feb 12, 2021

sebastian-philipp suggested changes Feb 12, 2021

View reviewed changes

adk3798 force-pushed the post-upgrade-redeploy branch 2 times, most recently from a5e89a2 to 4feea4e Compare February 12, 2021 21:35

adk3798 requested a review from sebastian-philipp February 12, 2021 21:42

sebastian-philipp added the wip-swagner-testing My Teuthology tests label Feb 16, 2021

sebastian-philipp removed the wip-swagner-testing My Teuthology tests label Feb 18, 2021

adk3798 force-pushed the post-upgrade-redeploy branch from 4feea4e to d89037f Compare March 3, 2021 20:10

github-actions bot added the orchestrator label Mar 3, 2021

adk3798 force-pushed the post-upgrade-redeploy branch from d89037f to 4cb3578 Compare March 4, 2021 22:13

sebastian-philipp added the wip-swagner-testing My Teuthology tests label Mar 10, 2021

sebastian-philipp removed the wip-swagner-testing My Teuthology tests label Mar 12, 2021

sebastian-philipp changed the title ~~mgr/cephadm: redeploy daemons deployed using old image after upgrade~~ [WIP] mgr/cephadm: redeploy daemons deployed using old image after upgrade Mar 16, 2021

sebastian-philipp changed the title ~~[WIP] mgr/cephadm: redeploy daemons deployed using old image after upgrade~~ mgr/cephadm: redeploy daemons deployed using old image after upgrade Mar 16, 2021

sebastian-philipp approved these changes Mar 16, 2021

View reviewed changes

sebastian-philipp reviewed Mar 16, 2021

View reviewed changes

src/pybind/mgr/orchestrator/_interface.py Outdated Show resolved Hide resolved

adk3798 force-pushed the post-upgrade-redeploy branch from 4cb3578 to 6e79690 Compare March 16, 2021 16:26

github-actions bot added the needs-rebase label Mar 16, 2021

adk3798 force-pushed the post-upgrade-redeploy branch from 6e79690 to 0fa7004 Compare March 16, 2021 18:55

github-actions bot removed the needs-rebase label Mar 16, 2021

liewegas reviewed Mar 17, 2021

View reviewed changes

src/pybind/mgr/cephadm/upgrade.py Outdated Show resolved Hide resolved

adk3798 force-pushed the post-upgrade-redeploy branch from 0fa7004 to 7ab9749 Compare March 18, 2021 22:12

github-actions bot added the needs-rebase label Mar 18, 2021

mgr/cephadm: add container digests of mgr that deployed daemon to uni…

baa95a7

…t.meta For use in upgrade. It's useful to see if daemon was deployed by mgr running same container digest as container we are upgrading to. Signed-off-by: Adam King <adking@redhat.com>

liewegas reviewed Mar 18, 2021

View reviewed changes

src/pybind/mgr/cephadm/upgrade.py Outdated Show resolved Hide resolved

liewegas reviewed Mar 18, 2021

View reviewed changes

adk3798 force-pushed the post-upgrade-redeploy branch from 7ab9749 to 83f0ca9 Compare March 18, 2021 22:24

github-actions bot removed the needs-rebase label Mar 18, 2021

liewegas added the wip-sage-testing label Mar 21, 2021

liewegas approved these changes Mar 21, 2021

View reviewed changes

liewegas merged commit 94e152f into ceph:master Mar 23, 2021

liewegas mentioned this pull request Mar 24, 2021

pacific: cephadm: Batch backport, March (4) #40355

Merged

Conversation

adk3798 commented Feb 12, 2021

Uh oh!

sebastian-philipp left a comment

Choose a reason for hiding this comment

Uh oh!

sebastian-philipp Feb 12, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

adk3798 commented Feb 12, 2021

Uh oh!

adk3798 commented Feb 17, 2021

Uh oh!

sebastian-philipp commented Feb 18, 2021

Uh oh!

sebastian-philipp commented Feb 18, 2021

Uh oh!

adk3798 commented Feb 18, 2021

Uh oh!

liewegas commented Feb 18, 2021

Uh oh!

liewegas commented Feb 18, 2021

Uh oh!

sebastian-philipp commented Mar 2, 2021

Uh oh!

adk3798 commented Mar 3, 2021

Uh oh!

sebastian-philipp commented Mar 12, 2021

Uh oh!

sebastian-philipp commented Mar 12, 2021

Uh oh!

sebastian-philipp commented Mar 12, 2021

Uh oh!

Uh oh!

github-actions bot commented Mar 16, 2021

Uh oh!

liewegas Mar 17, 2021

Choose a reason for hiding this comment

Uh oh!

adk3798 Mar 17, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Mar 18, 2021

Uh oh!

Uh oh!

liewegas Mar 18, 2021

Choose a reason for hiding this comment

Uh oh!

adk3798 Mar 18, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants