mgr/cephadm: add `orch ok-to-stop` commands by mgfritch · Pull Request #36232 · ceph/ceph

mgfritch · 2020-07-21T22:01:02Z

Adds a per daemon ok-to-stop command:

$ ceph orch ok-to-stop osd 0 1 2
Error EBUSY: It is NOT safe to stop ['osd.0', 'osd.1', 'osd.2']: 65 PGs are already too degraded, would become too degraded or might become unavailable

Also adds a per host ok-to-stop command:

$ ceph orch host ok-to-stop host1
It is presumed safe to stop host host1

Signed-off-by: Michael Fritch mfritch@suse.com

Checklist

References tracker ticket
Updates documentation if necessary
Includes tests for new functionality or reproducer for bug

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard backend
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox

batrick · 2020-07-21T23:06:40Z

Adds a per daemon ok-to-stop command:

$ ceph orch ok-to-stop osd 0 1 2
Error ENOENT: It is NOT safe to stop ['osd.0', 'osd.1', 'osd.2']

ENOENT is an odd error number for this.

sebastian-philipp · 2020-07-22T10:21:18Z

Hm. do we really need a ceph orch ok-to-stop $daemon? I mean, we already have ceph osd ok-to-stop.

sebastian-philipp · 2020-07-22T10:22:21Z

src/pybind/mgr/cephadm/module.py

+            if not self.cephadm_services[daemon_type].ok_to_stop(daemon_ids):
+                raise orchestrator.OrchestratorError(
+                        f'It is NOT safe to stop host {hostname}')


we should probably return the error string returned by ceph * ok-to-stop here.

sebastian-philipp · 2020-07-22T10:23:23Z

src/pybind/mgr/cephadm/module.py

+    @trivial_completion
+    def ok_to_stop(self,
+                   daemon_type: str,
+                   daemon_ids: List[str]):
+        names = [f'{daemon_type}.{d_id}' for d_id in daemon_ids]
+
+        if daemon_type not in ServiceSpec.KNOWN_SERVICE_TYPES:
+            raise orchestrator.OrchestratorError(
+                    f'unknown daemon_type "{daemon_type}"')
+
+        if not self.cephadm_services[daemon_type].ok_to_stop(daemon_ids):
+            raise orchestrator.OrchestratorError(
+                    f'It is NOT safe to stop {names}')
+
+        msg = f'It is presumed safe to stop {names}'
+        self.log.info(msg)
+        return msg


ok now I see the point. for e.g. NFS etc. Hm.

Hm. do we really need a ceph orch ok-to-stop $daemon? I mean, we already have ceph osd ok-to-stop.

it's useful if we implement logic like this for NFS, iscsi etc, but it's also useful in case we'd decide to extend the existing ok-to-stop logic with Orchestrator specific logic ...

sebastian-philipp · 2020-07-22T10:23:50Z

src/pybind/mgr/cephadm/services/cephadmservice.py

        if ret:
-            logger.info(f'It is NOT safe to stop {names}: {err}')
+            logger.error(f'It is NOT safe to stop {names}: {err}')
            return False


can we also return err here?

I've added a commit to return an HandleCommandResult from the ok-to-stop command.

jschmid1 · 2020-07-22T14:12:25Z

great stuff, just adding my thoughts here:

I can imagine the following commands being useful for various tools/users:

Is this host ok-to-stop
- yields 'ok' when all services running on this host ok-to-stop
- probably used in other management tools for i.e. reboots

* Is this service ok-to-stop
* yields 'ok' when availability is ensured. (one or more gateway/daemon up and running)
* probably used by users that want to restart a service without caring where the daemons are deployed.

This will probably always return false.. It's probably nonsensical to implement this.

Is this daemon ok-to-stop
- yields 'ok' when the daemon is not the last of its kind or is still serving connections.
- probably used by users that want to be very granular or other cephadm functions that need to determine if a service is ok-to-stop

ricardoasmarques

lgtm

mgfritch · 2020-07-23T20:38:39Z

Adds a per daemon ok-to-stop command:
$ ceph orch ok-to-stop osd 0 1 2
Error ENOENT: It is NOT safe to stop ['osd.0', 'osd.1', 'osd.2']
ENOENT is an odd error number for this.

apparently ENOENT was the default errno used for all Orchestrator exceptions. I've extend these to take an errno param (using a default of EINVAL).

This should now use the errno (and stderr) returned by ok-to-stop:

$ ceph orch ok-to-stop osd 0 1 2
Error EBUSY: It is NOT safe to stop ['osd.0', 'osd.1', 'osd.2']: 65 PGs are already too degraded, would become too degraded or might become unavailable

sebastian-philipp · 2020-07-24T08:00:03Z

jenkins test make check

sebastian-philipp · 2020-07-24T08:00:13Z

jenkins test dashboard backend

sebastian-philipp · 2020-07-24T09:33:39Z

src/pybind/mgr/orchestrator/module.py

+
+    @_cli_write_command(
+        'orch ok-to-stop',
+        "name=daemon_type,type=CephString "
+        "name=daemon_ids,type=CephString,n=N",
+        desc='Check if the specified daemons can be safely stopped without reducing availability')
+    def _ok_to_stop(self, daemon_type: str, daemon_ids: List[str]):
+        completion = self.ok_to_stop(daemon_type, daemon_ids)
+        self._orchestrator_wait([completion])
+        raise_if_exception(completion)
+        return HandleCommandResult(stdout=completion.result_str())


It still feels odd. I mean, looking at https://docs.ceph.com/docs/master/api/mon_command_api/ we have:

ceph mds ok-to-stop

ceph mon ok-to-stop

ceph osd ok-to-top

Woudn't it make more sense to provide commands in a similar fashion? Like

ceph nfs ok-to-stop

ceph rgw ok-to-stop

etc?

I mean, what's the point in aliasing ceph osd ok-to-stop 0 1 2 with ceph orch ok-to-stop osd 0 1?

yeah that's redundant. I've dropped the per service type orch ok-to-stop command.

if we want to add orchestrator specific logic we should probably extend the the existing mds/mon/osd commands or add new nfs/rgw ones etc.

sebastian-philipp · 2020-07-28T10:48:02Z

https://tracker.ceph.com/issues/46734

sebastian-philipp · 2020-07-28T10:49:27Z

jenkins test make check

sebastian-philipp · 2020-07-28T10:49:39Z

jenkins test dashboard backend

sebastian-philipp · 2020-07-28T14:09:29Z

2020-07-28T13:54:29.751 INFO:teuthology.orchestra.run.smithi081:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph orch host add raise_no_support
2020-07-28T13:54:30.040 INFO:tasks.ceph.mgr.x.smithi081.stderr:2020-07-28T13:54:30.039+0000 7fc50fb77700 -1 mgr.server reply reply (22) Invalid argument MON count must be either 1, 3 or 5
2020-07-28T13:54:30.041 INFO:teuthology.orchestra.run.smithi081.stderr:Error EINVAL: MON count must be either 1, 3 or 5
2020-07-28T13:54:30.044 DEBUG:teuthology.orchestra.run:got remote process result: 22
2020-07-28T13:54:30.045 INFO:teuthology.orchestra.run.smithi081:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph --log-early log 'Ended test tasks.mgr.test_orchestrator_cli.TestOrchestratorCli.test_error'
2020-07-28T13:54:31.183 INFO:tasks.cephfs_test_runner:test_error (tasks.mgr.test_orchestrator_cli.TestOrchestratorCli) ... FAIL
2020-07-28T13:54:31.184 INFO:tasks.cephfs_test_runner:
2020-07-28T13:54:31.184 INFO:tasks.cephfs_test_runner:======================================================================
2020-07-28T13:54:31.185 INFO:tasks.cephfs_test_runner:FAIL: test_error (tasks.mgr.test_orchestrator_cli.TestOrchestratorCli)
2020-07-28T13:54:31.185 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2020-07-28T13:54:31.185 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2020-07-28T13:54:31.186 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-swagner-testing-2020-07-28-1314/qa/tasks/mgr/test_orchestrator_cli.py", line 165, in test_error
2020-07-28T13:54:31.186 INFO:tasks.cephfs_test_runner:    self.assertEqual(ret, errno.ENOENT)
2020-07-28T13:54:31.186 INFO:tasks.cephfs_test_runner:AssertionError: 22 != 2

https://pulpito.ceph.com/swagner-2020-07-28_13:34:22-rados:cephadm-wip-swagner-testing-2020-07-28-1314-distro-basic-smithi/5263894/

add errno to OrchestratorError and ServiceSpecValidationError exceptions Signed-off-by: Michael Fritch <mfritch@suse.com>

- return output from the result of the ok_to_stop command - log ok-to-stop result during all invocations Signed-off-by: Michael Fritch <mfritch@suse.com>

$ ceph orch host ok-to-stop host1 It is presumed safe to stop host host1 Signed-off-by: Michael Fritch <mfritch@suse.com>

tchaikov · 2020-07-29T12:28:12Z

mgfritch added orchestrator cephadm labels Jul 21, 2020

mgfritch requested a review from ricardoasmarques July 21, 2020 22:01

mgfritch requested a review from a team as a code owner July 21, 2020 22:01

sebastian-philipp suggested changes Jul 22, 2020

View reviewed changes

ricardoasmarques approved these changes Jul 22, 2020

View reviewed changes

mgfritch force-pushed the cephadm-ok-to-stop branch from 12b7ef9 to 079dab5 Compare July 23, 2020 20:32

mgfritch force-pushed the cephadm-ok-to-stop branch from 079dab5 to d5d6bcd Compare July 23, 2020 21:10

sebastian-philipp reviewed Jul 24, 2020

View reviewed changes

ricardoasmarques mentioned this pull request Jul 24, 2020

Add 'ceph-salt update [--reboot] [minion_id]' command ceph/ceph-salt#303

Merged

mgfritch force-pushed the cephadm-ok-to-stop branch from d5d6bcd to 45bdcf6 Compare July 27, 2020 20:26

sebastian-philipp approved these changes Jul 28, 2020

View reviewed changes

sebastian-philipp added needs-qa wip-swagner-testing My Teuthology tests labels Jul 28, 2020

sebastian-philipp removed the wip-swagner-testing My Teuthology tests label Jul 28, 2020

mgfritch added 3 commits July 28, 2020 15:54

mgr/orch: add errno to OrchestratorError

60b99dc

add errno to OrchestratorError and ServiceSpecValidationError exceptions Signed-off-by: Michael Fritch <mfritch@suse.com>

mgr/cephadm: return HandleCommandResult from ok_to_stop

2521a7c

- return output from the result of the ok_to_stop command - log ok-to-stop result during all invocations Signed-off-by: Michael Fritch <mfritch@suse.com>

mgr/cephadm: add orch host ok-to-stop command

d6fa2e2

$ ceph orch host ok-to-stop host1 It is presumed safe to stop host host1 Signed-off-by: Michael Fritch <mfritch@suse.com>

mgfritch force-pushed the cephadm-ok-to-stop branch from 45bdcf6 to d6fa2e2 Compare July 28, 2020 21:56

tchaikov added the wip-kefu-testing label Jul 29, 2020

tchaikov merged commit 45c3bed into ceph:master Jul 29, 2020

mgfritch deleted the cephadm-ok-to-stop branch July 29, 2020 17:41

sebastian-philipp mentioned this pull request Aug 4, 2020

octopus: cephadm batch backport August (1) #36450

Merged

mgfritch mentioned this pull request Oct 9, 2020

doc/dev/cephadm: Doc defining the design for host maintenance #37607

Merged

Conversation

mgfritch commented Jul 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

batrick commented Jul 21, 2020

Uh oh!

sebastian-philipp commented Jul 22, 2020

Uh oh!

sebastian-philipp Jul 22, 2020

Choose a reason for hiding this comment

Uh oh!

sebastian-philipp Jul 22, 2020

Choose a reason for hiding this comment

Uh oh!

mgfritch Jul 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sebastian-philipp Jul 22, 2020

Choose a reason for hiding this comment

Uh oh!

mgfritch Jul 23, 2020

Choose a reason for hiding this comment

Uh oh!

jschmid1 commented Jul 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ricardoasmarques left a comment

Choose a reason for hiding this comment

Uh oh!

mgfritch commented Jul 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sebastian-philipp commented Jul 24, 2020

Uh oh!

sebastian-philipp commented Jul 24, 2020

Uh oh!

sebastian-philipp Jul 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mgfritch Jul 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sebastian-philipp commented Jul 28, 2020

Uh oh!

sebastian-philipp commented Jul 28, 2020

Uh oh!

sebastian-philipp commented Jul 28, 2020

Uh oh!

sebastian-philipp commented Jul 28, 2020

Uh oh!

tchaikov commented Jul 29, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

mgfritch commented Jul 21, 2020 •

edited

Loading

mgfritch Jul 23, 2020 •

edited

Loading

jschmid1 commented Jul 22, 2020 •

edited

Loading

mgfritch commented Jul 23, 2020 •

edited

Loading

sebastian-philipp Jul 24, 2020 •

edited

Loading

mgfritch Jul 27, 2020 •

edited

Loading