Skip to content

mgr/cephadm: add orch ok-to-stop commands#36232

Merged
tchaikov merged 3 commits intoceph:masterfrom
mgfritch:cephadm-ok-to-stop
Jul 29, 2020
Merged

mgr/cephadm: add orch ok-to-stop commands#36232
tchaikov merged 3 commits intoceph:masterfrom
mgfritch:cephadm-ok-to-stop

Conversation

@mgfritch
Copy link
Contributor

@mgfritch mgfritch commented Jul 21, 2020

Adds a per daemon ok-to-stop command:

$ ceph orch ok-to-stop osd 0 1 2
Error EBUSY: It is NOT safe to stop ['osd.0', 'osd.1', 'osd.2']: 65 PGs are already too degraded, would become too degraded or might become unavailable

Also adds a per host ok-to-stop command:

$ ceph orch host ok-to-stop host1
It is presumed safe to stop host host1

Signed-off-by: Michael Fritch mfritch@suse.com

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard backend
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@batrick
Copy link
Member

batrick commented Jul 21, 2020

Adds a per daemon ok-to-stop command:

$ ceph orch ok-to-stop osd 0 1 2
Error ENOENT: It is NOT safe to stop ['osd.0', 'osd.1', 'osd.2']

ENOENT is an odd error number for this.

@sebastian-philipp
Copy link
Contributor

Hm. do we really need a ceph orch ok-to-stop $daemon? I mean, we already have ceph osd ok-to-stop.

Comment on lines +1189 to +1191
if not self.cephadm_services[daemon_type].ok_to_stop(daemon_ids):
raise orchestrator.OrchestratorError(
f'It is NOT safe to stop host {hostname}')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably return the error string returned by ceph * ok-to-stop here.

Comment on lines +2341 to +2357
@trivial_completion
def ok_to_stop(self,
daemon_type: str,
daemon_ids: List[str]):
names = [f'{daemon_type}.{d_id}' for d_id in daemon_ids]

if daemon_type not in ServiceSpec.KNOWN_SERVICE_TYPES:
raise orchestrator.OrchestratorError(
f'unknown daemon_type "{daemon_type}"')

if not self.cephadm_services[daemon_type].ok_to_stop(daemon_ids):
raise orchestrator.OrchestratorError(
f'It is NOT safe to stop {names}')

msg = f'It is presumed safe to stop {names}'
self.log.info(msg)
return msg
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok now I see the point. for e.g. NFS etc. Hm.

Copy link
Contributor Author

@mgfritch mgfritch Jul 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. do we really need a ceph orch ok-to-stop $daemon? I mean, we already have ceph osd ok-to-stop.

it's useful if we implement logic like this for NFS, iscsi etc, but it's also useful in case we'd decide to extend the existing ok-to-stop logic with Orchestrator specific logic ...

if ret:
logger.info(f'It is NOT safe to stop {names}: {err}')
logger.error(f'It is NOT safe to stop {names}: {err}')
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also return err here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a commit to return an HandleCommandResult from the ok-to-stop command.

@jschmid1
Copy link
Contributor

jschmid1 commented Jul 22, 2020

great stuff, just adding my thoughts here:

I can imagine the following commands being useful for various tools/users:

  • Is this host ok-to-stop
    • yields 'ok' when all services running on this host ok-to-stop
    • probably used in other management tools for i.e. reboots

* Is this service ok-to-stop
* yields 'ok' when availability is ensured. (one or more gateway/daemon up and running)
* probably used by users that want to restart a service without caring where the daemons are deployed.

This will probably always return false.. It's probably nonsensical to implement this.

  • Is this daemon ok-to-stop
    • yields 'ok' when the daemon is not the last of its kind or is still serving connections.
    • probably used by users that want to be very granular or other cephadm functions that need to determine if a service is ok-to-stop

Copy link
Contributor

@ricardoasmarques ricardoasmarques left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@mgfritch mgfritch force-pushed the cephadm-ok-to-stop branch from 12b7ef9 to 079dab5 Compare July 23, 2020 20:32
@mgfritch
Copy link
Contributor Author

mgfritch commented Jul 23, 2020

Adds a per daemon ok-to-stop command:

$ ceph orch ok-to-stop osd 0 1 2
Error ENOENT: It is NOT safe to stop ['osd.0', 'osd.1', 'osd.2']

ENOENT is an odd error number for this.

apparently ENOENT was the default errno used for all Orchestrator exceptions. I've extend these to take an errno param (using a default of EINVAL).

This should now use the errno (and stderr) returned by ok-to-stop:

$ ceph orch ok-to-stop osd 0 1 2
Error EBUSY: It is NOT safe to stop ['osd.0', 'osd.1', 'osd.2']: 65 PGs are already too degraded, would become too degraded or might become unavailable

@mgfritch mgfritch force-pushed the cephadm-ok-to-stop branch from 079dab5 to d5d6bcd Compare July 23, 2020 21:10
@sebastian-philipp
Copy link
Contributor

jenkins test make check

@sebastian-philipp
Copy link
Contributor

jenkins test dashboard backend

Comment on lines +1557 to +1567

@_cli_write_command(
'orch ok-to-stop',
"name=daemon_type,type=CephString "
"name=daemon_ids,type=CephString,n=N",
desc='Check if the specified daemons can be safely stopped without reducing availability')
def _ok_to_stop(self, daemon_type: str, daemon_ids: List[str]):
completion = self.ok_to_stop(daemon_type, daemon_ids)
self._orchestrator_wait([completion])
raise_if_exception(completion)
return HandleCommandResult(stdout=completion.result_str())
Copy link
Contributor

@sebastian-philipp sebastian-philipp Jul 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It still feels odd. I mean, looking at https://docs.ceph.com/docs/master/api/mon_command_api/ we have:

Woudn't it make more sense to provide commands in a similar fashion? Like

  • ceph nfs ok-to-stop
  • ceph rgw ok-to-stop

etc?

I mean, what's the point in aliasing ceph osd ok-to-stop 0 1 2 with ceph orch ok-to-stop osd 0 1?

Copy link
Contributor Author

@mgfritch mgfritch Jul 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that's redundant. I've dropped the per service type orch ok-to-stop command.

if we want to add orchestrator specific logic we should probably extend the the existing mds/mon/osd commands or add new nfs/rgw ones etc.

@sebastian-philipp
Copy link
Contributor

@sebastian-philipp
Copy link
Contributor

jenkins test make check

@sebastian-philipp
Copy link
Contributor

jenkins test dashboard backend

@sebastian-philipp
Copy link
Contributor

2020-07-28T13:54:29.751 INFO:teuthology.orchestra.run.smithi081:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph orch host add raise_no_support
2020-07-28T13:54:30.040 INFO:tasks.ceph.mgr.x.smithi081.stderr:2020-07-28T13:54:30.039+0000 7fc50fb77700 -1 mgr.server reply reply (22) Invalid argument MON count must be either 1, 3 or 5
2020-07-28T13:54:30.041 INFO:teuthology.orchestra.run.smithi081.stderr:Error EINVAL: MON count must be either 1, 3 or 5
2020-07-28T13:54:30.044 DEBUG:teuthology.orchestra.run:got remote process result: 22
2020-07-28T13:54:30.045 INFO:teuthology.orchestra.run.smithi081:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph --log-early log 'Ended test tasks.mgr.test_orchestrator_cli.TestOrchestratorCli.test_error'
2020-07-28T13:54:31.183 INFO:tasks.cephfs_test_runner:test_error (tasks.mgr.test_orchestrator_cli.TestOrchestratorCli) ... FAIL
2020-07-28T13:54:31.184 INFO:tasks.cephfs_test_runner:
2020-07-28T13:54:31.184 INFO:tasks.cephfs_test_runner:======================================================================
2020-07-28T13:54:31.185 INFO:tasks.cephfs_test_runner:FAIL: test_error (tasks.mgr.test_orchestrator_cli.TestOrchestratorCli)
2020-07-28T13:54:31.185 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2020-07-28T13:54:31.185 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2020-07-28T13:54:31.186 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-swagner-testing-2020-07-28-1314/qa/tasks/mgr/test_orchestrator_cli.py", line 165, in test_error
2020-07-28T13:54:31.186 INFO:tasks.cephfs_test_runner:    self.assertEqual(ret, errno.ENOENT)
2020-07-28T13:54:31.186 INFO:tasks.cephfs_test_runner:AssertionError: 22 != 2

https://pulpito.ceph.com/swagner-2020-07-28_13:34:22-rados:cephadm-wip-swagner-testing-2020-07-28-1314-distro-basic-smithi/5263894/

@sebastian-philipp sebastian-philipp removed the wip-swagner-testing My Teuthology tests label Jul 28, 2020
mgfritch added 3 commits July 28, 2020 15:54
add errno to OrchestratorError and ServiceSpecValidationError exceptions

Signed-off-by: Michael Fritch <mfritch@suse.com>
- return output from the result of the ok_to_stop command
- log ok-to-stop result during all invocations

Signed-off-by: Michael Fritch <mfritch@suse.com>
$ ceph orch host ok-to-stop host1
It is presumed safe to stop host host1

Signed-off-by: Michael Fritch <mfritch@suse.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants