mgr/cephadm: Reschedule nfs daemons from offline hosts by adk3798 · Pull Request #44343 · ceph/ceph

adk3798 · 2021-12-16T22:55:40Z

In order to improve nfs availability, if there are other
hosts we can place an nfs daemon on or if there is a host
with a lower rank nfs daemon when a higher rank one is on
an offline host, we should reschedule the nfs daemons

Signed-off-by: Adam King adking@redhat.com

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox

src/pybind/mgr/cephadm/agent.py

src/pybind/mgr/cephadm/module.py

github-actions · 2022-01-05T10:03:44Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

adk3798 · 2022-01-18T16:48:27Z

jenkins retest this please

src/pybind/mgr/cephadm/agent.py

src/pybind/mgr/cephadm/ssh.py

src/pybind/mgr/cephadm/schedule.py

github-actions · 2022-01-28T00:04:41Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

adk3798 · 2022-02-14T13:28:54Z

jenkins test make check

adk3798 · 2022-02-14T14:15:22Z

http://pulpito.front.sepia.ceph.com/adking-2022-02-11_16:35:44-orch:cephadm-wip-adk-testing-2022-02-10-1637-distro-basic-smithi/

only ingress failures tracked by https://tracker.ceph.com/issues/53807

adk3798 · 2022-02-25T13:28:01Z

http://pulpito.front.sepia.ceph.com/adking-2022-02-24_02:51:26-orch:cephadm-wip-adk-testing-2022-02-23-1713-distro-basic-smithi/

Most Failures tracked by:
https://tracker.ceph.com/issues/54389
https://tracker.ceph.com/issues/54304

and a couple upgrade test failures. One where it mysteriously failed redeploying a grafana daemon (the last daemon redeployed during upgrade) and one where it looped on the mgr needing to upgrade itself. Errors don't seem related to the PRs in the run.

adk3798 · 2022-02-28T13:19:57Z

http://pulpito.front.sepia.ceph.com/adking-2022-02-28_06:27:32-orch:cephadm-wip-adk-testing-2022-02-27-1813-distro-basic-smithi/

green

github-actions · 2022-03-16T13:27:10Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

ajarr

Thanks @adk3798 ! I was trying to follow the changes. Since I'm not very familiar with the cephadm codebase, I have a couple of questions.

src/pybind/mgr/cephadm/schedule.py

src/pybind/mgr/cephadm/agent.py

ajarr · 2022-03-18T19:47:14Z

src/pybind/mgr/cephadm/schedule.py

        random.Random(seed).shuffle(final)
        return ls
+
+    def remove_non_maintenance_unreachable_candidates(self, candidates: List[DaemonPlacement]) -> List[DaemonPlacement]:


I wasn't able to figure out what hosts are considered to be in maintenance. So wasn't able to also figure out why we include hosts in maintenance mode as candidates.

we don't want to re-schedule daemons which are placed on the nodes in maintenance mode ...

mgfritch

can we split off the agent change into a separate PR ?

mgfritch · 2022-03-22T15:51:35Z

src/pybind/mgr/cephadm/schedule.py

        random.Random(seed).shuffle(final)
        return ls
+
+    def remove_non_maintenance_unreachable_candidates(self, candidates: List[DaemonPlacement]) -> List[DaemonPlacement]:


we don't want to re-schedule daemons which are placed on the nodes in maintenance mode ...

src/pybind/mgr/cephadm/schedule.py

mgfritch · 2022-03-22T15:52:24Z

src/pybind/mgr/cephadm/utils.py

 CEPH_TYPES = ['mgr', 'mon', 'crash', 'osd', 'mds', 'rgw', 'rbd-mirror', 'cephfs-mirror']
 GATEWAY_TYPES = ['iscsi', 'nfs']
 MONITORING_STACK_TYPES = ['node-exporter', 'prometheus', 'alertmanager', 'grafana', 'loki', 'promtail']
+RESCHEDULE_FROM_OFFLINE_HOSTS_TYPES = ['nfs']


at some point I think we'll want to add the other stateless services here (e.g. mds, etc)

Yeah, i think so. Just want to have this PR for only nfs as that's priority

In order to improve nfs availability, if there are other hosts we can place an nfs daemon on or if there is a host with a lower rank nfs daemon when a higher rank one is on an offline host, we should reschedule the nfs daemons Signed-off-by: Adam King <adking@redhat.com>

adk3798 · 2022-03-22T23:00:00Z

Changelog: Removed all agent related work here so it's just the rescheduling of nfs daemons

adk3798 · 2022-03-24T19:03:54Z

jenkins test api

adk3798 · 2022-03-24T19:05:12Z

http://pulpito.front.sepia.ceph.com/adking-2022-03-23_02:54:35-orch:cephadm-wip-adk-testing-2022-03-22-2000-distro-basic-smithi/

2 Failures caused by wrong error code from host add command due to another PR included in the run

adk3798 · 2022-03-25T15:56:17Z

jenkins test api

adk3798 requested a review from a team as a code owner December 16, 2021 22:55

github-actions bot added cephadm pybind labels Dec 16, 2021

sebastian-philipp reviewed Dec 17, 2021

View reviewed changes

src/pybind/mgr/cephadm/agent.py Outdated Show resolved Hide resolved

src/pybind/mgr/cephadm/module.py Outdated Show resolved Hide resolved

adk3798 force-pushed the nfs-offline branch from 73c3dff to 7cb0bac Compare December 17, 2021 23:27

adk3798 force-pushed the nfs-offline branch from 7cb0bac to 3104b1a Compare January 4, 2022 13:49

github-actions bot added the needs-rebase label Jan 5, 2022

adk3798 force-pushed the nfs-offline branch from 3104b1a to 070f1d5 Compare January 5, 2022 23:40

github-actions bot removed the needs-rebase label Jan 5, 2022

sebastian-philipp reviewed Jan 18, 2022

View reviewed changes

src/pybind/mgr/cephadm/agent.py Outdated Show resolved Hide resolved

src/pybind/mgr/cephadm/ssh.py Outdated Show resolved Hide resolved

src/pybind/mgr/cephadm/schedule.py Show resolved Hide resolved

adk3798 force-pushed the nfs-offline branch from 070f1d5 to 106e7fe Compare January 18, 2022 20:39

adk3798 force-pushed the nfs-offline branch from 106e7fe to f469d37 Compare January 25, 2022 22:05

adk3798 added wip-adk-testing and removed wip-adk-testing labels Jan 26, 2022

github-actions bot added the needs-rebase label Jan 28, 2022

adk3798 force-pushed the nfs-offline branch from f469d37 to 9a777c9 Compare January 31, 2022 15:27

github-actions bot removed the needs-rebase label Jan 31, 2022

adk3798 added the wip-adk-testing label Jan 31, 2022

adk3798 added the needs-review label Feb 14, 2022

ajarr self-requested a review March 8, 2022 18:23

github-actions bot added the needs-rebase label Mar 16, 2022

mgfritch added the wip-mgfritch-testing label Mar 17, 2022

adk3798 force-pushed the nfs-offline branch from 9a777c9 to d56fda9 Compare March 17, 2022 17:06

github-actions bot removed the needs-rebase label Mar 17, 2022

ajarr reviewed Mar 18, 2022

View reviewed changes

mgfritch reviewed Mar 22, 2022

View reviewed changes

adk3798 force-pushed the nfs-offline branch from d56fda9 to 9febc21 Compare March 22, 2022 22:58

mgfritch approved these changes Mar 25, 2022

View reviewed changes

adk3798 merged commit 56f20fe into ceph:master Mar 25, 2022

adk3798 mentioned this pull request Mar 30, 2022

Cephadm Pacific Batch Backport March #45716

Merged

14 tasks

adk3798 mentioned this pull request Apr 27, 2022

quincy: Cephadm Batch Backport April #46055

Merged

14 tasks

Conversation

adk3798 commented Dec 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Jan 5, 2022

Uh oh!

adk3798 commented Jan 18, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Jan 28, 2022

Uh oh!

adk3798 commented Feb 14, 2022

Uh oh!

adk3798 commented Feb 14, 2022

Uh oh!

adk3798 commented Feb 25, 2022

Uh oh!

adk3798 commented Feb 28, 2022

Uh oh!

github-actions bot commented Mar 16, 2022

Uh oh!

ajarr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ajarr Mar 18, 2022

Choose a reason for hiding this comment

Uh oh!

mgfritch Mar 22, 2022

Choose a reason for hiding this comment

Uh oh!

mgfritch left a comment

Choose a reason for hiding this comment

Uh oh!

mgfritch Mar 22, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mgfritch Mar 22, 2022

Choose a reason for hiding this comment

Uh oh!

adk3798 Mar 22, 2022

Choose a reason for hiding this comment

Uh oh!

adk3798 commented Mar 22, 2022

Uh oh!

adk3798 commented Mar 24, 2022

Uh oh!

adk3798 commented Mar 24, 2022

Uh oh!

adk3798 commented Mar 25, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

adk3798 commented Dec 16, 2021 •

edited

Loading