mgr/cephadm: Reschedule nfs daemons from offline hosts#44343
mgr/cephadm: Reschedule nfs daemons from offline hosts#44343adk3798 merged 1 commit intoceph:masterfrom
Conversation
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
|
jenkins retest this please |
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
|
jenkins test make check |
|
Most Failures tracked by: and a couple upgrade test failures. One where it mysteriously failed redeploying a grafana daemon (the last daemon redeployed during upgrade) and one where it looped on the mgr needing to upgrade itself. Errors don't seem related to the PRs in the run. |
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
| random.Random(seed).shuffle(final) | ||
| return ls | ||
|
|
||
| def remove_non_maintenance_unreachable_candidates(self, candidates: List[DaemonPlacement]) -> List[DaemonPlacement]: |
There was a problem hiding this comment.
I wasn't able to figure out what hosts are considered to be in maintenance. So wasn't able to also figure out why we include hosts in maintenance mode as candidates.
There was a problem hiding this comment.
we don't want to re-schedule daemons which are placed on the nodes in maintenance mode ...
mgfritch
left a comment
There was a problem hiding this comment.
can we split off the agent change into a separate PR ?
| random.Random(seed).shuffle(final) | ||
| return ls | ||
|
|
||
| def remove_non_maintenance_unreachable_candidates(self, candidates: List[DaemonPlacement]) -> List[DaemonPlacement]: |
There was a problem hiding this comment.
we don't want to re-schedule daemons which are placed on the nodes in maintenance mode ...
| CEPH_TYPES = ['mgr', 'mon', 'crash', 'osd', 'mds', 'rgw', 'rbd-mirror', 'cephfs-mirror'] | ||
| GATEWAY_TYPES = ['iscsi', 'nfs'] | ||
| MONITORING_STACK_TYPES = ['node-exporter', 'prometheus', 'alertmanager', 'grafana', 'loki', 'promtail'] | ||
| RESCHEDULE_FROM_OFFLINE_HOSTS_TYPES = ['nfs'] |
There was a problem hiding this comment.
at some point I think we'll want to add the other stateless services here (e.g. mds, etc)
There was a problem hiding this comment.
Yeah, i think so. Just want to have this PR for only nfs as that's priority
In order to improve nfs availability, if there are other hosts we can place an nfs daemon on or if there is a host with a lower rank nfs daemon when a higher rank one is on an offline host, we should reschedule the nfs daemons Signed-off-by: Adam King <adking@redhat.com>
|
Changelog: Removed all agent related work here so it's just the rescheduling of nfs daemons |
|
jenkins test api |
|
2 Failures caused by wrong error code from host add command due to another PR included in the run |
|
jenkins test api |
In order to improve nfs availability, if there are other
hosts we can place an nfs daemon on or if there is a host
with a lower rank nfs daemon when a higher rank one is on
an offline host, we should reschedule the nfs daemons
Signed-off-by: Adam King adking@redhat.com
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume tox