Skip to content

mgr/cephadm: Reschedule nfs daemons from offline hosts#44343

Merged
adk3798 merged 1 commit intoceph:masterfrom
adk3798:nfs-offline
Mar 25, 2022
Merged

mgr/cephadm: Reschedule nfs daemons from offline hosts#44343
adk3798 merged 1 commit intoceph:masterfrom
adk3798:nfs-offline

Conversation

@adk3798
Copy link
Contributor

@adk3798 adk3798 commented Dec 16, 2021

In order to improve nfs availability, if there are other
hosts we can place an nfs daemon on or if there is a host
with a lower rank nfs daemon when a higher rank one is on
an offline host, we should reschedule the nfs daemons

Signed-off-by: Adam King adking@redhat.com

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@adk3798 adk3798 requested a review from a team as a code owner December 16, 2021 22:55
@github-actions
Copy link

github-actions bot commented Jan 5, 2022

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@adk3798
Copy link
Contributor Author

adk3798 commented Jan 18, 2022

jenkins retest this please

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@adk3798
Copy link
Contributor Author

adk3798 commented Feb 14, 2022

jenkins test make check

@adk3798
Copy link
Contributor Author

adk3798 commented Feb 14, 2022

@adk3798
Copy link
Contributor Author

adk3798 commented Feb 25, 2022

http://pulpito.front.sepia.ceph.com/adking-2022-02-24_02:51:26-orch:cephadm-wip-adk-testing-2022-02-23-1713-distro-basic-smithi/

Most Failures tracked by:
https://tracker.ceph.com/issues/54389
https://tracker.ceph.com/issues/54304

and a couple upgrade test failures. One where it mysteriously failed redeploying a grafana daemon (the last daemon redeployed during upgrade) and one where it looped on the mgr needing to upgrade itself. Errors don't seem related to the PRs in the run.

@adk3798
Copy link
Contributor Author

adk3798 commented Feb 28, 2022

@ajarr ajarr self-requested a review March 8, 2022 18:23
@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

Copy link
Contributor

@ajarr ajarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @adk3798 ! I was trying to follow the changes. Since I'm not very familiar with the cephadm codebase, I have a couple of questions.

random.Random(seed).shuffle(final)
return ls

def remove_non_maintenance_unreachable_candidates(self, candidates: List[DaemonPlacement]) -> List[DaemonPlacement]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't able to figure out what hosts are considered to be in maintenance. So wasn't able to also figure out why we include hosts in maintenance mode as candidates.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't want to re-schedule daemons which are placed on the nodes in maintenance mode ...

Copy link
Contributor

@mgfritch mgfritch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we split off the agent change into a separate PR ?

random.Random(seed).shuffle(final)
return ls

def remove_non_maintenance_unreachable_candidates(self, candidates: List[DaemonPlacement]) -> List[DaemonPlacement]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't want to re-schedule daemons which are placed on the nodes in maintenance mode ...

CEPH_TYPES = ['mgr', 'mon', 'crash', 'osd', 'mds', 'rgw', 'rbd-mirror', 'cephfs-mirror']
GATEWAY_TYPES = ['iscsi', 'nfs']
MONITORING_STACK_TYPES = ['node-exporter', 'prometheus', 'alertmanager', 'grafana', 'loki', 'promtail']
RESCHEDULE_FROM_OFFLINE_HOSTS_TYPES = ['nfs']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at some point I think we'll want to add the other stateless services here (e.g. mds, etc)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, i think so. Just want to have this PR for only nfs as that's priority

In order to improve nfs availability, if there are other
hosts we can place an nfs daemon on or if there is a host
with a lower rank nfs daemon when a higher rank one is on
an offline host, we should reschedule the nfs daemons

Signed-off-by: Adam King <adking@redhat.com>
@adk3798
Copy link
Contributor Author

adk3798 commented Mar 22, 2022

Changelog: Removed all agent related work here so it's just the rescheduling of nfs daemons

@adk3798
Copy link
Contributor Author

adk3798 commented Mar 24, 2022

jenkins test api

@adk3798
Copy link
Contributor Author

adk3798 commented Mar 24, 2022

http://pulpito.front.sepia.ceph.com/adking-2022-03-23_02:54:35-orch:cephadm-wip-adk-testing-2022-03-22-2000-distro-basic-smithi/

2 Failures caused by wrong error code from host add command due to another PR included in the run

@adk3798
Copy link
Contributor Author

adk3798 commented Mar 25, 2022

jenkins test api

@adk3798 adk3798 merged commit 56f20fe into ceph:master Mar 25, 2022
@adk3798 adk3798 mentioned this pull request Mar 30, 2022
14 tasks
@adk3798 adk3798 mentioned this pull request Apr 27, 2022
14 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants