Skip to content

mgr/cephadm: make scheduling make sense#34633

Merged
sebastian-philipp merged 9 commits intoceph:masterfrom
sebastian-philipp:cephadm-total-scheduler
Jun 9, 2020
Merged

mgr/cephadm: make scheduling make sense#34633
sebastian-philipp merged 9 commits intoceph:masterfrom
sebastian-philipp:cephadm-total-scheduler

Conversation

@sebastian-philipp
Copy link
Contributor

@sebastian-philipp sebastian-philipp commented Apr 19, 2020

Depends on


Consider this placement:

placement:
  hosts:
  - myhost
  count: 2

Previously, if there was already a daemon running on myhost, cephadm would schedule another daemon on a random other host.

Now, will always only schedule one daemon on myhost.

Open questions:

  • How can we merge this without braking existing clusters?

TODO:

  • introduce a migration from old to new scheduler to change the placement specs to reflect the reality.

Also fixed:

HostAssignment(
                spec=ServiceSpec('mon', placement=PlacementSpec(
                    hosts=['host1'],
                    count=3,
                )),
                get_hosts_func=lambda _: ['host1, 'host2'],
                get_daemons_func=lambda _: []).place()
# returns
['host1', 'host1', 'host2']

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard backend
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@sebastian-philipp sebastian-philipp requested a review from a team as a code owner April 19, 2020 14:04
@sebastian-philipp sebastian-philipp changed the title mgr/cephadm: verify explicit placement actually works. mgr/cephadm: make scheduling make sense Apr 20, 2020
@sebastian-philipp sebastian-philipp changed the title mgr/cephadm: make scheduling make sense [WIP]: mgr/cephadm: make scheduling make sense Apr 20, 2020
@jschmid1
Copy link
Contributor

discussion here: https://pad.ceph.com/p/orchestration-weekly

@sebastian-philipp sebastian-philipp force-pushed the cephadm-total-scheduler branch 2 times, most recently from f4d6669 to 29a8399 Compare April 21, 2020 16:43
@sebastian-philipp sebastian-philipp force-pushed the cephadm-total-scheduler branch 2 times, most recently from 457ce7a to 30785bb Compare May 6, 2020 13:52
@sebastian-philipp sebastian-philipp force-pushed the cephadm-total-scheduler branch from 30785bb to 2e13612 Compare May 6, 2020 16:27
@sebastian-philipp sebastian-philipp force-pushed the cephadm-total-scheduler branch from 2e13612 to cf1e104 Compare May 6, 2020 21:27
@sebastian-philipp sebastian-philipp changed the title [WIP]: mgr/cephadm: make scheduling make sense mgr/cephadm: make scheduling make sense May 6, 2020
@sebastian-philipp
Copy link
Contributor Author

ping @jschmid1

Copy link
Contributor

@jschmid1 jschmid1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs a closer review. consider this as a first round of reviews

)

new_spec = ServiceSpec.from_json(spec.to_json())
new_spec.placement = new_placement
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can also see an issue where a user would apply a cluster.yaml only when changing a single service. Since we converted the placement_glob(or pattern) to an explicit placement spec, they probably have to export and overwrite their existing cluster.yaml.

I think this should be recommended somehow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, that's a big problem but I didn't have a better answer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having previews for all services would be a first step I think.. If placements would be changed completely we could atleast raise a warning before applying it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both schedulers behave very similar when it comes to assigning hosts. It's just the old assigned more hosts than the new one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still struggling with this concept, but I suppose it's similar to k8s nodeName ...
https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodename

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. but what's the alternative? what if the user wrongly created a placement like

service_type: mon
placement:
    label: mymons
    count=5

but only a single host was labled as mymons? should we simply remove all the mons? I'm also against being more elaborate here as this increases the likelihood of bugs. I'd be really in favor of keeping this as simple as possible to stay safe.

@sebastian-philipp sebastian-philipp force-pushed the cephadm-total-scheduler branch from cf1e104 to c1d9dc9 Compare May 8, 2020 07:53
@sebastian-philipp sebastian-philipp added the wip-swagner-testing My Teuthology tests label May 13, 2020
@sebastian-philipp
Copy link
Contributor Author

)

new_spec = ServiceSpec.from_json(spec.to_json())
new_spec.placement = new_placement
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still struggling with this concept, but I suppose it's similar to k8s nodeName ...
https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodename

spoiler: it doesn't.
Signed-off-by: Sebastian Wagner <sebastian.wagner@suse.com>
Like,

```yaml
placement:
  hosts:
  - myhost
  count: 3
```
will always only schedule one daemon on `myhost`.

Signed-off-by: Sebastian Wagner <sebastian.wagner@suse.com>
Signed-off-by: Sebastian Wagner <sebastian.wagner@suse.com>
Signed-off-by: Sebastian Wagner <sebastian.wagner@suse.com>
Signed-off-by: Sebastian Wagner <sebastian.wagner@suse.com>
Signed-off-by: Sebastian Wagner <sebastian.wagner@suse.com>
@sebastian-philipp sebastian-philipp force-pushed the cephadm-total-scheduler branch from 3a04af6 to a3dfaa0 Compare June 3, 2020 10:18
New scheduler that takes PlacementSpec as the bound and not as recommendation.

Which means, we have to make sure, we're not removing any daemons directly after
upgrading to the new scheduler.

There is a potential race here:

1. user updates his spec to remove daemons
2. mgr gets upgrades to new scheduler, before the old scheduler removed the daemon
3. now, we're converting the spec to explicit placement, thus reverting (1.)

I think this is ok.

Signed-off-by: Sebastian Wagner <sebastian.wagner@suse.com>
Signed-off-by: Sebastian Wagner <sebastian.wagner@suse.com>
Signed-off-by: Sebastian Wagner <sebastian.wagner@suse.com>
@sebastian-philipp sebastian-philipp force-pushed the cephadm-total-scheduler branch from a3dfaa0 to b307021 Compare June 3, 2020 12:22
@sebastian-philipp
Copy link
Contributor Author

ping @jschmid1 + @mgfritch wdyt?

@sebastian-philipp sebastian-philipp added wip-swagner-testing My Teuthology tests and removed needs-doc labels Jun 5, 2020
@sebastian-philipp
Copy link
Contributor Author

@sebastian-philipp sebastian-philipp merged commit 3a757a4 into ceph:master Jun 9, 2020
@LenzGr
Copy link
Contributor

LenzGr commented Jun 10, 2020

Unfortunately this change broke Ceph Dashboard: https://tracker.ceph.com/issues/45963
(the failing ceph dashboard backend API tests on this PR should have been a warning sign)

{
'name': 'migration_current',
'type': 'int',
'default': None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noneis not an integer, thus it will break code that relies on the configured type. It would be better to choose a value of int, e.g. -1 or 0 here.
The regression introduced by this PR is https://tracker.ceph.com/issues/45963.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm.

When looking at the code:

def get_module_option(self, key, default=None):
"""
Retrieve the value of a persistent configuration setting
:param str key:
:param default: the default value of the config if it is not found
:return: str
"""
r = self._ceph_get_module_option(key)
if r is None:
return self.MODULE_OPTION_DEFAULTS.get(key, default)
else:
return r

having a default that is arbitrary does seems to work find. Having the dashboard treat the values differently than the mgr_module.py seems odd to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

``myfs`` across the cluster.

Then, in case there are less than three daemons deployed on the candidate
hosts, cephadm will then then randomly choose hosts for deploying new daemons.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/will then then randomly/will then randomly/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants