mgr/cephadm: make scheduling make sense#34633
Conversation
|
discussion here: https://pad.ceph.com/p/orchestration-weekly |
f4d6669 to
29a8399
Compare
457ce7a to
30785bb
Compare
30785bb to
2e13612
Compare
2e13612 to
cf1e104
Compare
|
ping @jschmid1 |
jschmid1
left a comment
There was a problem hiding this comment.
this needs a closer review. consider this as a first round of reviews
| ) | ||
|
|
||
| new_spec = ServiceSpec.from_json(spec.to_json()) | ||
| new_spec.placement = new_placement |
There was a problem hiding this comment.
I can also see an issue where a user would apply a cluster.yaml only when changing a single service. Since we converted the placement_glob(or pattern) to an explicit placement spec, they probably have to export and overwrite their existing cluster.yaml.
I think this should be recommended somehow.
There was a problem hiding this comment.
Right, that's a big problem but I didn't have a better answer.
There was a problem hiding this comment.
Having previews for all services would be a first step I think.. If placements would be changed completely we could atleast raise a warning before applying it.
There was a problem hiding this comment.
both schedulers behave very similar when it comes to assigning hosts. It's just the old assigned more hosts than the new one.
There was a problem hiding this comment.
I'm still struggling with this concept, but I suppose it's similar to k8s nodeName ...
https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodename
There was a problem hiding this comment.
yes. but what's the alternative? what if the user wrongly created a placement like
service_type: mon
placement:
label: mymons
count=5but only a single host was labled as mymons? should we simply remove all the mons? I'm also against being more elaborate here as this increases the likelihood of bugs. I'd be really in favor of keeping this as simple as possible to stay safe.
cf1e104 to
c1d9dc9
Compare
| ) | ||
|
|
||
| new_spec = ServiceSpec.from_json(spec.to_json()) | ||
| new_spec.placement = new_placement |
There was a problem hiding this comment.
I'm still struggling with this concept, but I suppose it's similar to k8s nodeName ...
https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodename
22c5e58 to
3a04af6
Compare
spoiler: it doesn't. Signed-off-by: Sebastian Wagner <sebastian.wagner@suse.com>
Like, ```yaml placement: hosts: - myhost count: 3 ``` will always only schedule one daemon on `myhost`. Signed-off-by: Sebastian Wagner <sebastian.wagner@suse.com>
Signed-off-by: Sebastian Wagner <sebastian.wagner@suse.com>
Signed-off-by: Sebastian Wagner <sebastian.wagner@suse.com>
Signed-off-by: Sebastian Wagner <sebastian.wagner@suse.com>
Signed-off-by: Sebastian Wagner <sebastian.wagner@suse.com>
3a04af6 to
a3dfaa0
Compare
New scheduler that takes PlacementSpec as the bound and not as recommendation. Which means, we have to make sure, we're not removing any daemons directly after upgrading to the new scheduler. There is a potential race here: 1. user updates his spec to remove daemons 2. mgr gets upgrades to new scheduler, before the old scheduler removed the daemon 3. now, we're converting the spec to explicit placement, thus reverting (1.) I think this is ok. Signed-off-by: Sebastian Wagner <sebastian.wagner@suse.com>
Signed-off-by: Sebastian Wagner <sebastian.wagner@suse.com>
Signed-off-by: Sebastian Wagner <sebastian.wagner@suse.com>
a3dfaa0 to
b307021
Compare
|
Unfortunately this change broke Ceph Dashboard: https://tracker.ceph.com/issues/45963 |
| { | ||
| 'name': 'migration_current', | ||
| 'type': 'int', | ||
| 'default': None, |
There was a problem hiding this comment.
Noneis not an integer, thus it will break code that relies on the configured type. It would be better to choose a value of int, e.g. -1 or 0 here.
The regression introduced by this PR is https://tracker.ceph.com/issues/45963.
There was a problem hiding this comment.
hm.
When looking at the code:
ceph/src/pybind/mgr/mgr_module.py
Lines 643 to 655 in 11ddc9c
having a default that is arbitrary does seems to work find. Having the dashboard treat the values differently than the mgr_module.py seems odd to me.
| ``myfs`` across the cluster. | ||
|
|
||
| Then, in case there are less than three daemons deployed on the candidate | ||
| hosts, cephadm will then then randomly choose hosts for deploying new daemons. |
There was a problem hiding this comment.
s/will then then randomly/will then randomly/
Depends on
Consider this placement:
Previously, if there was already a daemon running on
myhost, cephadm would schedule another daemon on a random other host.Now, will always only schedule one daemon on
myhost.Open questions:
TODO:
Also fixed:
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard backendjenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume tox