mgr/cephadm: mgr or mds scale-down should prefer non-active daemons#36485
mgr/cephadm: mgr or mds scale-down should prefer non-active daemons#36485sebastian-philipp merged 1 commit intoceph:masterfrom
Conversation
sebastian-philipp
left a comment
There was a problem hiding this comment.
I'm missing any tests here. Mind if you create a test in test_scheduling? I really want to avoid this to break at any time in the future.
|
Swapped from passing the get_active_daemon function to the scheduler to having a new field in the DaemonDescription object to mark if a daemon is the active one (only applies to mgr and mds). Right now, I'm setting it in the I still need to make tests. |
|
@adk3798 I hope this makes sense. |
The intention with that part of the message was to mirror the line from the ceph tracker "2. make the scheduler prefer active daemons when placing them." but I can see how that can be confusing, especially when the first line of the commit says it will prefer non-active daemons, so I'll change it to speak only about preferring non-active daemons when removing them and leave out language about placing. |
86e4ee7 to
7c0fad6
Compare
|
Added some python tests for placing on hosts when one of the daemons has the is_active flag set |
|
jenkins test make check |
9bf3424 to
949bf30
Compare
src/pybind/mgr/cephadm/module.py
Outdated
| if dd.daemon_type in ['grafana', 'iscsi', 'prometheus', 'alertmanager', 'nfs']: | ||
| daemons_post[dd.daemon_type].append(dd) | ||
|
|
||
| if dd.daemon_type in ['mgr', 'mds']: |
There was a problem hiding this comment.
why only for mgr and mds? We have get_active_daemon also for grafana
There was a problem hiding this comment.
@sebastian-philipp The reason I was originally very careful and only allowing mds or mgr daemons here was because I had left get_active_daemon undefined for many of the daemon types. I've refactored it a bit so if it's called for a service that hasn't defined the function it just gets back an empty Daemon Desc and then checks if the daemon_id on the Daemon Desc it gets back matches the daemon that is being checked. Thoughts?
Also, the reasoning here is the same reason I was limiting it to mgr and mds elsewhere. I've marked those conversations as resolved but can go back to them if this system is insufficient.
There was a problem hiding this comment.
lgtm. I'll try to get though QA asap
|
jenkins test make check |
When removing daemons during a mgr/mds scale-down, prefer to remove standby daemons so the active daemon is not killed Fixes: https://tracker.ceph.com/issues/44252 Signed-off-by: Adam King <adking@redhat.com>
I'm facing a problem that `ceph orch daemon redeploy <our own mgr>` really fails badly: 1. client sends `redeploy` to the mon 2. mon sends `redeploy` to the mgr 3. we synchonously call _create_daemon() with our own mgr. this never completes 4. the mon re-sends the command to the mgr as soon as it starts 5. goto 2. and we#re in an evil endless loop I'm now trying to 1. ok-to-stop to always return False in this case and 2. call ok-to-stop from `orch daemon redeploy` ,but this makes some problems, as we then no longer failove the mgr thus never undeploy the active MGR. Turns out this is really closely related to AdamKm's ceph#36485 (preferring to remove standby daemons instead of active ones), but that dosn't solve `orch daemon redeploy`. So for now, I think making ok-to-stop always false for our own mgr is the wrong idea for that I should rely on ceph#36485 instead. but how do I solve `daemon redeploy`? I think we _have_ to failover before redeploying our own mgr I mean, self._create_daemon is the last call we ever execute. becuase then we're gone and another MGR takes over Signed-off-by: Sebastian Wagner <sebastian.wagner@suse.com>
When placing daemons during a mgr/mds scale-down, give preferenceto the host with the active daemon so the active daemon is not
picked for removal
When removing daemons during a mgr/mds scale-down, prefer to remove
standby daemons so the active daemon is not killed
Fixes: https://tracker.ceph.com/issues/44252
Signed-off-by: Adam King adking@redhat.com