mgr/mds_autoscaler: add autoscaling for mds mgmt#32731
Conversation
5883416 to
de06bd6
Compare
sebastian-philipp
left a comment
There was a problem hiding this comment.
Please add this module to the ../tox.ini under mypy
| def is_fewer_than_max_mds(self, fs): | ||
| if len(fs['in']) < fs['max_mds']: | ||
| return True | ||
| return False |
There was a problem hiding this comment.
It's not that simple. You need to review:
https://github.com/ceph/ceph/blob/master/src/mds/MDSMap.cc#L1025-L1033
and
Line 1996 in 0573918
In general, you only really need to ensure that sufficient standbys exist for a file system. The monitors will use standbys to stabilize the file system if possible. You will see an "INSUFFICIENT_STANDBY" warning generated by the monitors if there are not enough standbys for a file system. See also:
Lines 242 to 250 in 0573918
This module should be able to monitor for that via the below notify method for the "clog" events. However, this module should just replicate the logic that detects insufficient standbys and not try to parse the clog event message.
5dc4000 to
6b2812b
Compare
| case CEPH_MSG_FS_MAP: | ||
| py_module_registry->notify_all("fs_map", ""); | ||
| handle_fs_map(ref_cast<MFSMap>(m)); | ||
| py_module_registry->notify_all("fs_map", ""); |
There was a problem hiding this comment.
This should be in a separate commit with explanation.
There was a problem hiding this comment.
(Looks like you figured out the problem with point 1. in the mail you sent.)
I would also send this fix as a separate PR so it can get in sooner.
There was a problem hiding this comment.
@batrick I thought I had figured out the problem, but doesn't look like so. The fs_map object is still undefined when the clog notification type is handled by the plugin.
Or maybe I turned too many knobs at once while changing the handling of notifications fromfs_map to clog. The plugin probably needs to handle the fs_map notification to save the fs_map and then act on it when the clog notification is delivered.
There was a problem hiding this comment.
@batrick This was the missing scope specifier (self) prefix to the fs_map member that was biting me.
6b2812b to
a23701a
Compare
| time.sleep(1) | ||
|
|
||
| def is_insufficient_standby(self, message): | ||
| return (message == 'Health check failed: insufficient standby MDS daemons available (MDS_INSUFFICIENT_STANDBY)') |
There was a problem hiding this comment.
is such a comparison the only way to figure out if there are insufficient standbys? wouldn't get_current_standby_count() < get_total_standby_count_required() be a good enough?
| if isinstance(notify_id, dict): | ||
| message = notify_id.get('message') | ||
| if self.is_insufficient_standby(message) and self.fs_map: | ||
| self.filesystems = self.fs_map['filesystems'] |
There was a problem hiding this comment.
is it really required to maintain a copy of fsmap, filesystems and standbys?
There was a problem hiding this comment.
@vshankar did away with the copies now ... will update PR later
src/mgr/Mgr.cc
Outdated
| {}, &c->outbl, &c->outs, c); | ||
| } | ||
| } | ||
| fs_map_cond.notify_all(); |
There was a problem hiding this comment.
this has been moved from the start of the routine to the end -- what does it fix?
There was a problem hiding this comment.
@vshankar I had thought of moving the notification to the last since my patch wasn't working; but my python code wasn't what it was supposed to be; I'll be reverting back these changes
9ed834d to
5bccc44
Compare
5f6de2f to
d90f87e
Compare
| if sb['state'] == 'up:active' and sb['rank'] == -1: | ||
| name = sb['name'] | ||
| self.remove_mds(sb['name']) | ||
| break |
There was a problem hiding this comment.
I think this logic should be moved to cephadm so that it, in general, kills standby (mds,mgr) daemons when scaling down.
There was a problem hiding this comment.
@liewegas I presumed that cephadm was only responsible for implementing the mechanism of the operations and the policy decisions were supposed to be taken by cephadm clients.
There was a problem hiding this comment.
cephadm has a kubernetes-like spec (that says, e.g., "3 daemons") and controller that ensures that the right number of daemons are running, removing or adding individual daemons will tend to fight with that: if you remove one, a new one will get created (somewhere), and if you add one, an existing one will get removed. Alternatively, if you adjust the count down, cephadm will pick which one to remove.
I think the best path forward is to make cephadm (and rook) just a bit smarter so that they prefer to remove standby daemons and not active ones. Then this tool just needs to adjust the total daemon count (max_mds + num standbys)...
There was a problem hiding this comment.
I have a half-written patch to make cephadm prefer to remove standbys... so I suggest simplifying this PR to just adjust the count.
There was a problem hiding this comment.
@liewegas "Then this tool just needs to adjust the total daemon count (max_mds + num standbys)" ... how do I do that ?
There was a problem hiding this comment.
In the most common case, it's as simple as ceph orch apply mds $fsname $count (apply_mds(...)). But the user might have a customized placement (specifying a label, hosts, etc.). So I think right thing to do is fetch the current placement (you can do this with ceph orch ls --service-name mds.$fsname (describe_service(service_name='mds.whatever')), modify just the count field in spec.placement PlacementSpec (make a copy first--don't modify in place), and then send that back to apply_mds().
There was a problem hiding this comment.
@liewegas all that the describe_service() returns back is one record about the crash service
I've posted a WIP PR; please take a look; I must be missing something that I can't fathom yet.
There was a problem hiding this comment.
@sebastian-philipp I'm trying to test my plugin on a vstart cluster on my laptop. I can't find anyway to make cephadm act on the default cephfs.a cluster. Looks like cephadm should start the daemons itself to gain ownership of the cluster to effectively act on them.
Is there a way to instruct cephadm to start a file-system and take ownership of all relevant daemons ?
The MDS Autoscaler Plugin is supposed to act on changes to the FSMap and spawn or kill MDS standby daemons as required. Could you suggest me a workflow to achieve this ? I've used Sage's suggestions to update my code but the describe_service() API isn't listing anything than the crash service. Same goes with ceph orch ls command at the command-line as well.
There was a problem hiding this comment.
@mchangir so, cephadm won't touch any daemons that are not created by cephadm, like e.g. daemons created by vstart.sh. you'll have to create them using ceph orch apply mds... and not via vstart.
is there a way to instruct cephadm to start a file-system and take ownership of all relevant daemons ?
yes: ceph fs volume create --placement=....
the describe_service() API isn't listing anything than the crash service
exactly. you'll have to create the daemons using cephadm first.
24217a1 to
f7a33a8
Compare
|
f7a33a8 to
670e6c3
Compare
098205b to
80f3e44
Compare
batrick
left a comment
There was a problem hiding this comment.
I'm trying this out with vstart.sh: env MDS=5 ../src/vstart.sh -n -d --cephadm.
I'm seeing the module do some work (with the patches below suggested) but cephadm isn't spawning any new daemons. Have you tried this with vstart.sh?
| assert fs_map is not None | ||
| for fsys in fs_map['filesystems']: | ||
| if fsys.get('mdsmap').get('fs_name') == fs_name: | ||
| return len(fsys.get('mdsmap').get('up')) |
There was a problem hiding this comment.
Since you're looking at the daemon names in get_current_standby_count to determine which standbys belong to a file system, you need to do the same here. If the daemon was not spawned by the orchestrator, it should not be counted.
| for fsys in fs_map['filesystems']: | ||
| if fsys.get('mdsmap').get('fs_name') == fs_name: | ||
| return len(fsys.get('mdsmap').get('up')) |
There was a problem hiding this comment.
| for fsys in fs_map['filesystems']: | |
| if fsys.get('mdsmap').get('fs_name') == fs_name: | |
| return len(fsys.get('mdsmap').get('up')) | |
| for fs in fs_map['filesystems']: | |
| if fs['mdsmap']['fs_name'] == fs_name: | |
| return len(fs['mdsmap']['up']) |
| assert fs_map is not None | ||
| for fs in fs_map['filesystems']: | ||
| if fs['mdsmap']['fs_name'] == fs_name: | ||
| return fs['mdsmap'].get('standby_count_wanted') |
There was a problem hiding this comment.
| return fs['mdsmap'].get('standby_count_wanted') | |
| return fs['mdsmap']['standby_count_wanted'] |
| for fsys in fs_map['filesystems']: | ||
| if fsys.get('mdsmap').get('fs_name') == fs_name: | ||
| return fsys.get('mdsmap', {}).get('max_mds') |
There was a problem hiding this comment.
| for fsys in fs_map['filesystems']: | |
| if fsys.get('mdsmap').get('fs_name') == fs_name: | |
| return fsys.get('mdsmap', {}).get('max_mds') | |
| for fs in fs_map['filesystems']: | |
| if fs['mdsmap']['fs_name'] == fs_name: | |
| return fs['mdsmap']['max_mds'] |
| except (orchestrator.OrchestratorError, AssertionError) as e: | ||
| self.log.debug(f"fs:{fs_name} exception while verifying mds status: {e!r}") |
There was a problem hiding this comment.
| except (orchestrator.OrchestratorError, AssertionError) as e: | |
| self.log.debug(f"fs:{fs_name} exception while verifying mds status: {e!r}") | |
| except orchestrator.OrchestratorError as e: | |
| self.log.exception(f"fs:{fs_name} exception while verifying mds status: {e}") |
There was a problem hiding this comment.
Don't catch AssertionError, we want the module to blow up if that hits.
| MDS autoscaler. | ||
| """ | ||
| def __init__(self, *args, **kwargs): | ||
| super(MDSAutoscaler, self).__init__(*args, **kwargs) |
There was a problem hiding this comment.
| super(MDSAutoscaler, self).__init__(*args, **kwargs) | |
| MgrModule.__init__(*args, **kwargs) |
With multiple inheritance, super is ambiguous. OrchestratorClientMixin has no __init__ method, too.
According to @sebastian-philipp the cephadm will not spawn more than one MDS per host for a filesystem. |
That's not going to work: MDSs deployed by vstart itself aren't managed by cephadm. You're going to need a few VMs at add them as hosts to cephadm. Then you can deploy MDSs on them. So, something like this should work: |
Thanks for reminding me. However, I didn't see cephadm even spawn one daemon on the host. Perhaps it knows that there are MDS it does not manage already on that host (localhost)?
Ok. Let's get the patches I've suggested above in then we can merge if @sebastian-philipp is satisfied. |
80f3e44 to
b93a95c
Compare
|
@sebastian-philipp please review |
|
Good first milestone. Just waiting on @sebastian-philipp for his final take. |
|
do we have a QA run validating this PR? |
nope; its been all manual testing so far; |
|
mgr plugin to deploy and configure MDSs in response to degraded file system MDS instance management as per changes to: * 'max_mds' option * 'standby_count_wanted' option * mds liveness and transitions from standby to active mds_autoscaler plugin test credit goes to Sebastian Wagner. Fixes: https://tracker.ceph.com/issues/40929 Signed-off-by: Milind Changire <mchangir@redhat.com>
b93a95c to
f69abe6
Compare
I've added the |
MDS instance management as per changes to 'max_mds' option and mds process liveness.
Fixes: https://tracker.ceph.com/issues/40929
Signed-off-by: Milind Changire mchangir@redhat.com
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard backendjenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume tox