[ML] Distribute trained model allocations across availability zones#89822
Conversation
When a model deployment is started with 2 or more allocations and availability zones are present we should distribute the allocations across availability zones so that there is resilience. This commit adds a `ZoneAwareAssignmentPlanner` that attempts to evenly distribute the allocations of a deployment across the available zones.
|
Pinging @elastic/ml-core (Team:ML) |
|
Hi @dimitris-athanasiou, I've created a changelog YAML for you. |
|
run elasticsearch-ci/part-1 |
droberts195
left a comment
There was a problem hiding this comment.
LGTM
A couple of minor nits and one much more far-reaching observation. But I am happy to merge the PR for 8.5 if you could just adjust the release note to make clear it's referring to trained models.
...ava/org/elasticsearch/xpack/ml/inference/assignment/planning/ZoneAwareAssignmentPlanner.java
Outdated
Show resolved
Hide resolved
| ) { | ||
| Map<String, Integer> modelIdToTargetAllocations = modelIdToRemainingAllocations.entrySet() | ||
| .stream() | ||
| .collect(Collectors.toMap(e -> e.getKey(), e -> (e.getValue() - 1) / remainingZones + 1)); |
There was a problem hiding this comment.
There's an interesting effect here.
Suppose we have 3 AZs, some models that have been given 3 allocations and some jobs that have been given 1 allocation.
We'll try to give all the jobs 1 allocation in the first AZ. This likely won't fit, but it could be that we end up choosing the models that are going to have 1 allocation in total. In the worst case we could end up with all the models with 1 allocation in the first AZ, and the models with 3 allocations having to have 1 and 2 allocations in the other two AZs. This would be highly unintuitive to a user who would expect that the models they've asked for 3 allocations of to go one per AZ, and the ones they've asked for 1 allocation of to be fitted in around that.
One possibility might be to exclude models with only one allocation requested from the per-zone plans, so they get fitted in at the very end by the overall plan. If you think this is too hard for the time available for 8.5 then I am happy to leave this as-is for 8.5. I think what's here at the moment is better than what we have now, so the PR in its current form is a step forwards. But we might need to revisit this again in the future.
There was a problem hiding this comment.
Right, yes, that is possible. However, if we excluded 1-allocation models from the per-zone planning, then it is possible that they do not get any allocations if the multi-allocation models took all resources. It would be like tweaking the objective function to prefer multi-allocation models. Not sure if that is better.
…rence/assignment/planning/ZoneAwareAssignmentPlanner.java Co-authored-by: David Roberts <dave.roberts@elastic.co>
Co-authored-by: David Roberts <dave.roberts@elastic.co>
|
@elasticmachine update branch |
* main: (175 commits) Fix integration test on Windows (elastic#89894) Avoiding the use of dynamic map keys in the cluster_formation results of the stable master health indicator (elastic#89842) Mute org.elasticsearch.tracing.apm.ApmIT.testCapturesTracesForHttpTraffic (elastic#89891) Fix typos in audit event types (elastic#89886) Synthetic _source: support histogram field (elastic#89833) [TSDB] Rename rollup public API to downsample (elastic#89809) Format script values access (elastic#89780) [DOCS] Simplifies composite aggregation recommendation (elastic#89878) [DOCS] Update CCS compatibility matrix for 8.3 (elastic#88906) Fix memory leak when double invoking RestChannel.sendResponse (elastic#89873) [ML] Add processor autoscaling decider (elastic#89645) Update disk-usage.asciidoc (elastic#89709) (elastic#89874) Add allocation deciders in createComponents (elastic#89836) Mute flaky H3LatLonGeometryTest.testIndexPoints (elastic#89870) Fix typo in get-snapshot-status-api doc (elastic#89865) Picking master eligible node at random in the master stability health indicator (elastic#89841) Do not reuse the client after a disruption elastic#89815 (elastic#89866) [ML] Distribute trained model allocations across availability zones (elastic#89822) Increment clientCalledCount before onResponse (elastic#89858) AwaitsFix for elastic#89867 ...
* main: (175 commits) Fix integration test on Windows (elastic#89894) Avoiding the use of dynamic map keys in the cluster_formation results of the stable master health indicator (elastic#89842) Mute org.elasticsearch.tracing.apm.ApmIT.testCapturesTracesForHttpTraffic (elastic#89891) Fix typos in audit event types (elastic#89886) Synthetic _source: support histogram field (elastic#89833) [TSDB] Rename rollup public API to downsample (elastic#89809) Format script values access (elastic#89780) [DOCS] Simplifies composite aggregation recommendation (elastic#89878) [DOCS] Update CCS compatibility matrix for 8.3 (elastic#88906) Fix memory leak when double invoking RestChannel.sendResponse (elastic#89873) [ML] Add processor autoscaling decider (elastic#89645) Update disk-usage.asciidoc (elastic#89709) (elastic#89874) Add allocation deciders in createComponents (elastic#89836) Mute flaky H3LatLonGeometryTest.testIndexPoints (elastic#89870) Fix typo in get-snapshot-status-api doc (elastic#89865) Picking master eligible node at random in the master stability health indicator (elastic#89841) Do not reuse the client after a disruption elastic#89815 (elastic#89866) [ML] Distribute trained model allocations across availability zones (elastic#89822) Increment clientCalledCount before onResponse (elastic#89858) AwaitsFix for elastic#89867 ... # Conflicts: # x-pack/plugin/rollup/src/main/java/org/elasticsearch/xpack/downsample/RollupShardIndexer.java
* main: (283 commits) Fix integration test on Windows (elastic#89894) Avoiding the use of dynamic map keys in the cluster_formation results of the stable master health indicator (elastic#89842) Mute org.elasticsearch.tracing.apm.ApmIT.testCapturesTracesForHttpTraffic (elastic#89891) Fix typos in audit event types (elastic#89886) Synthetic _source: support histogram field (elastic#89833) [TSDB] Rename rollup public API to downsample (elastic#89809) Format script values access (elastic#89780) [DOCS] Simplifies composite aggregation recommendation (elastic#89878) [DOCS] Update CCS compatibility matrix for 8.3 (elastic#88906) Fix memory leak when double invoking RestChannel.sendResponse (elastic#89873) [ML] Add processor autoscaling decider (elastic#89645) Update disk-usage.asciidoc (elastic#89709) (elastic#89874) Add allocation deciders in createComponents (elastic#89836) Mute flaky H3LatLonGeometryTest.testIndexPoints (elastic#89870) Fix typo in get-snapshot-status-api doc (elastic#89865) Picking master eligible node at random in the master stability health indicator (elastic#89841) Do not reuse the client after a disruption elastic#89815 (elastic#89866) [ML] Distribute trained model allocations across availability zones (elastic#89822) Increment clientCalledCount before onResponse (elastic#89858) AwaitsFix for elastic#89867 ...
When a model deployment is started with 2 or more allocations
and availability zones are present we should distribute the allocations
across availability zones so that there is resilience.
This commit adds a
ZoneAwareAssignmentPlannerthat attempts to evenlydistribute the allocations of a deployment across the available zones.