Skip to content

[ML] Fail model deployment if all allocations cannot be provided#88656

Merged
dimitris-athanasiou merged 3 commits intoelastic:masterfrom
dimitris-athanasiou:fail-starting-deployment-if-model-cannot-be-satisfied
Jul 21, 2022
Merged

[ML] Fail model deployment if all allocations cannot be provided#88656
dimitris-athanasiou merged 3 commits intoelastic:masterfrom
dimitris-athanasiou:fail-starting-deployment-if-model-cannot-be-satisfied

Conversation

@dimitris-athanasiou
Copy link
Copy Markdown
Contributor

When we cannot scale up (autoscaling is disabled) we should fail
requests to start a trained model deployment whose allocations
cannot be provided.

When we cannot scale up (autoscaling is disabled) we should fail
requests to start a trained model deployment whose allocations
cannot be provided.
@elasticsearchmachine elasticsearchmachine added the Team:ML Meta label for the ML team label Jul 20, 2022
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/ml-core (Team:ML)

@benwtrent benwtrent self-requested a review July 20, 2022 16:07
Comment on lines +513 to +519
if (maxLazyMLNodes <= nodes.size()
&& trainedModelAssignment.isSatisfied(nodes.stream().map(DiscoveryNode::getId).collect(Collectors.toSet())) == false) {
String msg = "Could not start deployment because there are not enough resources to provide all requested allocations";
logger.debug(() -> format("[%s] %s", modelId, msg));
exception = new ElasticsearchStatusException(msg, RestStatus.TOO_MANY_REQUESTS);
return true;
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also check to make sure that the current node size is less than the max node size (vertical scaling vs horizontal).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well spotted. I've added a commit that factors out a isScalingPossible and uses it in both places we make that check.

return false;
}

private boolean isScalingPossible(List<DiscoveryNode> nodes) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should add a comment that this only considers memory, and in the future it would be nice to consider CPU too.

It means there's a discrepancy in how we handle different situations:

  • If autoscaling is enabled, a cluster is scaled to maximum size, there are 2 free CPUs, and you ask to start a deployment that needs 3 CPUs then you get told you cannot.
  • If autoscaling is enabled, a cluster is scaled to one step below its maximum size, there are no free CPUs, and you ask to start a deployment that needs 100000 CPUs then that's fine - we start it, the cluster scales to its maximum size and the deployment goes ahead but with far fewer than 100000 CPUs allocated to it.

While autoscaling doesn't understand CPU there's not a lot we can do about this, but it's worth at least adding comments to acknowledge where we are today.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a TODO comment.

@dimitris-athanasiou dimitris-athanasiou merged commit 154b924 into elastic:master Jul 21, 2022
@dimitris-athanasiou dimitris-athanasiou deleted the fail-starting-deployment-if-model-cannot-be-satisfied branch July 21, 2022 12:27
weizijun added a commit to weizijun/elasticsearch that referenced this pull request Jul 22, 2022
* upstream/master: (40 commits)
  Fix CI job naming
  [ML] disallow autoscaling downscaling in two trained model assignment scenarios (elastic#88623)
  Add "Vector Search" area to changelog schema
  [DOCS] Update API key API (elastic#88499)
  Enable the pipeline on the feature branch (elastic#88672)
  Adding the ability to register a PeerFinderListener to Coordinator (elastic#88626)
  [DOCS] Fix transform painless example syntax (elastic#88364)
  [ML] Muting InternalCategorizationAggregationTests testReduceRandom (elastic#88685)
  Fix double rounding errors for disk usage (elastic#88683)
  Replace health request with a state observer. (elastic#88641)
  [ML] Fail model deployment if all allocations cannot be provided (elastic#88656)
  Upgrade to OpenJDK 18.0.2+9 (elastic#88675)
  [ML] make bucket_correlation aggregation generally available (elastic#88655)
  Adding cardinality support for random_sampler agg (elastic#86838)
  Use custom task instead of generic AckedClusterStateUpdateTask (elastic#88643)
  Reinstate test cluster throttling behavior (elastic#88664)
  Mute testReadBlobWithPrematureConnectionClose
  Simplify plugin descriptor tests (elastic#88659)
  Add CI job for testing more job parallelism
  [ML] make deployment infer requests fully cancellable (elastic#88649)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:ml Machine learning >non-issue Team:ML Meta label for the ML team v8.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants