[autoscaler] Allowing users to provide extra configs for AWS#7844
[autoscaler] Allowing users to provide extra configs for AWS#7844richardliaw merged 16 commits intoray-project:masterfrom
Conversation
|
Can one of the admins verify this patch? |
|
Test FAILed. |
|
Test PASSed. |
|
Test PASSed. |
|
Test PASSed. |
|
Test FAILed. |
|
Test PASSed. |
python/ray/autoscaler/aws/config.py
Outdated
| try: | ||
| aws_credentials = config["provider"]["extra_config"]["aws_credentials"] | ||
| return boto3.client( | ||
| name, | ||
| config["provider"]["region"], | ||
| config=boto_config, | ||
| **aws_credentials) | ||
| except KeyError: | ||
| return boto3.client( | ||
| name, config["provider"]["region"], config=boto_config) |
There was a problem hiding this comment.
We already assume that everything within the provider keyspace is custom to AWS (see the Azure, GCP provider fields), so the extra_config field is redundant.
Consider:
aws_credentials = config["provider"].get("aws_credentials", {})
return boto3.client(
name,
config["provider"]["region"],
config=boto_config,
**aws_credentials)There was a problem hiding this comment.
You are correct that everything within the provider keyspace is custom to AWS.
Essentially, I want to pass in some configuration that is not specified in ray-schema.json (i.e. aws_credentials, security_group and etc), and my understanding is that we only allow properties that are listed in ray-schema.json to be present in the config object. In other words, config["provider"]["aws_credentials"] is not allowed with the current schema. I am using extra-config to pass them in because it's described as "provider-specific config" (see here) , which I thought was appropriate for the use case.
There was a problem hiding this comment.
As for how to get the value from the config object, I debated between what I currently have and what you purposed. If am correct about using extra_config, using the approach you proposed we would get some like
config["provider"].get("extra_config", {}).get("security_group", {}).get("IpPermissions", {}) for getting IpPermissions.
I personally think the KeyError way is easier to understand in this case, but if you feel strongly about avoiding it I'm happy to change that.
There was a problem hiding this comment.
Ah - let's change that. The schema was added last week, and for provider configurations, there's no particular reason why it needs to be stiff.
diff --git a/python/ray/autoscaler/ray-schema.json b/python/ray/autoscaler/ray-schema.json
index 81bd1663a..c0b64a955 100644
--- a/python/ray/autoscaler/ray-schema.json
+++ b/python/ray/autoscaler/ray-schema.json
@@ -58,7 +58,7 @@
"type": "object",
"description": "Cloud-provider specific configuration.",
"required": [ "type" ],
- "additionalProperties": false,
+ "additionalProperties": true,
"properties": {
"type": {
"type": "string",
Stage this hunk [y,n,q,a,d,j,J,g,/,e,?]? y
@@ -128,10 +128,6 @@
"type": "object",
"description": "k8s autoscaler permissions, if using k8s"
},
- "extra_config": {
- "type": "object",
- "description": "provider-specific config"
- },
"cache_stopped_nodes": {
"type": "boolean",
"description": " Whether to try to reuse previously stopped nodes instead of launching nodes. This will also cause the autoscaler to stop nodes instead of terminating them. Only implemented for AWS."There was a problem hiding this comment.
I agree that chaining 3 gets is not practical. If we can remove the extra_config layer, that should make your life easier.
There was a problem hiding this comment.
^ feel free to make this above change.
There was a problem hiding this comment.
Made the changes as you suggested.
|
Test PASSed. |
python/ray/autoscaler/aws/config.py
Outdated
| IpPermissions.extend(config["provider"]["extra_config"][ | ||
| "security_group"]["IpPermissions"]) |
There was a problem hiding this comment.
While reusing the existing Boto3 IpPermission model here is quite powerful, having users write it directly into their config unfortunately forces them to pick through the (pretty gnarly) Boto3 IpPermissions docs and cross-reference them with other AWS docs to understand the significance of things like EC2-VPC vs. EC2-Classic parameters. Given the comparatively simple modern EC2-VPC IpPermissions model, the difficulty of trying to support/test Ray cluster configuration via deprecated technologies like EC2-Classic, and the difficulty involved in backing out of this decision in the future, I would recommend exposing our own simplified EC2-VPC abstraction of this model to users. For example:
https://github.com/ray-project/ray/blob/207a0aa2fe477edb261de05cefaf7b3b4eed4bc5/python/ray/autoscaler/ray-schema.json#L15-L78 https://github.com/ray-project/ray/blob/207a0aa2fe477edb261de05cefaf7b3b4eed4bc5/python/ray/autoscaler/aws/example-security-groups.yaml#L18-L75.
There was a problem hiding this comment.
I agree this is a better approach than using the existing Boto3 IpPermission model. Since you have written much of what's needed to support this, do you want to make a separate PR for this?
There was a problem hiding this comment.
Also do you have any purposed solutions to the issue you mentioned about the failure when unequal but equivalent rule is provided by the user? Maybe bringing the server-side coercion logic here could be a solution. Not sure how feasible that is though.
There was a problem hiding this comment.
Definitely! I actually opened up a PR for this yesterday: #7873. It would be great to get some reviewer eyes on it, since the high degree of manual effort currently involved in configuring AWS account Subnets and/or Security Group rules without these changes effectively presents a blocker to us standing up Ray Clusters at scale across separate internal AWS accounts.
There was a problem hiding this comment.
My changes currently only do naive IpPermission equality checks, and thus suffer from the same coercion corner case problems. However, I've been taking some time today to look through the server-side rule target coercion logic to see how easy it is to replicate within the AWS autoscaler code (and how fragile the resulting solution may be). If I can land on a promising solution, then I'll let you know and update the PR as well.
|
Test PASSed. |
|
@allenyin55 ping me when I can merge this. |
|
Test FAILed. |
|
Test FAILed. |
|
Test FAILed. |
|
Test FAILed. |
|
Test PASSed. |
|
This looks great now, test failures are unrelated! Do you want to merge it @richardliaw? |
Why are these changes needed?
These changes allow users to use autoscaler to create security groups with customize inbound rules and key pairs with custom key names with specific credentials.
Only AWS configuration is implemented in this PR.
Related issue number
Checks
scripts/format.shto lint the changes in this PR.