Skip to content

[autoscaler] AWS Autoscaler support for EC2 security group rule configuration#7873

Closed
pdames wants to merge 4 commits intoray-project:masterfrom
pdames:aws-autoscaler-security-groups
Closed

[autoscaler] AWS Autoscaler support for EC2 security group rule configuration#7873
pdames wants to merge 4 commits intoray-project:masterfrom
pdames:aws-autoscaler-security-groups

Conversation

@pdames
Copy link
Copy Markdown
Member

@pdames pdames commented Apr 2, 2020

Why are these changes needed?

  1. These changes support common enterprise network security compliance use cases to ensure that inbound AWS EC2 instance connections are restricted to a known set of internal subnets. Although custom security groups or network ACLs may currently be configured to restrict SSH access to internal subnets, their creation requires:
  • Learning the current inbound connection rule set required for a Ray cluster and associated CLI tools to function properly (which may change over time).
  • Creating a custom security group and/or network ACL prior to running Ray cluster setup (via either a manual or automated out-of-band process).
  • Specifying the custom security group ID and/or subnet ID associated with the network ACL in the Ray autoscaler config file (which are only known after creating a custom security group or Network ACL)
    In general, I believe that we should decouple AWS autoscaler users from these responsibilities by allowing them to specify common security group configuration overrides to apply during default security group bootstrapping. To that end, these changes expose new AWS autoscaler configuration settings that allow users to specify custom inbound head/worker node security group rules to apply (via authorization of new rules or revocation of old rules) during config bootstrapping. Please refer to the included "example-security-groups.yaml" file for additional details. Also note that, if no custom security group rule config is specified, then prior default security group bootstrap behavior is maintained, namely:
  • Inbound rules permitting external connections to previously bootstrapped security groups are not modified.
  • Newly bootstrapped security groups allow all network traffic between head and worker nodes as well as external SSH connections from 0.0.0.0/0.
  1. These changes also fix a bug which allowed users to specify separate subnet ID configurations for AWS head and worker nodes, but then ignored the configured head node subnet ID during cluster security group bootstrapping. Previously, we would always bootstrap one security group for both head and worker nodes and always choose the VPC associated with a worker's subnet ID (thus failing to create a separate head node security group whenever its subnet ID configuration resolved to a different VPC). Now, separate security groups will be created for head and worker nodes whenever their respective subnet ID configurations resolve to separate VPCs.

It also may be worth noting that additional tests and documentation updates are expected to follow this pull request, but I wanted to collect early reviewer opinions now instead of opening a more complete (but monolithic) pull request later.

Related issue number

#8420

Checks

@AmplabJenkins
Copy link
Copy Markdown

Can one of the admins verify this patch?

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/24160/
Test FAILed.

@pdames pdames force-pushed the aws-autoscaler-security-groups branch from 207a0aa to d0172ee Compare April 3, 2020 07:48
@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/24185/
Test PASSed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, it's for boto tests later. These tests don't actually require creds right?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct - the tests are stubbed/mocked out and don't interact with any real AWS account.

@richardliaw
Copy link
Copy Markdown
Contributor

hey @allenyin55, do you mind reviewing this PR?

@pdames pdames force-pushed the aws-autoscaler-security-groups branch from d0172ee to 57688b0 Compare April 7, 2020 01:18
@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/24332/
Test FAILed.

@allenyin55
Copy link
Copy Markdown
Contributor

hey @allenyin55, do you mind reviewing this PR?

Might be tough this week, but I'll try to make some time for this.

pdames added 4 commits April 9, 2020 00:20
… inbound rule configuration and separate head/worker subnets.
…ult SSH inbound rules, and correct misleading AWS autoscaler example and test comments.
…ules, revoke unspecified rules by default, and revised security group config example.
@pdames pdames force-pushed the aws-autoscaler-security-groups branch from 57688b0 to acc44d1 Compare April 9, 2020 16:50
@pdames
Copy link
Copy Markdown
Member Author

pdames commented Apr 9, 2020

FYI, the latest update of this pull request includes changes to deduplicate equivalent, but unequal, rules as mentioned in #7844. Essentially, the rules boil down to simply resolving case-sensitivity differences between prefix and group list IDs, as well as canonicalizing IPv4 and IPv6 addresses. The canonicalization rules rely on fairly standard Java IP address libraries together with a few custom extensions that ultimately produce the same canonical string representations that we can get using https://docs.python.org/3/library/ipaddress.html:

canonical_cidr_ip_string = str(ipaddress.ip_network(cidr_ip_string, False))

@richardliaw
Copy link
Copy Markdown
Contributor

This PR is just exploding; can you split some of this up into separate PRs?

@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/24465/
Test PASSed.

@pdames
Copy link
Copy Markdown
Member Author

pdames commented Apr 9, 2020

This PR is just exploding; can you split some of this up into separate PRs?

I'm open to whatever makes your job easiest. There are a few ways that I could see us split it up going forward:

  1. Break the most recent changes for deduping equivalent but unequal rules and automatic revocation of all unlisted rules out into a separate PR (although it should be noted that represents a backwards-incompatible change, and that most of the new code volume here is from tests).
  2. Include only non-test, non-example changes in this PR. Include tests and examples in 1 or 2 subsequent PRs (helps break up focus by area, although I worry a bit about pushing code without accompanying tests to protect it from regresions).
  3. Leave it as it is for now, and simply ensure that any subsequent revisions to the logic, tests, or examples here go into new PRs.

Let me know if you have a preference (or if the preference perhaps isn't listed here).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants