feat(eks): support INF2 instance types#27373
Conversation
aws-cdk-automation
left a comment
There was a problem hiding this comment.
The pull request linter has failed. See the aws-cdk-automation comment below for failure reasons. If you believe this pull request should receive an exemption, please comment and provide a justification.
A comment requesting an exemption should contain the text Exemption Request. Additionally, if clarification is needed add Clarification Request to a comment.
kaizencc
left a comment
There was a problem hiding this comment.
are the sagemaker changes and the eks changes independent of each other? if so, I'd prefer them being included in two separate PRs. Otherwise, this looks largely ok. The integ test will need to be run, however.
| * ml.inf2.48xlarge | ||
| */ | ||
| public static readonly INF2_48XLARGE = InstanceType.of('ml.inf2.48xlarge'); | ||
|
|
There was a problem hiding this comment.
do we have a place to unit test at least one of these in sagemaker, just for sanity?
| cluster.addAutoScalingGroupCapacity('InferenceInstances', { | ||
| instanceType: new ec2.InstanceType('inf2.xlarge'), | ||
| minCapacity: 1, | ||
| }); |
There was a problem hiding this comment.
you will have to run the integ test to update the snapshots. do you have capacity to do that?
There was a problem hiding this comment.
I made further changes: I duplicated the integ test: 1 with asg inf1, the other inf2.
It is in failed state currently, not sure if I need to actually do something manually somewhere:
aws-cdk-eks-cluster-inf1-test: destroy failed Error: The stack named aws-cdk-eks-cluster-inf1-test is in a failed state. You may need to delete it from the AWS console : DELETE_FAILED (The following resource(s) failed to delete: [ClusterNodegroupDefaultCapacityNodeGroupRole55953B04, ClusterInf1InstancesInstanceRole67C931E4]. )
There was a problem hiding this comment.
does it say why it failed? the integ test should be able to be successfully deployed and deleted.
There was a problem hiding this comment.
actually i can try to run this for you. our eks integ tests take forever and are wonky :(
✅ Updated pull request passes all PRLinter validations. Dismissing previous PRLinter review.
|
Thank you for contributing! Your pull request will be updated from main and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork). |
|
Thank you for contributing! Your pull request will be updated from main and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork). |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
|
Thank you for contributing! Your pull request will be updated from main and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork). |
fix: added INF2 support to 1/ isGpuInstanceType to correctly select AMI, 2/ neuron-device-plugin-daemonset
INF2 is currently (wrongly) not included in the list of instance types mapping to GPU AMIs.
The change adds it to the list
inf2 not present in neuron-device-plugin-daemonset, added
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license