Skip to content

feat(eks): enable ipv6 for eks cluster#25819

Merged
mergify[bot] merged 10 commits intoaws:mainfrom
kishiel:main
Jun 8, 2023
Merged

feat(eks): enable ipv6 for eks cluster#25819
mergify[bot] merged 10 commits intoaws:mainfrom
kishiel:main

Conversation

@kishiel
Copy link
Copy Markdown
Contributor

@kishiel kishiel commented Jun 1, 2023

Description

This change enables IPv6 for EKS clusters

Reasoning

  • IPv6-based EKS clusters will enable service owners to minimize or even eliminate the perils of IPv4 CIDR micromanagement
  • IPv6 will enable very-large-scale EKS clusters
  • My working group ( Amazon SDO/ECST ) recently attempted to enable IPv6 using L1 Cfn EKS constructs, but failed after discovering a CDKv2 issue which results in a master-less EKS cluster. Rather than investing in fixing this interaction we agreed to contribute to aws-eks (this PR)

Design

  • This change treats IPv4 as the default networking configuration
  • A new enum IpFamily is introduced to direct users to specify IP_V4 or IP_V6
  • This change adds a new Sam layer dependency Dependency removed after validation it was no longer necessary

Testing

I consulted with some team members about how to best approach testing this change, and I concluded that I should duplicate the eks-cluster test definition. I decided that this was a better approach than redefining the existing cluster test to use IPv6 for a couple of reasons:

  1. EKS still requires IPv4 under the hood
  2. IPv6 CIDR and subnet association isn't exactly straightforward. My example in eks-cluster-ipv6 is the simplest one I could come up with
  3. There's additional permissions and routing configuration that's necessary to get the cluster tests to succeed. The differences were sufficient to motivate splitting out the test, in my opinion.

I ran into several issues running the test suite, primarily related to out-of-memory conditions which no amount of RAM appeared to help. NODE_OPTIONS--max-old-space-size=8192 did not improve this issue, nor did increasing it to 12GB. Edit: This ended up being a simple fix, but annoying to dig out. The fix is export NODE_OPTIONS=--max-old-space-size=8192. Setting this up in my .rc file did not stick, either. MacOS Ventura for those keeping score at home.

The bulk of my testing was performed using a sample stack definition (below), but I was unable to run the manual testing described in aws-eks/test/MANUAL_TEST.md due to no access to the underlying node instances. Edit, I can run the MANUAL_TESTS now if that's deemed necessary.

Updated: This sample stack creates an ipv6 enabled cluster with an example nginx service running.

Sample:

import {
  App, Duration, Fn, Stack,
  aws_ec2 as ec2,
  aws_eks as eks,
  aws_iam as iam,
} from 'aws-cdk-lib';
import { getClusterVersionConfig } from './integ-tests-kubernetes-version';

const app = new App();
const env = { region: 'us-east-1', account: '' };
const stack = new Stack(app, 'my-v6-test-stack-1', { env });

const vpc = new ec2.Vpc(stack, 'Vpc', { maxAzs: 3, natGateways: 1, restrictDefaultSecurityGroup: false });
const ipv6cidr = new ec2.CfnVPCCidrBlock(stack, 'CIDR6', {
  vpcId: vpc.vpcId,
  amazonProvidedIpv6CidrBlock: true,
});

let subnetcount = 0;
let subnets = [...vpc.publicSubnets, ...vpc.privateSubnets];
for ( let subnet of subnets) {
  // Wait for the ipv6 cidr to complete
  subnet.node.addDependency(ipv6cidr);
  _associate_subnet_with_v6_cidr(subnetcount, subnet);
  subnetcount++;
}

const roles = _create_roles();

const cluster = new eks.Cluster(stack, 'Cluster', {
  ...getClusterVersionConfig(stack),
  vpc: vpc,
  clusterName: 'some-eks-cluster',
  defaultCapacity: 0,
  endpointAccess: eks.EndpointAccess.PUBLIC_AND_PRIVATE,
  ipFamily: eks.IpFamily.IP_V6,
  mastersRole: roles.masters,
  securityGroup: _create_eks_security_group(),
  vpcSubnets: [{ subnets: subnets }],
});

// add a extra nodegroup
cluster.addNodegroupCapacity('some-node-group', {
  instanceTypes: [new ec2.InstanceType('m5.large')],
  minSize: 1,
  nodeRole: roles.nodes,
});

cluster.kubectlSecurityGroup?.addEgressRule(
  ec2.Peer.anyIpv6(), ec2.Port.allTraffic(),
);

// deploy an nginx ingress in a namespace
const nginxNamespace = cluster.addManifest('nginx-namespace', {
  apiVersion: 'v1',
  kind: 'Namespace',
  metadata: {
    name: 'nginx',
  },
});

const nginxIngress = cluster.addHelmChart('nginx-ingress', {
  chart: 'nginx-ingress',
  repository: 'https://helm.nginx.com/stable',
  namespace: 'nginx',
  wait: true,
  createNamespace: false,
  timeout: Duration.minutes(5),
});

// make sure namespace is deployed before the chart
nginxIngress.node.addDependency(nginxNamespace);

function _associate_subnet_with_v6_cidr(count: number, subnet: ec2.ISubnet) {
  const cfnSubnet = subnet.node.defaultChild as ec2.CfnSubnet;
  cfnSubnet.ipv6CidrBlock = Fn.select(count, Fn.cidr(Fn.select(0, vpc.vpcIpv6CidrBlocks), 256, (128 - 64).toString()));
  cfnSubnet.assignIpv6AddressOnCreation = true;
}

export function _create_eks_security_group(): ec2.SecurityGroup {
  let sg = new ec2.SecurityGroup(stack, 'eks-sg', {
    allowAllIpv6Outbound: true,
    allowAllOutbound: true,
    vpc,
  });
  sg.addIngressRule(
    ec2.Peer.ipv4('10.0.0.0/8'), ec2.Port.allTraffic(),
  );
  sg.addIngressRule(
    ec2.Peer.ipv6(Fn.select(0, vpc.vpcIpv6CidrBlocks)), ec2.Port.allTraffic(),
  );
  return sg;
}

export namespace Kubernetes {
  export interface RoleDescriptors {
    masters: iam.Role,
    nodes: iam.Role,
  }
}

function _create_roles(): Kubernetes.RoleDescriptors {
  const clusterAdminStatement = new iam.PolicyDocument({
    statements: [new iam.PolicyStatement({
      actions: [
        'eks:*',
        'iam:ListRoles',
      ],
      resources: ['*'],
    })],
  });

  const eksClusterAdminRole = new iam.Role(stack, 'AdminRole', {
    roleName: 'some-eks-master-admin',
    assumedBy: new iam.AccountRootPrincipal(),
    inlinePolicies: { clusterAdminStatement },
  });

  const assumeAnyRolePolicy = new iam.PolicyDocument({
    statements: [new iam.PolicyStatement({
      actions: [
        'sts:AssumeRole',
      ],
      resources: ['*'],
    })],
  });

  const ipv6Management = new iam.PolicyDocument({
    statements: [new iam.PolicyStatement({
      resources: ['arn:aws:ec2:*:*:network-interface/*'],
      actions: [
        'ec2:AssignIpv6Addresses',
        'ec2:UnassignIpv6Addresses',
      ],
    })],
  });

  const eksClusterNodeGroupRole = new iam.Role(stack, 'NodeGroupRole', {
    roleName: 'some-node-group-role',
    assumedBy: new iam.ServicePrincipal('ec2.amazonaws.com'),
    managedPolicies: [
      iam.ManagedPolicy.fromAwsManagedPolicyName('AmazonEKSWorkerNodePolicy'),
      iam.ManagedPolicy.fromAwsManagedPolicyName('AmazonEC2ContainerRegistryReadOnly'),
      iam.ManagedPolicy.fromAwsManagedPolicyName('AmazonEKS_CNI_Policy'),
      iam.ManagedPolicy.fromAwsManagedPolicyName('AmazonSSMManagedInstanceCore'),
      iam.ManagedPolicy.fromAwsManagedPolicyName('CloudWatchAgentServerPolicy'),
    ],
    inlinePolicies: {
      assumeAnyRolePolicy,
      ipv6Management,
    },
  });

  return { masters: eksClusterAdminRole, nodes: eksClusterNodeGroupRole };
}

Issues

Edit: Fixed

Integration tests, specifically the new one I contributed, failed with an issue in describing a Fargate profile:

2023-06-01T16:24:30.127Z    6f9b8583-8440-4f13-a48f-28e09a261d40    INFO    {
    "describeFargateProfile": {
        "clusterName": "Cluster9EE0221C-f458e6dc5f544e9b9db928f6686c14d5",
        "fargateProfileName": "ClusterfargateprofiledefaultEF-1628f1c3e6ea41ebb3b0c224de5698b4"
    }
}
---------------------------
2023-06-01T16:24:30.138Z    6f9b8583-8440-4f13-a48f-28e09a261d40    INFO    {
    "describeFargateProfileError": {}
}
---------------------------
2023-06-01T16:24:30.139Z    6f9b8583-8440-4f13-a48f-28e09a261d40    ERROR    Invoke Error     {
    "errorType": "TypeError",
    "errorMessage": "getEksClient(...).describeFargateProfile is not a function",
    "stack": [
        "TypeError: getEksClient(...).describeFargateProfile is not a function",
        "    at Object.describeFargateProfile (/var/task/index.js:27:51)",
        "    at FargateProfileResourceHandler.queryStatus (/var/task/fargate.js:83:67)",
        "    at FargateProfileResourceHandler.isUpdateComplete (/var/task/fargate.js:49:35)",
        "    at FargateProfileResourceHandler.isCreateComplete (/var/task/fargate.js:46:21)",
        "    at FargateProfileResourceHandler.isComplete (/var/task/common.js:31:40)",
        "    at Runtime.isComplete [as handler] (/var/task/index.js:50:21)",
        "    at Runtime.handleOnceNonStreaming (/var/runtime/Runtime.js:74:25)"
    ]
}

I am uncertain if this is an existing issue or one introduced by this change, or something related to my local build. Again, I had abundant issues related to building aws-cdk and the test suites depending on Jupiter's position in the sky.

Collaborators

Most of the work in this change was performed by @wlami and @jagu-sayan (thank you!)

Fixes #18423


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

beginning-contributor [Pilot] contributed between 0-2 PRs to the CDK effort/small Small work item – less than a day of effort feature-request A feature should be added or improved. p2 pr/needs-community-review This PR needs a review from a Trusted Community Member or Core Team Member.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

(aws-eks): IPv6 support for EKS Clusters

3 participants