What happened: We tried to update our cluster to use version 1.26.8-20230919 for the nodes and the newly launched ec2 instances (repeatedly) failed to join the cluster.
What you expected to happen: The new nodes should successfully join the cluster
How to reproduce it (as minimally and precisely as possible): Start a cluster with an older AMI and then update to v1.26.8-20230919 .
Anything else we need to know?:
I have looked into it and here is what I found:
- The node does not start because it fails to pull the pause-image
- In the containerd logs we see messages like:
The image 602401143452.dkr.ecr.ap-south-2\nap-south-1\neu-south-1\neu-south-2\nme-central-1\nil-central-1\nca-central-1\neu-central-1\neu-central-2\nus-west-1\nus-west-2\naf-south[...]amazonaws.com/eks/pause:3.5 is not unpacked.
- That image-name does not seem correct to me. Instead of a single region, it contains all the regions, separated by newlines.
- Looking into the git-commits of this repo we find: be7bc10
- In the script (from the commit mentioned above) the variable REGIONS contains a newline separated list of all regions
- The loop in line 469 does not work. Instead of iterating over the regions, it just runs unce, placing the whole (newline separated) list of regions into the region-variable
- This broken region-variable is then used in the next few lines to build the image-names, resulting in broken, un-pullable images.
I think that this broken image-names prevent the node from booting up properly.
Environment:
- AWS Region: eu-west-1
- Instance Type(s): t3a.large
- EKS Platform version (use
aws eks describe-cluster --name <name> --query cluster.platformVersion):
- Kubernetes version (use
aws eks describe-cluster --name <name> --query cluster.version): eks.6
- AMI Version: v1.26.8-20230919
- Kernel (e.g.
uname -a): 5.10.192-183.736.amzn2.x86_64
- Release information (run
cat /etc/eks/release on a node):
BASE_AMI_ID="ami-0bac1825e471f5042"
BUILD_TIME="Mon Oct 2 20:39:23 UTC 2023"
BUILD_KERNEL="5.10.192-183.736.amzn2.x86_64"
ARCH="x86_64"
What happened: We tried to update our cluster to use version 1.26.8-20230919 for the nodes and the newly launched ec2 instances (repeatedly) failed to join the cluster.
What you expected to happen: The new nodes should successfully join the cluster
How to reproduce it (as minimally and precisely as possible): Start a cluster with an older AMI and then update to v1.26.8-20230919 .
Anything else we need to know?:
I have looked into it and here is what I found:
The image 602401143452.dkr.ecr.ap-south-2\nap-south-1\neu-south-1\neu-south-2\nme-central-1\nil-central-1\nca-central-1\neu-central-1\neu-central-2\nus-west-1\nus-west-2\naf-south[...]amazonaws.com/eks/pause:3.5 is not unpacked.I think that this broken image-names prevent the node from booting up properly.
Environment:
aws eks describe-cluster --name <name> --query cluster.platformVersion):aws eks describe-cluster --name <name> --query cluster.version): eks.6uname -a): 5.10.192-183.736.amzn2.x86_64cat /etc/eks/releaseon a node):