-
Notifications
You must be signed in to change notification settings - Fork 1.5k
[Bug] With eksctl 0.123.0 neuron device plugin fails to come up in a cluster #6063
Copy link
Copy link
Closed
Labels
Description
What were you trying to accomplish?
Using eksctl create a cluster with inferentia
What happened?
Neuron device plugin does not come up
How to reproduce it?
eksctl create cluster \
--name inferentia \
--region us-east-1 \
--nodegroup-name ng-inf1 \
--node-type inf1.2xlarge \
--nodes 1 \
--nodes-min 1 \
--nodes-max 2 \
--ssh-access \
--ssh-public-key /home/ubuntu/.ssh/id_rsa.pub \
--with-oidc
Logs
ubuntu@ip-10-0-1-135:~$ kubectl get daemonset -A NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE kube-system aws-node 2 2 2 2 2 <none> 12m kube-system kube-proxy 2 2 2 2 2 <none> 12m kube-system neuron-device-plugin-daemonset 2 2 0 2 0 <none> 2m17s ubuntu@ip-10-0-1-135:~$ kubectl get pods -A NAMESPACE NAME READY STATUS RESTARTS AGE kube-system aws-node-kbw6h 1/1 Running 0 4m16s kube-system aws-node-rzvs5 1/1 Running 0 4m16s kube-system coredns-57ff979f67-h8htx 1/1 Running 0 13m kube-system coredns-57ff979f67-sqgpc 1/1 Running 0 13m kube-system kube-proxy-9kgxf 1/1 Running 0 4m16s kube-system kube-proxy-s9vcn 1/1 Running 0 4m16s kube-system neuron-device-plugin-daemonset-4c8k5 0/1 CreateContainerConfigError 0 2m29s kube-system neuron-device-plugin-daemonset-5jhg6 0/1 CreateContainerConfigError 0 2m29s
ubuntu@ip-10-0-1-135:~$ kubectl describe pods -n kube-system neuron-device-plugin-daemonset-4c8k5 Name: neuron-device-plugin-daemonset-4c8k5 Namespace: kube-system Priority: 2000001000 Priority Class Name: system-node-critical Node: ip-192-168-13-18.us-west-2.compute.internal/192.168.13.18 Start Time: Thu, 15 Dec 2022 19:44:15 +0000 Labels: controller-revision-hash=75b496489c name=neuron-device-plugin-ds pod-template-generation=1 Annotations: kubernetes.io/psp: eks.privileged scheduler.alpha.kubernetes.io/critical-pod: Status: Pending IP: 192.168.5.60 IPs: IP: 192.168.5.60 Controlled By: DaemonSet/neuron-device-plugin-daemonset Containers: k8s-neuron-device-plugin-ctr: Container ID: Image: public.ecr.aws/neuron/neuron-device-plugin:2.1.2.0 Image ID: Port: <none> Host Port: <none> State: Waiting Reason: CreateContainerConfigError Ready: False Restart Count: 0 Environment: KUBECONFIG: /etc/kubernetes/kubelet.conf NODE_NAME: (v1:spec.nodeName) Mounts: /run from infa-map (rw) /var/lib/kubelet/device-plugins from device-plugin (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-2d896 (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: device-plugin: Type: HostPath (bare host directory volume) Path: /var/lib/kubelet/device-plugins HostPathType: infa-map: Type: HostPath (bare host directory volume) Path: /run HostPathType: kube-api-access-2d896: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true QoS Class: BestEffort Node-Selectors: <none> Tolerations: CriticalAddonsOnly op=Exists aws.amazon.com/neuron:NoSchedule op=Exists node.kubernetes.io/disk-pressure:NoSchedule op=Exists node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists node.kubernetes.io/pid-pressure:NoSchedule op=Exists node.kubernetes.io/unreachable:NoExecute op=Exists node.kubernetes.io/unschedulable:NoSchedule op=Exists Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 2m46s default-scheduler Successfully assigned kube-system/neuron-device-plugin-daemonset-4c8k5 to ip-192-168-13-18.us-west-2.compute.internal Normal Pulled 2m36s kubelet Successfully pulled image "public.ecr.aws/neuron/neuron-device-plugin:2.1.2.0" in 9.531102559s Normal Pulled 2m35s kubelet Successfully pulled image "public.ecr.aws/neuron/neuron-device-plugin:2.1.2.0" in 178.860731ms Normal Pulled 2m20s kubelet Successfully pulled image "public.ecr.aws/neuron/neuron-device-plugin:2.1.2.0" in 243.025135ms Normal Pulled 2m6s kubelet Successfully pulled image "public.ecr.aws/neuron/neuron-device-plugin:2.1.2.0" in 179.651576ms Normal Pulled 113s kubelet Successfully pulled image "public.ecr.aws/neuron/neuron-device-plugin:2.1.2.0" in 184.916333ms Normal Pulled 99s kubelet Successfully pulled image "public.ecr.aws/neuron/neuron-device-plugin:2.1.2.0" in 183.782961ms Normal Pulled 86s kubelet Successfully pulled image "public.ecr.aws/neuron/neuron-device-plugin:2.1.2.0" in 179.288398ms Warning Failed 74s (x8 over 2m36s) kubelet Error: container has runAsNonRoot and image will run as root (pod: "neuron-device-plugin-daemonset-4c8k5_kube-system(61309d75-6ad9-4aa2-90b4-059a5d8d5c29)", container: k8s-neuron-device-plugin-ctr) Normal Pulled 74s kubelet Successfully pulled image "public.ecr.aws/neuron/neuron-device-plugin:2.1.2.0" in 173.571044ms Normal Pulling 62s (x9 over 2m45s) kubelet Pulling image "public.ecr.aws/neuron/neuron-device-plugin:2.1.2.0" u
Anything else we need to know?
- Seems like with commit, the neuron device plugin runAsNonRoot is set to true in the yaml. If we remove that then the device plugin comes up fine.
Reactions are currently unavailable