Skip to content

[Bug] With eksctl 0.123.0 neuron device plugin fails to come up in a cluster #6063

@aws-vrnatham

Description

@aws-vrnatham

What were you trying to accomplish?

Using eksctl create a cluster with inferentia

What happened?

Neuron device plugin does not come up

How to reproduce it?

eksctl create cluster \
    --name inferentia \
    --region us-east-1 \
    --nodegroup-name ng-inf1 \
    --node-type inf1.2xlarge \
    --nodes 1 \
    --nodes-min 1 \
    --nodes-max 2 \
    --ssh-access \
    --ssh-public-key /home/ubuntu/.ssh/id_rsa.pub \
    --with-oidc

Logs

ubuntu@ip-10-0-1-135:~$ kubectl get daemonset -A                                                                                                                                      NAMESPACE     NAME                             DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE                                                               kube-system   aws-node                         2         2         2       2            2           <none>          12m                                                               kube-system   kube-proxy                       2         2         2       2            2           <none>          12m                                                               kube-system   neuron-device-plugin-daemonset   2         2         0       2            0           <none>          2m17s                                                             ubuntu@ip-10-0-1-135:~$ kubectl get pods -A                                                                                                                                           NAMESPACE     NAME                                   READY   STATUS                       RESTARTS   AGE                                                                              kube-system   aws-node-kbw6h                         1/1     Running                      0          4m16s                                                                            kube-system   aws-node-rzvs5                         1/1     Running                      0          4m16s                                                                            kube-system   coredns-57ff979f67-h8htx               1/1     Running                      0          13m                                                                              kube-system   coredns-57ff979f67-sqgpc               1/1     Running                      0          13m                                                                              kube-system   kube-proxy-9kgxf                       1/1     Running                      0          4m16s                                                                            kube-system   kube-proxy-s9vcn                       1/1     Running                      0          4m16s                                                                            kube-system   neuron-device-plugin-daemonset-4c8k5   0/1     CreateContainerConfigError   0          2m29s                                                                            kube-system   neuron-device-plugin-daemonset-5jhg6   0/1     CreateContainerConfigError   0          2m29s
 ubuntu@ip-10-0-1-135:~$ kubectl describe pods -n kube-system neuron-device-plugin-daemonset-4c8k5                                                                                     Name:                 neuron-device-plugin-daemonset-4c8k5                                                                                                                            Namespace:            kube-system                                                                                                                                                     Priority:             2000001000                                                                                                                                                      Priority Class Name:  system-node-critical                                                                                                                                            Node:                 ip-192-168-13-18.us-west-2.compute.internal/192.168.13.18                                                                                                       Start Time:           Thu, 15 Dec 2022 19:44:15 +0000                                                                                                                                 Labels:               controller-revision-hash=75b496489c                                                                                                                                                   name=neuron-device-plugin-ds                                                                                                                                                          pod-template-generation=1                                                                                                                                       Annotations:          kubernetes.io/psp: eks.privileged                                                                                                                                                     scheduler.alpha.kubernetes.io/critical-pod:                                                                                                                     Status:               Pending                                                                                                                                                         IP:                   192.168.5.60                                                                                                                                                    IPs:                                                                                                                                                                                    IP:           192.168.5.60                                                                                                                                                          Controlled By:  DaemonSet/neuron-device-plugin-daemonset                                                                                                                              Containers:                                                                                                                                                                             k8s-neuron-device-plugin-ctr:                                                                                                                                                           Container ID:                                                                                                                                                                         Image:          public.ecr.aws/neuron/neuron-device-plugin:2.1.2.0                                                                                                                    Image ID:                                                                                                                                                                             Port:           <none>                                                                                                                                                                Host Port:      <none>                                                                                                                                                                State:          Waiting                                                                                                                                                                 Reason:       CreateContainerConfigError                                                                                                                                            Ready:          False                                                                                                                                                                 Restart Count:  0                                                                                                                                                                     Environment:                                                                                                                                                                            KUBECONFIG:  /etc/kubernetes/kubelet.conf                                                                                                                                             NODE_NAME:    (v1:spec.nodeName)                                                                                                                                                    Mounts:                                                                                                                                                                                 /run from infa-map (rw)                                                                                                                                                               /var/lib/kubelet/device-plugins from device-plugin (rw)                                                                                                                               /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-2d896 (ro)                                                                                                   Conditions:                                                                                                                                                                             Type              Status                                                                                                                                                              Initialized       True                                                                                                                                                                Ready             False                                                                                                                                                               ContainersReady   False                                                                                                                                                               PodScheduled      True                                                                                                                                                              Volumes:                                                                                                                                                                                device-plugin:                                                                                                                                                                          Type:          HostPath (bare host directory volume)                                                                                                                                  Path:          /var/lib/kubelet/device-plugins                                                                                                                                        HostPathType:                                                                                                                                                                       infa-map:                                                                                                                                                                               Type:          HostPath (bare host directory volume)                                                                                                                                  Path:          /run                                                                                                                                                                   HostPathType:                                                                                                                                                                       kube-api-access-2d896:                                                                                                                                                                  Type:                    Projected (a volume that contains injected data from multiple sources)                                                                                       TokenExpirationSeconds:  3607                                                                                                                                                         ConfigMapName:           kube-root-ca.crt                                                                                                                                             ConfigMapOptional:       <nil>                                                                                                                                                        DownwardAPI:             true                                                                                                                                                     QoS Class:                   BestEffort                                                                                                                                               Node-Selectors:              <none>                                                                                                                                                   Tolerations:                 CriticalAddonsOnly op=Exists                                                                                                                                                          aws.amazon.com/neuron:NoSchedule op=Exists                                                                                                                                            node.kubernetes.io/disk-pressure:NoSchedule op=Exists                                                                                                                                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists                                                                                                                               node.kubernetes.io/not-ready:NoExecute op=Exists                                                                                                                                      node.kubernetes.io/pid-pressure:NoSchedule op=Exists                                                                                                                                  node.kubernetes.io/unreachable:NoExecute op=Exists                                                                                                                                    node.kubernetes.io/unschedulable:NoSchedule op=Exists                                                                                                    Events:                                                                                                                                                                                 Type     Reason     Age                  From               Message                                                                                                                   ----     ------     ----                 ----               -------                                                                                                                   Normal   Scheduled  2m46s                default-scheduler  Successfully assigned kube-system/neuron-device-plugin-daemonset-4c8k5 to ip-192-168-13-18.us-west-2.compute.internal     Normal   Pulled     2m36s                kubelet            Successfully pulled image "public.ecr.aws/neuron/neuron-device-plugin:2.1.2.0" in 9.531102559s                            Normal   Pulled     2m35s                kubelet            Successfully pulled image "public.ecr.aws/neuron/neuron-device-plugin:2.1.2.0" in 178.860731ms                            Normal   Pulled     2m20s                kubelet            Successfully pulled image "public.ecr.aws/neuron/neuron-device-plugin:2.1.2.0" in 243.025135ms                            Normal   Pulled     2m6s                 kubelet            Successfully pulled image "public.ecr.aws/neuron/neuron-device-plugin:2.1.2.0" in 179.651576ms                            Normal   Pulled     113s                 kubelet            Successfully pulled image "public.ecr.aws/neuron/neuron-device-plugin:2.1.2.0" in 184.916333ms                            Normal   Pulled     99s                  kubelet            Successfully pulled image "public.ecr.aws/neuron/neuron-device-plugin:2.1.2.0" in 183.782961ms                            Normal   Pulled     86s                  kubelet            Successfully pulled image "public.ecr.aws/neuron/neuron-device-plugin:2.1.2.0" in 179.288398ms                            Warning  Failed     74s (x8 over 2m36s)  kubelet            Error: container has runAsNonRoot and image will run as root (pod: "neuron-device-plugin-daemonset-4c8k5_kube-system(61309d75-6ad9-4aa2-90b4-059a5d8d5c29)", container: k8s-neuron-device-plugin-ctr)                                                                                                           Normal   Pulled     74s                  kubelet            Successfully pulled image "public.ecr.aws/neuron/neuron-device-plugin:2.1.2.0" in 173.571044ms                            Normal   Pulling    62s (x9 over 2m45s)  kubelet            Pulling image "public.ecr.aws/neuron/neuron-device-plugin:2.1.2.0"                                                      u

Anything else we need to know?

  • Seems like with commit, the neuron device plugin runAsNonRoot is set to true in the yaml. If we remove that then the device plugin comes up fine.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions