The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Quick Debug Checklist
1. Issue or feature description:
The cuda validators in my cluster are in Init:CrashLoopBackOff while other cuda vector addition workloads run completely fine.
$ kubectl get po
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-cqjqx 1/1 Running 0 28m
gpu-feature-discovery-x4qh2 1/1 Running 0 28m
gpu-operator-77787587cf-cxnzl 1/1 Running 0 28m
nvidia-container-toolkit-daemonset-bk28j 1/1 Running 0 28m
nvidia-container-toolkit-daemonset-qvftc 1/1 Running 0 28m
nvidia-cuda-validator-ccn65 0/1 Init:CrashLoopBackOff 5 (23s ago) 3m29s
nvidia-cuda-validator-p7sgd 0/1 Init:CrashLoopBackOff 5 (16s ago) 3m18s
nvidia-dcgm-exporter-p5wrc 1/1 Running 0 28m
nvidia-dcgm-exporter-rvnz6 1/1 Running 0 28m
nvidia-device-plugin-daemonset-2bfmt 1/1 Running 0 28m
nvidia-device-plugin-daemonset-pvphw 1/1 Running 0 28m
nvidia-operator-validator-5qfx2 0/1 Init:2/4 5 (4m55s ago) 28m
nvidia-operator-validator-nc9n6 0/1 Init:2/4 5 (4m40s ago) 28m
On closer inspection its the vector add that's giving us an issue when getting the logs we get
$ kubectl logs nvidia-cuda-validator-ccn65 -c cuda-validation --previous
Failed to allocate device vector A (error code all CUDA-capable devices are busy or unavailable)!
[Vector addition of 50000 elements]
However, I am able to deploy other cudavector workloads for example
$ kubectl get po -n e2e-gpu-workload
NAME READY STATUS RESTARTS AGE
cuda-vector-add 0/1 Completed 0 29m
cuda-vector-add-2 0/1 CrashLoopBackOff 10 (2m19s ago) 14m
cuda-vector-add-3 0/1 CrashLoopBackOff 6 (4m19s ago) 11m
cuda-vector-add-4 0/1 Completed 0 9m34s
and another look, printing out the container images for the running and non-running pods
$ kubectl get po -n e2e-gpu-workload -o custom-columns=CONTAINER:.spec.containers[0].name,IMAGE:.spec.containers[0].image
CONTAINER IMAGE
cuda-vectoradd nvidia/samples:vectoradd-cuda11.2.1
cuda-vectoradd nvcr.io/nvidia/cloud-native/gpu-operator-validator:v1.11.1
cuda-vectoradd docker.io/anjia0532/cuda-vector-add:v0.1
cuda-vectoradd docker.io/anjia0532/cuda-vector-add:v0.1
when we look at the logs we find that everything works fine
$ kubectl logs -n e2e-gpu-workload cuda-vector-add
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
and expanding the pod definition we get
$ kubectl get po -n e2e-gpu-workload cuda-vector-add -o yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
cni.projectcalico.org/containerID: 9cbccd5849cd9f8a0b1670b84392deacfac4a55eaa05541ef83ab66df24de089
cni.projectcalico.org/podIP: ""
cni.projectcalico.org/podIPs: ""
creationTimestamp: "2022-08-10T22:59:38Z"
name: cuda-vector-add
namespace: e2e-gpu-workload
resourceVersion: "16028"
uid: 17064311-b8a5-4ff3-bd4c-c3a9665d2ec4
spec:
containers:
- image: nvidia/samples:vectoradd-cuda11.2.1
imagePullPolicy: IfNotPresent
name: cuda-vectoradd
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-d7zbx
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
nodeSelector:
nvidia.com/gpu.count: "1"
2. Steps to reproduce the issue
I installed the operator with the following command:
helm install --wait --generate-name --set nfd.enabled=false --set driver.enabled=false --set toolkit.version=1.6.0-centos7 nvidia/gpu-operator
In my setup I installed the nvidia device drivers and cuda drivers on the host using the runfile and am using the operator to install the container runtime.
3. Information to attach (optional if deemed irrelevant)
When I run nvidia-smi on the host i get the following, indicating that cuda is installed alright.
[root@ip-10-0-101-7 bin]# nvidia-smi
Wed Aug 10 23:40:28 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:00:1E.0 Off | 0 |
| N/A 35C P8 30W / 149W | 0MiB / 11441MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
as well as running a container with cuda
[root@ip-10-0-101-7 bin]# ctr run --rm --gpus 0 -t docker.io/nvidia/samples:vectoradd-cuda11.2.1 add2
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
[root@ip-10-0-101-7 bin]# ls -la /usr/local/nvidia/toolkit
total 8548
drwxr-xr-x. 3 root root 4096 Aug 10 23:04 .
drwxr-xr-x. 3 root root 21 Aug 10 23:04 ..
drwxr-xr-x. 3 root root 38 Aug 10 23:04 .config
lrwxrwxrwx. 1 root root 28 Aug 10 23:04 libnvidia-container.so.1 -> libnvidia-container.so.1.4.0
-rwxr-xr-x. 1 root root 179192 Aug 10 23:04 libnvidia-container.so.1.4.0
-rwxr-xr-x. 1 root root 154 Aug 10 23:04 nvidia-container-cli
-rwxr-xr-x. 1 root root 43024 Aug 10 23:04 nvidia-container-cli.real
-rwxr-xr-x. 1 root root 342 Aug 10 23:04 nvidia-container-runtime
-rwxr-xr-x. 1 root root 350 Aug 10 23:04 nvidia-container-runtime-experimental
-rwxr-xr-x. 1 root root 3991000 Aug 10 23:04 nvidia-container-runtime.experimental
lrwxrwxrwx. 1 root root 24 Aug 10 23:04 nvidia-container-runtime-hook -> nvidia-container-toolkit
-rwxr-xr-x. 1 root root 2359384 Aug 10 23:04 nvidia-container-runtime.real
-rwxr-xr-x. 1 root root 198 Aug 10 23:04 nvidia-container-toolkit
-rwxr-xr-x. 1 root root 2147896 Aug 10 23:04 nvidia-container-toolkit.real
[root@ip-10-0-101-7 bin]#
Any guidance would be greatly appreciated, and please let me know how I can help.
Thank you.
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Quick Debug Checklist
i2c_coreandipmi_msghandlerloaded on the nodes?kubectl describe clusterpolicies --all-namespaces)1. Issue or feature description:
The cuda validators in my cluster are in
Init:CrashLoopBackOffwhile other cuda vector addition workloads run completely fine.On closer inspection its the vector add that's giving us an issue when getting the logs we get
However, I am able to deploy other cudavector workloads for example
and another look, printing out the container images for the running and non-running pods
when we look at the logs we find that everything works fine
and expanding the pod definition we get
2. Steps to reproduce the issue
I installed the operator with the following command:
In my setup I installed the nvidia device drivers and cuda drivers on the host using the runfile and am using the operator to install the container runtime.
3. Information to attach (optional if deemed irrelevant)
kubernetes pods status:
kubectl get pods --all-namespaceskubernetes daemonset status:
kubectl get ds --all-namespacesIf a pod/ds is in an error state or pending state
kubectl describe pod -n NAMESPACE POD_NAMEIf a pod/ds is in an error state or pending state
kubectl logs -n NAMESPACE POD_NAMEOutput of running a container on the GPU machine:
docker run -it alpine echo fooDocker configuration file:
cat /etc/docker/daemon.jsonDocker runtime configuration:
docker info | grep runtimeNVIDIA shared directory:
ls -la /run/nvidiaNVIDIA packages directory:
ls -la /usr/local/nvidia/toolkitNVIDIA driver directory:
ls -la /run/nvidia/driverkubelet logs
journalctl -u kubelet > kubelet.logsWhen I run nvidia-smi on the host i get the following, indicating that cuda is installed alright.
as well as running a container with cuda
Any guidance would be greatly appreciated, and please let me know how I can help.
Thank you.