CUDA validators crashlooping while other cuda containers run fine

_The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense._

### 1. Quick Debug Checklist
- [ ] Are you running on an Ubuntu 18.04 node?
- [x] Are you running Kubernetes v1.13+?
- [ ] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
- [ ] Do you have `i2c_core` and `ipmi_msghandler` loaded on the nodes?
- [ ] Did you apply the CRD (`kubectl describe clusterpolicies --all-namespaces`)

### 1. Issue or feature description:

The cuda validators in my cluster are in `Init:CrashLoopBackOff `while other cuda vector addition workloads run completely fine. 

```sh 
$ kubectl get po 
NAME                                       READY   STATUS                  RESTARTS        AGE
gpu-feature-discovery-cqjqx                1/1     Running                 0               28m
gpu-feature-discovery-x4qh2                1/1     Running                 0               28m
gpu-operator-77787587cf-cxnzl              1/1     Running                 0               28m
nvidia-container-toolkit-daemonset-bk28j   1/1     Running                 0               28m
nvidia-container-toolkit-daemonset-qvftc   1/1     Running                 0               28m
nvidia-cuda-validator-ccn65                0/1     Init:CrashLoopBackOff   5 (23s ago)     3m29s
nvidia-cuda-validator-p7sgd                0/1     Init:CrashLoopBackOff   5 (16s ago)     3m18s
nvidia-dcgm-exporter-p5wrc                 1/1     Running                 0               28m
nvidia-dcgm-exporter-rvnz6                 1/1     Running                 0               28m
nvidia-device-plugin-daemonset-2bfmt       1/1     Running                 0               28m
nvidia-device-plugin-daemonset-pvphw       1/1     Running                 0               28m
nvidia-operator-validator-5qfx2            0/1     Init:2/4                5 (4m55s ago)   28m
nvidia-operator-validator-nc9n6            0/1     Init:2/4                5 (4m40s ago)   28m
```

On closer inspection its the vector add that's giving us an issue when getting the logs we get

```
$ kubectl logs nvidia-cuda-validator-ccn65 -c cuda-validation --previous 
Failed to allocate device vector A (error code all CUDA-capable devices are busy or unavailable)!
[Vector addition of 50000 elements]
``` 
However, I am able to deploy other cudavector workloads for example 

```
$ kubectl get po -n e2e-gpu-workload 
NAME                READY   STATUS             RESTARTS         AGE
cuda-vector-add     0/1     Completed          0                29m
cuda-vector-add-2   0/1     CrashLoopBackOff   10 (2m19s ago)   14m
cuda-vector-add-3   0/1     CrashLoopBackOff   6 (4m19s ago)    11m
cuda-vector-add-4   0/1     Completed          0                9m34s
```

and another look, printing out the container images for the running and non-running pods 

```
$ kubectl get po -n e2e-gpu-workload -o custom-columns=CONTAINER:.spec.containers[0].name,IMAGE:.spec.containers[0].image
CONTAINER        IMAGE
cuda-vectoradd   nvidia/samples:vectoradd-cuda11.2.1
cuda-vectoradd   nvcr.io/nvidia/cloud-native/gpu-operator-validator:v1.11.1
cuda-vectoradd   docker.io/anjia0532/cuda-vector-add:v0.1
cuda-vectoradd   docker.io/anjia0532/cuda-vector-add:v0.1
```
when we look at the logs we find that everything works fine 

```
$ kubectl logs -n e2e-gpu-workload cuda-vector-add
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
``` 
and expanding the pod definition we get 

```yaml 
$ kubectl get po -n e2e-gpu-workload cuda-vector-add -o yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    cni.projectcalico.org/containerID: 9cbccd5849cd9f8a0b1670b84392deacfac4a55eaa05541ef83ab66df24de089
    cni.projectcalico.org/podIP: ""
    cni.projectcalico.org/podIPs: ""
  creationTimestamp: "2022-08-10T22:59:38Z"
  name: cuda-vector-add
  namespace: e2e-gpu-workload
  resourceVersion: "16028"
  uid: 17064311-b8a5-4ff3-bd4c-c3a9665d2ec4
spec:
  containers:
  - image: nvidia/samples:vectoradd-cuda11.2.1
    imagePullPolicy: IfNotPresent
    name: cuda-vectoradd
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-d7zbx
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeSelector:
    nvidia.com/gpu.count: "1"
```


### 2. Steps to reproduce the issue

I installed the operator with the following command:

```
helm install --wait --generate-name  --set nfd.enabled=false --set driver.enabled=false --set toolkit.version=1.6.0-centos7  nvidia/gpu-operator
``` 

In my setup I installed the nvidia device drivers and cuda drivers on the host using the runfile and am using the operator to install the container runtime. 


### 3. Information to [attach](https://help.github.com/articles/file-attachments-on-issues-and-pull-requests/) (optional if deemed irrelevant)

 - [ ] kubernetes pods status: `kubectl get pods --all-namespaces`
 - [ ] kubernetes daemonset status: `kubectl get ds --all-namespaces`
 - [ ] If a pod/ds is in an error state or pending state `kubectl describe pod -n NAMESPACE POD_NAME`
 - [ ] If a pod/ds is in an error state or pending state `kubectl logs -n NAMESPACE POD_NAME`

 - [ ] Output of running a container on the GPU machine: `docker run -it alpine echo foo`
 - [ ] Docker configuration file: `cat /etc/docker/daemon.json`
 - [ ] Docker runtime configuration: `docker info | grep runtime`

 - [ ] NVIDIA shared directory: `ls -la /run/nvidia`
 - [x] NVIDIA packages directory: `ls -la /usr/local/nvidia/toolkit`
 - [ ] NVIDIA driver directory: `ls -la /run/nvidia/driver`
 - [ ] kubelet logs `journalctl -u kubelet > kubelet.logs`

When I run nvidia-smi on the host i get the following, indicating that cuda is installed alright.

```
[root@ip-10-0-101-7 bin]# nvidia-smi
Wed Aug 10 23:40:28 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   35C    P8    30W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
```

as well as running  a container with cuda 

```
[root@ip-10-0-101-7 bin]# ctr run --rm --gpus 0 -t docker.io/nvidia/samples:vectoradd-cuda11.2.1 add2
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
```

```
[root@ip-10-0-101-7 bin]# ls -la /usr/local/nvidia/toolkit
total 8548
drwxr-xr-x. 3 root root    4096 Aug 10 23:04 .
drwxr-xr-x. 3 root root      21 Aug 10 23:04 ..
drwxr-xr-x. 3 root root      38 Aug 10 23:04 .config
lrwxrwxrwx. 1 root root      28 Aug 10 23:04 libnvidia-container.so.1 -> libnvidia-container.so.1.4.0
-rwxr-xr-x. 1 root root  179192 Aug 10 23:04 libnvidia-container.so.1.4.0
-rwxr-xr-x. 1 root root     154 Aug 10 23:04 nvidia-container-cli
-rwxr-xr-x. 1 root root   43024 Aug 10 23:04 nvidia-container-cli.real
-rwxr-xr-x. 1 root root     342 Aug 10 23:04 nvidia-container-runtime
-rwxr-xr-x. 1 root root     350 Aug 10 23:04 nvidia-container-runtime-experimental
-rwxr-xr-x. 1 root root 3991000 Aug 10 23:04 nvidia-container-runtime.experimental
lrwxrwxrwx. 1 root root      24 Aug 10 23:04 nvidia-container-runtime-hook -> nvidia-container-toolkit
-rwxr-xr-x. 1 root root 2359384 Aug 10 23:04 nvidia-container-runtime.real
-rwxr-xr-x. 1 root root     198 Aug 10 23:04 nvidia-container-toolkit
-rwxr-xr-x. 1 root root 2147896 Aug 10 23:04 nvidia-container-toolkit.real
[root@ip-10-0-101-7 bin]#
```

Any guidance would be greatly appreciated, and please let me know how I can help. 

Thank you. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA validators crashlooping while other cuda containers run fine #389

1. Quick Debug Checklist

1. Issue or feature description:

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

CUDA validators crashlooping while other cuda containers run fine #389

Description

1. Quick Debug Checklist

1. Issue or feature description:

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions