Skip to content

CUDA version label mismatch  #391

@jacobtomlinson

Description

@jacobtomlinson

1. Issue or feature description

When installing the operator on nodes that already have a driver installed the CUDA version labels attached to the node do not match the installed CUDA version.

I have driver 495.46 and CUDA 11.5 installed on my workstation but the labels show CUDA 11.6.

2. Steps to reproduce the issue

Create cluster with driver 495.46 installed on nodes. (I use kind on a Ubuntu workstation for k8s development).

Install operator with driver.enabled=false.

$ helm install --wait --generate-name \
  -n gpu-operator --create-namespace \
  nvidia/gpu-operator \
  --set driver.enabled=false

Check nvidia-smi on node (shows driver 495.46 and CUDA 11.5):

$ nvidia-smi
Fri Aug 12 11:25:23 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.46       Driver Version: 495.46       CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
...
+-----------------------------------------------------------------------------+

Check nvidia-smi in a pod (shows driver 495.46 and CUDA 11.5):

$ cat << EOF | kubectl create -f -
apiVersion: v1
kind: Pod
metadata:
  name: nvidia-version-check
spec:
  restartPolicy: OnFailure
  containers:
  - name: nvidia-version-check
    image: "nvidia/cuda:11.0.3-base-ubuntu20.04"
    command: ["nvidia-smi"]
    resources:
      limits:
         nvidia.com/gpu: "1"
EOF
pod/nvidia-version-check created

$ kubectl logs nvidia-version-check
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.46       Driver Version: 495.46       CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
...
+-----------------------------------------------------------------------------+

$ kubectl delete pod nvidia-version-check
pod "nvidia-version-check" deleted

Check node labels created by the operator (shows driver 495.46 and CUDA 11.6):

$ kubectl describe node -A | grep nvidia.com/cuda
                    nvidia.com/cuda.driver.major=495
                    nvidia.com/cuda.driver.minor=46
                    nvidia.com/cuda.driver.rev=
                    nvidia.com/cuda.runtime.major=11
                    nvidia.com/cuda.runtime.minor=6

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions