Skip to content

Adding a new node with a different os-image and kernel-version ends up mixing up the nvidia-driver-daemonset(s) #1622

@ScottWatsonWork

Description

@ScottWatsonWork

Describe the bug
I was running a node with os-image of 1592.11 and kernel-version 6.6.95-cloud-amd64 with driver version 570.172.08

Now I have added a new node to my cluster and it will be running with os-image 1877.2 and kernel version 6.12.40-cloud-amd64 and the same driver version 570.172.08

There has been no change to the cluster policy as we are running the same driver version.

I see the gpu operator will create a new daemonset with what appears to be the correct name with the right os-image and kernel-version in the name.

k get ds -l app=nvidia-driver-daemonset
NAME                                                            DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                                                                          AGE
nvidia-driver-daemonset-6.12.40-cloud-amd64-gardenlinux1877.2   1         1         0       0            0           feature.node.kubernetes.io/kernel-version.full=6.12.40-cloud-amd64,nvidia.com/gpu.deploy.driver=true   31m
nvidia-driver-daemonset-6.6.95-cloud-amd64-gardenlinux1592.11   1         1         1       0            1           feature.node.kubernetes.io/kernel-version.full=6.6.95-cloud-amd64,nvidia.com/gpu.deploy.driver=true    3h2m

The problem is that the image name that is associate with the daemonset is wrong. The example below has the ds for os-image 1592.11 but the image that it is trying to pull is 1877.2

kubectl  get ds nvidia-driver-daemonset-6.6.95-cloud-amd64-gardenlinux1592.11 -o yaml | yq '.spec.template.spec.containers[0].image'
icx-data-ai-devops-docker.common.repositories.cloud.sap/nvidia/driver:570.172.08-6.6.95-cloud-amd64-gardenlinux1877.2

The daemonset will be redeployed by the operatore and the image value can be inverted again.

Checking just now shows that it has now gone the other way

kubectl  get ds nvidia-driver-daemonset-6.6.95-cloud-amd64-gardenlinux**1592.11** -o yaml | yq '.spec.template.spec.containers[0].image'
icx-data-ai-devops-docker.common.repositories.cloud.sap/nvidia/driver:570.172.08-6.6.95-cloud-amd64-gardenlinux**1592.11**

kubectl get ds nvidia-driver-daemonset-6.12.40-cloud-amd64-gardenlinux**1877.2** -o yaml | yq '.spec.template.spec.containers[0].image'
icx-data-ai-devops-docker.common.repositories.cloud.sap/nvidia/driver:570.172.08-6.12.40-cloud-amd64-gardenlinux**1592.11**

The one thing I did notice is that when selecting the pods being controlled by the daemonsets I get all the pods and not just the one tied to the specifc os-image and kernel version.

For example.

k get pod -l app=nvidia-driver-daemonset
NAME                                                              READY   STATUS             RESTARTS   AGE
nvidia-driver-daemonset-6.12.40-cloud-amd64-gardenlinux187lmfxk   0/1     Terminating        0          5s
nvidia-driver-daemonset-6.6.95-cloud-amd64-gardenlinux1592dnprr   0/1     PodInitializing    0          13s
nvidia-driver-daemonset-6.6.95-cloud-amd64-gardenlinux1592xr5l6   0/1     ImagePullBackOff   0          10m
k get ds nvidia-driver-daemonset-6.12.40-cloud-amd64-gardenlinux1877.2 -o json | jq '.metadata.name, .metadata.labels'
"nvidia-driver-daemonset-6.12.40-cloud-amd64-gardenlinux1877.2"
{
  "app": "nvidia-driver-daemonset",
  "app.kubernetes.io/component": "nvidia-driver",
  "app.kubernetes.io/managed-by": "gpu-operator",
  "helm.sh/chart": "gpu-operator-v25.3.2",
  "nvidia.com/precompiled": "true"
}
k get ds nvidia-driver-daemonset-6.6.95-cloud-amd64-gardenlinux1592.11  -o json | jq '.metadata.name, .metadata.labels'
"nvidia-driver-daemonset-6.6.95-cloud-amd64-gardenlinux1592.11"
{
  "app": "nvidia-driver-daemonset",
  "app.kubernetes.io/component": "nvidia-driver",
  "app.kubernetes.io/managed-by": "gpu-operator",
  "helm.sh/chart": "gpu-operator-v25.3.2",
  "nvidia.com/precompiled": "true"
}

To Reproduce
Add a new node to your cluster that uses a different os-image and kernel-version and wait for it to be picked up.

Expected behavior
I expect the driver daemonset to be able to target the correct node with the correct values. Note that the node selector for the daemonset is good and it will target the proper kernel-version.full.

Environment (please provide the following information):

  • GPU Operator Version: v25.3.2
  • OS: gardenlinux
  • Kernel Version: [6.6.95-cloud-amd64, 6.12.40-cloud-amd64]
  • Container Runtime Version: [containerd://1.7.23, containerd://2.1.4]
  • Kubernetes Distro and Version: [SAP Gardener 1.32.2]

Metadata

Metadata

Labels

bugIssue/PR to expose/discuss/fix a bug

Type

No type
No fields configured for issues without a type.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions