Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
Describe the bug
After upgrading to version v25.3.3 of gpu-operator, the CONTAINERD_SOCKET environment variable defined in the .toolkit.env configuration is being ignored.
The release notes for v25.3.3 seems to mention something related to environment variables so it might be related.
The same configuration works correctly when reverting to v25.3.2. Below is the describe output for v.25.3.2. for v25.3.3, please see "Information to attach" section
v25.3.2 describe
Name: nvidia-container-toolkit-daemonset-hnxph
Namespace: gpu-operator
Priority: 2000001000
Priority Class Name: system-node-critical
Service Account: nvidia-container-toolkit
Node: host-1/192.168.1.1
Start Time: Sat, 13 Sep 2025 00:11:05 +0000
Labels: app=nvidia-container-toolkit-daemonset
app.kubernetes.io/managed-by=gpu-operator
controller-revision-hash=7c4599bb9f
helm.sh/chart=gpu-operator-v25.3.2
pod-template-generation=15
Annotations: <none>
Status: Running
IP: 10.42.0.229
IPs:
IP: 10.42.0.229
Controlled By: DaemonSet/nvidia-container-toolkit-daemonset
Init Containers:
driver-validation:
Container ID: containerd://35f13b7e15f078a0791efd1c1b54b8695e4bef172bc6ea074c1a45fe9cc0e9dc
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v25.3.2
Image ID: nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:e183dc07e5889bd9e269c320ffad7f61df655f57ecc3aa158c4929e74528420a
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
State: Terminated
Reason: Completed
Exit Code: 0
Started: Sat, 13 Sep 2025 00:11:06 +0000
Finished: Sat, 13 Sep 2025 00:13:06 +0000
Ready: True
Restart Count: 0
Environment:
WITH_WAIT: true
COMPONENT: driver
OPERATOR_NAMESPACE: gpu-operator (v1:metadata.namespace)
Mounts:
/host from host-root (ro)
/host-dev-char from host-dev-char (rw)
/run/nvidia/driver from driver-install-dir (rw)
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-62fmq (ro)
Containers:
nvidia-container-toolkit-ctr:
Container ID: containerd://e6cf1a6c89bd653cc60eef4b5b2d17c73774fd2c80852575373a92c29e1e98ac
Image: nvcr.io/nvidia/k8s/container-toolkit:v1.17.8-ubuntu20.04
Image ID: nvcr.io/nvidia/k8s/container-toolkit@sha256:d90dd628828082d61ea2334dc5dbfe7104a160ddea5ff4e0d44e12dee24c10f6
Port: <none>
Host Port: <none>
Command:
/bin/bash
-c
Args:
/bin/entrypoint.sh
State: Running
Started: Sat, 13 Sep 2025 00:13:07 +0000
Ready: True
Restart Count: 0
Environment:
ROOT: /usr/local/nvidia
NVIDIA_CONTAINER_RUNTIME_MODES_CDI_DEFAULT_KIND: management.nvidia.com/gpu
NVIDIA_VISIBLE_DEVICES: void
TOOLKIT_PID_FILE: /run/nvidia/toolkit/toolkit.pid
CONTAINERD_CONFIG: /runtime/config-dir/config.toml
CONTAINERD_SOCKET: /runtime/sock-dir/containerd.sock
CONTAINERD_RUNTIME_CLASS: nvidia
CONTAINERD_SET_AS_DEFAULT: true
CDI_ENABLED: true
NVIDIA_CONTAINER_RUNTIME_MODES_CDI_ANNOTATION_PREFIXES: nvidia.cdi.k8s.io/
CRIO_CONFIG_MODE: config
NVIDIA_CONTAINER_RUNTIME_MODE: cdi
RUNTIME: containerd
RUNTIME_CONFIG: /runtime/config-dir/config.toml
RUNTIME_SOCKET: /runtime/sock-dir/containerd.sock
Mounts:
/bin/entrypoint.sh from nvidia-container-toolkit-entrypoint (ro,path="entrypoint.sh")
/driver-root from driver-install-dir (rw)
/host from host-root (ro)
/run/nvidia/toolkit from toolkit-root (rw)
/run/nvidia/validations from run-nvidia-validations (rw)
/runtime/config-dir/ from containerd-config (rw)
/runtime/sock-dir/ from containerd-socket (rw)
/usr/local/nvidia from toolkit-install-dir (rw)
/usr/share/containers/oci/hooks.d from crio-hooks (rw)
/var/run/cdi from cdi-root (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-62fmq (ro)
Conditions:
Type Status
PodReadyToStartContainers True
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
nvidia-container-toolkit-entrypoint:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: nvidia-container-toolkit-entrypoint
Optional: false
toolkit-root:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/toolkit
HostPathType: DirectoryOrCreate
run-nvidia-validations:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/validations
HostPathType: DirectoryOrCreate
driver-install-dir:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/driver
HostPathType: DirectoryOrCreate
host-root:
Type: HostPath (bare host directory volume)
Path: /
HostPathType:
toolkit-install-dir:
Type: HostPath (bare host directory volume)
Path: /usr/local/nvidia
HostPathType:
crio-hooks:
Type: HostPath (bare host directory volume)
Path: /run/containers/oci/hooks.d
HostPathType:
host-dev-char:
Type: HostPath (bare host directory volume)
Path: /dev/char
HostPathType:
cdi-root:
Type: HostPath (bare host directory volume)
Path: /var/run/cdi
HostPathType: DirectoryOrCreate
containerd-config:
Type: HostPath (bare host directory volume)
Path: /var/lib/rancher/k3s/agent/etc/containerd
HostPathType: DirectoryOrCreate
containerd-socket:
Type: HostPath (bare host directory volume)
Path: /run/k3s/containerd
HostPathType:
kube-api-access-62fmq:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
Optional: false
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: nvidia.com/gpu.deploy.container-toolkit=true
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 3m31s default-scheduler Successfully assigned gpu-operator/nvidia-container-toolkit-daemonset-hnxph to host-1
Normal Pulled 3m31s kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v25.3.2" already present on machine
Normal Created 3m31s kubelet Created container: driver-validation
Normal Started 3m30s kubelet Started container driver-validation
Normal Pulled 89s kubelet Container image "nvcr.io/nvidia/k8s/container-toolkit:v1.17.8-ubuntu20.04" already present on machine
Normal Created 89s kubelet Created container: nvidia-container-toolkit-ctr
Normal Started 89s kubelet Started container nvidia-container-toolkit-ctr
To Reproduce
configure CONTAINERD_SOCKET for toolkit
toolkit:
env:
- name: CONTAINERD_CONFIG
value: /var/lib/rancher/k3s/agent/etc/containerd/config.toml
- name: CONTAINERD_SOCKET
value: /run/k3s/containerd/containerd.sock
- name: CONTAINERD_RUNTIME_CLASS
value: nvidia
- name: CONTAINERD_SET_AS_DEFAULT
value: "true"
Expected behavior
Toolkit able to communicate with containerd
Environment (please provide the following information):
- GPU Operator Version: v25.3.3
- OS: Ubuntu24.04
- Kernel Version: 6.14.0-29-generic
- Container Runtime Version: 2.0.4-k3s2
- Kubernetes Distro and Version: k3s 1.32
Information to attach (optional if deemed irrelevant)
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-mssmq 1/1 Running 0 33m
gpu-operator-5b98787478-6bj47 1/1 Running 0 35m
nvidia-container-toolkit-daemonset-5k96v 0/1 CrashLoopBackOff 9 (4m23s ago) 33m
nvidia-cuda-validator-mnw9z 0/1 Completed 0 30m
nvidia-device-plugin-daemonset-kbxd4 1/1 Running 0 33m
nvidia-driver-daemonset-hfv5p 1/1 Running 0 34m
nvidia-operator-validator-lqc7f 1/1 Running 0 33m
Name: nvidia-container-toolkit-daemonset-5k96v
Namespace: gpu-operator
Priority: 2000001000
Priority Class Name: system-node-critical
Service Account: nvidia-container-toolkit
Node: host-1/192.168.1.1
Start Time: Fri, 12 Sep 2025 23:32:48 +0000
Labels: app=nvidia-container-toolkit-daemonset
app.kubernetes.io/managed-by=gpu-operator
controller-revision-hash=b799c8b98
helm.sh/chart=gpu-operator-v25.3.3
pod-template-generation=13
Annotations: <none>
Status: Running
IP: 10.42.0.213
IPs:
IP: 10.42.0.213
Controlled By: DaemonSet/nvidia-container-toolkit-daemonset
Init Containers:
driver-validation:
Container ID: containerd://c7b9aba6a473a9848eab9db9fa85d06c61f0dd900b61af72fd54638ea581afcd
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v25.3.3
Image ID: nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:8ca4b8f222887d42e09ab2f517914e51f374a8f887bce5d75391794b47ccb0e7
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
State: Terminated
Reason: Completed
Exit Code: 0
Started: Fri, 12 Sep 2025 23:32:48 +0000
Finished: Fri, 12 Sep 2025 23:35:39 +0000
Ready: True
Restart Count: 0
Environment:
WITH_WAIT: true
COMPONENT: driver
OPERATOR_NAMESPACE: gpu-operator (v1:metadata.namespace)
Mounts:
/host from host-root (ro)
/host-dev-char from host-dev-char (rw)
/run/nvidia/driver from driver-install-dir (rw)
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qk27c (ro)
Containers:
nvidia-container-toolkit-ctr:
Container ID: containerd://cf64bebf4b6f4eccaba3319947a6a90a83eb4e563ed2929b94f21a889ff1a90a
Image: nvcr.io/nvidia/k8s/container-toolkit:v1.17.8-ubuntu20.04
Image ID: nvcr.io/nvidia/k8s/container-toolkit@sha256:d90dd628828082d61ea2334dc5dbfe7104a160ddea5ff4e0d44e12dee24c10f6
Port: <none>
Host Port: <none>
Command:
/bin/bash
-c
Args:
/bin/entrypoint.sh
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Fri, 12 Sep 2025 23:37:52 +0700
Finished: Fri, 12 Sep 2025 23:38:23 +0700
Ready: False
Restart Count: 3
Environment:
ROOT: /usr/local/nvidia
NVIDIA_CONTAINER_RUNTIME_MODES_CDI_DEFAULT_KIND: management.nvidia.com/gpu
NVIDIA_VISIBLE_DEVICES: void
TOOLKIT_PID_FILE: /run/nvidia/toolkit/toolkit.pid
CDI_ENABLED: true
NVIDIA_CONTAINER_RUNTIME_MODES_CDI_ANNOTATION_PREFIXES: nvidia.cdi.k8s.io/
CRIO_CONFIG_MODE: config
NVIDIA_CONTAINER_RUNTIME_MODE: cdi
RUNTIME: containerd
CONTAINERD_RUNTIME_CLASS: nvidia
RUNTIME_CONFIG: /runtime/config-dir/config.toml
CONTAINERD_CONFIG: /var/lib/rancher/k3s/agent/etc/containerd/config.toml
RUNTIME_SOCKET: /runtime/sock-dir/containerd.sock
CONTAINERD_SOCKET: /run/k3s/containerd/containerd.sock
CONTAINERD_SET_AS_DEFAULT: true
Mounts:
/bin/entrypoint.sh from nvidia-container-toolkit-entrypoint (ro,path="entrypoint.sh")
/driver-root from driver-install-dir (rw)
/host from host-root (ro)
/run/nvidia/toolkit from toolkit-root (rw)
/run/nvidia/validations from run-nvidia-validations (rw)
/runtime/config-dir/ from containerd-config (rw)
/runtime/sock-dir/ from containerd-socket (rw)
/usr/local/nvidia from toolkit-install-dir (rw)
/usr/share/containers/oci/hooks.d from crio-hooks (rw)
/var/run/cdi from cdi-root (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qk27c (ro)
Conditions:
Type Status
PodReadyToStartContainers True
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
nvidia-container-toolkit-entrypoint:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: nvidia-container-toolkit-entrypoint
Optional: false
toolkit-root:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/toolkit
HostPathType: DirectoryOrCreate
run-nvidia-validations:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/validations
HostPathType: DirectoryOrCreate
driver-install-dir:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/driver
HostPathType: DirectoryOrCreate
host-root:
Type: HostPath (bare host directory volume)
Path: /
HostPathType:
toolkit-install-dir:
Type: HostPath (bare host directory volume)
Path: /usr/local/nvidia
HostPathType:
crio-hooks:
Type: HostPath (bare host directory volume)
Path: /run/containers/oci/hooks.d
HostPathType:
host-dev-char:
Type: HostPath (bare host directory volume)
Path: /dev/char
HostPathType:
cdi-root:
Type: HostPath (bare host directory volume)
Path: /var/run/cdi
HostPathType: DirectoryOrCreate
containerd-config:
Type: HostPath (bare host directory volume)
Path: /etc/containerd
HostPathType: DirectoryOrCreate
containerd-socket:
Type: HostPath (bare host directory volume)
Path: /run/containerd
HostPathType:
kube-api-access-qk27c:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
Optional: false
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: nvidia.com/gpu.deploy.container-toolkit=true
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 6m20s default-scheduler Successfully assigned gpu-operator/nvidia-container-toolkit-daemonset-5k96v to host-1
Normal Pulled 6m20s kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v25.3.3" already present on machine
Normal Created 6m20s kubelet Created container: driver-validation
Normal Started 6m20s kubelet Started container driver-validation
Normal Pulled 76s (x4 over 3m27s) kubelet Container image "nvcr.io/nvidia/k8s/container-toolkit:v1.17.8-ubuntu20.04" already present on machine
Normal Created 76s (x4 over 3m27s) kubelet Created container: nvidia-container-toolkit-ctr
Normal Started 76s (x4 over 3m27s) kubelet Started container nvidia-container-toolkit-ctr
Warning BackOff 6s (x7 over 2m25s) kubelet Back-off restarting failed container nvidia-container-toolkit-ctr in pod nvidia-container-toolkit-daemonset-5k96v_gpu-operator(136b9094-d96d-48fd-8ee1-6e930a3eeb50)
time="2025-09-12T16:35:46Z" level=warning msg="Error signaling containerd, attempt 1/6: unable to dial: dial unix /runtime/sock-dir/containerd.sock: connect: no such file or directory"
time="2025-09-12T16:35:51Z" level=warning msg="Error signaling containerd, attempt 2/6: unable to dial: dial unix /runtime/sock-dir/containerd.sock: connect: no such file or directory"
time="2025-09-12T16:35:56Z" level=warning msg="Error signaling containerd, attempt 3/6: unable to dial: dial unix /runtime/sock-dir/containerd.sock: connect: no such file or directory"
time="2025-09-12T16:36:01Z" level=warning msg="Error signaling containerd, attempt 4/6: unable to dial: dial unix /runtime/sock-dir/containerd.sock: connect: no such file or directory"
time="2025-09-12T16:36:06Z" level=warning msg="Error signaling containerd, attempt 5/6: unable to dial: dial unix /runtime/sock-dir/containerd.sock: connect: no such file or directory"
time="2025-09-12T16:36:11Z" level=warning msg="Max retries reached 6/6, aborting"
time="2025-09-12T16:36:11Z" level=info msg="Shutting Down"
time="2025-09-12T16:36:11Z" level=error msg="error running nvidia-toolkit: unable to setup runtime: unable to restart containerd: unable to dial: dial unix /runtime/sock-dir/containerd.sock: connect: no such file or directory"
Collecting full debug bundle (optional):
curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh
NOTE: please refer to the must-gather script for debug data collected.
This bundle can be submitted to us via email: operator_feedback@nvidia.com
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
Describe the bug
After upgrading to version
v25.3.3of gpu-operator, theCONTAINERD_SOCKETenvironment variable defined in the.toolkit.envconfiguration is being ignored.The release notes for
v25.3.3seems to mention something related to environment variables so it might be related.The same configuration works correctly when reverting to
v25.3.2. Below is thedescribeoutput forv.25.3.2. forv25.3.3, please see "Information to attach" sectionv25.3.2 describe
To Reproduce
configure
CONTAINERD_SOCKETfor toolkitExpected behavior
Toolkit able to communicate with containerd
Environment (please provide the following information):
Information to attach (optional if deemed irrelevant)
kubectl get pods -n OPERATOR_NAMESPACEkubectl get ds -n OPERATOR_NAMESPACEkubectl describe pod -n OPERATOR_NAMESPACE POD_NAMEkubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containersnvidia-smifrom the driver container:kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smijournalctl -u containerd > containerd.logCollecting full debug bundle (optional):
NOTE: please refer to the must-gather script for debug data collected.
This bundle can be submitted to us via email: operator_feedback@nvidia.com