Skip to content

operator ignoring CONTAINERD_SOCKET environment variable for toolkit in version v25.3.3 #1694

@ilmannafian04

Description

@ilmannafian04

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

Describe the bug
After upgrading to version v25.3.3 of gpu-operator, the CONTAINERD_SOCKET environment variable defined in the .toolkit.env configuration is being ignored.

The release notes for v25.3.3 seems to mention something related to environment variables so it might be related.

The same configuration works correctly when reverting to v25.3.2. Below is the describe output for v.25.3.2. for v25.3.3, please see "Information to attach" section

v25.3.2 describe
Name:                 nvidia-container-toolkit-daemonset-hnxph
Namespace:            gpu-operator
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      nvidia-container-toolkit
Node:                 host-1/192.168.1.1
Start Time:           Sat, 13 Sep 2025 00:11:05 +0000
Labels:               app=nvidia-container-toolkit-daemonset
                      app.kubernetes.io/managed-by=gpu-operator
                      controller-revision-hash=7c4599bb9f
                      helm.sh/chart=gpu-operator-v25.3.2
                      pod-template-generation=15
Annotations:          <none>
Status:               Running
IP:                   10.42.0.229
IPs:
  IP:           10.42.0.229
Controlled By:  DaemonSet/nvidia-container-toolkit-daemonset
Init Containers:
  driver-validation:
    Container ID:  containerd://35f13b7e15f078a0791efd1c1b54b8695e4bef172bc6ea074c1a45fe9cc0e9dc
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v25.3.2
    Image ID:      nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:e183dc07e5889bd9e269c320ffad7f61df655f57ecc3aa158c4929e74528420a
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Sat, 13 Sep 2025 00:11:06 +0000
      Finished:     Sat, 13 Sep 2025 00:13:06 +0000
    Ready:          True
    Restart Count:  0
    Environment:
      WITH_WAIT:           true
      COMPONENT:           driver
      OPERATOR_NAMESPACE:  gpu-operator (v1:metadata.namespace)
    Mounts:
      /host from host-root (ro)
      /host-dev-char from host-dev-char (rw)
      /run/nvidia/driver from driver-install-dir (rw)
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-62fmq (ro)
Containers:
  nvidia-container-toolkit-ctr:
    Container ID:  containerd://e6cf1a6c89bd653cc60eef4b5b2d17c73774fd2c80852575373a92c29e1e98ac
    Image:         nvcr.io/nvidia/k8s/container-toolkit:v1.17.8-ubuntu20.04
    Image ID:      nvcr.io/nvidia/k8s/container-toolkit@sha256:d90dd628828082d61ea2334dc5dbfe7104a160ddea5ff4e0d44e12dee24c10f6
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
      -c
    Args:
      /bin/entrypoint.sh
    State:          Running
      Started:      Sat, 13 Sep 2025 00:13:07 +0000
    Ready:          True
    Restart Count:  0
    Environment:
      ROOT:                                                    /usr/local/nvidia
      NVIDIA_CONTAINER_RUNTIME_MODES_CDI_DEFAULT_KIND:         management.nvidia.com/gpu
      NVIDIA_VISIBLE_DEVICES:                                  void
      TOOLKIT_PID_FILE:                                        /run/nvidia/toolkit/toolkit.pid
      CONTAINERD_CONFIG:                                       /runtime/config-dir/config.toml
      CONTAINERD_SOCKET:                                       /runtime/sock-dir/containerd.sock
      CONTAINERD_RUNTIME_CLASS:                                nvidia
      CONTAINERD_SET_AS_DEFAULT:                               true
      CDI_ENABLED:                                             true
      NVIDIA_CONTAINER_RUNTIME_MODES_CDI_ANNOTATION_PREFIXES:  nvidia.cdi.k8s.io/
      CRIO_CONFIG_MODE:                                        config
      NVIDIA_CONTAINER_RUNTIME_MODE:                           cdi
      RUNTIME:                                                 containerd
      RUNTIME_CONFIG:                                          /runtime/config-dir/config.toml
      RUNTIME_SOCKET:                                          /runtime/sock-dir/containerd.sock
    Mounts:
      /bin/entrypoint.sh from nvidia-container-toolkit-entrypoint (ro,path="entrypoint.sh")
      /driver-root from driver-install-dir (rw)
      /host from host-root (ro)
      /run/nvidia/toolkit from toolkit-root (rw)
      /run/nvidia/validations from run-nvidia-validations (rw)
      /runtime/config-dir/ from containerd-config (rw)
      /runtime/sock-dir/ from containerd-socket (rw)
      /usr/local/nvidia from toolkit-install-dir (rw)
      /usr/share/containers/oci/hooks.d from crio-hooks (rw)
      /var/run/cdi from cdi-root (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-62fmq (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       True 
  ContainersReady             True 
  PodScheduled                True 
Volumes:
  nvidia-container-toolkit-entrypoint:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      nvidia-container-toolkit-entrypoint
    Optional:  false
  toolkit-root:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/toolkit
    HostPathType:  DirectoryOrCreate
  run-nvidia-validations:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/validations
    HostPathType:  DirectoryOrCreate
  driver-install-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/driver
    HostPathType:  DirectoryOrCreate
  host-root:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:  
  toolkit-install-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/local/nvidia
    HostPathType:  
  crio-hooks:
    Type:          HostPath (bare host directory volume)
    Path:          /run/containers/oci/hooks.d
    HostPathType:  
  host-dev-char:
    Type:          HostPath (bare host directory volume)
    Path:          /dev/char
    HostPathType:  
  cdi-root:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/cdi
    HostPathType:  DirectoryOrCreate
  containerd-config:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/rancher/k3s/agent/etc/containerd
    HostPathType:  DirectoryOrCreate
  containerd-socket:
    Type:          HostPath (bare host directory volume)
    Path:          /run/k3s/containerd
    HostPathType:  
  kube-api-access-62fmq:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    Optional:                false
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              nvidia.com/gpu.deploy.container-toolkit=true
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type    Reason     Age    From               Message
  ----    ------     ----   ----               -------
  Normal  Scheduled  3m31s  default-scheduler  Successfully assigned gpu-operator/nvidia-container-toolkit-daemonset-hnxph to host-1
  Normal  Pulled     3m31s  kubelet            Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v25.3.2" already present on machine
  Normal  Created    3m31s  kubelet            Created container: driver-validation
  Normal  Started    3m30s  kubelet            Started container driver-validation
  Normal  Pulled     89s    kubelet            Container image "nvcr.io/nvidia/k8s/container-toolkit:v1.17.8-ubuntu20.04" already present on machine
  Normal  Created    89s    kubelet            Created container: nvidia-container-toolkit-ctr
  Normal  Started    89s    kubelet            Started container nvidia-container-toolkit-ctr

To Reproduce
configure CONTAINERD_SOCKET for toolkit

toolkit:
  env:
    - name: CONTAINERD_CONFIG
      value: /var/lib/rancher/k3s/agent/etc/containerd/config.toml
    - name: CONTAINERD_SOCKET
      value: /run/k3s/containerd/containerd.sock
    - name: CONTAINERD_RUNTIME_CLASS
      value: nvidia
    - name: CONTAINERD_SET_AS_DEFAULT
      value: "true"

Expected behavior
Toolkit able to communicate with containerd

Environment (please provide the following information):

  • GPU Operator Version: v25.3.3
  • OS: Ubuntu24.04
  • Kernel Version: 6.14.0-29-generic
  • Container Runtime Version: 2.0.4-k3s2
  • Kubernetes Distro and Version: k3s 1.32

Information to attach (optional if deemed irrelevant)

  • kubernetes pods status: kubectl get pods -n OPERATOR_NAMESPACE
NAME                                       READY   STATUS             RESTARTS        AGE
gpu-feature-discovery-mssmq                1/1     Running            0               33m
gpu-operator-5b98787478-6bj47              1/1     Running            0               35m
nvidia-container-toolkit-daemonset-5k96v   0/1     CrashLoopBackOff   9 (4m23s ago)   33m
nvidia-cuda-validator-mnw9z                0/1     Completed          0               30m
nvidia-device-plugin-daemonset-kbxd4       1/1     Running            0               33m
nvidia-driver-daemonset-hfv5p              1/1     Running            0               34m
nvidia-operator-validator-lqc7f            1/1     Running            0               33m
  • kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE
  • If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
Name:                 nvidia-container-toolkit-daemonset-5k96v
Namespace:            gpu-operator
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      nvidia-container-toolkit
Node:                 host-1/192.168.1.1
Start Time:           Fri, 12 Sep 2025 23:32:48 +0000
Labels:               app=nvidia-container-toolkit-daemonset
                      app.kubernetes.io/managed-by=gpu-operator
                      controller-revision-hash=b799c8b98
                      helm.sh/chart=gpu-operator-v25.3.3
                      pod-template-generation=13
Annotations:          <none>
Status:               Running
IP:                   10.42.0.213
IPs:
  IP:           10.42.0.213
Controlled By:  DaemonSet/nvidia-container-toolkit-daemonset
Init Containers:
  driver-validation:
    Container ID:  containerd://c7b9aba6a473a9848eab9db9fa85d06c61f0dd900b61af72fd54638ea581afcd
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v25.3.3
    Image ID:      nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:8ca4b8f222887d42e09ab2f517914e51f374a8f887bce5d75391794b47ccb0e7
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 12 Sep 2025 23:32:48 +0000
      Finished:     Fri, 12 Sep 2025 23:35:39 +0000
    Ready:          True
    Restart Count:  0
    Environment:
      WITH_WAIT:           true
      COMPONENT:           driver
      OPERATOR_NAMESPACE:  gpu-operator (v1:metadata.namespace)
    Mounts:
      /host from host-root (ro)
      /host-dev-char from host-dev-char (rw)
      /run/nvidia/driver from driver-install-dir (rw)
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qk27c (ro)
Containers:
  nvidia-container-toolkit-ctr:
    Container ID:  containerd://cf64bebf4b6f4eccaba3319947a6a90a83eb4e563ed2929b94f21a889ff1a90a
    Image:         nvcr.io/nvidia/k8s/container-toolkit:v1.17.8-ubuntu20.04
    Image ID:      nvcr.io/nvidia/k8s/container-toolkit@sha256:d90dd628828082d61ea2334dc5dbfe7104a160ddea5ff4e0d44e12dee24c10f6
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
      -c
    Args:
      /bin/entrypoint.sh
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Fri, 12 Sep 2025 23:37:52 +0700
      Finished:     Fri, 12 Sep 2025 23:38:23 +0700
    Ready:          False
    Restart Count:  3
    Environment:
      ROOT:                                                    /usr/local/nvidia
      NVIDIA_CONTAINER_RUNTIME_MODES_CDI_DEFAULT_KIND:         management.nvidia.com/gpu
      NVIDIA_VISIBLE_DEVICES:                                  void
      TOOLKIT_PID_FILE:                                        /run/nvidia/toolkit/toolkit.pid
      CDI_ENABLED:                                             true
      NVIDIA_CONTAINER_RUNTIME_MODES_CDI_ANNOTATION_PREFIXES:  nvidia.cdi.k8s.io/
      CRIO_CONFIG_MODE:                                        config
      NVIDIA_CONTAINER_RUNTIME_MODE:                           cdi
      RUNTIME:                                                 containerd
      CONTAINERD_RUNTIME_CLASS:                                nvidia
      RUNTIME_CONFIG:                                          /runtime/config-dir/config.toml
      CONTAINERD_CONFIG:                                       /var/lib/rancher/k3s/agent/etc/containerd/config.toml
      RUNTIME_SOCKET:                                          /runtime/sock-dir/containerd.sock
      CONTAINERD_SOCKET:                                       /run/k3s/containerd/containerd.sock
      CONTAINERD_SET_AS_DEFAULT:                               true
    Mounts:
      /bin/entrypoint.sh from nvidia-container-toolkit-entrypoint (ro,path="entrypoint.sh")
      /driver-root from driver-install-dir (rw)
      /host from host-root (ro)
      /run/nvidia/toolkit from toolkit-root (rw)
      /run/nvidia/validations from run-nvidia-validations (rw)
      /runtime/config-dir/ from containerd-config (rw)
      /runtime/sock-dir/ from containerd-socket (rw)
      /usr/local/nvidia from toolkit-install-dir (rw)
      /usr/share/containers/oci/hooks.d from crio-hooks (rw)
      /var/run/cdi from cdi-root (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qk27c (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  nvidia-container-toolkit-entrypoint:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      nvidia-container-toolkit-entrypoint
    Optional:  false
  toolkit-root:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/toolkit
    HostPathType:  DirectoryOrCreate
  run-nvidia-validations:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/validations
    HostPathType:  DirectoryOrCreate
  driver-install-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/driver
    HostPathType:  DirectoryOrCreate
  host-root:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:  
  toolkit-install-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/local/nvidia
    HostPathType:  
  crio-hooks:
    Type:          HostPath (bare host directory volume)
    Path:          /run/containers/oci/hooks.d
    HostPathType:  
  host-dev-char:
    Type:          HostPath (bare host directory volume)
    Path:          /dev/char
    HostPathType:  
  cdi-root:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/cdi
    HostPathType:  DirectoryOrCreate
  containerd-config:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/containerd
    HostPathType:  DirectoryOrCreate
  containerd-socket:
    Type:          HostPath (bare host directory volume)
    Path:          /run/containerd
    HostPathType:  
  kube-api-access-qk27c:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    Optional:                false
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              nvidia.com/gpu.deploy.container-toolkit=true
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason     Age                  From               Message
  ----     ------     ----                 ----               -------
  Normal   Scheduled  6m20s                default-scheduler  Successfully assigned gpu-operator/nvidia-container-toolkit-daemonset-5k96v to host-1
  Normal   Pulled     6m20s                kubelet            Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v25.3.3" already present on machine
  Normal   Created    6m20s                kubelet            Created container: driver-validation
  Normal   Started    6m20s                kubelet            Started container driver-validation
  Normal   Pulled     76s (x4 over 3m27s)  kubelet            Container image "nvcr.io/nvidia/k8s/container-toolkit:v1.17.8-ubuntu20.04" already present on machine
  Normal   Created    76s (x4 over 3m27s)  kubelet            Created container: nvidia-container-toolkit-ctr
  Normal   Started    76s (x4 over 3m27s)  kubelet            Started container nvidia-container-toolkit-ctr
  Warning  BackOff    6s (x7 over 2m25s)   kubelet            Back-off restarting failed container nvidia-container-toolkit-ctr in pod nvidia-container-toolkit-daemonset-5k96v_gpu-operator(136b9094-d96d-48fd-8ee1-6e930a3eeb50)
  • If a pod/ds is in an error state or pending state kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
time="2025-09-12T16:35:46Z" level=warning msg="Error signaling containerd, attempt 1/6: unable to dial: dial unix /runtime/sock-dir/containerd.sock: connect: no such file or directory"
time="2025-09-12T16:35:51Z" level=warning msg="Error signaling containerd, attempt 2/6: unable to dial: dial unix /runtime/sock-dir/containerd.sock: connect: no such file or directory"
time="2025-09-12T16:35:56Z" level=warning msg="Error signaling containerd, attempt 3/6: unable to dial: dial unix /runtime/sock-dir/containerd.sock: connect: no such file or directory"
time="2025-09-12T16:36:01Z" level=warning msg="Error signaling containerd, attempt 4/6: unable to dial: dial unix /runtime/sock-dir/containerd.sock: connect: no such file or directory"
time="2025-09-12T16:36:06Z" level=warning msg="Error signaling containerd, attempt 5/6: unable to dial: dial unix /runtime/sock-dir/containerd.sock: connect: no such file or directory"
time="2025-09-12T16:36:11Z" level=warning msg="Max retries reached 6/6, aborting"
time="2025-09-12T16:36:11Z" level=info msg="Shutting Down"
time="2025-09-12T16:36:11Z" level=error msg="error running nvidia-toolkit: unable to setup runtime: unable to restart containerd: unable to dial: dial unix /runtime/sock-dir/containerd.sock: connect: no such file or directory"
  • Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
  • containerd logs journalctl -u containerd > containerd.log

Collecting full debug bundle (optional):

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh

NOTE: please refer to the must-gather script for debug data collected.

This bundle can be submitted to us via email: operator_feedback@nvidia.com

Metadata

Metadata

Assignees

Labels

bugIssue/PR to expose/discuss/fix a bug

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions