operator ignoring `CONTAINERD_SOCKET` environment variable for toolkit in version v25.3.3

_**Important Note:  NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case [here](https://enterprise-support.nvidia.com/s/create-case)**._

**Describe the bug**
After upgrading to version `v25.3.3` of gpu-operator, the `CONTAINERD_SOCKET` environment variable defined in the `.toolkit.env` configuration is being ignored.

The [release notes for `v25.3.3`](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/25.3.3/release-notes.html#fixed-issues) seems to mention something related to environment variables so it might be related.

The same configuration works correctly when reverting to `v25.3.2`. Below is the `describe` output for `v.25.3.2`. for `v25.3.3`, please see "Information to attach" section

<details>
<summary>v25.3.2 describe</summary>

```
Name:                 nvidia-container-toolkit-daemonset-hnxph
Namespace:            gpu-operator
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      nvidia-container-toolkit
Node:                 host-1/192.168.1.1
Start Time:           Sat, 13 Sep 2025 00:11:05 +0000
Labels:               app=nvidia-container-toolkit-daemonset
                      app.kubernetes.io/managed-by=gpu-operator
                      controller-revision-hash=7c4599bb9f
                      helm.sh/chart=gpu-operator-v25.3.2
                      pod-template-generation=15
Annotations:          <none>
Status:               Running
IP:                   10.42.0.229
IPs:
  IP:           10.42.0.229
Controlled By:  DaemonSet/nvidia-container-toolkit-daemonset
Init Containers:
  driver-validation:
    Container ID:  containerd://35f13b7e15f078a0791efd1c1b54b8695e4bef172bc6ea074c1a45fe9cc0e9dc
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v25.3.2
    Image ID:      nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:e183dc07e5889bd9e269c320ffad7f61df655f57ecc3aa158c4929e74528420a
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Sat, 13 Sep 2025 00:11:06 +0000
      Finished:     Sat, 13 Sep 2025 00:13:06 +0000
    Ready:          True
    Restart Count:  0
    Environment:
      WITH_WAIT:           true
      COMPONENT:           driver
      OPERATOR_NAMESPACE:  gpu-operator (v1:metadata.namespace)
    Mounts:
      /host from host-root (ro)
      /host-dev-char from host-dev-char (rw)
      /run/nvidia/driver from driver-install-dir (rw)
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-62fmq (ro)
Containers:
  nvidia-container-toolkit-ctr:
    Container ID:  containerd://e6cf1a6c89bd653cc60eef4b5b2d17c73774fd2c80852575373a92c29e1e98ac
    Image:         nvcr.io/nvidia/k8s/container-toolkit:v1.17.8-ubuntu20.04
    Image ID:      nvcr.io/nvidia/k8s/container-toolkit@sha256:d90dd628828082d61ea2334dc5dbfe7104a160ddea5ff4e0d44e12dee24c10f6
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
      -c
    Args:
      /bin/entrypoint.sh
    State:          Running
      Started:      Sat, 13 Sep 2025 00:13:07 +0000
    Ready:          True
    Restart Count:  0
    Environment:
      ROOT:                                                    /usr/local/nvidia
      NVIDIA_CONTAINER_RUNTIME_MODES_CDI_DEFAULT_KIND:         management.nvidia.com/gpu
      NVIDIA_VISIBLE_DEVICES:                                  void
      TOOLKIT_PID_FILE:                                        /run/nvidia/toolkit/toolkit.pid
      CONTAINERD_CONFIG:                                       /runtime/config-dir/config.toml
      CONTAINERD_SOCKET:                                       /runtime/sock-dir/containerd.sock
      CONTAINERD_RUNTIME_CLASS:                                nvidia
      CONTAINERD_SET_AS_DEFAULT:                               true
      CDI_ENABLED:                                             true
      NVIDIA_CONTAINER_RUNTIME_MODES_CDI_ANNOTATION_PREFIXES:  nvidia.cdi.k8s.io/
      CRIO_CONFIG_MODE:                                        config
      NVIDIA_CONTAINER_RUNTIME_MODE:                           cdi
      RUNTIME:                                                 containerd
      RUNTIME_CONFIG:                                          /runtime/config-dir/config.toml
      RUNTIME_SOCKET:                                          /runtime/sock-dir/containerd.sock
    Mounts:
      /bin/entrypoint.sh from nvidia-container-toolkit-entrypoint (ro,path="entrypoint.sh")
      /driver-root from driver-install-dir (rw)
      /host from host-root (ro)
      /run/nvidia/toolkit from toolkit-root (rw)
      /run/nvidia/validations from run-nvidia-validations (rw)
      /runtime/config-dir/ from containerd-config (rw)
      /runtime/sock-dir/ from containerd-socket (rw)
      /usr/local/nvidia from toolkit-install-dir (rw)
      /usr/share/containers/oci/hooks.d from crio-hooks (rw)
      /var/run/cdi from cdi-root (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-62fmq (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       True 
  ContainersReady             True 
  PodScheduled                True 
Volumes:
  nvidia-container-toolkit-entrypoint:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      nvidia-container-toolkit-entrypoint
    Optional:  false
  toolkit-root:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/toolkit
    HostPathType:  DirectoryOrCreate
  run-nvidia-validations:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/validations
    HostPathType:  DirectoryOrCreate
  driver-install-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/driver
    HostPathType:  DirectoryOrCreate
  host-root:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:  
  toolkit-install-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/local/nvidia
    HostPathType:  
  crio-hooks:
    Type:          HostPath (bare host directory volume)
    Path:          /run/containers/oci/hooks.d
    HostPathType:  
  host-dev-char:
    Type:          HostPath (bare host directory volume)
    Path:          /dev/char
    HostPathType:  
  cdi-root:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/cdi
    HostPathType:  DirectoryOrCreate
  containerd-config:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/rancher/k3s/agent/etc/containerd
    HostPathType:  DirectoryOrCreate
  containerd-socket:
    Type:          HostPath (bare host directory volume)
    Path:          /run/k3s/containerd
    HostPathType:  
  kube-api-access-62fmq:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    Optional:                false
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              nvidia.com/gpu.deploy.container-toolkit=true
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type    Reason     Age    From               Message
  ----    ------     ----   ----               -------
  Normal  Scheduled  3m31s  default-scheduler  Successfully assigned gpu-operator/nvidia-container-toolkit-daemonset-hnxph to host-1
  Normal  Pulled     3m31s  kubelet            Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v25.3.2" already present on machine
  Normal  Created    3m31s  kubelet            Created container: driver-validation
  Normal  Started    3m30s  kubelet            Started container driver-validation
  Normal  Pulled     89s    kubelet            Container image "nvcr.io/nvidia/k8s/container-toolkit:v1.17.8-ubuntu20.04" already present on machine
  Normal  Created    89s    kubelet            Created container: nvidia-container-toolkit-ctr
  Normal  Started    89s    kubelet            Started container nvidia-container-toolkit-ctr
```

</details>

**To Reproduce**
configure `CONTAINERD_SOCKET` for toolkit
```yaml
toolkit:
  env:
    - name: CONTAINERD_CONFIG
      value: /var/lib/rancher/k3s/agent/etc/containerd/config.toml
    - name: CONTAINERD_SOCKET
      value: /run/k3s/containerd/containerd.sock
    - name: CONTAINERD_RUNTIME_CLASS
      value: nvidia
    - name: CONTAINERD_SET_AS_DEFAULT
      value: "true"
```

**Expected behavior**
Toolkit able to communicate with containerd

**Environment (please provide the following information):**
 - GPU Operator Version: v25.3.3
 - OS: Ubuntu24.04
 - Kernel Version: 6.14.0-29-generic
 - Container Runtime Version: 2.0.4-k3s2
 - Kubernetes Distro and Version: k3s 1.32



**Information to [attach](https://help.github.com/articles/file-attachments-on-issues-and-pull-requests/)** (optional if deemed irrelevant)

 - [x] kubernetes pods status: `kubectl get pods -n OPERATOR_NAMESPACE`
```
NAME                                       READY   STATUS             RESTARTS        AGE
gpu-feature-discovery-mssmq                1/1     Running            0               33m
gpu-operator-5b98787478-6bj47              1/1     Running            0               35m
nvidia-container-toolkit-daemonset-5k96v   0/1     CrashLoopBackOff   9 (4m23s ago)   33m
nvidia-cuda-validator-mnw9z                0/1     Completed          0               30m
nvidia-device-plugin-daemonset-kbxd4       1/1     Running            0               33m
nvidia-driver-daemonset-hfv5p              1/1     Running            0               34m
nvidia-operator-validator-lqc7f            1/1     Running            0               33m
```
 - [ ] kubernetes daemonset status: `kubectl get ds -n OPERATOR_NAMESPACE`
 - [x] If a pod/ds is in an error state or pending state `kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME`
```
Name:                 nvidia-container-toolkit-daemonset-5k96v
Namespace:            gpu-operator
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      nvidia-container-toolkit
Node:                 host-1/192.168.1.1
Start Time:           Fri, 12 Sep 2025 23:32:48 +0000
Labels:               app=nvidia-container-toolkit-daemonset
                      app.kubernetes.io/managed-by=gpu-operator
                      controller-revision-hash=b799c8b98
                      helm.sh/chart=gpu-operator-v25.3.3
                      pod-template-generation=13
Annotations:          <none>
Status:               Running
IP:                   10.42.0.213
IPs:
  IP:           10.42.0.213
Controlled By:  DaemonSet/nvidia-container-toolkit-daemonset
Init Containers:
  driver-validation:
    Container ID:  containerd://c7b9aba6a473a9848eab9db9fa85d06c61f0dd900b61af72fd54638ea581afcd
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v25.3.3
    Image ID:      nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:8ca4b8f222887d42e09ab2f517914e51f374a8f887bce5d75391794b47ccb0e7
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 12 Sep 2025 23:32:48 +0000
      Finished:     Fri, 12 Sep 2025 23:35:39 +0000
    Ready:          True
    Restart Count:  0
    Environment:
      WITH_WAIT:           true
      COMPONENT:           driver
      OPERATOR_NAMESPACE:  gpu-operator (v1:metadata.namespace)
    Mounts:
      /host from host-root (ro)
      /host-dev-char from host-dev-char (rw)
      /run/nvidia/driver from driver-install-dir (rw)
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qk27c (ro)
Containers:
  nvidia-container-toolkit-ctr:
    Container ID:  containerd://cf64bebf4b6f4eccaba3319947a6a90a83eb4e563ed2929b94f21a889ff1a90a
    Image:         nvcr.io/nvidia/k8s/container-toolkit:v1.17.8-ubuntu20.04
    Image ID:      nvcr.io/nvidia/k8s/container-toolkit@sha256:d90dd628828082d61ea2334dc5dbfe7104a160ddea5ff4e0d44e12dee24c10f6
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
      -c
    Args:
      /bin/entrypoint.sh
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Fri, 12 Sep 2025 23:37:52 +0700
      Finished:     Fri, 12 Sep 2025 23:38:23 +0700
    Ready:          False
    Restart Count:  3
    Environment:
      ROOT:                                                    /usr/local/nvidia
      NVIDIA_CONTAINER_RUNTIME_MODES_CDI_DEFAULT_KIND:         management.nvidia.com/gpu
      NVIDIA_VISIBLE_DEVICES:                                  void
      TOOLKIT_PID_FILE:                                        /run/nvidia/toolkit/toolkit.pid
      CDI_ENABLED:                                             true
      NVIDIA_CONTAINER_RUNTIME_MODES_CDI_ANNOTATION_PREFIXES:  nvidia.cdi.k8s.io/
      CRIO_CONFIG_MODE:                                        config
      NVIDIA_CONTAINER_RUNTIME_MODE:                           cdi
      RUNTIME:                                                 containerd
      CONTAINERD_RUNTIME_CLASS:                                nvidia
      RUNTIME_CONFIG:                                          /runtime/config-dir/config.toml
      CONTAINERD_CONFIG:                                       /var/lib/rancher/k3s/agent/etc/containerd/config.toml
      RUNTIME_SOCKET:                                          /runtime/sock-dir/containerd.sock
      CONTAINERD_SOCKET:                                       /run/k3s/containerd/containerd.sock
      CONTAINERD_SET_AS_DEFAULT:                               true
    Mounts:
      /bin/entrypoint.sh from nvidia-container-toolkit-entrypoint (ro,path="entrypoint.sh")
      /driver-root from driver-install-dir (rw)
      /host from host-root (ro)
      /run/nvidia/toolkit from toolkit-root (rw)
      /run/nvidia/validations from run-nvidia-validations (rw)
      /runtime/config-dir/ from containerd-config (rw)
      /runtime/sock-dir/ from containerd-socket (rw)
      /usr/local/nvidia from toolkit-install-dir (rw)
      /usr/share/containers/oci/hooks.d from crio-hooks (rw)
      /var/run/cdi from cdi-root (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qk27c (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  nvidia-container-toolkit-entrypoint:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      nvidia-container-toolkit-entrypoint
    Optional:  false
  toolkit-root:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/toolkit
    HostPathType:  DirectoryOrCreate
  run-nvidia-validations:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/validations
    HostPathType:  DirectoryOrCreate
  driver-install-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/driver
    HostPathType:  DirectoryOrCreate
  host-root:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:  
  toolkit-install-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/local/nvidia
    HostPathType:  
  crio-hooks:
    Type:          HostPath (bare host directory volume)
    Path:          /run/containers/oci/hooks.d
    HostPathType:  
  host-dev-char:
    Type:          HostPath (bare host directory volume)
    Path:          /dev/char
    HostPathType:  
  cdi-root:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/cdi
    HostPathType:  DirectoryOrCreate
  containerd-config:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/containerd
    HostPathType:  DirectoryOrCreate
  containerd-socket:
    Type:          HostPath (bare host directory volume)
    Path:          /run/containerd
    HostPathType:  
  kube-api-access-qk27c:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    Optional:                false
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              nvidia.com/gpu.deploy.container-toolkit=true
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason     Age                  From               Message
  ----     ------     ----                 ----               -------
  Normal   Scheduled  6m20s                default-scheduler  Successfully assigned gpu-operator/nvidia-container-toolkit-daemonset-5k96v to host-1
  Normal   Pulled     6m20s                kubelet            Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v25.3.3" already present on machine
  Normal   Created    6m20s                kubelet            Created container: driver-validation
  Normal   Started    6m20s                kubelet            Started container driver-validation
  Normal   Pulled     76s (x4 over 3m27s)  kubelet            Container image "nvcr.io/nvidia/k8s/container-toolkit:v1.17.8-ubuntu20.04" already present on machine
  Normal   Created    76s (x4 over 3m27s)  kubelet            Created container: nvidia-container-toolkit-ctr
  Normal   Started    76s (x4 over 3m27s)  kubelet            Started container nvidia-container-toolkit-ctr
  Warning  BackOff    6s (x7 over 2m25s)   kubelet            Back-off restarting failed container nvidia-container-toolkit-ctr in pod nvidia-container-toolkit-daemonset-5k96v_gpu-operator(136b9094-d96d-48fd-8ee1-6e930a3eeb50)
```
 - [x] If a pod/ds is in an error state or pending state `kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers`
  ```
  time="2025-09-12T16:35:46Z" level=warning msg="Error signaling containerd, attempt 1/6: unable to dial: dial unix /runtime/sock-dir/containerd.sock: connect: no such file or directory"
  time="2025-09-12T16:35:51Z" level=warning msg="Error signaling containerd, attempt 2/6: unable to dial: dial unix /runtime/sock-dir/containerd.sock: connect: no such file or directory"
  time="2025-09-12T16:35:56Z" level=warning msg="Error signaling containerd, attempt 3/6: unable to dial: dial unix /runtime/sock-dir/containerd.sock: connect: no such file or directory"
  time="2025-09-12T16:36:01Z" level=warning msg="Error signaling containerd, attempt 4/6: unable to dial: dial unix /runtime/sock-dir/containerd.sock: connect: no such file or directory"
  time="2025-09-12T16:36:06Z" level=warning msg="Error signaling containerd, attempt 5/6: unable to dial: dial unix /runtime/sock-dir/containerd.sock: connect: no such file or directory"
  time="2025-09-12T16:36:11Z" level=warning msg="Max retries reached 6/6, aborting"
  time="2025-09-12T16:36:11Z" level=info msg="Shutting Down"
  time="2025-09-12T16:36:11Z" level=error msg="error running nvidia-toolkit: unable to setup runtime: unable to restart containerd: unable to dial: dial unix /runtime/sock-dir/containerd.sock: connect: no such file or directory"
  ```
 - [ ] Output from running `nvidia-smi` from the driver container: `kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi`
 - [ ] containerd logs `journalctl -u containerd > containerd.log`


Collecting full debug bundle (optional):

```
curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh
```
**NOTE**: please refer to the [must-gather](https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh) script for debug data collected.

This bundle can be submitted to us via email: **operator_feedback@nvidia.com**


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

operator ignoring `CONTAINERD_SOCKET` environment variable for toolkit in version v25.3.3 #1694

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

operator ignoring CONTAINERD_SOCKET environment variable for toolkit in version v25.3.3 #1694

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

operator ignoring `CONTAINERD_SOCKET` environment variable for toolkit in version v25.3.3 #1694