This is effectively a continuation of #1099, but I cannot re-open that issue, so opening a new one.
I am experiencing the same problem while attempting to to upgrade from v24.6.0 to v24.9.0 on a k3s cluster. Perhaps a bad interaction related to this recent commit and the non-standard CONTAINERD paths required for gpu-operator+k3s, specified in my cluster's values as:
toolkit:
env:
- name: CONTAINERD_CONFIG
value: /var/lib/rancher/k3s/agent/etc/containerd/config.toml
- name: CONTAINERD_SOCKET
value: /run/k3s/containerd/containerd.sock
- name: CONTAINERD_SET_AS_DEFAULT
value: "false"
The pod log:
nvidia-container-toolkit-ctr IS_HOST_DRIVER=false
nvidia-container-toolkit-ctr NVIDIA_DRIVER_ROOT=/run/nvidia/driver
nvidia-container-toolkit-ctr DRIVER_ROOT_CTR_PATH=/driver-root
nvidia-container-toolkit-ctr NVIDIA_DEV_ROOT=/run/nvidia/driver
nvidia-container-toolkit-ctr DEV_ROOT_CTR_PATH=/driver-root
nvidia-container-toolkit-ctr time="2024-11-07T22:08:57Z" level=info msg="Parsing arguments"
nvidia-container-toolkit-ctr time="2024-11-07T22:08:57Z" level=info msg="Starting nvidia-toolkit"
nvidia-container-toolkit-ctr time="2024-11-07T22:08:57Z" level=info msg="disabling device node creation since --cdi-enabled=false"
nvidia-container-toolkit-ctr time="2024-11-07T22:08:57Z" level=info msg="Verifying Flags"
nvidia-container-toolkit-ctr time="2024-11-07T22:08:57Z" level=info msg=Initializing
nvidia-container-toolkit-ctr time="2024-11-07T22:08:57Z" level=info msg="Shutting Down"
nvidia-container-toolkit-ctr time="2024-11-07T22:08:57Z" level=error msg="error running nvidia-toolkit: unable to determine runtime options: unable to load containerd config: failed to load config: failed to run command chroot [/host containerd config dump]: exit status 127"
I confirmed that gpu-operator is setting the correct CONTAINERD_* paths according to my values:
containerd-config:
Type: HostPath (bare host directory volume)
Path: /var/lib/rancher/k3s/agent/etc/containerd
HostPathType: DirectoryOrCreate
containerd-socket:
Type: HostPath (bare host directory volume)
Path: /run/k3s/containerd
HostPathType:
This is effectively a continuation of #1099, but I cannot re-open that issue, so opening a new one.
I am experiencing the same problem while attempting to to upgrade from v24.6.0 to v24.9.0 on a k3s cluster. Perhaps a bad interaction related to this recent commit and the non-standard CONTAINERD paths required for gpu-operator+k3s, specified in my cluster's values as:
The pod log:
I confirmed that gpu-operator is setting the correct CONTAINERD_* paths according to my values: