Skip to content

container-toolkit fails to start after upgrading to v24.9.0 on k3s cluster #1109

@logan2211

Description

@logan2211

This is effectively a continuation of #1099, but I cannot re-open that issue, so opening a new one.

I am experiencing the same problem while attempting to to upgrade from v24.6.0 to v24.9.0 on a k3s cluster. Perhaps a bad interaction related to this recent commit and the non-standard CONTAINERD paths required for gpu-operator+k3s, specified in my cluster's values as:

    toolkit:
      env:
      - name: CONTAINERD_CONFIG
        value: /var/lib/rancher/k3s/agent/etc/containerd/config.toml
      - name: CONTAINERD_SOCKET
        value: /run/k3s/containerd/containerd.sock
      - name: CONTAINERD_SET_AS_DEFAULT
        value: "false"

The pod log:

nvidia-container-toolkit-ctr IS_HOST_DRIVER=false
nvidia-container-toolkit-ctr NVIDIA_DRIVER_ROOT=/run/nvidia/driver
nvidia-container-toolkit-ctr DRIVER_ROOT_CTR_PATH=/driver-root
nvidia-container-toolkit-ctr NVIDIA_DEV_ROOT=/run/nvidia/driver
nvidia-container-toolkit-ctr DEV_ROOT_CTR_PATH=/driver-root
nvidia-container-toolkit-ctr time="2024-11-07T22:08:57Z" level=info msg="Parsing arguments"
nvidia-container-toolkit-ctr time="2024-11-07T22:08:57Z" level=info msg="Starting nvidia-toolkit"
nvidia-container-toolkit-ctr time="2024-11-07T22:08:57Z" level=info msg="disabling device node creation since --cdi-enabled=false"
nvidia-container-toolkit-ctr time="2024-11-07T22:08:57Z" level=info msg="Verifying Flags"
nvidia-container-toolkit-ctr time="2024-11-07T22:08:57Z" level=info msg=Initializing
nvidia-container-toolkit-ctr time="2024-11-07T22:08:57Z" level=info msg="Shutting Down"
nvidia-container-toolkit-ctr time="2024-11-07T22:08:57Z" level=error msg="error running nvidia-toolkit: unable to determine runtime options: unable to load containerd config: failed to load config: failed to run command chroot [/host containerd config dump]: exit status 127"

I confirmed that gpu-operator is setting the correct CONTAINERD_* paths according to my values:

  containerd-config:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/rancher/k3s/agent/etc/containerd
    HostPathType:  DirectoryOrCreate
  containerd-socket:
    Type:          HostPath (bare host directory volume)
    Path:          /run/k3s/containerd
    HostPathType:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions