Skip to content

Failed to get sandbox runtime: no runtime for nvidia is configured #432

@Bec-k

Description

@Bec-k

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Quick Debug Checklist

  • Are you running on an Ubuntu 18.04 node?
  • Are you running Kubernetes v1.13+?
  • Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
  • Do you have i2c_core and ipmi_msghandler loaded on the nodes?
  • Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

1. Issue or feature description

nov 02 18:00:58 beck containerd[10237]: time="2022-11-02T18:00:58.738797825+02:00" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:gpu-feature-discovery-qfjgk,Uid:02c7d4ad-db02-4145-846b-616a94416008,Namespace:gpu-operator,Attempt:2,} failed, error" error="failed to get sandbox runtime: no runtime for \"nvidia\" is configured"

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

  • kubernetes pods status: kubectl get pods --all-namespaces

  • kubernetes daemonset status: kubectl get ds --all-namespaces

  • If a pod/ds is in an error state or pending state kubectl describe pod -n NAMESPACE POD_NAME

  • If a pod/ds is in an error state or pending state kubectl logs -n NAMESPACE POD_NAME

  • Output of running a container on the GPU machine: docker run -it alpine echo foo

  • Docker configuration file: cat /etc/docker/daemon.json

  • Docker runtime configuration: docker info | grep runtime

  • NVIDIA shared directory: ls -la /run/nvidia

  • NVIDIA packages directory: ls -la /usr/local/nvidia/toolkit

  • NVIDIA driver directory: ls -la /run/nvidia/driver

  • kubelet logs journalctl -u kubelet > kubelet.logs

(base) beck@beck:/$ ls -la /run/nvidia/
total 4
drwxr-xr-x  4 root root  100 nov  2 18:48 .
drwxr-xr-x 39 root root 1140 nov  2 18:47 ..
drwxr-xr-x  2 root root   40 nov  2 17:59 driver
-rw-r--r--  1 root root    7 nov  2 18:48 toolkit.pid
drwxr-xr-x  2 root root   80 nov  2 18:48 validations

Driver folder is empty:

(base) beck@beck:/$ ls -la /run/nvidia/driver/
total 0
drwxr-xr-x 2 root root 40 nov  2 17:59 .
drwxr-xr-x 4 root root 80 nov  2 18:48 ..

Metadata

Metadata

Assignees

No one assigned

    Labels

    lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions