Skip to content

Failed to initialize NVML: Unknown Error #430

@hoangtnm

Description

@hoangtnm

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Quick Debug Checklist

  • Are you running on an Ubuntu 18.04 node?
  • Are you running Kubernetes v1.13+?
  • Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
  • Do you have i2c_core and ipmi_msghandler loaded on the nodes?
  • Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

1. Issue or feature description

Hi, I'm deploying Kubeflow v1.6.1 along with nvidia/gpu-operator for training DL models. It works great, but after a random of time (maybe 1-2 days I guess), I cannot use nvidia-smi to check GPU status anymore. When this happens, it raises:

(base) jovyan@agm-0:~/vol-1$ nvidia-smi
Failed to initialize NVML: Unknown Error

I'm not so sure why this happens because it runs training without any problem for several epochs, and when I come back the next day, this error happens. Do you have any idea?

2. Steps to reproduce the issue

This is how I deploy nvidia/gpu-operator:

sudo snap install helm --classic
helm repo add nvidia https://nvidia.github.io/gpu-operator \
  && helm repo update \
  && helm install \
  --version=v22.9.0 \
  --generate-name \
  --create-namespace \
  --namespace=gpu-operator-resources \
  nvidia/gpu-operator \
  --set driver.enabled=false \
  --set devicePlugin.env[0].name=DEVICE_LIST_STRATEGY \
  --set devicePlugin.env[0].value="volume-mounts" \
  --set toolkit.env[0].name=ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED \
  --set-string toolkit.env[0].value=false \
  --set toolkit.env[1].name=ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS \
  --set-string toolkit.env[1].value=true

Metadata

Metadata

Assignees

No one assigned

    Labels

    lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions