Failed to initialize NVML: Unknown Error

_The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense._

### 1. Quick Debug Checklist
- [ ] Are you running on an Ubuntu 18.04 node?
- [x] Are you running Kubernetes v1.13+?
- [x] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
- [ ] Do you have `i2c_core` and `ipmi_msghandler` loaded on the nodes?
- [ ] Did you apply the CRD (`kubectl describe clusterpolicies --all-namespaces`)

### 1. Issue or feature description

Hi, I'm deploying Kubeflow v1.6.1 along with `nvidia/gpu-operator` for training DL models. It works great, but after a random of time (maybe 1-2 days I guess), I cannot use `nvidia-smi` to check GPU status anymore. When this happens, it raises:

```bash
(base) jovyan@agm-0:~/vol-1$ nvidia-smi
Failed to initialize NVML: Unknown Error
```

I'm not so sure why this happens because it runs training without any problem for several epochs, and when I come back the next day, this error happens. Do you have any idea?

### 2. Steps to reproduce the issue

This is how I deploy `nvidia/gpu-operator`:

```bash
sudo snap install helm --classic
helm repo add nvidia https://nvidia.github.io/gpu-operator \
  && helm repo update \
  && helm install \
  --version=v22.9.0 \
  --generate-name \
  --create-namespace \
  --namespace=gpu-operator-resources \
  nvidia/gpu-operator \
  --set driver.enabled=false \
  --set devicePlugin.env[0].name=DEVICE_LIST_STRATEGY \
  --set devicePlugin.env[0].value="volume-mounts" \
  --set toolkit.env[0].name=ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED \
  --set-string toolkit.env[0].value=false \
  --set toolkit.env[1].name=ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS \
  --set-string toolkit.env[1].value=true
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to initialize NVML: Unknown Error #430

1. Quick Debug Checklist

1. Issue or feature description

2. Steps to reproduce the issue

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Failed to initialize NVML: Unknown Error #430

Description

1. Quick Debug Checklist

1. Issue or feature description

2. Steps to reproduce the issue

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions