The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Quick Debug Checklist
1. Issue or feature description
Hi, I'm deploying Kubeflow v1.6.1 along with nvidia/gpu-operator for training DL models. It works great, but after a random of time (maybe 1-2 days I guess), I cannot use nvidia-smi to check GPU status anymore. When this happens, it raises:
(base) jovyan@agm-0:~/vol-1$ nvidia-smi
Failed to initialize NVML: Unknown Error
I'm not so sure why this happens because it runs training without any problem for several epochs, and when I come back the next day, this error happens. Do you have any idea?
2. Steps to reproduce the issue
This is how I deploy nvidia/gpu-operator:
sudo snap install helm --classic
helm repo add nvidia https://nvidia.github.io/gpu-operator \
&& helm repo update \
&& helm install \
--version=v22.9.0 \
--generate-name \
--create-namespace \
--namespace=gpu-operator-resources \
nvidia/gpu-operator \
--set driver.enabled=false \
--set devicePlugin.env[0].name=DEVICE_LIST_STRATEGY \
--set devicePlugin.env[0].value="volume-mounts" \
--set toolkit.env[0].name=ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED \
--set-string toolkit.env[0].value=false \
--set toolkit.env[1].name=ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS \
--set-string toolkit.env[1].value=true
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Quick Debug Checklist
i2c_coreandipmi_msghandlerloaded on the nodes?kubectl describe clusterpolicies --all-namespaces)1. Issue or feature description
Hi, I'm deploying Kubeflow v1.6.1 along with
nvidia/gpu-operatorfor training DL models. It works great, but after a random of time (maybe 1-2 days I guess), I cannot usenvidia-smito check GPU status anymore. When this happens, it raises:(base) jovyan@agm-0:~/vol-1$ nvidia-smi Failed to initialize NVML: Unknown ErrorI'm not so sure why this happens because it runs training without any problem for several epochs, and when I come back the next day, this error happens. Do you have any idea?
2. Steps to reproduce the issue
This is how I deploy
nvidia/gpu-operator: