I've uncovered a strange interaction between the validator and the device plugin that can cause an error state when the nvidia driver is installed on the host.
Background
As far as I can tell, the validator pod uses init containers to create these files, in order:
/run/nvidia/validations/host-driver-ready
/run/nvidia/validations/toolkit-ready
/run/nvidia/validations/cuda-ready
/run/nvidia/validations/plugin-ready
These files are deleted when the pod exits by a lifecycle hook on the main non-init container:
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- rm -f /run/nvidia/validations/*-ready
The device plugin pod has an init container and a main container. The init container runs
until [ -f /run/nvidia/validations/toolkit-ready ]; do echo waiting for nvidia container stack to be setup; sleep 5; done
which prevents the pod from proceeding to the main container until the toolkit-ready file exists. Then the main pod runs
[[ -f /run/nvidia/validations/host-driver-ready ]] && driver_root=/ || driver_root=/run/nvidia/driver; export NVIDIA_DRIVER_ROOT=$driver_root; exec nvidia-device-plugin;
which sets $NVIDIA_DRIVER_ROOT to different values depending on whether or not the host-driver-ready file exists. For a host driver installation, we need driver_root=/.
Problem
When the validator pod exits due to an error in one of the init containers, the files in /run/nvidia/validations are not removed because the pod never proceeded to the main container with the lifecycle hook. This leaves various -ready files in the host path. This is normally not a big problem, since the individual init pod checks delete already-present files before checking readiness and possibly re-creating them. For example, the Driver.validate function deletes /run/nvidia/validations/host-driver-ready, checks status, then recreates it if successful.
This whole process can lead to a brief period where the tookit-ready file exists (because it was not cleaned up by the previous validator pod), but host-driver-ready does not. If the device plugin pod starts during this time, it incorrectly sets $NVIDIA_DRIVER_ROOT=/run/nvidia/driver. This leads to any other GPU pods failing to initailize with the error
Error: failed to generate container "..." spec: failed to generate spec: lstat /run/nvidia/driver/dev/nvidiactl: no such file or directory
One would hope that restarting various components would resolve the issue, but I've observed that once this problem happens, it tends to keep happening until I manually remove the operator and delete the contents of /run/nvidia/validations. Furthermore, this problem seems to happen fairly consistently during our node provisioning process.
Other info
- GPU operator v22.9.1
- Host driver version 525.60.13
- container toolkit 1.11.0 installed on the host
- Kubernetes 1.22.10
- Host OS is CentOS 7.9
I've uncovered a strange interaction between the validator and the device plugin that can cause an error state when the nvidia driver is installed on the host.
Background
As far as I can tell, the validator pod uses init containers to create these files, in order:
/run/nvidia/validations/host-driver-ready/run/nvidia/validations/toolkit-ready/run/nvidia/validations/cuda-ready/run/nvidia/validations/plugin-readyThese files are deleted when the pod exits by a lifecycle hook on the main non-init container:
The device plugin pod has an init container and a main container. The init container runs
which prevents the pod from proceeding to the main container until the
toolkit-readyfile exists. Then the main pod runswhich sets
$NVIDIA_DRIVER_ROOTto different values depending on whether or not thehost-driver-readyfile exists. For a host driver installation, we needdriver_root=/.Problem
When the validator pod exits due to an error in one of the init containers, the files in
/run/nvidia/validationsare not removed because the pod never proceeded to the main container with the lifecycle hook. This leaves various-readyfiles in the host path. This is normally not a big problem, since the individual init pod checks delete already-present files before checking readiness and possibly re-creating them. For example, theDriver.validatefunction deletes/run/nvidia/validations/host-driver-ready, checks status, then recreates it if successful.This whole process can lead to a brief period where the
tookit-readyfile exists (because it was not cleaned up by the previous validator pod), buthost-driver-readydoes not. If the device plugin pod starts during this time, it incorrectly sets$NVIDIA_DRIVER_ROOT=/run/nvidia/driver. This leads to any other GPU pods failing to initailize with the errorOne would hope that restarting various components would resolve the issue, but I've observed that once this problem happens, it tends to keep happening until I manually remove the operator and delete the contents of
/run/nvidia/validations. Furthermore, this problem seems to happen fairly consistently during our node provisioning process.Other info