Interaction between operator-validator and device-plugin causes error state.

I've uncovered a strange interaction between the validator and the device plugin that can cause an error state when the nvidia driver is installed on the host.

### Background

As far as I can tell, the validator pod uses init containers to create these files, in order:
* `/run/nvidia/validations/host-driver-ready`
* `/run/nvidia/validations/toolkit-ready`
* `/run/nvidia/validations/cuda-ready`
* `/run/nvidia/validations/plugin-ready`

These files are deleted when the pod exits by a lifecycle hook on the main non-init container:

```
    lifecycle:
      preStop:
        exec:
          command:
          - /bin/sh
          - -c
          - rm -f /run/nvidia/validations/*-ready
 ```

The device plugin pod has an init container and a main container. The init container runs
```
 until [ -f /run/nvidia/validations/toolkit-ready ]; do echo waiting for nvidia container stack to be setup; sleep 5; done
```
which prevents the pod from proceeding to the main container until the `toolkit-ready` file exists. Then the main pod runs

```
[[ -f /run/nvidia/validations/host-driver-ready ]] && driver_root=/ || driver_root=/run/nvidia/driver; export NVIDIA_DRIVER_ROOT=$driver_root; exec nvidia-device-plugin;
```
which sets `$NVIDIA_DRIVER_ROOT` to different values depending on whether or not the `host-driver-ready` file exists. For a host driver installation, we need `driver_root=/`.

### Problem
When the validator pod exits due to an error in one of the init containers, the files in `/run/nvidia/validations` are not removed because the pod never proceeded to the main container with the lifecycle hook. This leaves various `-ready` files in the host path. This is normally not a big problem, since the individual init pod checks delete already-present files before checking readiness and possibly re-creating them. [For example](https://github.com/NVIDIA/gpu-operator/blob/master/validator/main.go#L629), the `Driver.validate` function deletes `/run/nvidia/validations/host-driver-ready`, checks status, then recreates it if successful.

This whole process can lead to a brief period where the `tookit-ready` file exists (because it was not cleaned up by the previous validator pod), but `host-driver-ready` does not. If the device plugin pod starts during this time, it incorrectly sets `$NVIDIA_DRIVER_ROOT=/run/nvidia/driver`. This leads to any other GPU pods failing to initailize with the error

```
Error: failed to generate container "..." spec: failed to generate spec: lstat /run/nvidia/driver/dev/nvidiactl: no such file or directory
```

One would hope that restarting various components would resolve the issue, but I've observed that once this problem happens, it tends to keep happening until I manually remove the operator and delete the contents of `/run/nvidia/validations`. Furthermore, this problem seems to happen fairly consistently during our node provisioning process.

## Other info
* GPU operator v22.9.1
* Host driver version 525.60.13
* container toolkit 1.11.0 installed on the host
* Kubernetes 1.22.10
* Host OS is CentOS 7.9


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interaction between operator-validator and device-plugin causes error state. #508

Background

Problem

Other info

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Interaction between operator-validator and device-plugin causes error state. #508

Description

Background

Problem

Other info

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions