Skip to content

Interaction between operator-validator and device-plugin causes error state. #508

@neggert

Description

@neggert

I've uncovered a strange interaction between the validator and the device plugin that can cause an error state when the nvidia driver is installed on the host.

Background

As far as I can tell, the validator pod uses init containers to create these files, in order:

  • /run/nvidia/validations/host-driver-ready
  • /run/nvidia/validations/toolkit-ready
  • /run/nvidia/validations/cuda-ready
  • /run/nvidia/validations/plugin-ready

These files are deleted when the pod exits by a lifecycle hook on the main non-init container:

    lifecycle:
      preStop:
        exec:
          command:
          - /bin/sh
          - -c
          - rm -f /run/nvidia/validations/*-ready

The device plugin pod has an init container and a main container. The init container runs

 until [ -f /run/nvidia/validations/toolkit-ready ]; do echo waiting for nvidia container stack to be setup; sleep 5; done

which prevents the pod from proceeding to the main container until the toolkit-ready file exists. Then the main pod runs

[[ -f /run/nvidia/validations/host-driver-ready ]] && driver_root=/ || driver_root=/run/nvidia/driver; export NVIDIA_DRIVER_ROOT=$driver_root; exec nvidia-device-plugin;

which sets $NVIDIA_DRIVER_ROOT to different values depending on whether or not the host-driver-ready file exists. For a host driver installation, we need driver_root=/.

Problem

When the validator pod exits due to an error in one of the init containers, the files in /run/nvidia/validations are not removed because the pod never proceeded to the main container with the lifecycle hook. This leaves various -ready files in the host path. This is normally not a big problem, since the individual init pod checks delete already-present files before checking readiness and possibly re-creating them. For example, the Driver.validate function deletes /run/nvidia/validations/host-driver-ready, checks status, then recreates it if successful.

This whole process can lead to a brief period where the tookit-ready file exists (because it was not cleaned up by the previous validator pod), but host-driver-ready does not. If the device plugin pod starts during this time, it incorrectly sets $NVIDIA_DRIVER_ROOT=/run/nvidia/driver. This leads to any other GPU pods failing to initailize with the error

Error: failed to generate container "..." spec: failed to generate spec: lstat /run/nvidia/driver/dev/nvidiactl: no such file or directory

One would hope that restarting various components would resolve the issue, but I've observed that once this problem happens, it tends to keep happening until I manually remove the operator and delete the contents of /run/nvidia/validations. Furthermore, this problem seems to happen fairly consistently during our node provisioning process.

Other info

  • GPU operator v22.9.1
  • Host driver version 525.60.13
  • container toolkit 1.11.0 installed on the host
  • Kubernetes 1.22.10
  • Host OS is CentOS 7.9

Metadata

Metadata

Assignees

No one assigned

    Labels

    lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions