Skip to content

Alert GPUOperatorNodeDeploymentDriverFailed constantly fires on OpenShift, even when driver deployment appears successful in 24.6.0 #892

@benhwebster

Description

@benhwebster

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04): RHEL 9.2
  • Kernel Version: 5.14.0-284.71.1.el9_2.x86_64
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): CRI-O
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): OCP 4.15.20
  • GPU Operator Version: 24.6.0

2. Issue or feature description

The following alert is constantly firing:

        - alert: GPUOperatorNodeDeploymentDriverFailed
          annotations:
            description: |
              GPU Operator could not expose GPUs for more than 30min and
              nvidia driver could not be properly deployed in the node
              {{ $labels.node }}
            summary: GPU Operator could not expose GPUs (Driver)
          expr: |
            gpu_operator_node_driver_validation == 0
          for: 30m
          labels:
            severity: warning

It looks like everything correctly validates out, but then it seems to think there's a pre-installed driver and attempts to validate it (there should not be a pre-installed driver on this host.)

logs from nvidia-node-status-exporter:

time="2024-08-02T18:36:59Z" level=info msg="version: 4d9a1887, commit: 4d9a188"
time="2024-08-02T18:36:59Z" level=info msg="Running the metrics server, listening on :8000/metrics"
time="2024-08-02T18:36:59Z" level=info msg="metrics: StatusFile: watching toolkit-ready"
time="2024-08-02T18:36:59Z" level=info msg="Attempting to validate a pre-installed driver on the host"
time="2024-08-02T18:36:59Z" level=info msg="metrics: StatusFile: 'toolkit-ready' is ready"
time="2024-08-02T18:36:59Z" level=info msg="metrics: StatusFile: watching plugin-ready"
time="2024-08-02T18:36:59Z" level=info msg="metrics: StatusFile: 'plugin-ready' is ready"
time="2024-08-02T18:36:59Z" level=info msg="metrics: StatusFile: watching driver-ready"
time="2024-08-02T18:36:59Z" level=info msg="metrics: StatusFile: 'driver-ready' is ready"
time="2024-08-02T18:36:59Z" level=info msg="metrics: StatusFile: watching cuda-ready"
time="2024-08-02T18:36:59Z" level=info msg="metrics: StatusFile: 'cuda-ready' is ready"
time="2024-08-02T18:36:59Z" level=info msg="metrics: DevicePlugin validation: node name is example.com"
time="2024-08-02T18:36:59Z" level=info msg="metrics: DevicePlugin validation: found 2 GPUs exposed by the DevicePlugin"
time="2024-08-02T18:36:59Z" level=info msg="metrics: PCI devices: found 1 NVIDIA device"
time="2024-08-02T18:37:59Z" level=info msg="Attempting to validate a pre-installed driver on the host"
time="2024-08-02T18:38:59Z" level=info msg="Attempting to validate a pre-installed driver on the host"
time="2024-08-02T18:39:59Z" level=info msg="Attempting to validate a pre-installed driver on the host"
time="2024-08-02T18:40:59Z" level=info msg="Attempting to validate a pre-installed driver on the host"
time="2024-08-02T18:41:59Z" level=info msg="Attempting to validate a pre-installed driver on the host"
time="2024-08-02T18:42:59Z" level=info msg="Attempting to validate a pre-installed driver on the host"
time="2024-08-02T18:43:59Z" level=info msg="Attempting to validate a pre-installed driver on the host"
time="2024-08-02T18:44:59Z" level=info msg="Attempting to validate a pre-installed driver on the host"
time="2024-08-02T18:45:59Z" level=info msg="Attempting to validate a pre-installed driver on the host"
time="2024-08-02T18:46:59Z" level=info msg="Attempting to validate a pre-installed driver on the host"
time="2024-08-02T18:47:59Z" level=info msg="Attempting to validate a pre-installed driver on the host"

3. Steps to reproduce the issue

Upgrade to 24.6.0

Let me know if you'd like any other information.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugIssue/PR to expose/discuss/fix a bug

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions