1. Quick Debug Information
- OS/Version(e.g. RHEL8.6, Ubuntu22.04): RHEL 9.2
- Kernel Version: 5.14.0-284.71.1.el9_2.x86_64
- Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): CRI-O
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): OCP 4.15.20
- GPU Operator Version: 24.6.0
2. Issue or feature description
The following alert is constantly firing:
- alert: GPUOperatorNodeDeploymentDriverFailed
annotations:
description: |
GPU Operator could not expose GPUs for more than 30min and
nvidia driver could not be properly deployed in the node
{{ $labels.node }}
summary: GPU Operator could not expose GPUs (Driver)
expr: |
gpu_operator_node_driver_validation == 0
for: 30m
labels:
severity: warning
It looks like everything correctly validates out, but then it seems to think there's a pre-installed driver and attempts to validate it (there should not be a pre-installed driver on this host.)
logs from nvidia-node-status-exporter:
time="2024-08-02T18:36:59Z" level=info msg="version: 4d9a1887, commit: 4d9a188"
time="2024-08-02T18:36:59Z" level=info msg="Running the metrics server, listening on :8000/metrics"
time="2024-08-02T18:36:59Z" level=info msg="metrics: StatusFile: watching toolkit-ready"
time="2024-08-02T18:36:59Z" level=info msg="Attempting to validate a pre-installed driver on the host"
time="2024-08-02T18:36:59Z" level=info msg="metrics: StatusFile: 'toolkit-ready' is ready"
time="2024-08-02T18:36:59Z" level=info msg="metrics: StatusFile: watching plugin-ready"
time="2024-08-02T18:36:59Z" level=info msg="metrics: StatusFile: 'plugin-ready' is ready"
time="2024-08-02T18:36:59Z" level=info msg="metrics: StatusFile: watching driver-ready"
time="2024-08-02T18:36:59Z" level=info msg="metrics: StatusFile: 'driver-ready' is ready"
time="2024-08-02T18:36:59Z" level=info msg="metrics: StatusFile: watching cuda-ready"
time="2024-08-02T18:36:59Z" level=info msg="metrics: StatusFile: 'cuda-ready' is ready"
time="2024-08-02T18:36:59Z" level=info msg="metrics: DevicePlugin validation: node name is example.com"
time="2024-08-02T18:36:59Z" level=info msg="metrics: DevicePlugin validation: found 2 GPUs exposed by the DevicePlugin"
time="2024-08-02T18:36:59Z" level=info msg="metrics: PCI devices: found 1 NVIDIA device"
time="2024-08-02T18:37:59Z" level=info msg="Attempting to validate a pre-installed driver on the host"
time="2024-08-02T18:38:59Z" level=info msg="Attempting to validate a pre-installed driver on the host"
time="2024-08-02T18:39:59Z" level=info msg="Attempting to validate a pre-installed driver on the host"
time="2024-08-02T18:40:59Z" level=info msg="Attempting to validate a pre-installed driver on the host"
time="2024-08-02T18:41:59Z" level=info msg="Attempting to validate a pre-installed driver on the host"
time="2024-08-02T18:42:59Z" level=info msg="Attempting to validate a pre-installed driver on the host"
time="2024-08-02T18:43:59Z" level=info msg="Attempting to validate a pre-installed driver on the host"
time="2024-08-02T18:44:59Z" level=info msg="Attempting to validate a pre-installed driver on the host"
time="2024-08-02T18:45:59Z" level=info msg="Attempting to validate a pre-installed driver on the host"
time="2024-08-02T18:46:59Z" level=info msg="Attempting to validate a pre-installed driver on the host"
time="2024-08-02T18:47:59Z" level=info msg="Attempting to validate a pre-installed driver on the host"
3. Steps to reproduce the issue
Upgrade to 24.6.0
Let me know if you'd like any other information.
1. Quick Debug Information
2. Issue or feature description
The following alert is constantly firing:
It looks like everything correctly validates out, but then it seems to think there's a pre-installed driver and attempts to validate it (there should not be a pre-installed driver on this host.)
logs from nvidia-node-status-exporter:
3. Steps to reproduce the issue
Upgrade to 24.6.0
Let me know if you'd like any other information.