-
Notifications
You must be signed in to change notification settings - Fork 461
Description
Hello Team,
Recently, while upgrading the OpenShift cluster version from 4.17.18 to 4.17.20 version (Kubernetes version - v1.30.10), the NVIDIA GPU Operator was upgraded to version V25.3 After the upgrade, the NVIDIA GPU Operator was down, which impacted the applications using the GPU. All the pods(dcgm,validator,dcgm-exporter) from the NVIDIA GPU Operator were in the "init" stage and also gpu cluster policy was in not ready stage . We later rolled back to the previous version, V24.9.2, which eventually resolved the issue.
Could the new version, V25.3, be unsupported by the latest OpenShift and Kubernetes version, or did it contain bugs that caused the issue? Can someone investigate this and let us know the major cause of the problem? We need to document the reason and provide it to the business department as part of the post-mortem analysis.
Let me know if you require any details.
Thanks,