Nvidia GPU operator issue on openshift(4.17.20)

Hello Team,


Recently, while upgrading the OpenShift cluster version from 4.17.18 to 4.17.20 version (Kubernetes version - v1.30.10), the NVIDIA GPU Operator was upgraded to version V25.3 After the upgrade, the NVIDIA GPU Operator was down, which impacted the applications using the GPU. All the pods(dcgm,validator,dcgm-exporter) from the NVIDIA GPU Operator were in the "init" stage and also gpu cluster policy was in not ready stage . We later rolled back to the previous version, V24.9.2, which eventually resolved the issue.


Could the new version, V25.3, be unsupported by the latest OpenShift and Kubernetes version, or did it contain bugs that caused the issue? Can someone investigate this and let us know the major cause of the problem? We need to document the reason and provide it to the business department as part of the post-mortem analysis.


Let me know if you require any details.  

Thanks,

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nvidia GPU operator issue on openshift(4.17.20) #1372

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Nvidia GPU operator issue on openshift(4.17.20) #1372

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions