Skip to content

Nvidia GPU operator issue on openshift(4.17.20) #1372

@Nikhil-VW

Description

@Nikhil-VW

Hello Team,

Recently, while upgrading the OpenShift cluster version from 4.17.18 to 4.17.20 version (Kubernetes version - v1.30.10), the NVIDIA GPU Operator was upgraded to version V25.3 After the upgrade, the NVIDIA GPU Operator was down, which impacted the applications using the GPU. All the pods(dcgm,validator,dcgm-exporter) from the NVIDIA GPU Operator were in the "init" stage and also gpu cluster policy was in not ready stage . We later rolled back to the previous version, V24.9.2, which eventually resolved the issue.

Could the new version, V25.3, be unsupported by the latest OpenShift and Kubernetes version, or did it contain bugs that caused the issue? Can someone investigate this and let us know the major cause of the problem? We need to document the reason and provide it to the business department as part of the post-mortem analysis.

Let me know if you require any details.



Thanks,

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionCategorizes issue or PR as a support question.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions