Skip to content

Unable to turn on advanced upgrade controller #688

@age9990

Description

@age9990

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04):Ubuntu20.04
  • Kernel Version:5.15.0-69
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker):crio
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS):K8s
  • GPU Operator Version:v23.9.2 with NvidiaDriver CRD on

2. Issue or feature description

In our cluster, one GPU has disk issue so its status is NotReady. When I turn on advanced upgrade controller by setting driver.upgradePolicy.autoUpgrade to true, the advanced upgrade controller is not enabled, showing the error messages below.
I tried to set nvidia.com/gpu-driver-upgrade.skip=true on the broken GPU, the same error occurred.
The advanced upgrade controller works as expected when every node is ready in another k8s cluster. However, since some node may be down temporarily, would it be reasonable to bypass broken nodes rather than failed straight away?

GPU Operator error logs:
{"level":"error","ts":"2024-03-27T06:00:03.292Z","logger":"controllers.Upgrade","msg":"Failed to build node upgrade state for pod","pod":{"namespace":"gpu-operator","name":"nvidia-gpu-driver-ubuntu20.04-797bd4457c-x4czx"},"error":"unable to get node : resource name may not be empty"}
{"level":"error","ts":"2024-03-27T06:00:03.292Z","logger":"controllers.Upgrade","msg":"Failed to build cluster upgrade state","error":"unable to get node : resource name may not be empty"}
{"level":"error","ts":"2024-03-27T06:00:03.292Z","msg":"Reconciler error","controller":"upgrade-controller","object":{"name":"cluster-policy"},"namespace":"","name":"cluster-policy","reconcileID":"474846e5-07f9-445a-9107-a452581f1a69","error":"unable to get node : resource name may not be empty"}

Metadata

Metadata

Assignees

Labels

bugIssue/PR to expose/discuss/fix a bug

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions