1. Quick Debug Information
- OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu 22.04.3 LTS
- Kernel Version: 5.15.0-87-generic x86_64
- Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): K3S
- GPU Operator Version: v23.9.0
2. Issue or feature description
Upgrading the gpu-operator to v23.9.0 should not have the gpu-operator pod be stuck in a crash loop. The error failed to get API group resources: unable to retrieve the complete list of server APIs: nvidia.com/v1alpha1: the server could not find the requested resource repeats numerous times in the container logs, before the container stops with the error failed to wait for nvidia-driver-controller caches to sync: timed out waiting for cache to be synced for Kind *v1alpha1.NVIDIADriver. This looks like a regression from the new GPU Driver Custom Resource Definition, and when not deployed, causes the operator to not function properly.
3. Steps to reproduce the issue
Install gpu-operator v23.6.1, upgrade the Helm chart to v23.9.0 and observe the gpu-operator pod in a crash loop.
gpu-operator-6fdbc66bd4-k82lb_gpu-operator.log
1. Quick Debug Information
2. Issue or feature description
Upgrading the gpu-operator to v23.9.0 should not have the gpu-operator pod be stuck in a crash loop. The error
failed to get API group resources: unable to retrieve the complete list of server APIs: nvidia.com/v1alpha1: the server could not find the requested resourcerepeats numerous times in the container logs, before the container stops with the errorfailed to wait for nvidia-driver-controller caches to sync: timed out waiting for cache to be synced for Kind *v1alpha1.NVIDIADriver. This looks like a regression from the new GPU Driver Custom Resource Definition, and when not deployed, causes the operator to not function properly.3. Steps to reproduce the issue
Install gpu-operator v23.6.1, upgrade the Helm chart to v23.9.0 and observe the gpu-operator pod in a crash loop.
gpu-operator-6fdbc66bd4-k82lb_gpu-operator.log