Skip to content

GPU Operator crash loop due to missing CRDs #602

@6ixfalls

Description

@6ixfalls

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu 22.04.3 LTS
  • Kernel Version: 5.15.0-87-generic x86_64
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): K3S
  • GPU Operator Version: v23.9.0

2. Issue or feature description

Upgrading the gpu-operator to v23.9.0 should not have the gpu-operator pod be stuck in a crash loop. The error failed to get API group resources: unable to retrieve the complete list of server APIs: nvidia.com/v1alpha1: the server could not find the requested resource repeats numerous times in the container logs, before the container stops with the error failed to wait for nvidia-driver-controller caches to sync: timed out waiting for cache to be synced for Kind *v1alpha1.NVIDIADriver. This looks like a regression from the new GPU Driver Custom Resource Definition, and when not deployed, causes the operator to not function properly.

3. Steps to reproduce the issue

Install gpu-operator v23.6.1, upgrade the Helm chart to v23.9.0 and observe the gpu-operator pod in a crash loop.

gpu-operator-6fdbc66bd4-k82lb_gpu-operator.log

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions