Skip to content

gpu-operator fails to start due to deletion of nonexistent resources #484

@xknight

Description

@xknight

1. Quick Debug Checklist

  • Are you running on an Ubuntu 18.04 node?
  • Are you running Kubernetes v1.13+?
  • Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
  • Do you have i2c_core and ipmi_msghandler loaded on the nodes?
  • Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

1. Issue or feature description

Clean up of PSP resources fails in k8s 1.25.4 / OKD 4.12 (tested with gpu-operator 22.9.2):

{"level":"info","ts":1675677539.5370145,"logger":"controllers.ClusterPolicy","msg":"Couldn't delete","PodSecurityPolicies":"gpu-operator-privileged","Error":"no matches for kind \"PodSecurityPolicy\" in version \"policy/v1beta1\""}
{"level":"error","ts":1675677539.5370853,"msg":"Reconciler error","controller":"clusterpolicy-controller","object":{"name":"cluster-policy"},"namespace":"","name":"cluster-policy","reconcileID":"2a1a3aa8-0a42-4425-95fd-0bd83f2d6ad7","error":"no matches for kind \"PodSecurityPolicy\" in version \"policy/v1beta1\""}

It seems the code here should treat a non-existent resource definition as "not found" and pass the condition, but it fails instead.

2. Steps to reproduce the issue

  1. Install the gpu-operator in an OKD 4.12 cluster
  2. Observe the logs for the gpu-operator pod

3. Other information

Commenting out the code block above allows the operator to start normally.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions