Skip to content

PodClique cleanup fails when resource.k8s.io/v1 ResourceClaim API is unavailable #609

@julienmancuso

Description

@julienmancuso

What happened?

A PodClique delete reconcile can fail when the target cluster does not serve resource.k8s.io/v1 ResourceClaim, even if the workload is only trying to clean up.

Observed operator log:

{"level":"error","ts":"2026-05-12T22:13:19.622Z","msg":"Reconciler error","controller":"podclique-controller","controllerGroup":"grove.io","controllerKind":"PodClique","PodClique":{"name":"myllm-0-frontend","namespace":"sr-48d5ee24-b975-4a09-b46e-e7f8834f210b"},"namespace":"sr-48d5ee24-b975-4a09-b46e-e7f8834f210b","name":"myllm-0-frontend","reconcileID":"e6f82fe8-bf91-4c3d-8730-81dfbcbee889","error":"[Operation: Delete, Code: ERR_DELETE_PCLQ_RESOURCE_CLAIM] message: Error deleting PCLQ-level ResourceClaims for sr-48d5ee24-b975-4a09-b46e-e7f8834f210b/myllm-0-frontend, cause: no matches for kind \"ResourceClaim\" in version \"resource.k8s.io/v1\"","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.22.4/pkg/internal/controller/controller.go:474\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.22.4/pkg/internal/controller/controller.go:421\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func1.1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.22.4/pkg/internal/controller/controller.go:296"}

The failing operation is:

Operation: Delete
Code: ERR_DELETE_PCLQ_RESOURCE_CLAIM
cause: no matches for kind "ResourceClaim" in version "resource.k8s.io/v1"

This is related to #543, but that issue was closed by upgrading the local development kind cluster. The underlying operator behavior still exists: Grove attempts to reconcile/delete ResourceClaim objects even when the resource.k8s.io/v1 API is not present in the apiserver.

Expected behavior

PodClique cleanup should not get stuck solely because the cluster does not serve resource.k8s.io/v1 ResourceClaim.

If DRA support is unavailable, Grove should either:

  • detect that the API is absent and skip ResourceClaim cleanup as already gone/not applicable, or
  • surface a clear prerequisite error before accepting/enabling DRA-backed resource sharing.

At minimum, delete cleanup should probably ignore NoKindMatchError/resource-not-found style errors for ResourceClaim cleanup, since there cannot be ResourceClaim objects to delete if the API is absent.

Notes

Grove currently uses the stable Kubernetes DRA API (k8s.io/api/resource/v1). That implies Kubernetes 1.34+ for the ResourceClaim sharing path, or equivalent clusters serving resource.k8s.io/v1.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions