fix: cascade-delete PCS/PCSG/PCLQ via Kubernetes GC by danbar2 · Pull Request #556 · ai-dynamo/grove

danbar2 · 2026-04-28T06:16:34Z

Closes #423.

What this changes

Replaces the manual delete-children → poll-until-clean → remove-finalizer flow in the PCS, PCSG, and PodClique reconcilers with immediate finalizer removal. The Kubernetes garbage collector then cascades the entire owner-reference tree (PCS → PCSG → PC → Pod, plus the PCS-owned siblings: Service, ServiceAccount, Role, RoleBinding, HPA, PodGang, ResourceClaim).

The PodClique reconciler keeps a one-step finalizer to clear its in-memory expectationsStore entry — that state lives outside the API server and GC can't see it.

The pattern: GC cascade + thin finalizer for non-API-server state

This is the dominant pattern across modern cloud-native CRDs that own a tree of K8s resources. Owner refs do the heavy lifting; the finalizer exists only as a hook for state GC can't reach.

Reference implementations using this pattern:

Kubernetes core controllers — ReplicaSet, Deployment, StatefulSet, DaemonSet, Job all rely on owner-reference cascade for child cleanup. The expectations cache pattern Grove now mirrors lives in k8s.io/kubernetes/pkg/controller/controller_utils.go and is consumed by replica_set.go and statefulset/stateful_set.go. They never manually Delete their pods on parent deletion.
JobSet (kubernetes-sigs) — JobSet → Job → Pod via owner refs; the controller does no manual child deletion.
LeaderWorkerSet (LWS) (kubernetes-sigs) — LWS → StatefulSet → Pod, same pattern.
Kueue — Workload carries a finalizer whose only job is to release queue/quota accounting before letting GC cascade.
KubeRay — RayCluster/RayService use finalizers for service-mesh deregistration and graceful dashboard shutdown, but children clean up via owner refs.
Volcano — PodGroup finalizer releases gang-scheduling state; pods cascade via GC.
KServe — InferenceService finalizer releases external model-server registration; the K8s subtree is cascaded.

What Grove was doing before is the older Gardener-style controller-utils template: a serialized delete → verify → finalize loop. Defensive when owner refs are unreliable, but it serializes through concurrentSyncs — which is exactly what made #423 a 20-minute deletion at 5k pods.

Tradeoffs we're accepting

Parent disappears from the API as soon as its finalizer is removed, even while children are still terminating. Anyone scripting around "PCS object gone = workload gone" should switch to watching pods.
No per-step deletion progress in status. Live cluster state is the source of truth.
Correctness now depends on every component having a controller owner reference. Audited the existing components (HPA, Service, ServiceAccount, Role, RoleBinding, SAtokenSecret, PodGang, ResourceClaim, PCSG, PC, Pod) — all set controllerutil.SetControllerReference on owned objects. A unit test asserting owner refs on every newly-built object would be a reasonable follow-up to keep this property from regressing.

copy-pr-bot · 2026-04-28T06:16:38Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

renormalize

I think the original intention behind blocking the delete call until all child resources are deleted is to signal to the user that the PCS is still undergoing deletion due to some reason - primary being that Pods could be stuck in Terminating, which is commonly observed in pods running disaggregated inference.

Deletion of the PCS before the pods have relinquished GPU claims might confuse the user behind why a new PCS is not being scheduled. However, I do not think this reason is strong enough to cause severe performance degradation with a large PCS.

Thanks for the fix!

The reconcilers used to manually walk every child resource, poll until each was gone, then drop the finalizer. With concurrentSyncs: 3 and ~5k PodCliques, that serialized into ~20 minutes per delete (ai-dynamo#423). Owner references already form the full PCS -> PCSG -> PC -> Pod chain plus the PCS-owned siblings (Service, ServiceAccount, Role/Binding, HPA, PodGang, ResourceClaim), so the kube-controller-manager's GC can cascade the whole subtree on its own. Each reconciler now just removes its finalizer; the PodClique reconciler additionally clears its in- memory expectations entry, since that state lives outside the API server and would otherwise leak. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds Test_ScaleTest_5000_Deletion as a regression guard for ai-dynamo#423: a single PodCliqueSet built from a standalone PodClique (frontend, 8 pods) plus two PodCliqueScalingGroups (prefill and decode, 624 replicas × 4 pods each), totalling exactly 5000 pods. The test deploys the workload, waits until all pods exist, deletes the PCS, and measures the time to full teardown via the existing TimelineTracker. The scale_test.go file is restructured to share scaffolding (cluster prep, tracker setup, pprof, output dir, result export) between this test and Test_ScaleTest_1000 so future scale tests can be added without duplicating ~50 lines of setup. The Makefile run-scale-test timeout is bumped to 60m to fit both tests in one invocation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replaces the delete-phase milestone in both scale tests with a new PCSFullyDeletedCondition that lists PCS, PodCliqueScalingGroups, PodCliques, and Pods by the workload's part-of label and only fires once every count is zero. The previous PCSDeletedCondition only checked that the parent PCS was NotFound — which the new cascade flow satisfies in <1s while child resources continue draining via K8s GC for several minutes at scale. The new condition reports actual end-to-end teardown latency (PCS=0 PCSG=0 PCLQ=0 Pod=N). Measured locally with the same operator build: - ScaleTest_1000: full cascade in 189.7s (~3.2 min) - ScaleTest_5000_Deletion: full cascade in 561.7s (~9.4 min) vs. the >20 min serial-delete behavior reported in ai-dynamo#423. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Scale tests had 15m + 30m internal budgets and a 60m Makefile timeout that were chosen pre-cascade-fix. Measured worst-case wall clock with the full-cascade-cleanup condition is ~5 min for ScaleTest_1000 and ~12 min for ScaleTest_5000_Deletion, so ~2x headroom is plenty: - ScaleTest_1000: 15m -> 10m - ScaleTest_5000_Deletion: 30m -> 20m - run-scale-test Makefile: 60m -> 40m Tighter budgets fail fast on real regressions instead of letting a hung run consume an hour. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…edCondition Reads more obviously at the call site: it's a delete-cascade check, not a vague "fully deleted" status. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ComputeDomains owned by a PodCliqueSet carry a Grove-managed finalizer (grove.io/computedomain-finalizer) that the K8s garbage collector cannot strip on its own. The previous serial-delete flow happened to clear it because deletePodCliqueSetResources walked every component operator and called Delete; the cascade refactor dropped that loop and left CDs stuck in `deleting` state once their owner PCS was removed. Re-introduces a single targeted step in the PCS deletion flow that invokes the ComputeDomain operator's Delete (which removes the finalizer and issues Delete) before the PCS finalizer is dropped. Every other child still cascades via owner references; this is the narrow exception for the only resource Grove decorates with its own finalizer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

danbar2 added the run-e2e label Apr 28, 2026

danbar2 marked this pull request as ready for review April 28, 2026 07:55

danbar2 requested review from Ronkahn21, gflarity, sanjaychatterjee, shayasoolin and unmarshall as code owners April 28, 2026 07:55

danbar2 force-pushed the cascade-delete branch from cf67cc3 to 0564399 Compare April 28, 2026 09:21

shayasoolin reviewed Apr 28, 2026

View reviewed changes

Comment thread operator/e2e/measurement/condition/pcs.go

danbar2 force-pushed the cascade-delete branch from 0564399 to aab968b Compare April 28, 2026 12:03

shayasoolin approved these changes Apr 28, 2026

View reviewed changes

danbar2 changed the title ~~fix: cascade-delete PCS/PCSG/PC via Kubernetes GC~~ fix: cascade-delete PCS/PCSG/PCLQ via Kubernetes GC Apr 28, 2026

danbar2 force-pushed the cascade-delete branch 2 times, most recently from ab9838a to 63f8b39 Compare April 30, 2026 13:58

gflarity approved these changes Apr 30, 2026

View reviewed changes

danbar2 force-pushed the cascade-delete branch from 63f8b39 to 279b229 Compare May 3, 2026 05:48

renormalize approved these changes May 5, 2026

View reviewed changes

danbar2 and others added 7 commits May 6, 2026 14:21

chore: gitignore scale-test per-run output dirs

99c37a8

test(e2e): rename PCSFullyDeletedCondition -> PCSAndSubresourcesDelet…

08e97ac

…edCondition Reads more obviously at the call site: it's a delete-cascade check, not a vague "fully deleted" status. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

danbar2 force-pushed the cascade-delete branch from 279b229 to f4a79d4 Compare May 6, 2026 11:21

danbar2 merged commit b594559 into ai-dynamo:main May 6, 2026
15 checks passed

shayasoolin mentioned this pull request May 18, 2026

Reduce noisy PodClique NotFound logs during normal cascade-delete flow #622

Open

yankay mentioned this pull request May 21, 2026

fix: suppress cascade-delete reconcile noise across PCS/PCSG/PCLQ #627

Open

5 tasks

AsadShahid04 mentioned this pull request Jun 1, 2026

fix: reduce noisy PodClique NotFound logs during cascade-delete #641

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: cascade-delete PCS/PCSG/PCLQ via Kubernetes GC#556

fix: cascade-delete PCS/PCSG/PCLQ via Kubernetes GC#556
danbar2 merged 7 commits into
ai-dynamo:mainfrom
danbar2:cascade-delete

danbar2 commented Apr 28, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Apr 28, 2026

Uh oh!

Uh oh!

renormalize left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

danbar2 commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this changes

The pattern: GC cascade + thin finalizer for non-API-server state

Tradeoffs we're accepting

Uh oh!

copy-pr-bot Bot commented Apr 28, 2026

Uh oh!

Uh oh!

renormalize left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

danbar2 commented Apr 28, 2026 •

edited

Loading