fix: cascade-delete PCS/PCSG/PCLQ via Kubernetes GC#556
Merged
Conversation
shayasoolin
reviewed
Apr 28, 2026
shayasoolin
approved these changes
Apr 28, 2026
ab9838a to
63f8b39
Compare
gflarity
approved these changes
Apr 30, 2026
renormalize
approved these changes
May 5, 2026
renormalize
left a comment
Contributor
There was a problem hiding this comment.
I think the original intention behind blocking the delete call until all child resources are deleted is to signal to the user that the PCS is still undergoing deletion due to some reason - primary being that Pods could be stuck in Terminating, which is commonly observed in pods running disaggregated inference.
Deletion of the PCS before the pods have relinquished GPU claims might confuse the user behind why a new PCS is not being scheduled. However, I do not think this reason is strong enough to cause severe performance degradation with a large PCS.
Thanks for the fix!
The reconcilers used to manually walk every child resource, poll until each was gone, then drop the finalizer. With concurrentSyncs: 3 and ~5k PodCliques, that serialized into ~20 minutes per delete (ai-dynamo#423). Owner references already form the full PCS -> PCSG -> PC -> Pod chain plus the PCS-owned siblings (Service, ServiceAccount, Role/Binding, HPA, PodGang, ResourceClaim), so the kube-controller-manager's GC can cascade the whole subtree on its own. Each reconciler now just removes its finalizer; the PodClique reconciler additionally clears its in- memory expectations entry, since that state lives outside the API server and would otherwise leak. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds Test_ScaleTest_5000_Deletion as a regression guard for ai-dynamo#423: a single PodCliqueSet built from a standalone PodClique (frontend, 8 pods) plus two PodCliqueScalingGroups (prefill and decode, 624 replicas × 4 pods each), totalling exactly 5000 pods. The test deploys the workload, waits until all pods exist, deletes the PCS, and measures the time to full teardown via the existing TimelineTracker. The scale_test.go file is restructured to share scaffolding (cluster prep, tracker setup, pprof, output dir, result export) between this test and Test_ScaleTest_1000 so future scale tests can be added without duplicating ~50 lines of setup. The Makefile run-scale-test timeout is bumped to 60m to fit both tests in one invocation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the delete-phase milestone in both scale tests with a new PCSFullyDeletedCondition that lists PCS, PodCliqueScalingGroups, PodCliques, and Pods by the workload's part-of label and only fires once every count is zero. The previous PCSDeletedCondition only checked that the parent PCS was NotFound — which the new cascade flow satisfies in <1s while child resources continue draining via K8s GC for several minutes at scale. The new condition reports actual end-to-end teardown latency (PCS=0 PCSG=0 PCLQ=0 Pod=N). Measured locally with the same operator build: - ScaleTest_1000: full cascade in 189.7s (~3.2 min) - ScaleTest_5000_Deletion: full cascade in 561.7s (~9.4 min) vs. the >20 min serial-delete behavior reported in ai-dynamo#423. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Scale tests had 15m + 30m internal budgets and a 60m Makefile timeout that were chosen pre-cascade-fix. Measured worst-case wall clock with the full-cascade-cleanup condition is ~5 min for ScaleTest_1000 and ~12 min for ScaleTest_5000_Deletion, so ~2x headroom is plenty: - ScaleTest_1000: 15m -> 10m - ScaleTest_5000_Deletion: 30m -> 20m - run-scale-test Makefile: 60m -> 40m Tighter budgets fail fast on real regressions instead of letting a hung run consume an hour. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…edCondition Reads more obviously at the call site: it's a delete-cascade check, not a vague "fully deleted" status. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ComputeDomains owned by a PodCliqueSet carry a Grove-managed finalizer (grove.io/computedomain-finalizer) that the K8s garbage collector cannot strip on its own. The previous serial-delete flow happened to clear it because deletePodCliqueSetResources walked every component operator and called Delete; the cascade refactor dropped that loop and left CDs stuck in `deleting` state once their owner PCS was removed. Re-introduces a single targeted step in the PCS deletion flow that invokes the ComputeDomain operator's Delete (which removes the finalizer and issues Delete) before the PCS finalizer is dropped. Every other child still cascades via owner references; this is the narrow exception for the only resource Grove decorates with its own finalizer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #423.
What this changes
Replaces the manual delete-children → poll-until-clean → remove-finalizer flow in the PCS, PCSG, and PodClique reconcilers with immediate finalizer removal. The Kubernetes garbage collector then cascades the entire owner-reference tree (PCS → PCSG → PC → Pod, plus the PCS-owned siblings: Service, ServiceAccount, Role, RoleBinding, HPA, PodGang, ResourceClaim).
The PodClique reconciler keeps a one-step finalizer to clear its in-memory
expectationsStoreentry — that state lives outside the API server and GC can't see it.The pattern: GC cascade + thin finalizer for non-API-server state
This is the dominant pattern across modern cloud-native CRDs that own a tree of K8s resources. Owner refs do the heavy lifting; the finalizer exists only as a hook for state GC can't reach.
Reference implementations using this pattern:
ReplicaSet,Deployment,StatefulSet,DaemonSet,Joball rely on owner-reference cascade for child cleanup. The expectations cache pattern Grove now mirrors lives ink8s.io/kubernetes/pkg/controller/controller_utils.goand is consumed byreplica_set.goandstatefulset/stateful_set.go. They never manuallyDeletetheir pods on parent deletion.Workloadcarries a finalizer whose only job is to release queue/quota accounting before letting GC cascade.RayCluster/RayServiceuse finalizers for service-mesh deregistration and graceful dashboard shutdown, but children clean up via owner refs.PodGroupfinalizer releases gang-scheduling state; pods cascade via GC.InferenceServicefinalizer releases external model-server registration; the K8s subtree is cascaded.What Grove was doing before is the older Gardener-style controller-utils template: a serialized
delete → verify → finalizeloop. Defensive when owner refs are unreliable, but it serializes throughconcurrentSyncs— which is exactly what made #423 a 20-minute deletion at 5k pods.Tradeoffs we're accepting
controllerutil.SetControllerReferenceon owned objects. A unit test asserting owner refs on every newly-built object would be a reasonable follow-up to keep this property from regressing.