Skip to content

fix: cascade-delete PCS/PCSG/PCLQ via Kubernetes GC#556

Merged
danbar2 merged 7 commits into
ai-dynamo:mainfrom
danbar2:cascade-delete
May 6, 2026
Merged

fix: cascade-delete PCS/PCSG/PCLQ via Kubernetes GC#556
danbar2 merged 7 commits into
ai-dynamo:mainfrom
danbar2:cascade-delete

Conversation

@danbar2

@danbar2 danbar2 commented Apr 28, 2026

Copy link
Copy Markdown
Contributor

Closes #423.

What this changes

Replaces the manual delete-children → poll-until-clean → remove-finalizer flow in the PCS, PCSG, and PodClique reconcilers with immediate finalizer removal. The Kubernetes garbage collector then cascades the entire owner-reference tree (PCS → PCSG → PC → Pod, plus the PCS-owned siblings: Service, ServiceAccount, Role, RoleBinding, HPA, PodGang, ResourceClaim).

The PodClique reconciler keeps a one-step finalizer to clear its in-memory expectationsStore entry — that state lives outside the API server and GC can't see it.

The pattern: GC cascade + thin finalizer for non-API-server state

This is the dominant pattern across modern cloud-native CRDs that own a tree of K8s resources. Owner refs do the heavy lifting; the finalizer exists only as a hook for state GC can't reach.

Reference implementations using this pattern:

  • Kubernetes core controllersReplicaSet, Deployment, StatefulSet, DaemonSet, Job all rely on owner-reference cascade for child cleanup. The expectations cache pattern Grove now mirrors lives in k8s.io/kubernetes/pkg/controller/controller_utils.go and is consumed by replica_set.go and statefulset/stateful_set.go. They never manually Delete their pods on parent deletion.
  • JobSet (kubernetes-sigs) — JobSet → Job → Pod via owner refs; the controller does no manual child deletion.
  • LeaderWorkerSet (LWS) (kubernetes-sigs) — LWS → StatefulSet → Pod, same pattern.
  • KueueWorkload carries a finalizer whose only job is to release queue/quota accounting before letting GC cascade.
  • KubeRayRayCluster/RayService use finalizers for service-mesh deregistration and graceful dashboard shutdown, but children clean up via owner refs.
  • VolcanoPodGroup finalizer releases gang-scheduling state; pods cascade via GC.
  • KServeInferenceService finalizer releases external model-server registration; the K8s subtree is cascaded.

What Grove was doing before is the older Gardener-style controller-utils template: a serialized delete → verify → finalize loop. Defensive when owner refs are unreliable, but it serializes through concurrentSyncs — which is exactly what made #423 a 20-minute deletion at 5k pods.

Tradeoffs we're accepting

  • Parent disappears from the API as soon as its finalizer is removed, even while children are still terminating. Anyone scripting around "PCS object gone = workload gone" should switch to watching pods.
  • No per-step deletion progress in status. Live cluster state is the source of truth.
  • Correctness now depends on every component having a controller owner reference. Audited the existing components (HPA, Service, ServiceAccount, Role, RoleBinding, SAtokenSecret, PodGang, ResourceClaim, PCSG, PC, Pod) — all set controllerutil.SetControllerReference on owned objects. A unit test asserting owner refs on every newly-built object would be a reasonable follow-up to keep this property from regressing.

@copy-pr-bot

copy-pr-bot Bot commented Apr 28, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@danbar2 danbar2 marked this pull request as ready for review April 28, 2026 07:55
Comment thread operator/e2e/measurement/condition/pcs.go
@danbar2 danbar2 changed the title fix: cascade-delete PCS/PCSG/PC via Kubernetes GC fix: cascade-delete PCS/PCSG/PCLQ via Kubernetes GC Apr 28, 2026
@danbar2 danbar2 force-pushed the cascade-delete branch 2 times, most recently from ab9838a to 63f8b39 Compare April 30, 2026 13:58

@renormalize renormalize left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the original intention behind blocking the delete call until all child resources are deleted is to signal to the user that the PCS is still undergoing deletion due to some reason - primary being that Pods could be stuck in Terminating, which is commonly observed in pods running disaggregated inference.

Deletion of the PCS before the pods have relinquished GPU claims might confuse the user behind why a new PCS is not being scheduled. However, I do not think this reason is strong enough to cause severe performance degradation with a large PCS.

Thanks for the fix!

danbar2 and others added 7 commits May 6, 2026 14:21
The reconcilers used to manually walk every child resource, poll until
each was gone, then drop the finalizer. With concurrentSyncs: 3 and
~5k PodCliques, that serialized into ~20 minutes per delete (ai-dynamo#423).

Owner references already form the full PCS -> PCSG -> PC -> Pod chain
plus the PCS-owned siblings (Service, ServiceAccount, Role/Binding,
HPA, PodGang, ResourceClaim), so the kube-controller-manager's GC can
cascade the whole subtree on its own. Each reconciler now just removes
its finalizer; the PodClique reconciler additionally clears its in-
memory expectations entry, since that state lives outside the API
server and would otherwise leak.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds Test_ScaleTest_5000_Deletion as a regression guard for ai-dynamo#423: a
single PodCliqueSet built from a standalone PodClique (frontend, 8
pods) plus two PodCliqueScalingGroups (prefill and decode, 624 replicas
× 4 pods each), totalling exactly 5000 pods. The test deploys the
workload, waits until all pods exist, deletes the PCS, and measures
the time to full teardown via the existing TimelineTracker.

The scale_test.go file is restructured to share scaffolding (cluster
prep, tracker setup, pprof, output dir, result export) between this
test and Test_ScaleTest_1000 so future scale tests can be added
without duplicating ~50 lines of setup. The Makefile run-scale-test
timeout is bumped to 60m to fit both tests in one invocation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the delete-phase milestone in both scale tests with a new
PCSFullyDeletedCondition that lists PCS, PodCliqueScalingGroups,
PodCliques, and Pods by the workload's part-of label and only fires
once every count is zero. The previous PCSDeletedCondition only
checked that the parent PCS was NotFound — which the new cascade flow
satisfies in <1s while child resources continue draining via K8s GC
for several minutes at scale. The new condition reports actual
end-to-end teardown latency (PCS=0 PCSG=0 PCLQ=0 Pod=N).

Measured locally with the same operator build:
- ScaleTest_1000:           full cascade in 189.7s (~3.2 min)
- ScaleTest_5000_Deletion:  full cascade in 561.7s (~9.4 min)

vs. the >20 min serial-delete behavior reported in ai-dynamo#423.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Scale tests had 15m + 30m internal budgets and a 60m Makefile timeout
that were chosen pre-cascade-fix. Measured worst-case wall clock with
the full-cascade-cleanup condition is ~5 min for ScaleTest_1000 and
~12 min for ScaleTest_5000_Deletion, so ~2x headroom is plenty:

- ScaleTest_1000:           15m -> 10m
- ScaleTest_5000_Deletion:  30m -> 20m
- run-scale-test Makefile:  60m -> 40m

Tighter budgets fail fast on real regressions instead of letting a
hung run consume an hour.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…edCondition

Reads more obviously at the call site: it's a delete-cascade check, not
a vague "fully deleted" status.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ComputeDomains owned by a PodCliqueSet carry a Grove-managed finalizer
(grove.io/computedomain-finalizer) that the K8s garbage collector
cannot strip on its own. The previous serial-delete flow happened to
clear it because deletePodCliqueSetResources walked every component
operator and called Delete; the cascade refactor dropped that loop and
left CDs stuck in `deleting` state once their owner PCS was removed.

Re-introduces a single targeted step in the PCS deletion flow that
invokes the ComputeDomain operator's Delete (which removes the
finalizer and issues Delete) before the PCS finalizer is dropped.
Every other child still cascades via owner references; this is the
narrow exception for the only resource Grove decorates with its own
finalizer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@danbar2 danbar2 merged commit b594559 into ai-dynamo:main May 6, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Slow Deletion at scale

4 participants