Fix gang termination when scheduled replicas regress below MinAvailable by danbar2 · Pull Request #561 · ai-dynamo/grove

danbar2 · 2026-04-29T08:35:11Z

Summary

Closes #277.

The MinAvailableBreached condition was suppressed whenever scheduledReplicas < MinAvailable. That suppression conflated two situations:

Initial startup (correct to suppress) — no pods have been admitted yet, gang termination would be premature.
Regression after a healthy state (incorrectly suppressed) — e.g. a node hosting running pods is cordoned or fails and the replacement pod cannot be scheduled. scheduledReplicas falls below MinAvailable, but the breach never fires, so gangterminate.go never deletes the PCS replica's PodCliques, the PodGang is never recreated, and the workload is stuck. This is exactly the gap the new E2E gang-termination tests were exposing.

Idea behind the fix

Split the under-scheduled branch on scheduledReplicas == 0 vs > 0:

if scheduledReplicas < minAvailable {
    if scheduledReplicas == 0 {
        // initial startup OR full regression — suppress (see below)
        return ConditionFalse, InsufficientScheduledPods
    }
    // 0 < scheduled < MinAvailable — must be regression, breach
    return ConditionTrue, ScheduledReplicasRegressed
}

Why 0 < scheduled < MinAvailable is unambiguously a regression. Under gang scheduling the gang scheduler admits MinAvailable pods atomically — a partial-scheduled state is structurally unreachable from a fresh start. The only way to land there is to have been healthy and then lost pods. So we can breach immediately on observing the partial state.

Why we keep scheduled == 0 suppressed. With nothing scheduled, gang termination has no useful work: recreating the PodGang produces the same Pending pods against the same cluster state. Suppressing avoids a churn loop. This matches KAI's behaviour: in general, if reaching MinAvailable legitimately takes longer than TerminationDelay, gang-terminating-and-recreating is the right move, and KAI's gang scheduler arrives at the same conclusion.

Pod scheduling continues throughout. The breach condition is consumed only by gangterminate.go, which still waits the full TerminationDelay (default 4h) before deleting. The pod-creation/replacement path in podclique/components/pod/syncflow.go and the PCLQ-creation path in podcliquescalinggroup/components/podclique/sync.go do not gate on the breach — replacement pods keep being created and re-scheduled while the timer runs. If they succeed, scheduledReplicas recovers, the condition flips back to False, and gang termination is never triggered.

Behaviour on non-gang schedulers

On schedulers without gang admission (e.g. vanilla kube-scheduler), pods are admitted one at a time, so a brief 0 < scheduled < MinAvailable flicker can occur during staged startup. The flicker is debounced by TerminationDelay (default 4h) — only pathological multi-hour rampups would see spurious gang termination, and in that case the existing Spec.Template.TerminationDelay knob is the remedy.

Test plan

go test ./... passes (operator module)
go vet ./... clean, go build ./... clean
Reproducer tests in reconcilestatus_test.go for both PodClique and PodCliqueScalingGroup capture:
- 0 < scheduled < MinAvailable ⟹ breach (the bug fix)
- scheduled == 0 after regression ⟹ no breach (sanity-pin against over-correction / churn loop)
- Fresh PCLQ/PCSG never scheduled ⟹ no breach (initial-startup sanity-pin)
Updated existing TestComputeMinAvailableBreachedCondition / insufficient_scheduled PCSG test, which had previously asserted the buggy behaviour, to assert the post-fix behaviour and added a paired scheduled == 0 case.
E2E gang termination tests (referenced in the issue) should now pass.

Open question for reviewers

The scheduled == 0 suppression is a deliberate judgement call: it avoids a churn loop, but it does mean a fully-down PCS replica won't be gang-terminated (the controller keeps trying to schedule replacements forever). If there is a scenario where the PodGang resource itself gets into a bad shape that recreation would fix, we may want to revisit. Flagging for @renormalize / @unmarshall / @sanjaychatterjee.

🤖 Generated with Claude Code

copy-pr-bot · 2026-04-29T08:35:16Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

shayasoolin

LGTM

Ronkahn21 · 2026-05-03T14:49:31Z

Hi @danbar2 — I ran this PR through a multi-agent review (code, tests, comments, silent failures + CodeRabbit). Findings below; happy to discuss.

1. "Structurally unreachable" claim and Reason name — constants.go:109-114, podclique/reconcilestatus.go:212-226, podcliquescalinggroup/reconcilestatus.go:168-176

The new comments call 0 < scheduled < MinAvailable "structurally unreachable from initial startup" then concede non-gang schedulers can flicker into it. Grove's kube backend (scheduler/kube/, default fallback in manager.go:81-85) IS non-gang — so the invariant is false there, and on PCS-replica recreation. The same false invariant drives the user-visible Reason=ScheduledReplicasRegressed and the "Scheduled replicas regressed below MinAvailable" message — both wrong when nothing has actually regressed. Suggested fixes (cheapest while the constant is unmerged):

Drop the "structurally unreachable" framing; rationale is "gang scheduler ⇒ implies regression; non-gang ⇒ TerminationDelay absorbs the flicker."
Rename the constant to a neutral term, e.g. ConditionReasonScheduledReplicasBelowMinAvailable — and tighten the 5-line doc comment to match its 1-line peers.
Match Message: "Scheduled replicas (%d) below MinAvailable (%d)".

2. scheduled==0 regression is invisible — emit an event — podclique/reconcilestatus.go:228-235, podcliquescalinggroup/reconcilestatus.go:193-199

A previously-healthy PCLQ that loses every scheduled pod gets the same Reason and Message as a fresh one — no event, no log differentiates them. Per the PR's open question the controller intentionally won't gang-terminate this state, so an event is the only "something's wrong" signal. Suggest emitting EventTypeWarning on the PodCliqueScheduled=True → scheduled==0 transition. mutateMinAvailableBreachedCondition doesn't carry an EventRecorder today, but the surrounding controllers do (gangterminate.go uses r.eventRecorder.Eventf).

3. New tests don't assert Reason — podclique/reconcilestatus_test.go:299-308, PCSG partial schedule regression row

Both new tests check only Type and Status. The existing PCSG test pins wantReason at line 206 — the same pattern would catch a refactor returning Status: True with the wrong reason. The PCSG case is extra fragile because its populated pclqsMap is genuinely unread (function short-circuits at line 192), so a broken short-circuit also passes today.

4. Missing test: PCSG update-in-progress × regression band — podcliquescalinggroup/reconcilestatus_test.go

No test sets Status.UpdateProgress non-nil with UpdateEndedAt: nil and 0 < scheduled < MinAvailable. The IsPCSGUpdateInProgress short-circuit (line 181) must win over the regression branch (line 192) — branch ordering is unpinned, future reordering produces silent churn during rolling updates.

renormalize

Thanks for raising the fix and closing this long existing gap @danbar2!

Feel free to ignore the nit comments below.

renormalize

thank you for fixing the nit!

gflarity

This is similar to the approach I was working on previously, so works for me. However I want to make sure CC: @nvrohanv and @sanjaychatterjee weigh because there was a lot of discussions around how to treat pods starting up.

We should also add comprehensive E2E tests for gang termination as a feature. I can't recall the specifics, but IIRC there might have been other issues with the tests. Here's the old E2E test branch I put together. Here's the scenario document it was based on.

The MinAvailableBreached condition was suppressed for any scheduledReplicas < MinAvailable, which conflated initial startup (no pods admitted yet) with regression after a healthy state. The latter case — e.g. a node hosting running pods is cordoned or fails and the replacement pods cannot be scheduled — left gang termination unable to fire and the workload stuck. Splitting the check on scheduledReplicas == 0 vs > 0 distinguishes the two: under gang scheduling, partial-scheduled is structurally unreachable from startup so it must be a regression and breach. The full-zero case stays suppressed to avoid a churn loop where recreating the PodGang would just produce the same Pending pods. Closes ai-dynamo#277.

Renames the reproducer tests to describe the partial-schedule-regression behaviour they cover instead of referencing the GitHub issue number, and applies gofmt struct field alignment.

…r tests Renames the new condition reason from ScheduledReplicasRegressed to the neutral ScheduledReplicasBelowMinAvailable and shortens the doc comment to match its peers; the message follows suit. The "structurally unreachable" framing was overstated for non-gang schedulers (Grove's kube backend admits one pod at a time), so the comment now states the guarantee where it actually holds and explicitly notes that on non-gang schedulers a brief flicker is absorbed by TerminationDelay. Adds a Warning event AllScheduledReplicasLost emitted from the PodClique and PodCliqueScalingGroup status reconcilers when scheduledReplicas drops from non-zero to zero. Gang termination is suppressed in that state to avoid a churn loop, so without an event there is no signal a previously-running workload is fully down. The PCSG reconciler now holds an EventRecorder field for this. Tightens unit-test assertions: the new tests now assert the Reason in addition to Type and Status, the PCSG cases populate pclqsMap with breached PodCliques so any refactor that drops the short-circuit and starts consulting it must still satisfy the assertions, and a new case pins the precedence of IsPCSGUpdateInProgress over the regression breach branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Both branches under scheduledReplicas < MinAvailable produced an identical Message, leaving operators no signal in the Condition explaining why one branch is False and the other True. Reword the False branch to call out the suppression and its reason; the True branch remains terse since "below MinAvailable" already implies the breach. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Extracts the inline event-emit block in both PodClique and PodCliqueScalingGroup status reconcilers into a small Reconciler method (maybeEmitAllScheduledReplicasLost) so the only explicit signal for the fully-lost scheduled-replica case can be unit-tested in isolation against a record.FakeRecorder. Each test exercises the four scheduled-replica transitions plus a stable-non-zero case to pin both that the event fires only on the non-zero → zero transition and that no other transition produces a spurious event. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

danbar2 requested review from Ronkahn21, gflarity, sanjaychatterjee, shayasoolin and unmarshall as code owners April 29, 2026 08:35

danbar2 force-pushed the partial-schedule-breach branch 3 times, most recently from 436cae2 to af82cce Compare April 30, 2026 13:58

shayasoolin previously approved these changes May 3, 2026

View reviewed changes

danbar2 dismissed shayasoolin’s stale review via 6569c6d May 4, 2026 08:16

danbar2 force-pushed the partial-schedule-breach branch from af82cce to 6569c6d Compare May 4, 2026 08:16

Ronkahn21 previously approved these changes May 5, 2026

View reviewed changes

danbar2 requested a review from shayasoolin May 5, 2026 12:51

shayasoolin previously approved these changes May 5, 2026

View reviewed changes

renormalize previously approved these changes May 6, 2026

View reviewed changes

Comment thread operator/internal/controller/podclique/reconcilestatus.go Outdated

Comment thread operator/internal/controller/podcliquescalinggroup/reconcilestatus.go Outdated

danbar2 dismissed stale reviews from renormalize, shayasoolin, and Ronkahn21 via 1207ced May 6, 2026 11:14

renormalize previously approved these changes May 6, 2026

View reviewed changes

Ronkahn21 previously approved these changes May 7, 2026

View reviewed changes

shayasoolin previously approved these changes May 7, 2026

View reviewed changes

danbar2 force-pushed the partial-schedule-breach branch from 1207ced to e2302bd Compare May 7, 2026 07:27

gflarity reviewed May 9, 2026

View reviewed changes

Comment thread operator/internal/controller/podclique/reconcilestatus.go Outdated

Comment thread operator/internal/controller/podcliquescalinggroup/reconcilestatus.go Outdated

danbar2 dismissed stale reviews from renormalize, Ronkahn21, and shayasoolin via 0dc26a1 May 10, 2026 07:12

danbar2 force-pushed the partial-schedule-breach branch from e2302bd to 0dc26a1 Compare May 10, 2026 07:12

danbar2 requested a review from gflarity May 11, 2026 07:17

danbar2 requested review from Ronkahn21, renormalize and shayasoolin May 11, 2026 07:17

danbar2 force-pushed the partial-schedule-breach branch 2 times, most recently from 1d42ad9 to de98a8a Compare May 11, 2026 10:24

gflarity approved these changes May 12, 2026

View reviewed changes

shayasoolin approved these changes May 13, 2026

View reviewed changes

Ronkahn21 approved these changes May 13, 2026

View reviewed changes

danbar2 force-pushed the partial-schedule-breach branch from de98a8a to ae639cb Compare May 13, 2026 07:31

danbar2 and others added 5 commits May 13, 2026 11:40

Rename Issue277 reproducer tests and gofmt struct fields

4459032

Renames the reproducer tests to describe the partial-schedule-regression behaviour they cover instead of referencing the GitHub issue number, and applies gofmt struct field alignment.

danbar2 force-pushed the partial-schedule-breach branch from ae639cb to 2a867bb Compare May 13, 2026 08:40

danbar2 merged commit d5afd16 into ai-dynamo:main May 13, 2026
15 checks passed

danbar2 mentioned this pull request May 14, 2026

Gang-termination breach formula rework + E2E scenarios (GT-1 through GT-6) #614

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix gang termination when scheduled replicas regress below MinAvailable#561

Fix gang termination when scheduled replicas regress below MinAvailable#561
danbar2 merged 5 commits into
ai-dynamo:mainfrom
danbar2:partial-schedule-breach

danbar2 commented Apr 29, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Apr 29, 2026

Uh oh!

shayasoolin left a comment

Uh oh!

Ronkahn21 commented May 3, 2026 •

edited

Loading

Uh oh!

renormalize left a comment

Uh oh!

Uh oh!

Uh oh!

renormalize left a comment

Uh oh!

gflarity left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

danbar2 commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Idea behind the fix

Behaviour on non-gang schedulers

Test plan

Open question for reviewers

Uh oh!

copy-pr-bot Bot commented Apr 29, 2026

Uh oh!

shayasoolin left a comment

Choose a reason for hiding this comment

Uh oh!

Ronkahn21 commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

renormalize left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

renormalize left a comment

Choose a reason for hiding this comment

Uh oh!

gflarity left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

danbar2 commented Apr 29, 2026 •

edited

Loading

Ronkahn21 commented May 3, 2026 •

edited

Loading