Skip to content

Fix gang termination when scheduled replicas regress below MinAvailable#561

Merged
danbar2 merged 5 commits into
ai-dynamo:mainfrom
danbar2:partial-schedule-breach
May 13, 2026
Merged

Fix gang termination when scheduled replicas regress below MinAvailable#561
danbar2 merged 5 commits into
ai-dynamo:mainfrom
danbar2:partial-schedule-breach

Conversation

@danbar2

@danbar2 danbar2 commented Apr 29, 2026

Copy link
Copy Markdown
Contributor

Summary

Closes #277.

The MinAvailableBreached condition was suppressed whenever scheduledReplicas < MinAvailable. That suppression conflated two situations:

  1. Initial startup (correct to suppress) — no pods have been admitted yet, gang termination would be premature.
  2. Regression after a healthy state (incorrectly suppressed) — e.g. a node hosting running pods is cordoned or fails and the replacement pod cannot be scheduled. scheduledReplicas falls below MinAvailable, but the breach never fires, so gangterminate.go never deletes the PCS replica's PodCliques, the PodGang is never recreated, and the workload is stuck. This is exactly the gap the new E2E gang-termination tests were exposing.

Idea behind the fix

Split the under-scheduled branch on scheduledReplicas == 0 vs > 0:

if scheduledReplicas < minAvailable {
    if scheduledReplicas == 0 {
        // initial startup OR full regression — suppress (see below)
        return ConditionFalse, InsufficientScheduledPods
    }
    // 0 < scheduled < MinAvailable — must be regression, breach
    return ConditionTrue, ScheduledReplicasRegressed
}

Why 0 < scheduled < MinAvailable is unambiguously a regression. Under gang scheduling the gang scheduler admits MinAvailable pods atomically — a partial-scheduled state is structurally unreachable from a fresh start. The only way to land there is to have been healthy and then lost pods. So we can breach immediately on observing the partial state.

Why we keep scheduled == 0 suppressed. With nothing scheduled, gang termination has no useful work: recreating the PodGang produces the same Pending pods against the same cluster state. Suppressing avoids a churn loop. This matches KAI's behaviour: in general, if reaching MinAvailable legitimately takes longer than TerminationDelay, gang-terminating-and-recreating is the right move, and KAI's gang scheduler arrives at the same conclusion.

Pod scheduling continues throughout. The breach condition is consumed only by gangterminate.go, which still waits the full TerminationDelay (default 4h) before deleting. The pod-creation/replacement path in podclique/components/pod/syncflow.go and the PCLQ-creation path in podcliquescalinggroup/components/podclique/sync.go do not gate on the breach — replacement pods keep being created and re-scheduled while the timer runs. If they succeed, scheduledReplicas recovers, the condition flips back to False, and gang termination is never triggered.

Behaviour on non-gang schedulers

On schedulers without gang admission (e.g. vanilla kube-scheduler), pods are admitted one at a time, so a brief 0 < scheduled < MinAvailable flicker can occur during staged startup. The flicker is debounced by TerminationDelay (default 4h) — only pathological multi-hour rampups would see spurious gang termination, and in that case the existing Spec.Template.TerminationDelay knob is the remedy.

Test plan

  • go test ./... passes (operator module)
  • go vet ./... clean, go build ./... clean
  • Reproducer tests in reconcilestatus_test.go for both PodClique and PodCliqueScalingGroup capture:
    • 0 < scheduled < MinAvailable ⟹ breach (the bug fix)
    • scheduled == 0 after regression ⟹ no breach (sanity-pin against over-correction / churn loop)
    • Fresh PCLQ/PCSG never scheduled ⟹ no breach (initial-startup sanity-pin)
  • Updated existing TestComputeMinAvailableBreachedCondition / insufficient_scheduled PCSG test, which had previously asserted the buggy behaviour, to assert the post-fix behaviour and added a paired scheduled == 0 case.
  • E2E gang termination tests (referenced in the issue) should now pass.

Open question for reviewers

The scheduled == 0 suppression is a deliberate judgement call: it avoids a churn loop, but it does mean a fully-down PCS replica won't be gang-terminated (the controller keeps trying to schedule replacements forever). If there is a scenario where the PodGang resource itself gets into a bad shape that recreation would fix, we may want to revisit. Flagging for @renormalize / @unmarshall / @sanjaychatterjee.

🤖 Generated with Claude Code

@copy-pr-bot

copy-pr-bot Bot commented Apr 29, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@danbar2 danbar2 force-pushed the partial-schedule-breach branch 3 times, most recently from 436cae2 to af82cce Compare April 30, 2026 13:58
shayasoolin
shayasoolin previously approved these changes May 3, 2026

@shayasoolin shayasoolin left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Ronkahn21

Ronkahn21 commented May 3, 2026

Copy link
Copy Markdown
Contributor

Hi @danbar2 — I ran this PR through a multi-agent review (code, tests, comments, silent failures + CodeRabbit). Findings below; happy to discuss.

1. "Structurally unreachable" claim and Reason nameconstants.go:109-114, podclique/reconcilestatus.go:212-226, podcliquescalinggroup/reconcilestatus.go:168-176

The new comments call 0 < scheduled < MinAvailable "structurally unreachable from initial startup" then concede non-gang schedulers can flicker into it. Grove's kube backend (scheduler/kube/, default fallback in manager.go:81-85) IS non-gang — so the invariant is false there, and on PCS-replica recreation. The same false invariant drives the user-visible Reason=ScheduledReplicasRegressed and the "Scheduled replicas regressed below MinAvailable" message — both wrong when nothing has actually regressed. Suggested fixes (cheapest while the constant is unmerged):

  1. Drop the "structurally unreachable" framing; rationale is "gang scheduler ⇒ implies regression; non-gang ⇒ TerminationDelay absorbs the flicker."
  2. Rename the constant to a neutral term, e.g. ConditionReasonScheduledReplicasBelowMinAvailable — and tighten the 5-line doc comment to match its 1-line peers.
  3. Match Message: "Scheduled replicas (%d) below MinAvailable (%d)".

2. scheduled==0 regression is invisible — emit an eventpodclique/reconcilestatus.go:228-235, podcliquescalinggroup/reconcilestatus.go:193-199

A previously-healthy PCLQ that loses every scheduled pod gets the same Reason and Message as a fresh one — no event, no log differentiates them. Per the PR's open question the controller intentionally won't gang-terminate this state, so an event is the only "something's wrong" signal. Suggest emitting EventTypeWarning on the PodCliqueScheduled=Truescheduled==0 transition. mutateMinAvailableBreachedCondition doesn't carry an EventRecorder today, but the surrounding controllers do (gangterminate.go uses r.eventRecorder.Eventf).

3. New tests don't assert Reasonpodclique/reconcilestatus_test.go:299-308, PCSG partial schedule regression row

Both new tests check only Type and Status. The existing PCSG test pins wantReason at line 206 — the same pattern would catch a refactor returning Status: True with the wrong reason. The PCSG case is extra fragile because its populated pclqsMap is genuinely unread (function short-circuits at line 192), so a broken short-circuit also passes today.

4. Missing test: PCSG update-in-progress × regression bandpodcliquescalinggroup/reconcilestatus_test.go

No test sets Status.UpdateProgress non-nil with UpdateEndedAt: nil and 0 < scheduled < MinAvailable. The IsPCSGUpdateInProgress short-circuit (line 181) must win over the regression branch (line 192) — branch ordering is unpinned, future reordering produces silent churn during rolling updates.

Ronkahn21
Ronkahn21 previously approved these changes May 5, 2026
@danbar2 danbar2 requested a review from shayasoolin May 5, 2026 12:51
shayasoolin
shayasoolin previously approved these changes May 5, 2026
renormalize
renormalize previously approved these changes May 6, 2026

@renormalize renormalize left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for raising the fix and closing this long existing gap @danbar2!

Feel free to ignore the nit comments below.

Comment thread operator/internal/controller/podclique/reconcilestatus.go Outdated
Comment thread operator/internal/controller/podcliquescalinggroup/reconcilestatus.go Outdated
@danbar2 danbar2 dismissed stale reviews from renormalize, shayasoolin, and Ronkahn21 via 1207ced May 6, 2026 11:14
renormalize
renormalize previously approved these changes May 6, 2026

@renormalize renormalize left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for fixing the nit!

Ronkahn21
Ronkahn21 previously approved these changes May 7, 2026
shayasoolin
shayasoolin previously approved these changes May 7, 2026
@danbar2 danbar2 force-pushed the partial-schedule-breach branch from 1207ced to e2302bd Compare May 7, 2026 07:27

@gflarity gflarity left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is similar to the approach I was working on previously, so works for me. However I want to make sure CC: @nvrohanv and @sanjaychatterjee weigh because there was a lot of discussions around how to treat pods starting up.

We should also add comprehensive E2E tests for gang termination as a feature. I can't recall the specifics, but IIRC there might have been other issues with the tests. Here's the old E2E test branch I put together. Here's the scenario document it was based on.

Comment thread operator/internal/controller/podclique/reconcilestatus.go Outdated
Comment thread operator/internal/controller/podcliquescalinggroup/reconcilestatus.go Outdated
@danbar2 danbar2 dismissed stale reviews from renormalize, Ronkahn21, and shayasoolin via 0dc26a1 May 10, 2026 07:12
@danbar2 danbar2 force-pushed the partial-schedule-breach branch from e2302bd to 0dc26a1 Compare May 10, 2026 07:12
@danbar2 danbar2 requested a review from gflarity May 11, 2026 07:17
@danbar2 danbar2 force-pushed the partial-schedule-breach branch 2 times, most recently from 1d42ad9 to de98a8a Compare May 11, 2026 10:24
@danbar2 danbar2 force-pushed the partial-schedule-breach branch from de98a8a to ae639cb Compare May 13, 2026 07:31
danbar2 and others added 5 commits May 13, 2026 11:40
The MinAvailableBreached condition was suppressed for any
scheduledReplicas < MinAvailable, which conflated initial startup
(no pods admitted yet) with regression after a healthy state. The
latter case — e.g. a node hosting running pods is cordoned or fails
and the replacement pods cannot be scheduled — left gang termination
unable to fire and the workload stuck. Splitting the check on
scheduledReplicas == 0 vs > 0 distinguishes the two: under gang
scheduling, partial-scheduled is structurally unreachable from
startup so it must be a regression and breach. The full-zero case
stays suppressed to avoid a churn loop where recreating the PodGang
would just produce the same Pending pods. Closes ai-dynamo#277.
Renames the reproducer tests to describe the partial-schedule-regression
behaviour they cover instead of referencing the GitHub issue number, and
applies gofmt struct field alignment.
…r tests

Renames the new condition reason from ScheduledReplicasRegressed to the
neutral ScheduledReplicasBelowMinAvailable and shortens the doc comment
to match its peers; the message follows suit. The "structurally
unreachable" framing was overstated for non-gang schedulers (Grove's
kube backend admits one pod at a time), so the comment now states the
guarantee where it actually holds and explicitly notes that on non-gang
schedulers a brief flicker is absorbed by TerminationDelay.

Adds a Warning event AllScheduledReplicasLost emitted from the PodClique
and PodCliqueScalingGroup status reconcilers when scheduledReplicas
drops from non-zero to zero. Gang termination is suppressed in that
state to avoid a churn loop, so without an event there is no signal a
previously-running workload is fully down. The PCSG reconciler now
holds an EventRecorder field for this.

Tightens unit-test assertions: the new tests now assert the Reason in
addition to Type and Status, the PCSG cases populate pclqsMap with
breached PodCliques so any refactor that drops the short-circuit and
starts consulting it must still satisfy the assertions, and a new case
pins the precedence of IsPCSGUpdateInProgress over the regression
breach branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both branches under scheduledReplicas < MinAvailable produced an
identical Message, leaving operators no signal in the Condition
explaining why one branch is False and the other True. Reword the
False branch to call out the suppression and its reason; the True
branch remains terse since "below MinAvailable" already implies the
breach.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extracts the inline event-emit block in both PodClique and
PodCliqueScalingGroup status reconcilers into a small Reconciler
method (maybeEmitAllScheduledReplicasLost) so the only explicit signal
for the fully-lost scheduled-replica case can be unit-tested in
isolation against a record.FakeRecorder. Each test exercises the four
scheduled-replica transitions plus a stable-non-zero case to pin both
that the event fires only on the non-zero → zero transition and that
no other transition produces a spurious event.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@danbar2 danbar2 force-pushed the partial-schedule-breach branch from ae639cb to 2a867bb Compare May 13, 2026 08:40
@danbar2 danbar2 merged commit d5afd16 into ai-dynamo:main May 13, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gang Termination Doesn't Work

5 participants