Fix gang termination when scheduled replicas regress below MinAvailable#561
Conversation
436cae2 to
af82cce
Compare
|
Hi @danbar2 — I ran this PR through a multi-agent review (code, tests, comments, silent failures + CodeRabbit). Findings below; happy to discuss. 1. "Structurally unreachable" claim and Reason name — The new comments call
2. A previously-healthy PCLQ that loses every scheduled pod gets the same 3. New tests don't assert Both new tests check only 4. Missing test: PCSG update-in-progress × regression band — No test sets |
af82cce to
6569c6d
Compare
renormalize
left a comment
There was a problem hiding this comment.
Thanks for raising the fix and closing this long existing gap @danbar2!
Feel free to ignore the nit comments below.
1207ced
renormalize
left a comment
There was a problem hiding this comment.
thank you for fixing the nit!
1207ced to
e2302bd
Compare
gflarity
left a comment
There was a problem hiding this comment.
This is similar to the approach I was working on previously, so works for me. However I want to make sure CC: @nvrohanv and @sanjaychatterjee weigh because there was a lot of discussions around how to treat pods starting up.
We should also add comprehensive E2E tests for gang termination as a feature. I can't recall the specifics, but IIRC there might have been other issues with the tests. Here's the old E2E test branch I put together. Here's the scenario document it was based on.
0dc26a1
e2302bd to
0dc26a1
Compare
1d42ad9 to
de98a8a
Compare
de98a8a to
ae639cb
Compare
The MinAvailableBreached condition was suppressed for any scheduledReplicas < MinAvailable, which conflated initial startup (no pods admitted yet) with regression after a healthy state. The latter case — e.g. a node hosting running pods is cordoned or fails and the replacement pods cannot be scheduled — left gang termination unable to fire and the workload stuck. Splitting the check on scheduledReplicas == 0 vs > 0 distinguishes the two: under gang scheduling, partial-scheduled is structurally unreachable from startup so it must be a regression and breach. The full-zero case stays suppressed to avoid a churn loop where recreating the PodGang would just produce the same Pending pods. Closes ai-dynamo#277.
Renames the reproducer tests to describe the partial-schedule-regression behaviour they cover instead of referencing the GitHub issue number, and applies gofmt struct field alignment.
…r tests Renames the new condition reason from ScheduledReplicasRegressed to the neutral ScheduledReplicasBelowMinAvailable and shortens the doc comment to match its peers; the message follows suit. The "structurally unreachable" framing was overstated for non-gang schedulers (Grove's kube backend admits one pod at a time), so the comment now states the guarantee where it actually holds and explicitly notes that on non-gang schedulers a brief flicker is absorbed by TerminationDelay. Adds a Warning event AllScheduledReplicasLost emitted from the PodClique and PodCliqueScalingGroup status reconcilers when scheduledReplicas drops from non-zero to zero. Gang termination is suppressed in that state to avoid a churn loop, so without an event there is no signal a previously-running workload is fully down. The PCSG reconciler now holds an EventRecorder field for this. Tightens unit-test assertions: the new tests now assert the Reason in addition to Type and Status, the PCSG cases populate pclqsMap with breached PodCliques so any refactor that drops the short-circuit and starts consulting it must still satisfy the assertions, and a new case pins the precedence of IsPCSGUpdateInProgress over the regression breach branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both branches under scheduledReplicas < MinAvailable produced an identical Message, leaving operators no signal in the Condition explaining why one branch is False and the other True. Reword the False branch to call out the suppression and its reason; the True branch remains terse since "below MinAvailable" already implies the breach. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extracts the inline event-emit block in both PodClique and PodCliqueScalingGroup status reconcilers into a small Reconciler method (maybeEmitAllScheduledReplicasLost) so the only explicit signal for the fully-lost scheduled-replica case can be unit-tested in isolation against a record.FakeRecorder. Each test exercises the four scheduled-replica transitions plus a stable-non-zero case to pin both that the event fires only on the non-zero → zero transition and that no other transition produces a spurious event. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ae639cb to
2a867bb
Compare
Summary
Closes #277.
The
MinAvailableBreachedcondition was suppressed wheneverscheduledReplicas < MinAvailable. That suppression conflated two situations:scheduledReplicasfalls belowMinAvailable, but the breach never fires, sogangterminate.gonever deletes the PCS replica's PodCliques, the PodGang is never recreated, and the workload is stuck. This is exactly the gap the new E2E gang-termination tests were exposing.Idea behind the fix
Split the under-scheduled branch on
scheduledReplicas == 0vs> 0:Why
0 < scheduled < MinAvailableis unambiguously a regression. Under gang scheduling the gang scheduler admitsMinAvailablepods atomically — a partial-scheduled state is structurally unreachable from a fresh start. The only way to land there is to have been healthy and then lost pods. So we can breach immediately on observing the partial state.Why we keep
scheduled == 0suppressed. With nothing scheduled, gang termination has no useful work: recreating the PodGang produces the same Pending pods against the same cluster state. Suppressing avoids a churn loop. This matches KAI's behaviour: in general, if reachingMinAvailablelegitimately takes longer thanTerminationDelay, gang-terminating-and-recreating is the right move, and KAI's gang scheduler arrives at the same conclusion.Pod scheduling continues throughout. The breach condition is consumed only by
gangterminate.go, which still waits the fullTerminationDelay(default 4h) before deleting. The pod-creation/replacement path inpodclique/components/pod/syncflow.goand the PCLQ-creation path inpodcliquescalinggroup/components/podclique/sync.godo not gate on the breach — replacement pods keep being created and re-scheduled while the timer runs. If they succeed,scheduledReplicasrecovers, the condition flips back to False, and gang termination is never triggered.Behaviour on non-gang schedulers
On schedulers without gang admission (e.g. vanilla kube-scheduler), pods are admitted one at a time, so a brief
0 < scheduled < MinAvailableflicker can occur during staged startup. The flicker is debounced byTerminationDelay(default 4h) — only pathological multi-hour rampups would see spurious gang termination, and in that case the existingSpec.Template.TerminationDelayknob is the remedy.Test plan
go test ./...passes (operator module)go vet ./...clean,go build ./...cleanreconcilestatus_test.gofor both PodClique and PodCliqueScalingGroup capture:0 < scheduled < MinAvailable⟹ breach (the bug fix)scheduled == 0after regression ⟹ no breach (sanity-pin against over-correction / churn loop)TestComputeMinAvailableBreachedCondition / insufficient_scheduledPCSG test, which had previously asserted the buggy behaviour, to assert the post-fix behaviour and added a pairedscheduled == 0case.Open question for reviewers
The
scheduled == 0suppression is a deliberate judgement call: it avoids a churn loop, but it does mean a fully-down PCS replica won't be gang-terminated (the controller keeps trying to schedule replacements forever). If there is a scenario where the PodGang resource itself gets into a bad shape that recreation would fix, we may want to revisit. Flagging for @renormalize / @unmarshall / @sanjaychatterjee.🤖 Generated with Claude Code