Skip to content

Gang Termination#114

Merged
unmarshall merged 22 commits into
ai-dynamo:mainfrom
unmarshall:gangterminate
Jul 22, 2025
Merged

Gang Termination#114
unmarshall merged 22 commits into
ai-dynamo:mainfrom
unmarshall:gangterminate

Conversation

@unmarshall

@unmarshall unmarshall commented Jul 21, 2025

Copy link
Copy Markdown
Collaborator

This PR introduces the multi-level Gang termination functionality. Gang termination can happen at the PodCliqueScalingGroup level or at the PodGangSet level.

To trigger Gang termination two configuration fields are important to understand:

  • PodGangSet.spec.template.cliques[<index>].spec.minAvailable - this is the minimum number of ready pods that must exist at any given time. If this number is breached then the PodClique is a candidate for termination.
  • For PodCliqueScalingGroup minAvailable is currently hard coded to 1 and is not configurable. In future we will consider making this configurable via the API.
  • PodGangSet.Spec.Template.TerminationDelay - A higher grouping resource like PodCliqueScalingGroup or PodGangSet observes the MinAvailable breached conditions on its constituents and waits for TerminationDelay duration. If the duration is crossed and MinAvailable is still breached then responsible reconciler will trigger termination of a gang represented either by a replica of PodCliqueScalingGroup or PodGangSet.

This PR introduces a new condition (MinAvailableBreached) that will be set on both PodClique and PodCliqueScalingGroup.

  • PodCliqueScalingGroup for every replica (except the 1st replica) will monitor the constituent PodClique's MinAvailableBreached condition. If it reports true then after TerminationDelay it will delete + create all the PodClique's for that PodCliqueScalingGroup replica. This will in turn recreate all the pods across all constituent PodCliques.
  • PodGangSet for every replica monitors the following:
    • MinAvailableBreached condition at PodCliqueScalingGroup level. If any one of PodCliqueScalingGroup reports breach of MinAvailable hard coded to 1, then it will delete + recreate all PodCliques for the PodGangSet replica after waiting for TerminationDelay duration.
    • MinAvailabeBreached for PodCliques that do not belong to any PodCliqueScalingGroup. If any one of these PodCliques report that its MinAvailable has been breached, then after waiting for TerminationDelay duration.

NOTE: While the recreation is going on, the respective PodGang resources are also updated with new Pod names across PodGroups.

Current limitation:
controller-runtime uses cached clients. The issue with these clients are that if there are a burst of events that are enqueued for a reconciler, then it can result in incorrect computation of num resources to create or delete. We see this for PodClique reconciler. As a consequence for a very short duration post Gang termination we see some additional Pods created which are subsequently removed after a few seconds as well. K8s controllers like replica-set controller also suffers from this problem and they solve it via controller expectations. However there are issues in that approach as well. We plan to improve on that design and introduce our own version of controller expectations in a later PR.

unmarshall and others added 19 commits July 15, 2025 14:28
* Introduced `PodCliqueConditionType` and
  ConditionTypeMinAvailableBreached condition type.
* Refactored PodClique reconcile status and added code to add PodClique
  condition.
* PCSG reconciler now watches for PCLQ delete and update events.
* Corrected the validation for TerminationDelay.

Signed-off-by: madhav bhargava <madhav.bhargava@sap.com>
* Removed PodCliqueConditionType and instead just defined constants in
  constants.go
* Removed unused constants from pod component.
* Removed unnecessary PGS lookup in pod component.
* Added code in PCLQ component in PCSG to handle gang termination.
* Removed utils/pcsg.go as the function defined was never used.
* Refactored PCLQ status reconciliation, conditionally updating
  MinAvailableBreached condition.
* Introduced a helper function `HasConditionChanged`

Signed-off-by: madhav bhargava <madhav.bhargava@sap.com>
…odGangSetName`.

Signed-off-by: Saketh Kalaga <51327242+renormalize@users.noreply.github.com>
Signed-off-by: Saketh Kalaga <51327242+renormalize@users.noreply.github.com>
This commit starts to change the deletion of PCLQs by making atomic
delete calls for all PCLQs belonging to a PodGang.

Signed-off-by: madhav bhargava <madhav.bhargava@sap.com>
Signed-off-by: Saketh Kalaga <51327242+renormalize@users.noreply.github.com>
…orks.

Signed-off-by: Saketh Kalaga <51327242+renormalize@users.noreply.github.com>
* Refactored PodCliqueScalingGroup reconcileStatus adding functions to
  mutate selector and minAvailableBreached condition.
* Added a bunch of helper functions.

Signed-off-by: madhav bhargava <madhav.bhargava@sap.com>
Signed-off-by: Saketh Kalaga <51327242+renormalize@users.noreply.github.com>
* Added WIP code to handle PGS replica pod gang termination.
* Added utility functions.

Signed-off-by: madhav bhargava <madhav.bhargava@sap.com>
Signed-off-by: Saketh Kalaga <51327242+renormalize@users.noreply.github.com>
…a` in

  podgangset/podclique component.
* Extracted PCSG replica deletion due to MinAvailable breached into
  `triggerDeletionOfMinAvailableBreachedPodGangs` function.

Signed-off-by: madhav bhargava <madhav.bhargava@sap.com>
  from this list can be deleted.
* Changed the order of method invocations in Sync. All deletes are
  called first and then the createOrUpdate is called.

Signed-off-by: madhav bhargava <madhav.bhargava@sap.com>
  minAvailable breached.
* Every PCLQ  that is part of a PCSG now also has LabelPodCliqueScalingGroupReplicaIndex label.

Signed-off-by: madhav bhargava <madhav.bhargava@sap.com>
* It might be the case that `PodClique`s might be in the `Unknown` status
  for the `MinAvailableBreached` condition, in which case the
  `PodCliqueScalingGroup` must also inherit this status for this condition.
  It was also observed that `PodClique`s might have empty conditions,
  which indicate that they have not been reconciled by the `PodClique`
  controller. In these cases, the `Unknown` status is set.

* `triggerDeletionOfMinAvailableBreachedPCSGReplicas` returns a `bool`
  to indicate if a re-queue must occur to handle `PodClique`s that
  have breached `minAvailable`, but have not crossed termination delay.

Signed-off-by: Saketh Kalaga <51327242+renormalize@users.noreply.github.com>
  comparing it with MinAvailableBreached.
* For PCLQs that are marked for termination, status update is now
  skipped.
* Refactored PCSG computeMinAvailableBreachedCondition method.

Signed-off-by: madhav bhargava <madhav.bhargava@sap.com>
  true for all PCSG replicas. This responsibility is now delegated to
  PGS reconciler.
* For PCLQs that are marked for termination their MinAvailableBreached
  condition is set to Unknown.

Signed-off-by: madhav bhargava <madhav.bhargava@sap.com>
  create.

Signed-off-by: madhav bhargava <madhav.bhargava@sap.com>
* Added a log

Signed-off-by: madhav bhargava <madhav.bhargava@sap.com>
@unmarshall unmarshall requested review from renormalize and removed request for dmitsh July 21, 2025 05:57
unmarshall and others added 3 commits July 21, 2025 11:57
Signed-off-by: madhav bhargava <madhav.bhargava@sap.com>
Signed-off-by: madhav bhargava <madhav.bhargava@sap.com>
…ermination.

Signed-off-by: Saketh Kalaga <51327242+renormalize@users.noreply.github.com>
@unmarshall unmarshall merged commit cd06869 into ai-dynamo:main Jul 22, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants