Skip to content

Enhance PGS and PCLQ reconcilers to support PodGang lifecycle management#95

Merged
renormalize merged 33 commits into
ai-dynamo:mainfrom
unmarshall:reconcilers
Jul 1, 2025
Merged

Enhance PGS and PCLQ reconcilers to support PodGang lifecycle management#95
renormalize merged 33 commits into
ai-dynamo:mainfrom
unmarshall:reconcilers

Conversation

@unmarshall

@unmarshall unmarshall commented Jun 27, 2025

Copy link
Copy Markdown
Collaborator

This PR introduces the following changes:

  • PodGang component which manages the create/update/delete of scheduler API PodGang resources.
  • Added custom error code to signal a requeue after in reconciliation flows.
  • Introduced PodClique.Status.ScheduleGatedReplicas to capture the number of schedule gated replicas for a PCLQ.
  • Enhanced the Grove operator ClusterRole giving it permissions for PodGang resources.
  • PodClique, PodCliqueScalingGroup, Pod get additional labels
  • Refactored the Pod component and added capability to add/remove Pod.Spec.SchedulingGates
  • Implemented the interplay between PGS and PCLQ reconcilers w.r.t PodGang and Pod resources.
  • Fixed scale-in issues in HPA component.
  • Optimized reconciler predicates to only enqueue events that are required.

unmarshall and others added 28 commits June 25, 2025 22:03
* Partially implements Pod component. With this commit pods without
  scheduling gates can be created.
* Enables Pod component in the PodClique reconciler.
* Fixed HPA component which now correctly sets the target resource ref.
* Fixed service component, which now creates a headless service.
* Introduced some convenient functions.
* Changed the pgs-replica-index label key as it was not as per allowed
  conventions.

Signed-off-by: madhav bhargava <madhav.bhargava@sap.com>
* Added missing license headers to new files
* Fixed linting issues
* Fixed formatting issues

Signed-off-by: madhav bhargava <madhav.bhargava@sap.com>
* Fixed scaling in HPA component.
* Fixed Role and Rolebinding component which now only create.
* Adapted the alias for grove core api in component files.
* Initial code for PodGang component.
* Added scheduler API as a dependency in go.mod

Signed-off-by: madhav bhargava <madhav.bhargava@sap.com>
Signed-off-by: madhav bhargava <madhav.bhargava@sap.com>
…PodCliques`.

Signed-off-by: Saketh Kalaga <51327242+renormalize@users.noreply.github.com>
Signed-off-by: madhav bhargava <madhav.bhargava@sap.com>
Signed-off-by: Saketh Kalaga <51327242+renormalize@users.noreply.github.com>
Signed-off-by: Saketh Kalaga <51327242+renormalize@users.noreply.github.com>
* Fixed internal types used by PodGang syncFlow.
* PGS reconciler now listens for PCLQ update events.
* PodGang CRDs are now copied when deploying grove operator.
* Removed syncer.go as this is now replaced with syncflow.go.

Signed-off-by: madhav bhargava <madhav.bhargava@sap.com>
* Added code in Pod component to add scheduling gate when creating pods.
* Fixed the operator/hack/prepare-local-deploy.sh to reflect the changes
  in PodGang CRD.

Signed-off-by: madhav bhargava <madhav.bhargava@sap.com>
Signed-off-by: Saketh Kalaga <51327242+renormalize@users.noreply.github.com>
* Refactored pclq reconciler.
* PCLQ reconciler now watches PodGang create/delete events.
* Fixed pcsg.Status.Selector.
* Refactored podgang syncflow.

Signed-off-by: madhav bhargava <madhav.bhargava@sap.com>
Signed-off-by: Saketh Kalaga <51327242+renormalize@users.noreply.github.com>
Signed-off-by: Saketh Kalaga <51327242+renormalize@users.noreply.github.com>
Signed-off-by: madhav bhargava <madhav.bhargava@sap.com>
* WIP commit for Pod component
* Minor rearrangement of hpaInfo in HPA component

Signed-off-by: madhav bhargava <madhav.bhargava@sap.com>
Signed-off-by: Saketh Kalaga <51327242+renormalize@users.noreply.github.com>
* Changed the order of components in PGS reconcile spec flow.

Signed-off-by: madhav bhargava <madhav.bhargava@sap.com>
Signed-off-by: Saketh Kalaga <51327242+renormalize@users.noreply.github.com>
* Moving `syncExistingPodGangs` to run every reconciliation
  enables the `schedulingGate`s to be removed.

Signed-off-by: Saketh Kalaga <51327242+renormalize@users.noreply.github.com>
Signed-off-by: Saketh Kalaga <51327242+renormalize@users.noreply.github.com>
…to them.

* Scaling a `PodClique` in caused unexpected behavior where more than
  expected number of pods were deleted by the `PodClique` controller.
  Multiple events are raised during the entire flow, which causes
  multiple requeues for the same `PodClique`.
  The `List` call made in the controller for `Pod`s returns the list
  of `Pod`s in a non-deterministic order, and for each requeue
  handled by a different worker, a different `Pod` was chosen for
  deletion. To avoid this, `Pod`s are currently deleted based on
  `creationTimestamp`.

Signed-off-by: Saketh Kalaga <51327242+renormalize@users.noreply.github.com>
… the `PodGang` name.

Signed-off-by: Saketh Kalaga <51327242+renormalize@users.noreply.github.com>
Signed-off-by: Saketh Kalaga <51327242+renormalize@users.noreply.github.com>
* Now filtering terminating pods when fetching existing pods in PodGang
  component.

Signed-off-by: madhav bhargava <madhav.bhargava@sap.com>
* Removed usage of Owns and now use Watches to watch for PodClique
  events.
* In PodClique register now listening for PodGang Create/Update/Delete
  events

Signed-off-by: madhav bhargava <madhav.bhargava@sap.com>
…` changes.

Signed-off-by: Saketh Kalaga <51327242+renormalize@users.noreply.github.com>
Signed-off-by: madhav bhargava <madhav.bhargava@sap.com>
* Fixed formatting issues.
* Refactored the Pod component to fix the PCSG scaling issue. The
  current implementation had issues.
* In this commit, code to delete excess pods is introduced but commented
  as its not been tested yet.

Signed-off-by: madhav bhargava <madhav.bhargava@sap.com>
* Added PCLQ labels to Pods created for the PCLQ.
* Introduced pod deletion for excess pods.
* Requeue interval moved to constant is usage fixed across PCLQ and
  PGS reconcilers.
* Fixed the issue where too many pods were created when PCSG is scaled.

Signed-off-by: madhav bhargava <madhav.bhargava@sap.com>
* Fixed PodGang component where it was eagerly creating PodGangs.

Signed-off-by: madhav bhargava <madhav.bhargava@sap.com>
@unmarshall unmarshall requested review from renormalize and removed request for dmitsh and sanjaychatterjee July 1, 2025 09:27
Signed-off-by: Saketh Kalaga <51327242+renormalize@users.noreply.github.com>
Signed-off-by: Saketh Kalaga <51327242+renormalize@users.noreply.github.com>
@renormalize renormalize merged commit 2d34ac9 into ai-dynamo:main Jul 1, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants