Skip to content

Modify PodCliqueScalingGroup behavior to create new PodCliques for each replica#103

Merged
unmarshall merged 9 commits into
ai-dynamo:mainfrom
unmarshall:refactor-apis
Jul 9, 2025
Merged

Modify PodCliqueScalingGroup behavior to create new PodCliques for each replica#103
unmarshall merged 9 commits into
ai-dynamo:mainfrom
unmarshall:refactor-apis

Conversation

@unmarshall

Copy link
Copy Markdown
Collaborator

This PR introduces the following changes:

  • Segregate management of PodCliques between PodGangSet and PodCliqueScalingGroup.
    • PodGangSet reconciler will only manage PodCliques that it owns, which are the ones that are not associated to any PodCliqueScalingGroups.
    • PodCliqueScalingGroup reconciler will only manage PodCliques that it owns.
  • Each PodClique now can only belong to a single PodGang. This results in the following changes:
    • Each PodClique now gets a PodGang name label at the time of creation.
    • PodGang component is no longer responsible to patch Pods with PodGang label.
    • Pod component during creation of Pods for PodClique inherits the PodGang label from its parent PodClique.
  • Introduced RoleName in PodCliqueSpec, generated code and api-docs to reflect that.
  • Added finalizer for PCSG which will get removed only after it removes all the associated PodCliques.
  • PCSG reconciler now listens for PGS update events.
  • Changed the naming convention for PodCliques created by PodCliqueScalingGroup reconciler.

unmarshall and others added 9 commits July 9, 2025 16:48
reconcilers. This commit includes:
* Introduces a finalizer for PCSG resources. This will be used for
  ensuring that all PCLQs that it manages are cleaned up before this
  resource gets deleted.
* Renamed the component directories to have a consistent naming.
* Modified Operator.Delete to now take metav1.ObjectMeta instead of the
  object.
* PCSG reconciler now listens for PGS update events.

Signed-off-by: madhav bhargava <madhav.bhargava@sap.com>
Signed-off-by: Saketh Kalaga <51327242+renormalize@users.noreply.github.com>
* Fixed PCLQ component in PCSG. It is still a WIP and needs additional
  corrections.
* Now getting the PGS name from the label as opposed to OwnerRef because
  PCLQ can either have PGS or PCSG as owner.

Signed-off-by: madhav bhargava <madhav.bhargava@sap.com>
PCSG
* Introduced RoleName in PodCliqueTemplateSpec

Signed-off-by: madhav bhargava <madhav.bhargava@sap.com>
…gGroup.

Signed-off-by: Saketh Kalaga <51327242+renormalize@users.noreply.github.com>
reconciler.

Signed-off-by: madhav bhargava <madhav.bhargava@sap.com>
* Updated simple1.yaml to now have role names for each PodClique.

Signed-off-by: madhav bhargava <madhav.bhargava@sap.com>
Signed-off-by: Saketh Kalaga <51327242+renormalize@users.noreply.github.com>
Signed-off-by: Saketh Kalaga <51327242+renormalize@users.noreply.github.com>
@unmarshall unmarshall merged commit a09e370 into ai-dynamo:main Jul 9, 2025
1 check passed
renormalize added a commit that referenced this pull request Jul 15, 2025
* New `PodClique`s are created for each replica of the `PodCliqueScalingGroup` as a consequence of #103.
  This is corrected in the example now.

Signed-off-by: Saketh Kalaga <51327242+renormalize@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants