Skip to content

[GREP-291] OnDelete implementation for standalone PodCliques and PodCliqueScalingGroups#438

Merged
renormalize merged 20 commits into
ai-dynamo:mainfrom
renormalize:ondelete
Mar 12, 2026
Merged

[GREP-291] OnDelete implementation for standalone PodCliques and PodCliqueScalingGroups#438
renormalize merged 20 commits into
ai-dynamo:mainfrom
renormalize:ondelete

Conversation

@renormalize

Copy link
Copy Markdown
Contributor

What type of PR is this?

What this PR does / why we need it:

/kind api feature

This PR introduces the implementation for the OnDelete update strategy as proposed in the GREP #403.

Which issue(s) this PR fixes:

Fixes #291

Special notes for your reviewer:

I have not yet finished unit tests and E2E tests. I'm planning to bring along the unit tests in this PR itself, and the E2E tests in a subsequent PR.

Does this PR introduce a API change?

New field `.spec.updateStrategy` of the `PodCliqueSet` supports specifying different update strategies for the PodCliqueSet; with `RollingRecreate` and `OnDelete` as the two options.

Additional documentation e.g., enhancement proposals, usage docs, etc.:

N/A

@renormalize renormalize self-assigned this Feb 17, 2026
@renormalize renormalize added enhancement New feature or request API Updates to API labels Feb 17, 2026
@copy-pr-bot

copy-pr-bot Bot commented Feb 17, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@renormalize

Copy link
Copy Markdown
Contributor Author

All E2E tests are failing currently due to:

✅ Cluster 'shared-e2e-test-cluster' is ready for E2E testing!
> Pushing alpine image to local registry...
Error response from daemon: toomanyrequests: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit
Error response from daemon: No such image: alpine:latest

Tests passed pre-rebase. Will wait and re-trigger.

Comment thread operator/api/core/v1alpha1/podcliqueset.go Outdated
Comment thread operator/api/core/v1alpha1/podcliqueset.go Outdated
@Ronkahn21

Copy link
Copy Markdown
Contributor

Code Review — Combined CodeRabbit + Manual Analysis

Critical

Nil pointer dereference on *pcs.Status.CurrentGenerationHash in OnDelete paths

The OnDelete code paths bypass the nil guard that protects the RollingRecreate path, causing a panic if CurrentGenerationHash is nil (e.g., first reconcile before generation hash is computed):

  1. operator/internal/controller/podcliquescalinggroup/reconcilespec.go:118shouldResetOrTriggerUpdate dereferences *pcs.Status.CurrentGenerationHash without nil check
  2. operator/internal/controller/podclique/reconcilespec.go:137 — Same issue in PCLQ's shouldResetOrTriggerUpdate
  3. operator/internal/controller/podclique/components/pod/syncflow.go:238 — Same pattern in selectExcessPodsToDelete

Suggested fix: Add a nil guard on pcs.Status.CurrentGenerationHash before dereferencing in all three locations.

Suggestions

  1. Typo "diabled" → "disabled" in 3 locations:

    • operator/internal/controller/podclique/reconcilespec.go:~172
    • operator/internal/controller/podcliquescalinggroup/reconcilespec.go:~131
    • operator/internal/controller/podcliqueset/reconcilespec.go:~138
  2. Grammar: "a update" → "an update"operator/internal/controller/podclique/reconcilespec.go:~133

  3. Comment references wrong fieldoperator/internal/controller/podcliqueset/components/podcliquesetreplica/rollingupdate.go:~149 says "UpdatedPodCliques" but should say "UpdatedPodCliqueScalingGroups"

  4. shouldEnqueueOnPCSUpdate returns true on every PCS status update for OnDelete (register.go:157-160) — causes excessive reconciliation when OnDelete is configured. The TODO in code acknowledges this.

@renormalize

Copy link
Copy Markdown
Contributor Author

@Ronkahn21 I've fixed all suggestions posted in your comment. However, I don't know how to go about fixing the following suggestion:

shouldEnqueueOnPCSUpdate returns true on every PCS status update for OnDelete (register.go:157-160) — causes excessive reconciliation when OnDelete is configured. The TODO in code acknowledges this.

The event object only contains the PCS, not the PCSG. We don't really have access to that resource in the watcher on PCS.

@Ronkahn21

Copy link
Copy Markdown
Contributor

@renormalize I think for now we could ignore the comment, I will try to finish the review the fix today

@unmarshall unmarshall left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1/n reviews

Comment thread operator/api/core/v1alpha1/podclique.go Outdated
Comment thread operator/api/core/v1alpha1/podclique.go Outdated
Comment thread operator/api/core/v1alpha1/podclique.go Outdated
Comment thread operator/api/core/v1alpha1/podclique.go
Comment thread operator/api/core/v1alpha1/podcliqueset.go Outdated
Comment thread operator/api/core/v1alpha1/scalinggroup.go
Comment thread operator/api/core/v1alpha1/scalinggroup.go
Comment thread operator/internal/controller/common/component/utils/podclique.go Outdated
Comment thread operator/internal/controller/podclique/components/pod/syncflow.go Outdated
Comment thread operator/internal/controller/podclique/components/pod/syncflow.go
Comment thread operator/internal/controller/podcliquescalinggroup/reconcilespec.go Outdated
Comment thread operator/internal/controller/podclique/components/pod/syncflow.go
Comment thread operator/internal/controller/podcliquescalinggroup/register.go
Comment thread operator/internal/controller/podcliquescalinggroup/register.go Outdated
Comment thread operator/internal/controller/podcliqueset/reconciler.go Outdated
Comment thread operator/internal/utils/miscellaneous.go
@renormalize renormalize requested a review from unmarshall March 9, 2026 20:30
@renormalize

Copy link
Copy Markdown
Contributor Author

@unmarshall I've not yet rebased on the latest changes in master since that will make it harder to review the changes I made to address the PR feedback. Will rebase at the end of the PR.

renormalize and others added 20 commits March 12, 2026 09:09
Signed-off-by: Saketh Kalaga <saketh.kalaga@sap.com>
Signed-off-by: Saketh Kalaga <saketh.kalaga@sap.com>
Signed-off-by: Saketh Kalaga <saketh.kalaga@sap.com>
… events

* PodCliqueSet update events when the update strategy is `OnDelete`
  will enqueue corresponding events for PodCliqueScalingGroup resources.

Signed-off-by: Saketh Kalaga <saketh.kalaga@sap.com>
* Support `OnDelete` in update logic

* Fix unsatisfied expectation on PodCliqueSet delete

Signed-off-by: Saketh Kalaga <saketh.kalaga@sap.com>
* `RollingUpdateProgress` structs are reintroduced fully and not made aliases
  of `UpdateProgress` since we intend to discourage usage of `RollingUpdateProgress`,
  and any new field introduced to `UpdateProgress` should not be introduced
  to `RollingUpdateProgress` as well, as there might be cases where
  consumers continue to use the old struct with newly introduced fields.

Signed-off-by: Saketh Kalaga <saketh.kalaga@sap.com>
Signed-off-by: Saketh Kalaga <saketh.kalaga@sap.com>
…or backwards compatibility

Signed-off-by: Saketh Kalaga <saketh.kalaga@sap.com>
…ues in-place for `OnDelete`

* PodCliques that belong to PodCliqueScalingGroups can be updated through
  the OnDelete strategy by deleting member pods of the PodCliques.
  This is achieved by updating the specifications of the PodCliques
  in-place, similar to how it is done already for standalone PodCliques.

Signed-off-by: Saketh Kalaga <saketh.kalaga@sap.com>
* `initOrResetUpdate()` is introduced for PodCliqueScalingGroup making
  it consistent with the other controllers.

* Correct comments in PodClique reconcilespec.

Signed-off-by: Saketh Kalaga <saketh.kalaga@sap.com>
Signed-off-by: Saketh Kalaga <saketh.kalaga@sap.com>
…h the PodClique's

Signed-off-by: Saketh Kalaga <saketh.kalaga@sap.com>
Signed-off-by: Saketh Kalaga <saketh.kalaga@sap.com>
Signed-off-by: Saketh Kalaga <saketh.kalaga@sap.com>
Signed-off-by: Saketh Kalaga <saketh.kalaga@sap.com>
Signed-off-by: Saketh Kalaga <saketh.kalaga@sap.com>
…tions, add unit tests

Signed-off-by: Saketh Kalaga <saketh.kalaga@sap.com>
Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
Signed-off-by: Saketh Kalaga <saketh.kalaga@sap.com>
…iqueSet

Signed-off-by: Saketh Kalaga <saketh.kalaga@sap.com>
@renormalize renormalize merged commit d1e159d into ai-dynamo:main Mar 12, 2026
11 checks passed
enoodle pushed a commit to enoodle/grove that referenced this pull request Mar 24, 2026
…`PodCliqueScalingGroup`s (ai-dynamo#438)

* Introduce `PodCliqueSetUpdateStrategy`

* Introduce `UpdateProgress` which eventually will replace `RollingUpdateProgress`

* Enhance update logic to support `OnDelete`

* Modify `PodCliqueScalingGroup` controller to update all replica `PodClique`s in-place for `OnDelete`

* `PodClique`s that belong to `PodCliqueScalingGroup`s can be updated through
  the `OnDelete` strategy by deleting member pods of the `PodClique`s.
  This is achieved by updating the specifications of the `PodClique`s
  in-place, similar to how it is done already for standalone `PodClique`s

* Introduce unit tests for changes made for OnDelete

* Modify the deletion sorter to prefer pods with specs that do not match the `PodClique`'s

* Make `CurrentlyUpdating` a slice of `PodCliqueSetReplicaUpdateProgress`

* Fix possible nil panics, and typos in comments

---------

Signed-off-by: Saketh Kalaga <saketh.kalaga@sap.com>
Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
Co-authored-by: Madhav Bhargava <madhav.bhargava@sap.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

API Updates to API enhancement New feature or request run-e2e

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enhance the rolling update approach for PodClique

4 participants