Skip to content

Addition of a deletionTimestamp on a Pod should trigger a PodClique reconciliation#433

Merged
renormalize merged 2 commits into
ai-dynamo:mainfrom
xulinfei1996:xlf/fix-pclq
Mar 5, 2026
Merged

Addition of a deletionTimestamp on a Pod should trigger a PodClique reconciliation#433
renormalize merged 2 commits into
ai-dynamo:mainfrom
xulinfei1996:xlf/fix-pclq

Conversation

@xulinfei1996

@xulinfei1996 xulinfei1996 commented Feb 11, 2026

Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind bug

What this PR does / why we need it:

In our scenario, when a Node becomes NotReady, the following sequence of events occurs:

  1. The Pods on that Node first transition to the NotReady state;
  2. After a period of time, the node controller deletes these Pods, at which point the Pods are marked with a deletionTimestamp.
    In the PCLQ controller, step 1 triggers a reconcile, but since the Pods are not yet marked with a deletionTimestamp, the controller does not recognize them as being in a terminating state. By step 2, although the Pods now have a deletionTimestamp, the PCLQ controller does not trigger another reconcile.

This results in: Pods that are in a terminating state, but the PCLQ controller does not create new Pods to replace them, thereby affecting service availability and the expected replica count.
This PR aims to fix this issue by ensuring that when Pods enter a terminating state, the PCLQ controller can promptly detect this and create new Pods to maintain the desired replica count.

Which issue(s) this PR fixes:

Fixes #434

Special notes for your reviewer:

Does this PR introduce a API change?

NONE

Additional documentation e.g., enhancement proposals, usage docs, etc.:

NONE

@copy-pr-bot

copy-pr-bot Bot commented Feb 11, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Comment thread operator/internal/controller/podclique/register.go Outdated
@kangclzjc

Copy link
Copy Markdown
Contributor

Thanks @xulinfei1996 , very great finding.

@sanjaychatterjee sanjaychatterjee left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the fix! Looks good to me.
@unmarshall can you please review as well?

@sanjaychatterjee

Copy link
Copy Markdown
Collaborator

Also, can we include a unit test as well?

@xulinfei1996 xulinfei1996 force-pushed the xlf/fix-pclq branch 2 times, most recently from 26dbce6 to c860e56 Compare February 12, 2026 02:36
Comment thread operator/internal/controller/podclique/register_test.go Outdated
@renormalize

Copy link
Copy Markdown
Contributor

Reviewing this, please do not merge this yet.

@renormalize renormalize left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Just a few nit comments.

The pods would eventually reach .spec.replicas in number after the DELETE event is sent, but for pods that are stuck in Terminating for long, the replacement pods might take a long time to be created if no event is generated for that specific PodClique.

Also, please fix the formatting through GOWORK=off make check, as the check CI job is failing.

Comment thread operator/internal/controller/podclique/register.go Outdated
Comment thread operator/internal/controller/podclique/register.go
@renormalize

Copy link
Copy Markdown
Contributor

@xulinfei1996 I'm not sure if you have bandwidth to work on this anymore. Can the maintainers take over so we can push these changes in?

cc @kangclzjc

xulinfei and others added 2 commits March 5, 2026 16:09
Signed-off-by: xulinfei <xulinfei@bytedance.com>
Signed-off-by: Saketh Kalaga <saketh.kalaga@sap.com>

@renormalize renormalize left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for spotting the bug and raising the fix!

@renormalize renormalize changed the title pod deletionTimestamp change should trigger pclq reconcile Addition of a deletionTimestamp on a Pod should trigger a PodClique reconciliation Mar 5, 2026
@renormalize renormalize requested a review from Ronkahn21 March 5, 2026 11:07
@renormalize renormalize merged commit daca4c2 into ai-dynamo:main Mar 5, 2026
21 of 22 checks passed
Ronkahn21 pushed a commit to Ronkahn21/grove that referenced this pull request Mar 10, 2026
…que` reconciliation (ai-dynamo#433)

---------

Signed-off-by: xulinfei <xulinfei@bytedance.com>
Co-authored-by: Saketh Kalaga <saketh.kalaga@sap.com>
enoodle pushed a commit to enoodle/grove that referenced this pull request Mar 24, 2026
…que` reconciliation (ai-dynamo#433)

---------

Signed-off-by: xulinfei <xulinfei@bytedance.com>
Co-authored-by: Saketh Kalaga <saketh.kalaga@sap.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PCLQ controller fails to replace terminating Pods on NotReady nodes due to missed reconcile

5 participants