Skip to content

fix: handle startup-probe-phase pods in rolling update categorization#435

Merged
gflarity merged 4 commits into
ai-dynamo:mainfrom
gflarity:rolling_update_stuck_400
Feb 23, 2026
Merged

fix: handle startup-probe-phase pods in rolling update categorization#435
gflarity merged 4 commits into
ai-dynamo:mainfrom
gflarity:rolling_update_stuck_400

Conversation

@gflarity

Copy link
Copy Markdown
Contributor

/kind bug

What this PR does / why we need it:

Fixes a bug where rolling updates get permanently stuck or prematurely complete when an old-hash pod is in the startup probe phase. computeUpdateWork had no category for pods with Started=false (startup probe not yet passed), causing them to silently fall through all classification branches.

This PR:

  • Adds a HasAnyContainerNotStarted utility to detect pods still in the startup probe phase
  • Adds explicit oldTemplateHashStartingPods and oldTemplateHashUncategorizedPods buckets to updateWork
  • Deletes all non-ready old-hash pods (pending, unhealthy, starting, uncategorized) immediately rather than only pending/unhealthy
  • Adds unit tests for computeUpdateWork categorization and HasAnyContainerNotStarted

Which issue(s) this PR fixes:

Fixes #400

Special notes for your reviewer:

The root cause: when a second spec change arrives while a replacement pod from a previous change is still in its startup probe phase, the replacement pod (now old-hash) has Phase=Running, Started=false, Ready=false. It fails all existing predicates (not Pending, not Started-but-not-Ready, not Ready) and is dropped from the update work entirely. This causes either an infinite requeue loop or premature updateEndedAt marking.

Does this PR introduce a API change?

NONE

Additional documentation e.g., enhancement proposals, usage docs, etc.:

NONE

@copy-pr-bot

copy-pr-bot Bot commented Feb 12, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

julienmancuso
julienmancuso previously approved these changes Feb 12, 2026
Comment thread operator/internal/controller/podclique/components/pod/rollingupdate.go Outdated
Comment thread operator/internal/controller/podclique/components/pod/rollingupdate.go Outdated

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes rolling update categorization so pods in the startup-probe phase (Running + Started=false/nil + Ready=false) are no longer silently dropped from computeUpdateWork, preventing updates from getting stuck or ending prematurely (issue #400).

Changes:

  • Add HasAnyContainerNotStarted to detect pods whose containers have not passed startup probe.
  • Expand computeUpdateWork with oldTemplateHashStartingPods and oldTemplateHashUncategorizedPods buckets, and ensure old-hash pods always land in a bucket.
  • Delete all non-ready old-hash pods (pending/unhealthy/starting/uncategorized) immediately; add unit tests for the new utility and categorization.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
operator/internal/utils/kubernetes/pod.go Adds HasAnyContainerNotStarted pod utility.
operator/internal/utils/kubernetes/pod_test.go Adds unit tests for HasAnyContainerNotStarted.
operator/internal/controller/podclique/components/pod/rollingupdate.go Extends rolling update bucketing and expands non-ready old-hash deletion behavior.
operator/internal/controller/podclique/components/pod/rollingupdate_test.go Adds unit tests covering computeUpdateWork bucketing, including startup-probe-phase pods.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@gflarity gflarity force-pushed the rolling_update_stuck_400 branch from 2f5130d to 00f79c4 Compare February 17, 2026 15:35
@gflarity

Copy link
Copy Markdown
Contributor Author

/ok to test 00f79c4

Ronkahn21
Ronkahn21 previously approved these changes Feb 17, 2026
Comment thread operator/internal/controller/podclique/components/pod/rollingupdate.go Outdated
Comment thread operator/internal/controller/podclique/components/pod/rollingupdate.go Outdated
Comment thread operator/internal/controller/podclique/components/pod/rollingupdate_test.go Outdated
Comment thread operator/internal/controller/podclique/components/pod/rollingupdate_test.go Outdated
julienmancuso
julienmancuso previously approved these changes Feb 19, 2026
@gflarity gflarity dismissed stale reviews from julienmancuso and Ronkahn21 via fbccbdf February 19, 2026 21:40
@gflarity gflarity force-pushed the rolling_update_stuck_400 branch from fbccbdf to d97f86b Compare February 19, 2026 21:41
@gflarity

Copy link
Copy Markdown
Contributor Author

/ok to test d97f86b

@shmuel-runai shmuel-runai left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gflarity gflarity force-pushed the rolling_update_stuck_400 branch from d97f86b to 30c7ebe Compare February 23, 2026 15:12
@gflarity gflarity merged commit 956c528 into ai-dynamo:main Feb 23, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rolling Update Gets Stuck When New Update Initiated During In-Progress Update

5 participants