[e2e] [mnnvl] resolve flaky e2e test in Config2_UnsupportedButEnabled by shmuel-runai · Pull Request #451 · ai-dynamo/grove

shmuel-runai · 2026-02-25T11:08:55Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

During a rolling deployment (maxUnavailable=0, maxSurge=1), both the old healthy operator pod and the new crashing pod coexist. The old pod may have RestartCount > 0 from cert-refresh restarts, causing waitForOperatorPod to non-deterministically select it instead of the actually-crashing pod.

Rename waitForOperatorPod to waitForFailedOperatorPod and filter out Ready pods so only the crashing pod (which never becomes Ready) is selected.

Which issue(s) this PR fixes:

Fixes #449

Special notes for your reviewer:

During a rolling deployment (maxUnavailable=0, maxSurge=1), both the old healthy operator pod and the new crashing pod coexist. The old pod may have RestartCount > 0 from cert-refresh restarts, causing waitForOperatorPod to non-deterministically select it instead of the actually-crashing pod. Rename waitForOperatorPod to waitForFailedOperatorPod and filter out Ready pods so only the crashing pod (which never becomes Ready) is selected.

copy-pr-bot · 2026-02-25T11:08:59Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

…ai-dynamo#451) * [e2e] [mnnvl] resolve flaky e2e test in Config2_UnsupportedButEnabled During a rolling deployment (maxUnavailable=0, maxSurge=1), both the old healthy operator pod and the new crashing pod coexist. The old pod may have RestartCount > 0 from cert-refresh restarts, causing waitForOperatorPod to non-deterministically select it instead of the actually-crashing pod. Rename waitForOperatorPod to waitForFailedOperatorPod and filter out Ready pods so only the crashing pod (which never becomes Ready) is selected. * add computedomain to groveManagedResourceTypes

…ai-dynamo#451) * [e2e] [mnnvl] resolve flaky e2e test in Config2_UnsupportedButEnabled During a rolling deployment (maxUnavailable=0, maxSurge=1), both the old healthy operator pod and the new crashing pod coexist. The old pod may have RestartCount > 0 from cert-refresh restarts, causing waitForOperatorPod to non-deterministically select it instead of the actually-crashing pod. Rename waitForOperatorPod to waitForFailedOperatorPod and filter out Ready pods so only the crashing pod (which never becomes Ready) is selected. * add computedomain to groveManagedResourceTypes Signed-off-by: Erez Freiberger <enoodle@gmail.com>

shmuel-runai self-assigned this Feb 25, 2026

shmuel-runai requested review from Ronkahn21, gflarity, sanjaychatterjee, shayasoolin and unmarshall as code owners February 25, 2026 11:08

shayasoolin previously approved these changes Feb 25, 2026

View reviewed changes

add computedomain to groveManagedResourceTypes

d88d9ae

shmuel-runai dismissed shayasoolin’s stale review via d88d9ae February 26, 2026 08:17

shayasoolin approved these changes Mar 2, 2026

View reviewed changes

sanjaychatterjee approved these changes Mar 3, 2026

View reviewed changes

sanjaychatterjee merged commit 67aaf20 into ai-dynamo:main Mar 3, 2026
29 of 31 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[e2e] [mnnvl] resolve flaky e2e test in Config2_UnsupportedButEnabled#451

[e2e] [mnnvl] resolve flaky e2e test in Config2_UnsupportedButEnabled#451
sanjaychatterjee merged 2 commits into
ai-dynamo:mainfrom
shmuel-runai:grove-449/main-mnnvl-flaky-e2e

shmuel-runai commented Feb 25, 2026

Uh oh!

copy-pr-bot Bot commented Feb 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

shmuel-runai commented Feb 25, 2026

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Uh oh!

copy-pr-bot Bot commented Feb 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants