Skip to content

[e2e] [mnnvl] resolve flaky e2e test in Config2_UnsupportedButEnabled#451

Merged
sanjaychatterjee merged 2 commits into
ai-dynamo:mainfrom
shmuel-runai:grove-449/main-mnnvl-flaky-e2e
Mar 3, 2026
Merged

[e2e] [mnnvl] resolve flaky e2e test in Config2_UnsupportedButEnabled#451
sanjaychatterjee merged 2 commits into
ai-dynamo:mainfrom
shmuel-runai:grove-449/main-mnnvl-flaky-e2e

Conversation

@shmuel-runai

Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind bug

What this PR does / why we need it:

During a rolling deployment (maxUnavailable=0, maxSurge=1), both the old healthy operator pod and the new crashing pod coexist. The old pod may have RestartCount > 0 from cert-refresh restarts, causing waitForOperatorPod to non-deterministically select it instead of the actually-crashing pod.

Rename waitForOperatorPod to waitForFailedOperatorPod and filter out Ready pods so only the crashing pod (which never becomes Ready) is selected.

Which issue(s) this PR fixes:

Fixes #449

Special notes for your reviewer:

During a rolling deployment (maxUnavailable=0, maxSurge=1), both the old
healthy operator pod and the new crashing pod coexist. The old pod may
have RestartCount > 0 from cert-refresh restarts, causing
waitForOperatorPod to non-deterministically select it instead of the
actually-crashing pod.

Rename waitForOperatorPod to waitForFailedOperatorPod and filter out
Ready pods so only the crashing pod (which never becomes Ready) is
selected.
@copy-pr-bot

copy-pr-bot Bot commented Feb 25, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

shayasoolin
shayasoolin previously approved these changes Feb 25, 2026
@sanjaychatterjee sanjaychatterjee merged commit 67aaf20 into ai-dynamo:main Mar 3, 2026
29 of 31 checks passed
Ronkahn21 pushed a commit to Ronkahn21/grove that referenced this pull request Mar 10, 2026
…ai-dynamo#451)

* [e2e] [mnnvl] resolve flaky e2e test in Config2_UnsupportedButEnabled

During a rolling deployment (maxUnavailable=0, maxSurge=1), both the old
healthy operator pod and the new crashing pod coexist. The old pod may
have RestartCount > 0 from cert-refresh restarts, causing
waitForOperatorPod to non-deterministically select it instead of the
actually-crashing pod.

Rename waitForOperatorPod to waitForFailedOperatorPod and filter out
Ready pods so only the crashing pod (which never becomes Ready) is
selected.

* add computedomain to groveManagedResourceTypes
enoodle pushed a commit to enoodle/grove that referenced this pull request Mar 24, 2026
…ai-dynamo#451)

* [e2e] [mnnvl] resolve flaky e2e test in Config2_UnsupportedButEnabled

During a rolling deployment (maxUnavailable=0, maxSurge=1), both the old
healthy operator pod and the new crashing pod coexist. The old pod may
have RestartCount > 0 from cert-refresh restarts, causing
waitForOperatorPod to non-deterministically select it instead of the
actually-crashing pod.

Rename waitForOperatorPod to waitForFailedOperatorPod and filter out
Ready pods so only the crashing pod (which never becomes Ready) is
selected.

* add computedomain to groveManagedResourceTypes
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky e2e: Test_AutoMNNVL_UnsupportedButEnabled/operator_exits_when_CD_CRD_is_missing

3 participants