What happened?
The Config2_UnsupportedButEnabled e2e test (Test_AutoMNNVL_UnsupportedButEnabled/operator_exits_when_CD_CRD_is_missing) is flaky — it intermittently fails with a 2-minute timeout waiting for the expected preflight-failure log message, but passes on re-run.
=== RUN Test_AutoMNNVL_UnsupportedButEnabled/operator_exits_when_CD_CRD_is_missing
unsupported_but_enabled_test.go:101:
Error Trace: unsupported_but_enabled_test.go:101
unsupported_but_enabled_test.go:61
Error: Received unexpected error:
condition not met within timeout of 2m0s
Test: Test_AutoMNNVL_UnsupportedButEnabled/operator_exits_when_CD_CRD_is_missing
Messages: Operator logs should show preflight failure due to missing CRD
--- FAIL: Test_AutoMNNVL_UnsupportedButEnabled (120.39s)
--- FAIL: Test_AutoMNNVL_UnsupportedButEnabled/operator_exits_when_CD_CRD_is_missing (120.15s)
What did you expect to happen?
The test should reliably find the crashing operator pod and verify that its logs contain the preflight failure message ("MNNVL preflight check failed" + "ComputeDomain CRD").
RCA
The flakiness is a race condition in pod selection during a rolling deployment.
Before Config2_UnsupportedButEnabled runs, config-cluster.py performs a helm upgrade that changes the operator ConfigMap (enabling autoMNNVLEnabled).
Because the Helm chart uses a content-hashed ConfigMap name (grove-operator-cm-), the deployment spec changes, triggering a rolling update.
With the default Kubernetes rolling update strategy for a single-replica Deployment (maxUnavailable=0, maxSurge=1):
A new pod is created with the updated config — it crashes immediately on the MNNVL preflight check (CRD is missing).
The old pod (from Config1) stays Running because the new pod never becomes Ready.
The old pod may have RestartCount > 0 from a cert-refresh restart during Config1.
The waitForOperatorPod() function iterates over all operator pods and returns the first one with RestartCount > 0 or a terminated container — this is non-deterministic because List ordering is not guaranteed:
When it picks the new (crashing) pod → test passes (logs contain the preflight failure).
When it picks the old (healthy) pod → test fails (logs don't contain the preflight failure, 2-minute poll times out).
Environment
Test environment: e2e (k3d cluster, 1 server + 2 agents)
Kubernetes version: k3s v1.34.2
Grove version: branch RUN-36489/mnnvl-demo (commit 5c076fc)
Test suite: operator/e2e/tests/auto-mnnvl — Config2_UnsupportedButEnabled
Cluster setup: run_autoMNNVL_e2e_all.py orchestrator with config-cluster.py (fake-gpu-operator not installed, autoMNNVL enabled, --skip-operator-wait)
Reproduction: Intermittent — fails ~50% of the time depending on pod list ordering during rolling update
What happened?
The Config2_UnsupportedButEnabled e2e test (Test_AutoMNNVL_UnsupportedButEnabled/operator_exits_when_CD_CRD_is_missing) is flaky — it intermittently fails with a 2-minute timeout waiting for the expected preflight-failure log message, but passes on re-run.
What did you expect to happen?
The test should reliably find the crashing operator pod and verify that its logs contain the preflight failure message
("MNNVL preflight check failed" + "ComputeDomain CRD").RCA
The flakiness is a race condition in pod selection during a rolling deployment.
Before
Config2_UnsupportedButEnabledruns,config-cluster.pyperforms a helm upgrade that changes the operator ConfigMap (enabling autoMNNVLEnabled).Because the Helm chart uses a content-hashed ConfigMap name (grove-operator-cm-), the deployment spec changes, triggering a rolling update.
With the default Kubernetes rolling update strategy for a single-replica Deployment (maxUnavailable=0, maxSurge=1):
A new pod is created with the updated config — it crashes immediately on the MNNVL preflight check (CRD is missing).
The old pod (from Config1) stays Running because the new pod never becomes Ready.
The old pod may have RestartCount > 0 from a cert-refresh restart during Config1.
The waitForOperatorPod() function iterates over all operator pods and returns the first one with RestartCount > 0 or a terminated container — this is non-deterministic because List ordering is not guaranteed:
When it picks the new (crashing) pod → test passes (logs contain the preflight failure).
When it picks the old (healthy) pod → test fails (logs don't contain the preflight failure, 2-minute poll times out).
Environment
Test environment: e2e (k3d cluster, 1 server + 2 agents)
Kubernetes version: k3s v1.34.2
Grove version: branch RUN-36489/mnnvl-demo (commit 5c076fc)
Test suite: operator/e2e/tests/auto-mnnvl — Config2_UnsupportedButEnabled
Cluster setup: run_autoMNNVL_e2e_all.py orchestrator with config-cluster.py (fake-gpu-operator not installed, autoMNNVL enabled, --skip-operator-wait)
Reproduction: Intermittent — fails ~50% of the time depending on pod list ordering during rolling update