Flaky e2e: Test_AutoMNNVL_UnsupportedButEnabled/operator_exits_when_CD_CRD_is_missing

### What happened?

The Config2_UnsupportedButEnabled e2e test (Test_AutoMNNVL_UnsupportedButEnabled/operator_exits_when_CD_CRD_is_missing) is flaky — it intermittently fails with a 2-minute timeout waiting for the expected preflight-failure log message, but passes on re-run.
```
=== RUN   Test_AutoMNNVL_UnsupportedButEnabled/operator_exits_when_CD_CRD_is_missing
    unsupported_but_enabled_test.go:101:
        Error Trace:  unsupported_but_enabled_test.go:101
                      unsupported_but_enabled_test.go:61
        Error:        Received unexpected error:
                      condition not met within timeout of 2m0s
        Test:         Test_AutoMNNVL_UnsupportedButEnabled/operator_exits_when_CD_CRD_is_missing
        Messages:     Operator logs should show preflight failure due to missing CRD
--- FAIL: Test_AutoMNNVL_UnsupportedButEnabled (120.39s)
    --- FAIL: Test_AutoMNNVL_UnsupportedButEnabled/operator_exits_when_CD_CRD_is_missing (120.15s)
```

### What did you expect to happen?

The test should reliably find the crashing operator pod and verify that its logs contain the preflight failure message `("MNNVL preflight check failed" + "ComputeDomain CRD").`

**RCA**
The flakiness is a race condition in pod selection during a rolling deployment.
Before `Config2_UnsupportedButEnabled` runs, `config-cluster.py` performs a helm upgrade that changes the operator ConfigMap (enabling autoMNNVLEnabled). 

Because the Helm chart uses a content-hashed ConfigMap name (grove-operator-cm-<hash>), the deployment spec changes, triggering a rolling update.

With the default Kubernetes rolling update strategy for a single-replica Deployment (maxUnavailable=0, maxSurge=1):
A new pod is created with the updated config — it crashes immediately on the MNNVL preflight check (CRD is missing).
The old pod (from Config1) stays Running because the new pod never becomes Ready.
The old pod may have RestartCount > 0 from a cert-refresh restart during Config1.
The waitForOperatorPod() function iterates over all operator pods and returns the first one with RestartCount > 0 or a terminated container — this is non-deterministic because List ordering is not guaranteed:
When it picks the new (crashing) pod → test passes (logs contain the preflight failure).
When it picks the old (healthy) pod → test fails (logs don't contain the preflight failure, 2-minute poll times out).


### Environment

Test environment: e2e (k3d cluster, 1 server + 2 agents)
Kubernetes version: k3s v1.34.2
Grove version: branch RUN-36489/mnnvl-demo (commit 5c076fc7)
Test suite: operator/e2e/tests/auto-mnnvl — Config2_UnsupportedButEnabled
Cluster setup: run_autoMNNVL_e2e_all.py orchestrator with config-cluster.py (fake-gpu-operator not installed, autoMNNVL enabled, --skip-operator-wait)
Reproduction: Intermittent — fails ~50% of the time depending on pod list ordering during rolling update


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky e2e: Test_AutoMNNVL_UnsupportedButEnabled/operator_exits_when_CD_CRD_is_missing #449

What happened?

What did you expect to happen?

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Flaky e2e: Test_AutoMNNVL_UnsupportedButEnabled/operator_exits_when_CD_CRD_is_missing #449

Description

What happened?

What did you expect to happen?

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions