Summary
Tests Test_SO2_InorderStartupOrderWithMinReplicas and Test_SO4_ExplicitStartupOrderWithMinReplicas fail intermittently because they expect all pods in a parent PodClique to be ready before dependent pods start, but the implementation only guarantees that minAvailable pods are ready before dependencies proceed.
Affected Tests
| Test |
Workload |
Status |
Test_SO1_InorderStartupOrderWithFullReplicas |
workload3 |
✅ Safe (minAvailable == replicas) |
Test_SO2_InorderStartupOrderWithMinReplicas |
workload4 |
⚠️ Vulnerable |
Test_SO3_ExplicitStartupOrderWithFullReplicas |
workload5 |
✅ Safe (minAvailable == replicas) |
Test_SO4_ExplicitStartupOrderWithMinReplicas |
workload6 |
⚠️ Vulnerable |
Failure Message
startup_ordering_test.go:291: Startup order violation: group scaling-groups (earliest at 2026-01-14 16:50:22 +0000 UTC) started before group pc-a (latest at 2026-01-14 16:50:23 +0000 UTC)
Configuration Comparison
Vulnerable Workloads (SO2/SO4)
workload4.yaml (SO2) and workload6.yaml (SO4):
cliques:
- name: pc-a
spec:
replicas: 2
minAvailable: 1 # <-- MISMATCH: only 1 pod required for dependency check
- name: pc-c
spec:
replicas: 3
minAvailable: 1 # <-- MISMATCH: only 1 pod required
podCliqueScalingGroups:
- name: sg-x
replicas: 2
minAvailable: 1 # <-- MISMATCH
Safe Workloads (SO1/SO3)
workload3.yaml (SO1) and workload5.yaml (SO3):
cliques:
- name: pc-a
spec:
replicas: 2
minAvailable: 2 # <-- MATCH: all pods required
- name: pc-c
spec:
replicas: 3
minAvailable: 3 # <-- MATCH: all pods required
podCliqueScalingGroups:
- name: sg-x
replicas: 2
minAvailable: 2 # <-- MATCH
The "FullReplicas" tests (SO1/SO3) pass because minAvailable == replicas, so the implementation behavior happens to match the test expectation. The "MinReplicas" tests (SO2/SO4) are vulnerable because minAvailable < replicas.
References:
operator/e2e/yaml/workload3.yaml through workload6.yaml
Implementation Behavior
The startsAfter dependency is enforced by the grove-initc init container. The init container only waits for minAvailable pods to be ready, not all replicas:
// operator/internal/controller/podclique/components/pod/initcontainer.go:144-145
for _, parentCliqueFQN := range pclq.Spec.StartsAfter {
// ...
args = append(args, fmt.Sprintf("--podcliques=%s:%d", parentCliqueFQN, *parentCliqueTemplateSpec.Spec.MinAvailable))
}
This means when pc-a has replicas=2 and minAvailable=1, the init container for pc-b will pass as soon as 1 pc-a pod is ready.
Test Expectation
The test (verifyScalingGroupStartupOrder) checks that:
- The latest Ready timestamp of all
pc-a pods is before
- The earliest Ready timestamp of any scaling group pod
This effectively requires all pc-a pods to be ready before any dependent pod starts.
Reference: operator/e2e/tests/startup_ordering_test.go:528-530
// Verify the ordering: all pods in groupBefore should start before any pod in groupAfter
if earliestAfter.Before(latestBefore) {
t.Fatalf("Startup order violation: group %s (earliest at %v) started before group %s (latest at %v)",
afterName, earliestAfter, beforeName, latestBefore)
}
Timeline Evidence
From the diagnostic logs:
| Time |
Event |
| 16:50:20 |
pc-a-whs98 starts (1st pc-a pod) |
| 16:50:20 |
sg-x-0-pc-b init container starts |
| 16:50:21 |
sg-x-0-pc-b init container passes (1 pc-a ready ≥ minAvailable=1) |
| 16:50:21 |
sg-x-0-pc-b main container starts |
| 16:50:22 |
pc-a-w977w starts (2nd pc-a pod) - after pc-b already started |
The init container correctly allowed pc-b to proceed when 1 pc-a pod was ready (minAvailable=1), but the test expected both pc-a pods to be ready first.
The Mismatch
| Aspect |
Implementation |
Test Expectation |
| Dependency check |
Ready(parent) >= minAvailable |
Ready(parent) == replicas (all pods) |
| For workload6 pc-a |
1 pod ready is sufficient |
Both pods must be ready |
Possible Resolutions
Option 1: Fix the Test
Modify the test to only verify that minAvailable pods from the parent group are ready before dependent pods start, matching the actual implementation behavior.
Option 2: Fix the Implementation
Add an option or change the behavior so that startsAfter waits for all parent replicas (or a configurable threshold) rather than just minAvailable.
Option 3: Adjust the Test Configuration
Change workload6.yaml so that pc-a.minAvailable equals pc-a.replicas (set both to 2), which would make the implementation behavior match the test expectation.
References
- Test file:
operator/e2e/tests/startup_ordering_test.go
- SO2: lines 102-171
- SO4: lines 229-303
verifyScalingGroupStartupOrder: lines 376-459
- Init container args generation:
operator/internal/controller/podclique/components/pod/initcontainer.go:141-155
- Workload configurations:
operator/e2e/yaml/workload4.yaml (SO2 - vulnerable)
operator/e2e/yaml/workload6.yaml (SO4 - vulnerable)
operator/e2e/yaml/workload3.yaml (SO1 - safe)
operator/e2e/yaml/workload5.yaml (SO3 - safe)
- Scheduling gate removal logic:
operator/internal/controller/podclique/components/pod/syncflow.go:241-301
- Test documentation (acknowledges minAvailable behavior):
operator/e2e/tests/startup_ordering_test.go:22-24
Test Documentation vs Implementation
Notably, the test file itself documents the minAvailable-based behavior:
// operator/e2e/tests/startup_ordering_test.go:22-24
// The init container watches for parent PodCliques to reach their minAvailable count
// in the Ready state, blocking the pod from becoming ready until dependencies are satisfied.
This confirms the test's verification logic is stricter than what the documented and implemented behavior guarantees.
Summary
Tests
Test_SO2_InorderStartupOrderWithMinReplicasandTest_SO4_ExplicitStartupOrderWithMinReplicasfail intermittently because they expect all pods in a parent PodClique to be ready before dependent pods start, but the implementation only guarantees that minAvailable pods are ready before dependencies proceed.Affected Tests
Test_SO1_InorderStartupOrderWithFullReplicasTest_SO2_InorderStartupOrderWithMinReplicasTest_SO3_ExplicitStartupOrderWithFullReplicasTest_SO4_ExplicitStartupOrderWithMinReplicasFailure Message
Configuration Comparison
Vulnerable Workloads (SO2/SO4)
workload4.yaml (SO2) and workload6.yaml (SO4):
Safe Workloads (SO1/SO3)
workload3.yaml (SO1) and workload5.yaml (SO3):
The "FullReplicas" tests (SO1/SO3) pass because
minAvailable == replicas, so the implementation behavior happens to match the test expectation. The "MinReplicas" tests (SO2/SO4) are vulnerable becauseminAvailable < replicas.References:
operator/e2e/yaml/workload3.yamlthroughworkload6.yamlImplementation Behavior
The
startsAfterdependency is enforced by thegrove-initcinit container. The init container only waits for minAvailable pods to be ready, not all replicas:This means when
pc-ahasreplicas=2andminAvailable=1, the init container forpc-bwill pass as soon as 1 pc-a pod is ready.Test Expectation
The test (
verifyScalingGroupStartupOrder) checks that:pc-apods is beforeThis effectively requires all pc-a pods to be ready before any dependent pod starts.
Reference:
operator/e2e/tests/startup_ordering_test.go:528-530Timeline Evidence
From the diagnostic logs:
pc-a-whs98starts (1st pc-a pod)sg-x-0-pc-binit container startssg-x-0-pc-binit container passes (1 pc-a ready ≥ minAvailable=1)sg-x-0-pc-bmain container startspc-a-w977wstarts (2nd pc-a pod) - after pc-b already startedThe init container correctly allowed
pc-bto proceed when 1 pc-a pod was ready (minAvailable=1), but the test expected both pc-a pods to be ready first.The Mismatch
Ready(parent) >= minAvailableReady(parent) == replicas(all pods)Possible Resolutions
Option 1: Fix the Test
Modify the test to only verify that
minAvailablepods from the parent group are ready before dependent pods start, matching the actual implementation behavior.Option 2: Fix the Implementation
Add an option or change the behavior so that
startsAfterwaits for all parent replicas (or a configurable threshold) rather than justminAvailable.Option 3: Adjust the Test Configuration
Change
workload6.yamlso thatpc-a.minAvailableequalspc-a.replicas(set both to 2), which would make the implementation behavior match the test expectation.References
operator/e2e/tests/startup_ordering_test.goverifyScalingGroupStartupOrder: lines 376-459operator/internal/controller/podclique/components/pod/initcontainer.go:141-155operator/e2e/yaml/workload4.yaml(SO2 - vulnerable)operator/e2e/yaml/workload6.yaml(SO4 - vulnerable)operator/e2e/yaml/workload3.yaml(SO1 - safe)operator/e2e/yaml/workload5.yaml(SO3 - safe)operator/internal/controller/podclique/components/pod/syncflow.go:241-301operator/e2e/tests/startup_ordering_test.go:22-24Test Documentation vs Implementation
Notably, the test file itself documents the minAvailable-based behavior:
This confirms the test's verification logic is stricter than what the documented and implemented behavior guarantees.