Skip to content

SO2/SO4 Test Expectations Do Not Match Startup Ordering Implementation #319

@gflarity

Description

@gflarity

Summary

Tests Test_SO2_InorderStartupOrderWithMinReplicas and Test_SO4_ExplicitStartupOrderWithMinReplicas fail intermittently because they expect all pods in a parent PodClique to be ready before dependent pods start, but the implementation only guarantees that minAvailable pods are ready before dependencies proceed.

Affected Tests

Test Workload Status
Test_SO1_InorderStartupOrderWithFullReplicas workload3 ✅ Safe (minAvailable == replicas)
Test_SO2_InorderStartupOrderWithMinReplicas workload4 ⚠️ Vulnerable
Test_SO3_ExplicitStartupOrderWithFullReplicas workload5 ✅ Safe (minAvailable == replicas)
Test_SO4_ExplicitStartupOrderWithMinReplicas workload6 ⚠️ Vulnerable

Failure Message

startup_ordering_test.go:291: Startup order violation: group scaling-groups (earliest at 2026-01-14 16:50:22 +0000 UTC) started before group pc-a (latest at 2026-01-14 16:50:23 +0000 UTC)

Configuration Comparison

Vulnerable Workloads (SO2/SO4)

workload4.yaml (SO2) and workload6.yaml (SO4):

cliques:
  - name: pc-a
    spec:
      replicas: 2
      minAvailable: 1    # <-- MISMATCH: only 1 pod required for dependency check
  - name: pc-c
    spec:
      replicas: 3
      minAvailable: 1    # <-- MISMATCH: only 1 pod required
podCliqueScalingGroups:
  - name: sg-x
    replicas: 2
    minAvailable: 1      # <-- MISMATCH

Safe Workloads (SO1/SO3)

workload3.yaml (SO1) and workload5.yaml (SO3):

cliques:
  - name: pc-a
    spec:
      replicas: 2
      minAvailable: 2    # <-- MATCH: all pods required
  - name: pc-c
    spec:
      replicas: 3
      minAvailable: 3    # <-- MATCH: all pods required
podCliqueScalingGroups:
  - name: sg-x
    replicas: 2
    minAvailable: 2      # <-- MATCH

The "FullReplicas" tests (SO1/SO3) pass because minAvailable == replicas, so the implementation behavior happens to match the test expectation. The "MinReplicas" tests (SO2/SO4) are vulnerable because minAvailable < replicas.

References:

  • operator/e2e/yaml/workload3.yaml through workload6.yaml

Implementation Behavior

The startsAfter dependency is enforced by the grove-initc init container. The init container only waits for minAvailable pods to be ready, not all replicas:

// operator/internal/controller/podclique/components/pod/initcontainer.go:144-145
for _, parentCliqueFQN := range pclq.Spec.StartsAfter {
    // ...
    args = append(args, fmt.Sprintf("--podcliques=%s:%d", parentCliqueFQN, *parentCliqueTemplateSpec.Spec.MinAvailable))
}

This means when pc-a has replicas=2 and minAvailable=1, the init container for pc-b will pass as soon as 1 pc-a pod is ready.

Test Expectation

The test (verifyScalingGroupStartupOrder) checks that:

  • The latest Ready timestamp of all pc-a pods is before
  • The earliest Ready timestamp of any scaling group pod

This effectively requires all pc-a pods to be ready before any dependent pod starts.

Reference: operator/e2e/tests/startup_ordering_test.go:528-530

// Verify the ordering: all pods in groupBefore should start before any pod in groupAfter
if earliestAfter.Before(latestBefore) {
    t.Fatalf("Startup order violation: group %s (earliest at %v) started before group %s (latest at %v)",
        afterName, earliestAfter, beforeName, latestBefore)
}

Timeline Evidence

From the diagnostic logs:

Time Event
16:50:20 pc-a-whs98 starts (1st pc-a pod)
16:50:20 sg-x-0-pc-b init container starts
16:50:21 sg-x-0-pc-b init container passes (1 pc-a ready ≥ minAvailable=1)
16:50:21 sg-x-0-pc-b main container starts
16:50:22 pc-a-w977w starts (2nd pc-a pod) - after pc-b already started

The init container correctly allowed pc-b to proceed when 1 pc-a pod was ready (minAvailable=1), but the test expected both pc-a pods to be ready first.

The Mismatch

Aspect Implementation Test Expectation
Dependency check Ready(parent) >= minAvailable Ready(parent) == replicas (all pods)
For workload6 pc-a 1 pod ready is sufficient Both pods must be ready

Possible Resolutions

Option 1: Fix the Test

Modify the test to only verify that minAvailable pods from the parent group are ready before dependent pods start, matching the actual implementation behavior.

Option 2: Fix the Implementation

Add an option or change the behavior so that startsAfter waits for all parent replicas (or a configurable threshold) rather than just minAvailable.

Option 3: Adjust the Test Configuration

Change workload6.yaml so that pc-a.minAvailable equals pc-a.replicas (set both to 2), which would make the implementation behavior match the test expectation.

References

  • Test file: operator/e2e/tests/startup_ordering_test.go
    • SO2: lines 102-171
    • SO4: lines 229-303
    • verifyScalingGroupStartupOrder: lines 376-459
  • Init container args generation: operator/internal/controller/podclique/components/pod/initcontainer.go:141-155
  • Workload configurations:
    • operator/e2e/yaml/workload4.yaml (SO2 - vulnerable)
    • operator/e2e/yaml/workload6.yaml (SO4 - vulnerable)
    • operator/e2e/yaml/workload3.yaml (SO1 - safe)
    • operator/e2e/yaml/workload5.yaml (SO3 - safe)
  • Scheduling gate removal logic: operator/internal/controller/podclique/components/pod/syncflow.go:241-301
  • Test documentation (acknowledges minAvailable behavior): operator/e2e/tests/startup_ordering_test.go:22-24

Test Documentation vs Implementation

Notably, the test file itself documents the minAvailable-based behavior:

// operator/e2e/tests/startup_ordering_test.go:22-24
// The init container watches for parent PodCliques to reach their minAvailable count
// in the Ready state, blocking the pod from becoming ready until dependencies are satisfied.

This confirms the test's verification logic is stricter than what the documented and implemented behavior guarantees.

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions