Skip to content

[BUG] Flaky E2E rolling update tests due to PCS update conflict race #503

@enoodle

Description

@enoodle

What happened?

Test_RU12_RollingUpdateWithPCSScaleInDuringUpdate intermittently fails in CI with:

Failed to trigger rolling update on pc-c: Operation cannot be fulfilled on
podcliquesets.grove.io "workload1": the object has been modified; please apply
your changes to the latest version and try again

triggerPodCliqueRollingUpdate calls GET → modify → UPDATE sequentially for pc-a, pc-b, pc-c. After the first update, the grove controller reconciles and writes back to the PCS, changing its resourceVersion. The next update conflicts and RetryOnConflict (5 retries × 10ms) cannot recover.

Failures confirmed across multiple branches:

  • erez/chore/update-kai-version: runs 23736130190, 23480636572
  • crd-upgrader-impl: runs 23756654965, 23747858360, 23445539025

Not reproducible locally — CI DinD with CPU throttling causes bursty controller reconciliation that widens the conflict window.

What did you expect to happen?

Rolling update tests should pass reliably in CI regardless of controller reconciliation timing.

Environment

  • CI runners: prod-grove-e2e-v1 (Docker-in-Docker with CPU limits)
  • Cluster: k3d with 30 KWOK nodes (default e2e preset)
  • Kubernetes: k3s v1.34.2

Metadata

Metadata

Assignees

Labels

Type

No fields configured for Bug.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions