Skip to content

fix: use server-side apply in RU e2e tests#504

Merged
enoodle merged 1 commit into
ai-dynamo:mainfrom
enoodle:erez/fix/rolling-update-e2e-use-ssa
Apr 9, 2026
Merged

fix: use server-side apply in RU e2e tests#504
enoodle merged 1 commit into
ai-dynamo:mainfrom
enoodle:erez/fix/rolling-update-e2e-use-ssa

Conversation

@enoodle

@enoodle enoodle commented Mar 31, 2026

Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind bug
/kind api

What this PR does / why we need it:

Use server-side apply in rolling updates e2e to fix flaky tests

Which issue(s) this PR fixes:

Fixes #503

Special notes for your reviewer:

Does this PR introduce a API change?

This updates the API of PCS to the field PodCliqueSetTemplateSpec.Cliques a map interpretation. It had no list-type annotation (defaults to atomic — the whole array is replaced on merge). Added +listType=map
+listMapKey=name to match how containers and env are already annotated. Non-breaking for existing clients.

@copy-pr-bot

copy-pr-bot Bot commented Mar 31, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@enoodle enoodle force-pushed the erez/fix/rolling-update-e2e-use-ssa branch from 059273a to e0dba56 Compare March 31, 2026 15:25
danbar2
danbar2 previously approved these changes Mar 31, 2026
@enoodle enoodle changed the title fix: use server-side apply RU e2e tests fix: use server-side apply in RU e2e tests Mar 31, 2026
@enoodle enoodle force-pushed the erez/fix/rolling-update-e2e-use-ssa branch from e0dba56 to b83968c Compare April 3, 2026 21:26

@Ronkahn21 Ronkahn21 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion using SSA, Had a few questions and some personal style

Comment thread operator/e2e/tests/update/utils.go
Comment thread operator/e2e/tests/update/utils.go

@Ronkahn21 Ronkahn21 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure not to merge from the repo only to rebase

@enoodle enoodle force-pushed the erez/fix/rolling-update-e2e-use-ssa branch from 223d791 to 0ef4a13 Compare April 5, 2026 19:55
@enoodle

enoodle commented Apr 5, 2026

Copy link
Copy Markdown
Contributor Author

@Ronkahn21 Why is rebase preferred to merge commits? from code perspective the result is similar - The tests run on the same code.

…ndition

triggerPodCliqueUpdate and patchPCSWithSIGTERMIgnoringCommand use
RetryOnConflict + DynamicClient.Update to modify the PCS. Under CI DinD
CPU throttling, the grove controller reconciles concurrently and changes
the resourceVersion between GET and UPDATE, exhausting the retry window.

Switch to server-side apply (CRClient.Patch with client.Apply) which
does not check resourceVersion. Also add +listType=map +listMapKey=name
to PodCliqueSetTemplateSpec.Cliques so SSA merges by clique name rather
than treating the array as atomic.

Closes ai-dynamo#503

Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@enoodle enoodle force-pushed the erez/fix/rolling-update-e2e-use-ssa branch from 0ef4a13 to 5ff5ddd Compare April 5, 2026 20:59
@enoodle enoodle merged commit ebbfcac into ai-dynamo:main Apr 9, 2026
26 of 27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Flaky E2E rolling update tests due to PCS update conflict race

3 participants