Skip to content

fix: use patch for topology label repplay on replaced nodes#384

Merged
Ronkahn21 merged 1 commit into
ai-dynamo:mainfrom
Ronkahn21:fix/node-label-replay-optimistic-lock
Feb 2, 2026
Merged

fix: use patch for topology label repplay on replaced nodes#384
Ronkahn21 merged 1 commit into
ai-dynamo:mainfrom
Ronkahn21:fix/node-label-replay-optimistic-lock

Conversation

@Ronkahn21

Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind bug

What this PR does / why we need it:

Fixes optimistic lock conflicts (409 Conflict errors) when replaying topology labels to nodes after they become unavailable and are replaced during e2e testing.

Problem:

  • reapplyNodeLabels() in operator/e2e/setup/k8s_clusters.go used full Update() call
  • Between fetching node and updating labels, ResourceVersion can change (kubelet, node controller)
  • Results in 409 Conflict errors due to stale ResourceVersion

Solution:

  • Switch from Update() to Patch() with StrategicMergePatchType
  • Patch operations apply regardless of ResourceVersion
  • Only modifies topology labels field, doesn't touch other node fields
  • Follows existing pattern from applyTopologyLabels()

Impact:

  • Eliminates optimistic lock conflicts during topology label replay
  • More efficient (only sends label changes)
  • Safer (doesn't risk overwriting concurrent changes)

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

  • Changes only affect e2e test infrastructure
  • Follows existing pattern from applyTopologyLabels() in setup/topology.go
  • No functional behavior change - same topology labels applied, different mechanism
  • Strategic merge patch automatically handles label merging

Does this PR introduce a API change?

NONE

Additional documentation e.g., enhancement proposals, usage docs, etc.:

NONE

Replace full Update() with Patch() in reapplyNodeLabels() to avoid
optimistic lock conflicts when replaying topology labels to replaced
nodes. Patch operations don't require matching ResourceVersion.

This fixes 409 Conflict errors that occur when nodes are replaced
during unavailability recovery and system components modify the node
between fetch and update operations.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
@Ronkahn21 Ronkahn21 changed the title fix: use patch for topology label replay on replaced nodes fix: use patch for topology label rep]lay on replaced nodes Feb 2, 2026
@Ronkahn21 Ronkahn21 changed the title fix: use patch for topology label rep]lay on replaced nodes fix: use patch for topology label repplay on replaced nodes Feb 2, 2026
@Ronkahn21 Ronkahn21 merged commit 5fd20af into ai-dynamo:main Feb 2, 2026
15 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants