Skip to content

fix: use Recreate strategy for GPU workloads to prevent rolling update deadlock#196

Merged
Defilan merged 1 commit intomainfrom
fix/gpu-rolling-update-deadlock
Mar 2, 2026
Merged

fix: use Recreate strategy for GPU workloads to prevent rolling update deadlock#196
Defilan merged 1 commit intomainfrom
fix/gpu-rolling-update-deadlock

Conversation

@Defilan
Copy link
Member

@Defilan Defilan commented Mar 2, 2026

Summary

  • Sets deployment strategy to Recreate for GPU workloads (gpuCount > 0) to prevent scheduling deadlock during rolling updates
  • Non-GPU workloads continue using the Kubernetes default (RollingUpdate)
  • Adds test assertions for both GPU and CPU-only strategy behavior

Problem

When all GPUs on a node are occupied, RollingUpdate creates a deadlock: the new pod cannot schedule without a GPU, and the old pod won't terminate until the new pod is Ready. This is especially common in homelab/small clusters with no spare GPUs.

How it works

The fix adds a Recreate strategy assignment inside the existing if gpuCount > 0 block in constructDeployment(), alongside the GPU toleration logic. This means the old pod terminates first, freeing the GPU for the replacement pod. Brief downtime during updates is acceptable — the alternative is a permanent deadlock.

Existing GPU InferenceService deployments will pick up the fix on the next reconcile cycle.

Fixes #192

Test plan

  • make test passes — all existing + new tests
  • Verify GPU deployment has strategy: Recreate via kubectl get deploy <name> -o yaml
  • Verify CPU-only deployment retains default RollingUpdate
  • Trigger a GPU InferenceService update and confirm old pod terminates before new pod creates

…e deadlock

When all GPUs on a node are occupied, RollingUpdate creates a deadlock:
the new pod cannot schedule without a GPU, and the old pod won't terminate
until the new pod is Ready. This sets the deployment strategy to Recreate
for GPU workloads (gpuCount > 0) so the old pod terminates first, freeing
the GPU for the replacement.

Fixes #192

Signed-off-by: Christopher Maher <chris@mahercode.io>
@Defilan Defilan merged commit 2e45181 into main Mar 2, 2026
15 checks passed
@Defilan Defilan deleted the fix/gpu-rolling-update-deadlock branch March 2, 2026 16:53
This was referenced Mar 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rolling updates fail for GPU workloads when no spare GPUs are available

1 participant