Skip to content

Rolling Update Gets Stuck When New Update Initiated During In-Progress Update #400

@julienmancuso

Description

@julienmancuso

What happened?

dynamo QA team tried to trigger a rolling update while an existing rolling update was still in progress.

When a new rolling update is triggered while a previous rolling update is still in progress, Grove's PodClique controller loses track of pods created during the first update attempt. This causes the rolling update to mark itself as "completed" (updateEndedAt is set) even though pods with incorrect template hashes are still running, leaving updatedReplicas: 0 permanently.

What Happened (Actual Behavior)

Scenario

  1. User initiates first rolling update by changing PodCliqueSet template (e.g., annotation nvidia.com/restartAt: "test2")
  2. Grove PodClique controller starts rolling update:
    • Sets rollingUpdateProgress.podTemplateHash to hash for test2
    • Begins deleting old pods
    • Creates new pod with test2 annotation and corresponding hash
  3. Before first update completes, user initiates second rolling update (e.g., annotation changes to nvidia.com/restartAt: "test3")
  4. Grove PodClique controller detects spec change and resets rolling update:
    • Calls initOrResetRollingUpdate()
    • Sets new rollingUpdateProgress.podTemplateHash to hash for test3
    • Sets updateStartedAt to current time
    • Resets updatedReplicas: 0
  5. Controller calls processPendingUpdates() to continue the rolling update
  6. Problem occurs: computeUpdateWork() categorizes existing pods:
    • Pods with test2 hash don't match currentPodTemplateHash (old, pre-test2)
    • Pods with test2 hash also don't match rollingUpdateProgress.podTemplateHash (test3)
    • These "orphaned" pods are not categorized as needing update
  7. getPodNamesPendingUpdate() returns empty list (no pods to update)
  8. Controller reaches line 131-132 in rollingupdate.go and calls markRollingUpdateEnd()
  9. Result: Rolling update is marked as completed (updateEndedAt is set) but:
    • Pod still running with wrong hash (test2 instead of test3)
    • updatedReplicas: 0
    • readyReplicas: 1
    • Rolling update never actually completes

Evidence from Cluster

PodClique Status:

status:
  currentPodTemplateHash: c9cbdcf75cbb9d7df89  # Old hash (pre-test2)
  rollingUpdateProgress:
    podTemplateHash: d554cc997cfd9f64fd8       # Target hash (test3)
    updateStartedAt: "2026-02-04T18:10:34Z"
    updateEndedAt: "2026-02-04T18:10:34Z"      # ❌ Marked as ended immediately
  readyReplicas: 1                             # Pod is ready
  updatedReplicas: 0                           # ❌ But not "updated"
  replicas: 1

Pod Labels:

metadata:
  labels:
    grove.io/pod-template-hash: b76488bc9dd56799677  # Hash from test2 (middle state)
  annotations:
    nvidia.com/restartAt: test2                      # From first update attempt

PodClique Annotations (expected):

metadata:
  annotations:
    nvidia.com/restartAt: test3  # Second update attempt

Hash Mismatch:

  • Pod actual hash: b76488bc9dd56799677 (from test2)
  • PodClique current hash: c9cbdcf75cbb9d7df89 (old)
  • PodClique target hash: d554cc997cfd9f64fd8 (test3)
  • Pod doesn't match any of them → orphaned

Grove Operator Logs:

"msg":"components has registered a request to requeue post completion of all components syncs",
"kind":"PodCliqueSetReplica",
"message":"[Operation: Sync, Code: ERR_CONTINUE_RECONCILE_AND_REQUEUE] message: rolling update of PodCliqueSet replica index 0 is not completed"

The PodCliqueSet controller keeps requeueing because it sees the rolling update is "not completed" but the PodClique controller thinks it's done.

What Should Happen (Expected Behavior)

When a new rolling update is initiated while one is in progress:

  1. Grove should detect pods from the previous update attempt that don't match the new target hash
  2. These "orphaned" pods should be identified as needing replacement
  3. The rolling update should:
    • Delete the orphaned pods (with test2 hash)
    • Create new pods with the correct hash (test3)
    • Wait for new pods to become ready
    • Mark rolling update as completed only when all pods have the correct hash
  4. Final state should be:
    • updatedReplicas: 1
    • readyReplicas: 1
    • Pod hash matches rollingUpdateProgress.podTemplateHash
    • updateEndedAt set only when actual update is complete

Root Cause Analysis

Location

File: grove/operator/internal/controller/podclique/components/pod/rollingupdate.go

The Bug

In computeUpdateWork() (line 136-159):

func (r _resource) computeUpdateWork(logger logr.Logger, sc *syncContext) *updateWork {
    work := &updateWork{}
    for _, pod := range sc.existingPCLQPods {
        if pod.Labels[common.LabelPodTemplateHash] != sc.expectedPodTemplateHash {
            // Pod has OLD template hash - should be updated
            if r.hasPodDeletionBeenTriggered(sc, pod) {
                logger.Info("skipping old Pod since its deletion has already been triggered", "pod", client.ObjectKeyFromObject(pod))
                continue
            }
            // Categorize by health status...
            if k8sutils.IsPodPending(pod) {
                work.oldTemplateHashPendingPods = append(work.oldTemplateHashPendingPods, pod)
            } else if k8sutils.HasAnyStartedButNotReadyContainer(pod) || k8sutils.HasAnyContainerExitedErroneously(logger, pod) {
                work.oldTemplateHashUnhealthyPods = append(work.oldTemplateHashUnhealthyPods, pod)
            } else if k8sutils.IsPodReady(pod) {
                work.oldTemplateHashReadyPods = append(work.oldTemplateHashReadyPods, pod)
            }
        } else {
            // Pod has NEW template hash - already updated
            if k8sutils.IsPodReady(pod) {
                work.newTemplateHashReadyPods = append(work.newTemplateHashReadyPods, pod)
            }
        }
    }
    return work
}

The Problem:

The code compares pods against sc.expectedPodTemplateHash (which is the NEW target hash from rollingUpdateProgress.podTemplateHash). It assumes pods are either:

  • Have the NEW hash (already updated)
  • Have some OTHER hash (need updating)

But it doesn't account for the case where:

  • Pod has an INTERMEDIATE hash from a previous update attempt
  • Pod is READY (passes all health checks)
  • Pod's hash != expectedPodTemplateHash
  • Pod's hash != currentPodTemplateHash (because currentPodTemplateHash is from pre-first-update)

When this happens:

  1. The check at line 138 evaluates to TRUE (pod hash != expectedPodTemplateHash)
  2. Pod is categorized into oldTemplateHashReadyPods
  3. BUT if the pod was created by a previous rolling update attempt, it may have already passed through the deletion check and is in a "healthy running" state
  4. The key issue: if this pod is the ONLY pod, and it's ready, the rolling update logic thinks "all old pods have been replaced" because it doesn't see any pods from the "true old" generation

Looking at the logs more carefully, the issue might be in the deletion expectation tracking. When initOrResetRollingUpdate() is called, it resets the rolling update progress but doesn't clear deletion expectations from the previous update attempt.

Alternative Root Cause

When initOrResetRollingUpdate() is called (line 149-167 in reconcilespec.go):

func (r *Reconciler) initOrResetRollingUpdate(ctx context.Context, pcs *grovecorev1alpha1.PodCliqueSet, pclq *grovecorev1alpha1.PodClique) error {
    podTemplateHash, err := componentutils.GetExpectedPCLQPodTemplateHash(pcs, pclq.ObjectMeta)
    if err != nil {
        return fmt.Errorf("could not update PodClique %s status with rolling update progress: %w", client.ObjectKeyFromObject(pclq), err)
    }
    // reset and start the rolling update
    patch := client.MergeFrom(pclq.DeepCopy())
    pclq.Status.RollingUpdateProgress = &grovecorev1alpha1.PodCliqueRollingUpdateProgress{
        UpdateStartedAt:            metav1.Now(),
        PodCliqueSetGenerationHash: *pcs.Status.CurrentGenerationHash,
        PodTemplateHash:            podTemplateHash,
    }
    // reset the updated replicas count to 0 so that the rolling update can start afresh.
    pclq.Status.UpdatedReplicas = 0
    if err = r.client.Status().Patch(ctx, pclq, patch); err != nil {
        return fmt.Errorf("failed to update PodClique %s status with rolling update progress: %w", client.ObjectKeyFromObject(pclq), err)
    }
    return nil
}

This resets RollingUpdateProgress but doesn't:

  1. Clear the expectations store for pending deletions from the previous update
  2. Clear ReadyPodsSelectedToUpdate (though it should be nil here)
  3. Force deletion of pods with intermediate hashes

So when the next reconciliation happens:

  • Pod with test2 hash exists and is ready
  • hasPodDeletionBeenTriggered() might return TRUE if a deletion was already expected from test2 update
  • Pod gets skipped in the categorization (line 141-143)
  • getPodNamesPendingUpdate() returns empty
  • Rolling update marked as complete

Reproduction Steps

  1. Create a PodCliqueSet with a PodClique (any spec)
  2. Wait for pods to become ready
  3. Update PodCliqueSet template (e.g., change annotation nvidia.com/restartAt: "test1")
  4. Grove starts rolling update, begins deleting/recreating pods
  5. Before all pods are recreated, update PodCliqueSet template again (e.g., change annotation to nvidia.com/restartAt: "test2")
  6. Observe:
    • PodClique rollingUpdateProgress.updateEndedAt is set
    • PodClique updatedReplicas: 0
    • Pods running with hash from first update (test1) instead of second (test2)
    • Rolling update never completes

Impact

  • High: Rolling updates can get permanently stuck when rapid changes are made
  • Users cannot complete deployments after rapid consecutive restarts
  • Requires manual intervention (deleting the PodClique or pods) to recover
  • Affects any workload using PodClique/PodCliqueSet for orchestration
  • Common in CI/CD scenarios or when users issue multiple restart commands

Suggested Fix

The fix should be in computeUpdateWork() or when calling initOrResetRollingUpdate():

Option 1: Clear Expectations on Reset

When initOrResetRollingUpdate() is called, also clear the expectations store for this PodClique:

func (r *Reconciler) initOrResetRollingUpdate(ctx context.Context, pcs *grovecorev1alpha1.PodCliqueSet, pclq *grovecorev1alpha1.PodClique) error {
    // ... existing code ...
    
    // Clear deletion expectations from previous rolling update attempt
    pclqExpectationsStoreKey := expectationstore.GetKey(pclq)
    r.expectationsStore.ClearDeleteExpectations(pclqExpectationsStoreKey)
    
    // ... rest of existing code ...
}

Option 2: Treat All Non-Matching Pods as Needing Update

Modify computeUpdateWork() to not skip pods even if they have deletion expectations, when starting a fresh rolling update:

func (r _resource) computeUpdateWork(logger logr.Logger, sc *syncContext) *updateWork {
    work := &updateWork{}
    for _, pod := range sc.existingPCLQPods {
        if pod.Labels[common.LabelPodTemplateHash] != sc.expectedPodTemplateHash {
            // Check if this pod has a deletion triggered, but only skip if we're in the middle
            // of an ongoing update (not a fresh reset)
            if r.hasPodDeletionBeenTriggered(sc, pod) && !isRollingUpdateJustReset(sc.pclq) {
                logger.Info("skipping old Pod since its deletion has already been triggered", ...)
                continue
            }
            // ... rest of categorization ...
        }
    }
    return work
}

Option 3: Force Delete Orphaned Pods on Reset

When initOrResetRollingUpdate() is called, immediately delete all pods that don't match the new target hash:

func (r *Reconciler) initOrResetRollingUpdate(ctx context.Context, pcs *grovecorev1alpha1.PodCliqueSet, pclq *grovecorev1alpha1.PodClique) error {
    // ... compute new pod template hash ...
    
    // Delete all existing pods that don't match the new target hash
    existingPods, err := getExistingPods(ctx, r.client, pclq)
    if err != nil {
        return err
    }
    for _, pod := range existingPods {
        if pod.Labels[common.LabelPodTemplateHash] != podTemplateHash {
            logger.Info("Deleting orphaned pod from previous rolling update", "pod", pod.Name, "oldHash", pod.Labels[common.LabelPodTemplateHash], "newHash", podTemplateHash)
            if err := r.client.Delete(ctx, pod); err != nil {
                return err
            }
        }
    }
    
    // ... rest of existing code ...
}

What did you expect to happen?

see above

Environment

  • Kubernetes version
  • Grove version: v0.1.0-alpha.3
  • Scheduler details
  • Cloud provider or hardware configuration
  • Tools that you are using Grove together with
  • Anything else that is relevant

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions