What happened?
dynamo QA team tried to trigger a rolling update while an existing rolling update was still in progress.
When a new rolling update is triggered while a previous rolling update is still in progress, Grove's PodClique controller loses track of pods created during the first update attempt. This causes the rolling update to mark itself as "completed" (updateEndedAt is set) even though pods with incorrect template hashes are still running, leaving updatedReplicas: 0 permanently.
What Happened (Actual Behavior)
Scenario
- User initiates first rolling update by changing PodCliqueSet template (e.g., annotation
nvidia.com/restartAt: "test2")
- Grove PodClique controller starts rolling update:
- Sets
rollingUpdateProgress.podTemplateHash to hash for test2
- Begins deleting old pods
- Creates new pod with test2 annotation and corresponding hash
- Before first update completes, user initiates second rolling update (e.g., annotation changes to
nvidia.com/restartAt: "test3")
- Grove PodClique controller detects spec change and resets rolling update:
- Calls
initOrResetRollingUpdate()
- Sets new
rollingUpdateProgress.podTemplateHash to hash for test3
- Sets
updateStartedAt to current time
- Resets
updatedReplicas: 0
- Controller calls
processPendingUpdates() to continue the rolling update
- Problem occurs:
computeUpdateWork() categorizes existing pods:
- Pods with test2 hash don't match
currentPodTemplateHash (old, pre-test2)
- Pods with test2 hash also don't match
rollingUpdateProgress.podTemplateHash (test3)
- These "orphaned" pods are not categorized as needing update
getPodNamesPendingUpdate() returns empty list (no pods to update)
- Controller reaches line 131-132 in
rollingupdate.go and calls markRollingUpdateEnd()
- Result: Rolling update is marked as completed (
updateEndedAt is set) but:
- Pod still running with wrong hash (test2 instead of test3)
updatedReplicas: 0
readyReplicas: 1
- Rolling update never actually completes
Evidence from Cluster
PodClique Status:
status:
currentPodTemplateHash: c9cbdcf75cbb9d7df89 # Old hash (pre-test2)
rollingUpdateProgress:
podTemplateHash: d554cc997cfd9f64fd8 # Target hash (test3)
updateStartedAt: "2026-02-04T18:10:34Z"
updateEndedAt: "2026-02-04T18:10:34Z" # ❌ Marked as ended immediately
readyReplicas: 1 # Pod is ready
updatedReplicas: 0 # ❌ But not "updated"
replicas: 1
Pod Labels:
metadata:
labels:
grove.io/pod-template-hash: b76488bc9dd56799677 # Hash from test2 (middle state)
annotations:
nvidia.com/restartAt: test2 # From first update attempt
PodClique Annotations (expected):
metadata:
annotations:
nvidia.com/restartAt: test3 # Second update attempt
Hash Mismatch:
- Pod actual hash:
b76488bc9dd56799677 (from test2)
- PodClique current hash:
c9cbdcf75cbb9d7df89 (old)
- PodClique target hash:
d554cc997cfd9f64fd8 (test3)
- Pod doesn't match any of them → orphaned
Grove Operator Logs:
"msg":"components has registered a request to requeue post completion of all components syncs",
"kind":"PodCliqueSetReplica",
"message":"[Operation: Sync, Code: ERR_CONTINUE_RECONCILE_AND_REQUEUE] message: rolling update of PodCliqueSet replica index 0 is not completed"
The PodCliqueSet controller keeps requeueing because it sees the rolling update is "not completed" but the PodClique controller thinks it's done.
What Should Happen (Expected Behavior)
When a new rolling update is initiated while one is in progress:
- Grove should detect pods from the previous update attempt that don't match the new target hash
- These "orphaned" pods should be identified as needing replacement
- The rolling update should:
- Delete the orphaned pods (with test2 hash)
- Create new pods with the correct hash (test3)
- Wait for new pods to become ready
- Mark rolling update as completed only when all pods have the correct hash
- Final state should be:
updatedReplicas: 1
readyReplicas: 1
- Pod hash matches
rollingUpdateProgress.podTemplateHash
updateEndedAt set only when actual update is complete
Root Cause Analysis
Location
File: grove/operator/internal/controller/podclique/components/pod/rollingupdate.go
The Bug
In computeUpdateWork() (line 136-159):
func (r _resource) computeUpdateWork(logger logr.Logger, sc *syncContext) *updateWork {
work := &updateWork{}
for _, pod := range sc.existingPCLQPods {
if pod.Labels[common.LabelPodTemplateHash] != sc.expectedPodTemplateHash {
// Pod has OLD template hash - should be updated
if r.hasPodDeletionBeenTriggered(sc, pod) {
logger.Info("skipping old Pod since its deletion has already been triggered", "pod", client.ObjectKeyFromObject(pod))
continue
}
// Categorize by health status...
if k8sutils.IsPodPending(pod) {
work.oldTemplateHashPendingPods = append(work.oldTemplateHashPendingPods, pod)
} else if k8sutils.HasAnyStartedButNotReadyContainer(pod) || k8sutils.HasAnyContainerExitedErroneously(logger, pod) {
work.oldTemplateHashUnhealthyPods = append(work.oldTemplateHashUnhealthyPods, pod)
} else if k8sutils.IsPodReady(pod) {
work.oldTemplateHashReadyPods = append(work.oldTemplateHashReadyPods, pod)
}
} else {
// Pod has NEW template hash - already updated
if k8sutils.IsPodReady(pod) {
work.newTemplateHashReadyPods = append(work.newTemplateHashReadyPods, pod)
}
}
}
return work
}
The Problem:
The code compares pods against sc.expectedPodTemplateHash (which is the NEW target hash from rollingUpdateProgress.podTemplateHash). It assumes pods are either:
- Have the NEW hash (already updated)
- Have some OTHER hash (need updating)
But it doesn't account for the case where:
- Pod has an INTERMEDIATE hash from a previous update attempt
- Pod is READY (passes all health checks)
- Pod's hash != expectedPodTemplateHash
- Pod's hash != currentPodTemplateHash (because currentPodTemplateHash is from pre-first-update)
When this happens:
- The check at line 138 evaluates to TRUE (pod hash != expectedPodTemplateHash)
- Pod is categorized into
oldTemplateHashReadyPods
- BUT if the pod was created by a previous rolling update attempt, it may have already passed through the deletion check and is in a "healthy running" state
- The key issue: if this pod is the ONLY pod, and it's ready, the rolling update logic thinks "all old pods have been replaced" because it doesn't see any pods from the "true old" generation
Looking at the logs more carefully, the issue might be in the deletion expectation tracking. When initOrResetRollingUpdate() is called, it resets the rolling update progress but doesn't clear deletion expectations from the previous update attempt.
Alternative Root Cause
When initOrResetRollingUpdate() is called (line 149-167 in reconcilespec.go):
func (r *Reconciler) initOrResetRollingUpdate(ctx context.Context, pcs *grovecorev1alpha1.PodCliqueSet, pclq *grovecorev1alpha1.PodClique) error {
podTemplateHash, err := componentutils.GetExpectedPCLQPodTemplateHash(pcs, pclq.ObjectMeta)
if err != nil {
return fmt.Errorf("could not update PodClique %s status with rolling update progress: %w", client.ObjectKeyFromObject(pclq), err)
}
// reset and start the rolling update
patch := client.MergeFrom(pclq.DeepCopy())
pclq.Status.RollingUpdateProgress = &grovecorev1alpha1.PodCliqueRollingUpdateProgress{
UpdateStartedAt: metav1.Now(),
PodCliqueSetGenerationHash: *pcs.Status.CurrentGenerationHash,
PodTemplateHash: podTemplateHash,
}
// reset the updated replicas count to 0 so that the rolling update can start afresh.
pclq.Status.UpdatedReplicas = 0
if err = r.client.Status().Patch(ctx, pclq, patch); err != nil {
return fmt.Errorf("failed to update PodClique %s status with rolling update progress: %w", client.ObjectKeyFromObject(pclq), err)
}
return nil
}
This resets RollingUpdateProgress but doesn't:
- Clear the expectations store for pending deletions from the previous update
- Clear
ReadyPodsSelectedToUpdate (though it should be nil here)
- Force deletion of pods with intermediate hashes
So when the next reconciliation happens:
- Pod with test2 hash exists and is ready
hasPodDeletionBeenTriggered() might return TRUE if a deletion was already expected from test2 update
- Pod gets skipped in the categorization (line 141-143)
getPodNamesPendingUpdate() returns empty
- Rolling update marked as complete
Reproduction Steps
- Create a PodCliqueSet with a PodClique (any spec)
- Wait for pods to become ready
- Update PodCliqueSet template (e.g., change annotation
nvidia.com/restartAt: "test1")
- Grove starts rolling update, begins deleting/recreating pods
- Before all pods are recreated, update PodCliqueSet template again (e.g., change annotation to
nvidia.com/restartAt: "test2")
- Observe:
- PodClique
rollingUpdateProgress.updateEndedAt is set
- PodClique
updatedReplicas: 0
- Pods running with hash from first update (test1) instead of second (test2)
- Rolling update never completes
Impact
- High: Rolling updates can get permanently stuck when rapid changes are made
- Users cannot complete deployments after rapid consecutive restarts
- Requires manual intervention (deleting the PodClique or pods) to recover
- Affects any workload using PodClique/PodCliqueSet for orchestration
- Common in CI/CD scenarios or when users issue multiple restart commands
Suggested Fix
The fix should be in computeUpdateWork() or when calling initOrResetRollingUpdate():
Option 1: Clear Expectations on Reset
When initOrResetRollingUpdate() is called, also clear the expectations store for this PodClique:
func (r *Reconciler) initOrResetRollingUpdate(ctx context.Context, pcs *grovecorev1alpha1.PodCliqueSet, pclq *grovecorev1alpha1.PodClique) error {
// ... existing code ...
// Clear deletion expectations from previous rolling update attempt
pclqExpectationsStoreKey := expectationstore.GetKey(pclq)
r.expectationsStore.ClearDeleteExpectations(pclqExpectationsStoreKey)
// ... rest of existing code ...
}
Option 2: Treat All Non-Matching Pods as Needing Update
Modify computeUpdateWork() to not skip pods even if they have deletion expectations, when starting a fresh rolling update:
func (r _resource) computeUpdateWork(logger logr.Logger, sc *syncContext) *updateWork {
work := &updateWork{}
for _, pod := range sc.existingPCLQPods {
if pod.Labels[common.LabelPodTemplateHash] != sc.expectedPodTemplateHash {
// Check if this pod has a deletion triggered, but only skip if we're in the middle
// of an ongoing update (not a fresh reset)
if r.hasPodDeletionBeenTriggered(sc, pod) && !isRollingUpdateJustReset(sc.pclq) {
logger.Info("skipping old Pod since its deletion has already been triggered", ...)
continue
}
// ... rest of categorization ...
}
}
return work
}
Option 3: Force Delete Orphaned Pods on Reset
When initOrResetRollingUpdate() is called, immediately delete all pods that don't match the new target hash:
func (r *Reconciler) initOrResetRollingUpdate(ctx context.Context, pcs *grovecorev1alpha1.PodCliqueSet, pclq *grovecorev1alpha1.PodClique) error {
// ... compute new pod template hash ...
// Delete all existing pods that don't match the new target hash
existingPods, err := getExistingPods(ctx, r.client, pclq)
if err != nil {
return err
}
for _, pod := range existingPods {
if pod.Labels[common.LabelPodTemplateHash] != podTemplateHash {
logger.Info("Deleting orphaned pod from previous rolling update", "pod", pod.Name, "oldHash", pod.Labels[common.LabelPodTemplateHash], "newHash", podTemplateHash)
if err := r.client.Delete(ctx, pod); err != nil {
return err
}
}
}
// ... rest of existing code ...
}
What did you expect to happen?
see above
Environment
- Kubernetes version
- Grove version: v0.1.0-alpha.3
- Scheduler details
- Cloud provider or hardware configuration
- Tools that you are using Grove together with
- Anything else that is relevant
What happened?
dynamo QA team tried to trigger a rolling update while an existing rolling update was still in progress.
When a new rolling update is triggered while a previous rolling update is still in progress, Grove's PodClique controller loses track of pods created during the first update attempt. This causes the rolling update to mark itself as "completed" (
updateEndedAtis set) even though pods with incorrect template hashes are still running, leavingupdatedReplicas: 0permanently.What Happened (Actual Behavior)
Scenario
nvidia.com/restartAt: "test2")rollingUpdateProgress.podTemplateHashto hash for test2nvidia.com/restartAt: "test3")initOrResetRollingUpdate()rollingUpdateProgress.podTemplateHashto hash for test3updateStartedAtto current timeupdatedReplicas: 0processPendingUpdates()to continue the rolling updatecomputeUpdateWork()categorizes existing pods:currentPodTemplateHash(old, pre-test2)rollingUpdateProgress.podTemplateHash(test3)getPodNamesPendingUpdate()returns empty list (no pods to update)rollingupdate.goand callsmarkRollingUpdateEnd()updateEndedAtis set) but:updatedReplicas: 0readyReplicas: 1Evidence from Cluster
PodClique Status:
Pod Labels:
PodClique Annotations (expected):
Hash Mismatch:
b76488bc9dd56799677(from test2)c9cbdcf75cbb9d7df89(old)d554cc997cfd9f64fd8(test3)Grove Operator Logs:
The PodCliqueSet controller keeps requeueing because it sees the rolling update is "not completed" but the PodClique controller thinks it's done.
What Should Happen (Expected Behavior)
When a new rolling update is initiated while one is in progress:
updatedReplicas: 1readyReplicas: 1rollingUpdateProgress.podTemplateHashupdateEndedAtset only when actual update is completeRoot Cause Analysis
Location
File:
grove/operator/internal/controller/podclique/components/pod/rollingupdate.goThe Bug
In
computeUpdateWork()(line 136-159):The Problem:
The code compares pods against
sc.expectedPodTemplateHash(which is the NEW target hash fromrollingUpdateProgress.podTemplateHash). It assumes pods are either:But it doesn't account for the case where:
When this happens:
oldTemplateHashReadyPodsLooking at the logs more carefully, the issue might be in the deletion expectation tracking. When
initOrResetRollingUpdate()is called, it resets the rolling update progress but doesn't clear deletion expectations from the previous update attempt.Alternative Root Cause
When
initOrResetRollingUpdate()is called (line 149-167 inreconcilespec.go):This resets
RollingUpdateProgressbut doesn't:ReadyPodsSelectedToUpdate(though it should be nil here)So when the next reconciliation happens:
hasPodDeletionBeenTriggered()might return TRUE if a deletion was already expected from test2 updategetPodNamesPendingUpdate()returns emptyReproduction Steps
nvidia.com/restartAt: "test1")nvidia.com/restartAt: "test2")rollingUpdateProgress.updateEndedAtis setupdatedReplicas: 0Impact
Suggested Fix
The fix should be in
computeUpdateWork()or when callinginitOrResetRollingUpdate():Option 1: Clear Expectations on Reset
When
initOrResetRollingUpdate()is called, also clear the expectations store for this PodClique:Option 2: Treat All Non-Matching Pods as Needing Update
Modify
computeUpdateWork()to not skip pods even if they have deletion expectations, when starting a fresh rolling update:Option 3: Force Delete Orphaned Pods on Reset
When
initOrResetRollingUpdate()is called, immediately delete all pods that don't match the new target hash:What did you expect to happen?
see above
Environment