Rolling Update Gets Stuck When New Update Initiated During In-Progress Update

### What happened?

dynamo QA team tried to trigger a rolling update while an existing rolling update was still in progress.

When a new rolling update is triggered while a previous rolling update is still in progress, Grove's PodClique controller loses track of pods created during the first update attempt. This causes the rolling update to mark itself as "completed" (`updateEndedAt` is set) even though pods with incorrect template hashes are still running, leaving `updatedReplicas: 0` permanently.

## What Happened (Actual Behavior)

### Scenario
1. User initiates first rolling update by changing PodCliqueSet template (e.g., annotation `nvidia.com/restartAt: "test2"`)
2. Grove PodClique controller starts rolling update:
   - Sets `rollingUpdateProgress.podTemplateHash` to hash for test2
   - Begins deleting old pods
   - Creates new pod with test2 annotation and corresponding hash
3. **Before first update completes**, user initiates second rolling update (e.g., annotation changes to `nvidia.com/restartAt: "test3"`)
4. Grove PodClique controller detects spec change and resets rolling update:
   - Calls `initOrResetRollingUpdate()` 
   - Sets new `rollingUpdateProgress.podTemplateHash` to hash for test3
   - Sets `updateStartedAt` to current time
   - Resets `updatedReplicas: 0`
5. Controller calls `processPendingUpdates()` to continue the rolling update
6. **Problem occurs**: `computeUpdateWork()` categorizes existing pods:
   - Pods with test2 hash don't match `currentPodTemplateHash` (old, pre-test2)
   - Pods with test2 hash also don't match `rollingUpdateProgress.podTemplateHash` (test3)
   - These "orphaned" pods are **not categorized** as needing update
7. `getPodNamesPendingUpdate()` returns **empty list** (no pods to update)
8. Controller reaches line 131-132 in `rollingupdate.go` and calls `markRollingUpdateEnd()`
9. **Result**: Rolling update is marked as completed (`updateEndedAt` is set) but:
   - Pod still running with wrong hash (test2 instead of test3)
   - `updatedReplicas: 0` 
   - `readyReplicas: 1`
   - Rolling update never actually completes

### Evidence from Cluster

**PodClique Status:**
```yaml
status:
  currentPodTemplateHash: c9cbdcf75cbb9d7df89  # Old hash (pre-test2)
  rollingUpdateProgress:
    podTemplateHash: d554cc997cfd9f64fd8       # Target hash (test3)
    updateStartedAt: "2026-02-04T18:10:34Z"
    updateEndedAt: "2026-02-04T18:10:34Z"      # ❌ Marked as ended immediately
  readyReplicas: 1                             # Pod is ready
  updatedReplicas: 0                           # ❌ But not "updated"
  replicas: 1
```

**Pod Labels:**
```yaml
metadata:
  labels:
    grove.io/pod-template-hash: b76488bc9dd56799677  # Hash from test2 (middle state)
  annotations:
    nvidia.com/restartAt: test2                      # From first update attempt
```

**PodClique Annotations (expected):**
```yaml
metadata:
  annotations:
    nvidia.com/restartAt: test3  # Second update attempt
```

**Hash Mismatch:**
- Pod actual hash: `b76488bc9dd56799677` (from test2)
- PodClique current hash: `c9cbdcf75cbb9d7df89` (old)
- PodClique target hash: `d554cc997cfd9f64fd8` (test3)
- Pod doesn't match any of them → **orphaned**

**Grove Operator Logs:**
```
"msg":"components has registered a request to requeue post completion of all components syncs",
"kind":"PodCliqueSetReplica",
"message":"[Operation: Sync, Code: ERR_CONTINUE_RECONCILE_AND_REQUEUE] message: rolling update of PodCliqueSet replica index 0 is not completed"
```

The PodCliqueSet controller keeps requeueing because it sees the rolling update is "not completed" but the PodClique controller thinks it's done.

## What Should Happen (Expected Behavior)

When a new rolling update is initiated while one is in progress:

1. Grove should detect pods from the previous update attempt that don't match the new target hash
2. These "orphaned" pods should be identified as needing replacement
3. The rolling update should:
   - Delete the orphaned pods (with test2 hash)
   - Create new pods with the correct hash (test3)
   - Wait for new pods to become ready
   - Mark rolling update as completed only when all pods have the correct hash
4. Final state should be:
   - `updatedReplicas: 1`
   - `readyReplicas: 1`
   - Pod hash matches `rollingUpdateProgress.podTemplateHash`
   - `updateEndedAt` set only when actual update is complete

## Root Cause Analysis

### Location
File: `grove/operator/internal/controller/podclique/components/pod/rollingupdate.go`

### The Bug

**In `computeUpdateWork()` (line 136-159):**

```go
func (r _resource) computeUpdateWork(logger logr.Logger, sc *syncContext) *updateWork {
    work := &updateWork{}
    for _, pod := range sc.existingPCLQPods {
        if pod.Labels[common.LabelPodTemplateHash] != sc.expectedPodTemplateHash {
            // Pod has OLD template hash - should be updated
            if r.hasPodDeletionBeenTriggered(sc, pod) {
                logger.Info("skipping old Pod since its deletion has already been triggered", "pod", client.ObjectKeyFromObject(pod))
                continue
            }
            // Categorize by health status...
            if k8sutils.IsPodPending(pod) {
                work.oldTemplateHashPendingPods = append(work.oldTemplateHashPendingPods, pod)
            } else if k8sutils.HasAnyStartedButNotReadyContainer(pod) || k8sutils.HasAnyContainerExitedErroneously(logger, pod) {
                work.oldTemplateHashUnhealthyPods = append(work.oldTemplateHashUnhealthyPods, pod)
            } else if k8sutils.IsPodReady(pod) {
                work.oldTemplateHashReadyPods = append(work.oldTemplateHashReadyPods, pod)
            }
        } else {
            // Pod has NEW template hash - already updated
            if k8sutils.IsPodReady(pod) {
                work.newTemplateHashReadyPods = append(work.newTemplateHashReadyPods, pod)
            }
        }
    }
    return work
}
```

**The Problem:**

The code compares pods against `sc.expectedPodTemplateHash` (which is the NEW target hash from `rollingUpdateProgress.podTemplateHash`). It assumes pods are either:
- Have the NEW hash (already updated) 
- Have some OTHER hash (need updating)

**But it doesn't account for the case where:**
- Pod has an INTERMEDIATE hash from a previous update attempt
- Pod is READY (passes all health checks)
- Pod's hash != expectedPodTemplateHash
- Pod's hash != currentPodTemplateHash (because currentPodTemplateHash is from pre-first-update)

When this happens:
1. The check at line 138 evaluates to TRUE (pod hash != expectedPodTemplateHash)
2. Pod is categorized into `oldTemplateHashReadyPods` 
3. **BUT** if the pod was created by a previous rolling update attempt, it may have already passed through the deletion check and is in a "healthy running" state
4. The key issue: if this pod is the ONLY pod, and it's ready, the rolling update logic thinks "all old pods have been replaced" because it doesn't see any pods from the "true old" generation

Looking at the logs more carefully, the issue might be in the deletion expectation tracking. When `initOrResetRollingUpdate()` is called, it resets the rolling update progress but doesn't clear deletion expectations from the previous update attempt.

### Alternative Root Cause

When `initOrResetRollingUpdate()` is called (line 149-167 in `reconcilespec.go`):

```go
func (r *Reconciler) initOrResetRollingUpdate(ctx context.Context, pcs *grovecorev1alpha1.PodCliqueSet, pclq *grovecorev1alpha1.PodClique) error {
    podTemplateHash, err := componentutils.GetExpectedPCLQPodTemplateHash(pcs, pclq.ObjectMeta)
    if err != nil {
        return fmt.Errorf("could not update PodClique %s status with rolling update progress: %w", client.ObjectKeyFromObject(pclq), err)
    }
    // reset and start the rolling update
    patch := client.MergeFrom(pclq.DeepCopy())
    pclq.Status.RollingUpdateProgress = &grovecorev1alpha1.PodCliqueRollingUpdateProgress{
        UpdateStartedAt:            metav1.Now(),
        PodCliqueSetGenerationHash: *pcs.Status.CurrentGenerationHash,
        PodTemplateHash:            podTemplateHash,
    }
    // reset the updated replicas count to 0 so that the rolling update can start afresh.
    pclq.Status.UpdatedReplicas = 0
    if err = r.client.Status().Patch(ctx, pclq, patch); err != nil {
        return fmt.Errorf("failed to update PodClique %s status with rolling update progress: %w", client.ObjectKeyFromObject(pclq), err)
    }
    return nil
}
```

This resets `RollingUpdateProgress` but **doesn't**:
1. Clear the expectations store for pending deletions from the previous update
2. Clear `ReadyPodsSelectedToUpdate` (though it should be nil here)
3. Force deletion of pods with intermediate hashes

So when the next reconciliation happens:
- Pod with test2 hash exists and is ready
- `hasPodDeletionBeenTriggered()` might return TRUE if a deletion was already expected from test2 update
- Pod gets skipped in the categorization (line 141-143)
- `getPodNamesPendingUpdate()` returns empty
- Rolling update marked as complete

## Reproduction Steps

1. Create a PodCliqueSet with a PodClique (any spec)
2. Wait for pods to become ready
3. Update PodCliqueSet template (e.g., change annotation `nvidia.com/restartAt: "test1"`)
4. Grove starts rolling update, begins deleting/recreating pods
5. **Before all pods are recreated**, update PodCliqueSet template again (e.g., change annotation to `nvidia.com/restartAt: "test2"`)
6. Observe:
   - PodClique `rollingUpdateProgress.updateEndedAt` is set
   - PodClique `updatedReplicas: 0`
   - Pods running with hash from first update (test1) instead of second (test2)
   - Rolling update never completes

## Impact

- **High**: Rolling updates can get permanently stuck when rapid changes are made
- Users cannot complete deployments after rapid consecutive restarts
- Requires manual intervention (deleting the PodClique or pods) to recover
- Affects any workload using PodClique/PodCliqueSet for orchestration
- Common in CI/CD scenarios or when users issue multiple restart commands

## Suggested Fix

The fix should be in `computeUpdateWork()` or when calling `initOrResetRollingUpdate()`:

### Option 1: Clear Expectations on Reset
When `initOrResetRollingUpdate()` is called, also clear the expectations store for this PodClique:

```go
func (r *Reconciler) initOrResetRollingUpdate(ctx context.Context, pcs *grovecorev1alpha1.PodCliqueSet, pclq *grovecorev1alpha1.PodClique) error {
    // ... existing code ...
    
    // Clear deletion expectations from previous rolling update attempt
    pclqExpectationsStoreKey := expectationstore.GetKey(pclq)
    r.expectationsStore.ClearDeleteExpectations(pclqExpectationsStoreKey)
    
    // ... rest of existing code ...
}
```

### Option 2: Treat All Non-Matching Pods as Needing Update
Modify `computeUpdateWork()` to not skip pods even if they have deletion expectations, when starting a fresh rolling update:

```go
func (r _resource) computeUpdateWork(logger logr.Logger, sc *syncContext) *updateWork {
    work := &updateWork{}
    for _, pod := range sc.existingPCLQPods {
        if pod.Labels[common.LabelPodTemplateHash] != sc.expectedPodTemplateHash {
            // Check if this pod has a deletion triggered, but only skip if we're in the middle
            // of an ongoing update (not a fresh reset)
            if r.hasPodDeletionBeenTriggered(sc, pod) && !isRollingUpdateJustReset(sc.pclq) {
                logger.Info("skipping old Pod since its deletion has already been triggered", ...)
                continue
            }
            // ... rest of categorization ...
        }
    }
    return work
}
```

### Option 3: Force Delete Orphaned Pods on Reset
When `initOrResetRollingUpdate()` is called, immediately delete all pods that don't match the new target hash:

```go
func (r *Reconciler) initOrResetRollingUpdate(ctx context.Context, pcs *grovecorev1alpha1.PodCliqueSet, pclq *grovecorev1alpha1.PodClique) error {
    // ... compute new pod template hash ...
    
    // Delete all existing pods that don't match the new target hash
    existingPods, err := getExistingPods(ctx, r.client, pclq)
    if err != nil {
        return err
    }
    for _, pod := range existingPods {
        if pod.Labels[common.LabelPodTemplateHash] != podTemplateHash {
            logger.Info("Deleting orphaned pod from previous rolling update", "pod", pod.Name, "oldHash", pod.Labels[common.LabelPodTemplateHash], "newHash", podTemplateHash)
            if err := r.client.Delete(ctx, pod); err != nil {
                return err
            }
        }
    }
    
    // ... rest of existing code ...
}
```

### What did you expect to happen?

see above

### Environment

- Kubernetes version
- Grove version: v0.1.0-alpha.3
- Scheduler details
- Cloud provider or hardware configuration
- Tools that you are using Grove together with
- Anything else that is relevant


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rolling Update Gets Stuck When New Update Initiated During In-Progress Update #400

What happened?

What Happened (Actual Behavior)

Scenario

Evidence from Cluster

What Should Happen (Expected Behavior)

Root Cause Analysis

Location

The Bug

Alternative Root Cause

Reproduction Steps

Impact

Suggested Fix

Option 1: Clear Expectations on Reset

Option 2: Treat All Non-Matching Pods as Needing Update

Option 3: Force Delete Orphaned Pods on Reset

What did you expect to happen?

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Rolling Update Gets Stuck When New Update Initiated During In-Progress Update #400

Description

What happened?

What Happened (Actual Behavior)

Scenario

Evidence from Cluster

What Should Happen (Expected Behavior)

Root Cause Analysis

Location

The Bug

Alternative Root Cause

Reproduction Steps

Impact

Suggested Fix

Option 1: Clear Expectations on Reset

Option 2: Treat All Non-Matching Pods as Needing Update

Option 3: Force Delete Orphaned Pods on Reset

What did you expect to happen?

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions