Summary
Rolling updates can get permanently stuck due to a race condition between when UpdateEndedAt is set and when pods are counted. The test Test_RU9_RollingUpdateAllPodCliques fails intermittently with a 4-minute timeout waiting for the rolling update to complete.
What Should Happen
-
Update in progress: Old pods are deleted, new pods are created.
CurrentPodTemplateHash = OLD hash
RollingUpdateProgress.PodTemplateHash = NEW hash
UpdateEndedAt = nil → mutateUpdatedReplica() counts against NEW hash ✓
UpdatedReplicas increases as new pods become Ready
-
All new pods become Ready:
UpdatedReplicas = 2, Replicas = 2
- All old pods have been deleted
-
markRollingUpdateEnd() sets UpdateEndedAt:
UpdateEndedAt is now set → IsPCLQUpdateInProgress() returns false
- At this moment:
UpdatedReplicas = 2, Replicas = 2 ✓
-
Next status reconciliation — hash switch happens safely:
mutateCurrentHashes() runs first:
IsPCLQUpdateInProgress()? → NO (because UpdateEndedAt is set)
UpdatedReplicas == Replicas? → YES (2 == 2) ✓
- Updates
CurrentPodTemplateHash to the NEW hash
mutateUpdatedReplica() runs next:
IsPCLQUpdateInProgress()? → NO (because UpdateEndedAt is set)
- Uses
CurrentPodTemplateHash (now the NEW hash!)
- Counts pods matching NEW hash → 2 pods match ✓
UpdatedReplicas = 2
-
Update completes successfully.
What Happened (The Bug)
-
Update in progress: Old pods are deleted, new pods are created.
CurrentPodTemplateHash = OLD hash
RollingUpdateProgress.PodTemplateHash = NEW hash
UpdateEndedAt = nil → mutateUpdatedReplica() counts against NEW hash ✓
UpdatedReplicas = 1 (one new pod is Ready)
-
Last old pod deleted, but one new pod is still becoming Ready:
markRollingUpdateEnd() is called (no old pods remain)
- Sets
UpdateEndedAt → IsPCLQUpdateInProgress() now returns false
- At this moment:
UpdatedReplicas = 1, Replicas = 2
-
Next status reconciliation — the breaking moment:
mutateCurrentHashes() runs first:
IsPCLQUpdateInProgress()? → NO (because UpdateEndedAt is set)
UpdatedReplicas == Replicas? → NO (1 != 2)
- Refuses to update
CurrentPodTemplateHash — it stays as the OLD hash
mutateUpdatedReplica() runs next:
IsPCLQUpdateInProgress()? → NO (because UpdateEndedAt is set)
- Uses
CurrentPodTemplateHash (the OLD hash!)
- Counts pods matching OLD hash → 0 pods match (all pods have NEW hash)
- Sets
UpdatedReplicas = 0
-
Deadlock — every subsequent reconciliation:
mutateCurrentHashes(): UpdatedReplicas (0) != Replicas (2) → refuses to update hash
mutateUpdatedReplica(): uses OLD hash → counts 0 pods → UpdatedReplicas = 0
UpdatedReplicas stays at 0 forever
The system is permanently stuck. CurrentPodTemplateHash will never be updated because UpdatedReplicas != Replicas, but UpdatedReplicas will never be correct because it's counting against the wrong hash.
Evidence
e2eRU9-v2.diag.txt
nce from Diagnostics
1. PodClique Shows Contradictory State
# From diagnostic dump
status:
replicas: 2
updatedReplicas: 0 # ← Should be 2!
currentPodCliqueSetGenerationHash: f6c5fd949cb444c6558 # ← OLD hash
currentPodTemplateHash: 9f945b7b659c5f4cb8f # ← OLD hash (never updated)
rollingUpdateProgress:
podCliqueSetGenerationHash: c947687c4d79f9cfb59 # ← NEW target hash
podTemplateHash: 574f6fb86d49bfdfbd9d # ← NEW target hash
updateStartedAt: "2026-01-13T00:38:19Z"
updateEndedAt: "2026-01-13T00:38:19Z" # ← Update marked COMPLETE!
The contradiction:
updateEndedAt is set → Rolling update component thinks it's done
currentPodTemplateHash has OLD value → Hash was never updated
updatedReplicas: 0 → Pods being counted against wrong hash
2. Controller Logs Show the Loop
{
"level": "info",
"ts": "2026-01-13T00:38:29.260Z",
"logger": "podclique-controller",
"msg": "PodClique is currently updating, cannot set PodCliqueSet CurrentGenerationHash yet",
"PodClique": {"name": "workload1-0-pc-a", "namespace": "default"}
}
This was logged 10 seconds after updateEndedAt was set. The controller is stuck refusing to update the hash because UpdatedReplicas != Replicas.
3. Timeline of Events
| Time |
Event |
State |
| 00:38:18 |
Rolling update triggered |
updateStartedAt set |
| 00:38:19 |
updateEndedAt set |
Old pods gone, but UpdatedReplicas may not equal Replicas |
| 00:38:19+ |
First broken reconciliation |
mutateUpdatedReplica() switches to OLD hash, counts 0 |
| 00:38:29 |
Controller logs "cannot set hash" |
Deadlock confirmed |
| 00:42:18 |
Test timeout |
4 minutes with no progress |
Summary
Rolling updates can get permanently stuck due to a race condition between when
UpdateEndedAtis set and when pods are counted. The testTest_RU9_RollingUpdateAllPodCliquesfails intermittently with a 4-minute timeout waiting for the rolling update to complete.What Should Happen
Update in progress: Old pods are deleted, new pods are created.
CurrentPodTemplateHash= OLD hashRollingUpdateProgress.PodTemplateHash= NEW hashUpdateEndedAt= nil →mutateUpdatedReplica()counts against NEW hash ✓UpdatedReplicasincreases as new pods become ReadyAll new pods become Ready:
UpdatedReplicas= 2,Replicas= 2markRollingUpdateEnd()setsUpdateEndedAt:UpdateEndedAtis now set →IsPCLQUpdateInProgress()returnsfalseUpdatedReplicas= 2,Replicas= 2 ✓Next status reconciliation — hash switch happens safely:
mutateCurrentHashes()runs first:IsPCLQUpdateInProgress()? → NO (becauseUpdateEndedAtis set)UpdatedReplicas == Replicas? → YES (2 == 2) ✓CurrentPodTemplateHashto the NEW hashmutateUpdatedReplica()runs next:IsPCLQUpdateInProgress()? → NO (becauseUpdateEndedAtis set)CurrentPodTemplateHash(now the NEW hash!)UpdatedReplicas= 2Update completes successfully.
What Happened (The Bug)
Update in progress: Old pods are deleted, new pods are created.
CurrentPodTemplateHash= OLD hashRollingUpdateProgress.PodTemplateHash= NEW hashUpdateEndedAt= nil →mutateUpdatedReplica()counts against NEW hash ✓UpdatedReplicas= 1 (one new pod is Ready)Last old pod deleted, but one new pod is still becoming Ready:
markRollingUpdateEnd()is called (no old pods remain)UpdateEndedAt→IsPCLQUpdateInProgress()now returnsfalseUpdatedReplicas= 1,Replicas= 2Next status reconciliation — the breaking moment:
mutateCurrentHashes()runs first:IsPCLQUpdateInProgress()? → NO (becauseUpdateEndedAtis set)UpdatedReplicas == Replicas? → NO (1 != 2)CurrentPodTemplateHash— it stays as the OLD hashmutateUpdatedReplica()runs next:IsPCLQUpdateInProgress()? → NO (becauseUpdateEndedAtis set)CurrentPodTemplateHash(the OLD hash!)UpdatedReplicas = 0Deadlock — every subsequent reconciliation:
mutateCurrentHashes():UpdatedReplicas (0) != Replicas (2)→ refuses to update hashmutateUpdatedReplica(): uses OLD hash → counts 0 pods →UpdatedReplicas = 0UpdatedReplicasstays at 0 foreverThe system is permanently stuck.
CurrentPodTemplateHashwill never be updated becauseUpdatedReplicas != Replicas, butUpdatedReplicaswill never be correct because it's counting against the wrong hash.Evidence
e2eRU9-v2.diag.txt
nce from Diagnostics
1. PodClique Shows Contradictory State
The contradiction:
updateEndedAtis set → Rolling update component thinks it's donecurrentPodTemplateHashhas OLD value → Hash was never updatedupdatedReplicas: 0→ Pods being counted against wrong hash2. Controller Logs Show the Loop
{ "level": "info", "ts": "2026-01-13T00:38:29.260Z", "logger": "podclique-controller", "msg": "PodClique is currently updating, cannot set PodCliqueSet CurrentGenerationHash yet", "PodClique": {"name": "workload1-0-pc-a", "namespace": "default"} }This was logged 10 seconds after
updateEndedAtwas set. The controller is stuck refusing to update the hash becauseUpdatedReplicas != Replicas.3. Timeline of Events
updateStartedAtsetupdateEndedAtsetUpdatedReplicasmay not equalReplicasmutateUpdatedReplica()switches to OLD hash, counts 0