Summary
E2E test cleanup fails intermittently with a 60-second timeout because pods belonging to PCSG-owned PodCliques are being orphaned. The root cause is a bug in the label selector logic used during PodClique deletion.
cleanup-failure.diag.txt
Error Message
setup.go:184: Failed to cleanup workloads: failed to delete all resources and pods: condition not met within timeout of 1m0s
Affected Test
Test_GS10_GangSchedulingWithPCSScalingMinReplicasAdvanced
Timeline of Events
| Timestamp |
Event |
01:22:52.306Z |
Test completed successfully |
01:22:52.378Z |
PodClique workload2-1-pc-a deletion triggered |
01:22:52.527Z |
PodCliqueSet workload2 waiting for children: [workload2-1-sg-x-0, workload2-0-pc-a, workload2-1-pc-a, workload2-0-sg-x, workload2-1-sg-x] |
01:22:52.958Z |
PodClique workload2-1-sg-x-0-pc-c (PCSG-owned): "Triggering delete of all pods" |
01:22:53.033Z |
PodClique workload2-1-sg-x-0-pc-c: "Successfully deleted all pods" |
01:22:53.034Z |
PodClique workload2-1-sg-x-0-pc-c: "No resources are awaiting cleanup" (BUG!) |
01:22:53.034Z |
PodClique workload2-1-sg-x-0-pc-c: Removing finalizer |
01:22:53.681Z |
PodClique workload2-1-sg-x-0-pc-c deleted |
01:23:53.671Z |
Cleanup check: Remaining PodCliqueSets: [default/workload2] |
01:23:53.682Z |
Cleanup check: Remaining PodCliques: [default/workload2-0-pc-a, default/workload2-1-pc-a] |
01:23:56.711Z |
Cleanup check: Remaining pod: workload2-1-sg-x-0-pc-c-9dhzl (Running) |
01:23:56.711Z |
CLEANUP FAILURE - 60s timeout reached |
01:24:02.791Z |
Diagnostic: Only 1 pod remains in namespace (the orphaned one) |
Root Cause Analysis
The Bug Location
File: operator/internal/controller/podclique/components/pod/pod.go
// getSelectorLabelsForPods creates label selector map for identifying pods belonging to a PodClique
func getSelectorLabelsForPods(pclqObjectMeta metav1.ObjectMeta) map[string]string {
pcsName := k8sutils.GetFirstOwnerName(pclqObjectMeta) // BUG: Returns PCSG name for PCSG-owned PodCliques!
return lo.Assign(
apicommon.GetDefaultLabelsForPodCliqueSetManagedResources(pcsName),
map[string]string{
apicommon.LabelPodClique: pclqObjectMeta.Name,
},
)
}
Why This Is Wrong
For a PodClique like workload2-1-sg-x-0-pc-c that is owned by a PodCliqueScalingGroup:
-
GetFirstOwnerName(pclqObjectMeta) returns the first owner reference name
- For PCSG-owned PodCliques, this returns the PCSG name (e.g.,
workload2-1-sg-x)
- NOT the PCS name (e.g.,
workload2)
-
The selector becomes:
{
"app.kubernetes.io/managed-by": "grove-operator",
"app.kubernetes.io/part-of": "workload2-1-sg-x", // WRONG - should be "workload2"
"grove.io/podclique": "workload2-1-sg-x-0-pc-c"
}
-
But pods are created with:
{
"app.kubernetes.io/managed-by": "grove-operator",
"app.kubernetes.io/part-of": "workload2", // Correct PCS name
"grove.io/podclique": "workload2-1-sg-x-0-pc-c"
}
-
Result: Label selector mismatch - DeleteAllOf finds 0 matching pods!
Evidence from Logs
The delete flow logs show the problem clearly:
01:22:52.958Z "Triggering delete of all pods for the PodClique" [workload2-1-sg-x-0-pc-c]
01:22:53.033Z "Successfully deleted all pods for the PodClique" [workload2-1-sg-x-0-pc-c]
01:22:53.034Z "No resources are awaiting cleanup" [workload2-1-sg-x-0-pc-c]
- Time between "Triggering delete" and "Successfully deleted": 75ms (impossibly fast for actual pod termination)
- "No resources awaiting cleanup" logged immediately after
This happens because:
DeleteAllOf returns success even when 0 pods match the selector
GetExistingResourceNames() uses the same wrong selector AND filters by IsControlledBy()
- Since no pods match the wrong selector, the list is empty
- Controller concludes cleanup is complete, removes finalizer
Contrast with PCS-owned PodCliques
For PCS-owned PodCliques like workload2-0-pc-a:
GetFirstOwnerName() returns workload2 (correct PCS name)
- Label selector is correct
- Pods ARE found and deleted
- Controller correctly waits for pod termination
Evidence:
01:22:52.378Z "Triggering delete of all pods for the PodClique" [workload2-1-pc-a]
01:22:52.506Z "Successfully deleted all pods for the PodClique" [workload2-1-pc-a]
01:22:52.506Z "Resources are still awaiting cleanup" resources=["workload2-1-pc-a-c2sbc","workload2-1-pc-a-5rr4w"]
These PodCliques correctly detect their pods still exist and requeue.
No "Killing" Events for Orphaned Pod
The Kubernetes events show "Killing" events for many pods during cleanup, but NOT for pods belonging to PCSG-owned PodCliques:
01:22:54Z Normal Killing Pod/workload2-1-sg-x-1-pc-c-t9vsj Stopping container pc-c
01:22:55Z Normal Killing Pod/workload2-1-sg-x-0-pc-c-s4kpd Stopping container pc-c
But workload2-1-sg-x-0-pc-c-9dhzl never receives a Killing event - it was never sent a delete request because the label selector didn't match it.
Impact
- Orphaned pods remain running after their owning PodClique is deleted
- Cleanup timeout because pods never terminate
- Cascade deletion blocked - parent resources (PodCliqueSet, PodCliqueScalingGroup) can't complete deletion
- E2E test failures due to cleanup timeout
Proposed Fix
Change getSelectorLabelsForPods to get the PCS name from labels instead of owner references:
func getSelectorLabelsForPods(pclqObjectMeta metav1.ObjectMeta) map[string]string {
// Get PCS name from labels, not owner reference
// Owner might be PCSG (not PCS) for scaling group PodCliques
pcsName := pclqObjectMeta.Labels[apicommon.LabelPartOfKey]
return lo.Assign(
apicommon.GetDefaultLabelsForPodCliqueSetManagedResources(pcsName),
map[string]string{
apicommon.LabelPodClique: pclqObjectMeta.Name,
},
)
}
This is consistent with how GetPodCliqueSetName works elsewhere in the codebase:
// From operator/internal/controller/common/component/utils/podcliqueset.go
func GetPodCliqueSetName(objectMeta metav1.ObjectMeta) string {
pcsName := objectMeta.GetLabels()[common.LabelPartOfKey]
return pcsName
}
Files to Modify
operator/internal/controller/podclique/components/pod/pod.go - Fix getSelectorLabelsForPods()
Testing
After the fix:
- PCSG-owned PodClique deletion should correctly delete owned pods
- No orphaned pods after PodClique deletion
- Cleanup should complete within timeout
Test_GS10_GangSchedulingWithPCSScalingMinReplicasAdvanced should pass cleanup
Summary
E2E test cleanup fails intermittently with a 60-second timeout because pods belonging to PCSG-owned PodCliques are being orphaned. The root cause is a bug in the label selector logic used during PodClique deletion.
cleanup-failure.diag.txt
Error Message
Affected Test
Test_GS10_GangSchedulingWithPCSScalingMinReplicasAdvancedTimeline of Events
01:22:52.306Z01:22:52.378Zworkload2-1-pc-adeletion triggered01:22:52.527Zworkload2waiting for children:[workload2-1-sg-x-0, workload2-0-pc-a, workload2-1-pc-a, workload2-0-sg-x, workload2-1-sg-x]01:22:52.958Zworkload2-1-sg-x-0-pc-c(PCSG-owned): "Triggering delete of all pods"01:22:53.033Zworkload2-1-sg-x-0-pc-c: "Successfully deleted all pods"01:22:53.034Zworkload2-1-sg-x-0-pc-c: "No resources are awaiting cleanup" (BUG!)01:22:53.034Zworkload2-1-sg-x-0-pc-c: Removing finalizer01:22:53.681Zworkload2-1-sg-x-0-pc-cdeleted01:23:53.671Z[default/workload2]01:23:53.682Z[default/workload2-0-pc-a, default/workload2-1-pc-a]01:23:56.711Zworkload2-1-sg-x-0-pc-c-9dhzl(Running)01:23:56.711Z01:24:02.791ZRoot Cause Analysis
The Bug Location
File:
operator/internal/controller/podclique/components/pod/pod.goWhy This Is Wrong
For a PodClique like
workload2-1-sg-x-0-pc-cthat is owned by a PodCliqueScalingGroup:GetFirstOwnerName(pclqObjectMeta)returns the first owner reference nameworkload2-1-sg-x)workload2)The selector becomes:
But pods are created with:
Result: Label selector mismatch -
DeleteAllOffinds 0 matching pods!Evidence from Logs
The delete flow logs show the problem clearly:
This happens because:
DeleteAllOfreturns success even when 0 pods match the selectorGetExistingResourceNames()uses the same wrong selector AND filters byIsControlledBy()Contrast with PCS-owned PodCliques
For PCS-owned PodCliques like
workload2-0-pc-a:GetFirstOwnerName()returnsworkload2(correct PCS name)Evidence:
These PodCliques correctly detect their pods still exist and requeue.
No "Killing" Events for Orphaned Pod
The Kubernetes events show "Killing" events for many pods during cleanup, but NOT for pods belonging to PCSG-owned PodCliques:
But
workload2-1-sg-x-0-pc-c-9dhzlnever receives a Killing event - it was never sent a delete request because the label selector didn't match it.Impact
Proposed Fix
Change
getSelectorLabelsForPodsto get the PCS name from labels instead of owner references:This is consistent with how
GetPodCliqueSetNameworks elsewhere in the codebase:Files to Modify
operator/internal/controller/podclique/components/pod/pod.go- FixgetSelectorLabelsForPods()Testing
After the fix:
Test_GS10_GangSchedulingWithPCSScalingMinReplicasAdvancedshould pass cleanup