Skip to content

PCSG-owned PodCliques are being orphaned during deletion #320

@gflarity

Description

@gflarity

Summary

E2E test cleanup fails intermittently with a 60-second timeout because pods belonging to PCSG-owned PodCliques are being orphaned. The root cause is a bug in the label selector logic used during PodClique deletion.

cleanup-failure.diag.txt

Error Message

setup.go:184: Failed to cleanup workloads: failed to delete all resources and pods: condition not met within timeout of 1m0s

Affected Test

Test_GS10_GangSchedulingWithPCSScalingMinReplicasAdvanced

Timeline of Events

Timestamp Event
01:22:52.306Z Test completed successfully
01:22:52.378Z PodClique workload2-1-pc-a deletion triggered
01:22:52.527Z PodCliqueSet workload2 waiting for children: [workload2-1-sg-x-0, workload2-0-pc-a, workload2-1-pc-a, workload2-0-sg-x, workload2-1-sg-x]
01:22:52.958Z PodClique workload2-1-sg-x-0-pc-c (PCSG-owned): "Triggering delete of all pods"
01:22:53.033Z PodClique workload2-1-sg-x-0-pc-c: "Successfully deleted all pods"
01:22:53.034Z PodClique workload2-1-sg-x-0-pc-c: "No resources are awaiting cleanup" (BUG!)
01:22:53.034Z PodClique workload2-1-sg-x-0-pc-c: Removing finalizer
01:22:53.681Z PodClique workload2-1-sg-x-0-pc-c deleted
01:23:53.671Z Cleanup check: Remaining PodCliqueSets: [default/workload2]
01:23:53.682Z Cleanup check: Remaining PodCliques: [default/workload2-0-pc-a, default/workload2-1-pc-a]
01:23:56.711Z Cleanup check: Remaining pod: workload2-1-sg-x-0-pc-c-9dhzl (Running)
01:23:56.711Z CLEANUP FAILURE - 60s timeout reached
01:24:02.791Z Diagnostic: Only 1 pod remains in namespace (the orphaned one)

Root Cause Analysis

The Bug Location

File: operator/internal/controller/podclique/components/pod/pod.go

// getSelectorLabelsForPods creates label selector map for identifying pods belonging to a PodClique
func getSelectorLabelsForPods(pclqObjectMeta metav1.ObjectMeta) map[string]string {
    pcsName := k8sutils.GetFirstOwnerName(pclqObjectMeta)  // BUG: Returns PCSG name for PCSG-owned PodCliques!
    return lo.Assign(
        apicommon.GetDefaultLabelsForPodCliqueSetManagedResources(pcsName),
        map[string]string{
            apicommon.LabelPodClique: pclqObjectMeta.Name,
        },
    )
}

Why This Is Wrong

For a PodClique like workload2-1-sg-x-0-pc-c that is owned by a PodCliqueScalingGroup:

  1. GetFirstOwnerName(pclqObjectMeta) returns the first owner reference name

    • For PCSG-owned PodCliques, this returns the PCSG name (e.g., workload2-1-sg-x)
    • NOT the PCS name (e.g., workload2)
  2. The selector becomes:

    {
      "app.kubernetes.io/managed-by": "grove-operator",
      "app.kubernetes.io/part-of": "workload2-1-sg-x",  // WRONG - should be "workload2"
      "grove.io/podclique": "workload2-1-sg-x-0-pc-c"
    }
    
  3. But pods are created with:

    {
      "app.kubernetes.io/managed-by": "grove-operator", 
      "app.kubernetes.io/part-of": "workload2",  // Correct PCS name
      "grove.io/podclique": "workload2-1-sg-x-0-pc-c"
    }
    
  4. Result: Label selector mismatch - DeleteAllOf finds 0 matching pods!

Evidence from Logs

The delete flow logs show the problem clearly:

01:22:52.958Z  "Triggering delete of all pods for the PodClique" [workload2-1-sg-x-0-pc-c]
01:22:53.033Z  "Successfully deleted all pods for the PodClique" [workload2-1-sg-x-0-pc-c]
01:22:53.034Z  "No resources are awaiting cleanup" [workload2-1-sg-x-0-pc-c]
  • Time between "Triggering delete" and "Successfully deleted": 75ms (impossibly fast for actual pod termination)
  • "No resources awaiting cleanup" logged immediately after

This happens because:

  1. DeleteAllOf returns success even when 0 pods match the selector
  2. GetExistingResourceNames() uses the same wrong selector AND filters by IsControlledBy()
  3. Since no pods match the wrong selector, the list is empty
  4. Controller concludes cleanup is complete, removes finalizer

Contrast with PCS-owned PodCliques

For PCS-owned PodCliques like workload2-0-pc-a:

  • GetFirstOwnerName() returns workload2 (correct PCS name)
  • Label selector is correct
  • Pods ARE found and deleted
  • Controller correctly waits for pod termination

Evidence:

01:22:52.378Z  "Triggering delete of all pods for the PodClique" [workload2-1-pc-a]
01:22:52.506Z  "Successfully deleted all pods for the PodClique" [workload2-1-pc-a]  
01:22:52.506Z  "Resources are still awaiting cleanup" resources=["workload2-1-pc-a-c2sbc","workload2-1-pc-a-5rr4w"]

These PodCliques correctly detect their pods still exist and requeue.

No "Killing" Events for Orphaned Pod

The Kubernetes events show "Killing" events for many pods during cleanup, but NOT for pods belonging to PCSG-owned PodCliques:

01:22:54Z  Normal  Killing  Pod/workload2-1-sg-x-1-pc-c-t9vsj   Stopping container pc-c
01:22:55Z  Normal  Killing  Pod/workload2-1-sg-x-0-pc-c-s4kpd   Stopping container pc-c

But workload2-1-sg-x-0-pc-c-9dhzl never receives a Killing event - it was never sent a delete request because the label selector didn't match it.

Impact

  1. Orphaned pods remain running after their owning PodClique is deleted
  2. Cleanup timeout because pods never terminate
  3. Cascade deletion blocked - parent resources (PodCliqueSet, PodCliqueScalingGroup) can't complete deletion
  4. E2E test failures due to cleanup timeout

Proposed Fix

Change getSelectorLabelsForPods to get the PCS name from labels instead of owner references:

func getSelectorLabelsForPods(pclqObjectMeta metav1.ObjectMeta) map[string]string {
    // Get PCS name from labels, not owner reference
    // Owner might be PCSG (not PCS) for scaling group PodCliques
    pcsName := pclqObjectMeta.Labels[apicommon.LabelPartOfKey]
    return lo.Assign(
        apicommon.GetDefaultLabelsForPodCliqueSetManagedResources(pcsName),
        map[string]string{
            apicommon.LabelPodClique: pclqObjectMeta.Name,
        },
    )
}

This is consistent with how GetPodCliqueSetName works elsewhere in the codebase:

// From operator/internal/controller/common/component/utils/podcliqueset.go
func GetPodCliqueSetName(objectMeta metav1.ObjectMeta) string {
    pcsName := objectMeta.GetLabels()[common.LabelPartOfKey]
    return pcsName
}

Files to Modify

  • operator/internal/controller/podclique/components/pod/pod.go - Fix getSelectorLabelsForPods()

Testing

After the fix:

  1. PCSG-owned PodClique deletion should correctly delete owned pods
  2. No orphaned pods after PodClique deletion
  3. Cleanup should complete within timeout
  4. Test_GS10_GangSchedulingWithPCSScalingMinReplicasAdvanced should pass cleanup

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions