PCSG-owned PodCliques are being orphaned during deletion

## Summary

E2E test cleanup fails intermittently with a 60-second timeout because pods belonging to PCSG-owned PodCliques are being orphaned. The root cause is a bug in the label selector logic used during PodClique deletion.

[cleanup-failure.diag.txt](https://github.com/user-attachments/files/24621971/cleanup-failure.diag.txt)

## Error Message

```
setup.go:184: Failed to cleanup workloads: failed to delete all resources and pods: condition not met within timeout of 1m0s
```

## Affected Test

`Test_GS10_GangSchedulingWithPCSScalingMinReplicasAdvanced`

## Timeline of Events

| Timestamp | Event |
|-----------|-------|
| `01:22:52.306Z` | Test completed successfully |
| `01:22:52.378Z` | PodClique `workload2-1-pc-a` deletion triggered |
| `01:22:52.527Z` | PodCliqueSet `workload2` waiting for children: `[workload2-1-sg-x-0, workload2-0-pc-a, workload2-1-pc-a, workload2-0-sg-x, workload2-1-sg-x]` |
| `01:22:52.958Z` | PodClique `workload2-1-sg-x-0-pc-c` (PCSG-owned): "Triggering delete of all pods" |
| `01:22:53.033Z` | PodClique `workload2-1-sg-x-0-pc-c`: "Successfully deleted all pods" |
| `01:22:53.034Z` | PodClique `workload2-1-sg-x-0-pc-c`: "No resources are awaiting cleanup" **(BUG!)** |
| `01:22:53.034Z` | PodClique `workload2-1-sg-x-0-pc-c`: Removing finalizer |
| `01:22:53.681Z` | PodClique `workload2-1-sg-x-0-pc-c` deleted |
| `01:23:53.671Z` | Cleanup check: Remaining PodCliqueSets: `[default/workload2]` |
| `01:23:53.682Z` | Cleanup check: Remaining PodCliques: `[default/workload2-0-pc-a, default/workload2-1-pc-a]` |
| `01:23:56.711Z` | Cleanup check: Remaining pod: `workload2-1-sg-x-0-pc-c-9dhzl` (Running) |
| `01:23:56.711Z` | **CLEANUP FAILURE** - 60s timeout reached |
| `01:24:02.791Z` | Diagnostic: Only 1 pod remains in namespace (the orphaned one) |

## Root Cause Analysis

### The Bug Location

File: `operator/internal/controller/podclique/components/pod/pod.go`

```go
// getSelectorLabelsForPods creates label selector map for identifying pods belonging to a PodClique
func getSelectorLabelsForPods(pclqObjectMeta metav1.ObjectMeta) map[string]string {
    pcsName := k8sutils.GetFirstOwnerName(pclqObjectMeta)  // BUG: Returns PCSG name for PCSG-owned PodCliques!
    return lo.Assign(
        apicommon.GetDefaultLabelsForPodCliqueSetManagedResources(pcsName),
        map[string]string{
            apicommon.LabelPodClique: pclqObjectMeta.Name,
        },
    )
}
```

### Why This Is Wrong

For a PodClique like `workload2-1-sg-x-0-pc-c` that is owned by a PodCliqueScalingGroup:

1. **`GetFirstOwnerName(pclqObjectMeta)`** returns the first owner reference name
   - For PCSG-owned PodCliques, this returns the **PCSG name** (e.g., `workload2-1-sg-x`)
   - NOT the PCS name (e.g., `workload2`)

2. **The selector becomes:**
   ```
   {
     "app.kubernetes.io/managed-by": "grove-operator",
     "app.kubernetes.io/part-of": "workload2-1-sg-x",  // WRONG - should be "workload2"
     "grove.io/podclique": "workload2-1-sg-x-0-pc-c"
   }
   ```

3. **But pods are created with:**
   ```
   {
     "app.kubernetes.io/managed-by": "grove-operator", 
     "app.kubernetes.io/part-of": "workload2",  // Correct PCS name
     "grove.io/podclique": "workload2-1-sg-x-0-pc-c"
   }
   ```

4. **Result:** Label selector mismatch - `DeleteAllOf` finds 0 matching pods!

### Evidence from Logs

The delete flow logs show the problem clearly:

```
01:22:52.958Z  "Triggering delete of all pods for the PodClique" [workload2-1-sg-x-0-pc-c]
01:22:53.033Z  "Successfully deleted all pods for the PodClique" [workload2-1-sg-x-0-pc-c]
01:22:53.034Z  "No resources are awaiting cleanup" [workload2-1-sg-x-0-pc-c]
```

- Time between "Triggering delete" and "Successfully deleted": **75ms** (impossibly fast for actual pod termination)
- "No resources awaiting cleanup" logged immediately after

This happens because:
1. `DeleteAllOf` returns success even when 0 pods match the selector
2. `GetExistingResourceNames()` uses the same wrong selector AND filters by `IsControlledBy()` 
3. Since no pods match the wrong selector, the list is empty
4. Controller concludes cleanup is complete, removes finalizer

### Contrast with PCS-owned PodCliques

For PCS-owned PodCliques like `workload2-0-pc-a`:
- `GetFirstOwnerName()` returns `workload2` (correct PCS name)
- Label selector is correct
- Pods ARE found and deleted
- Controller correctly waits for pod termination

Evidence:
```
01:22:52.378Z  "Triggering delete of all pods for the PodClique" [workload2-1-pc-a]
01:22:52.506Z  "Successfully deleted all pods for the PodClique" [workload2-1-pc-a]  
01:22:52.506Z  "Resources are still awaiting cleanup" resources=["workload2-1-pc-a-c2sbc","workload2-1-pc-a-5rr4w"]
```

These PodCliques correctly detect their pods still exist and requeue.

### No "Killing" Events for Orphaned Pod

The Kubernetes events show "Killing" events for many pods during cleanup, but **NOT** for pods belonging to PCSG-owned PodCliques:

```
01:22:54Z  Normal  Killing  Pod/workload2-1-sg-x-1-pc-c-t9vsj   Stopping container pc-c
01:22:55Z  Normal  Killing  Pod/workload2-1-sg-x-0-pc-c-s4kpd   Stopping container pc-c
```

But `workload2-1-sg-x-0-pc-c-9dhzl` never receives a Killing event - it was never sent a delete request because the label selector didn't match it.

## Impact

1. **Orphaned pods** remain running after their owning PodClique is deleted
2. **Cleanup timeout** because pods never terminate
3. **Cascade deletion blocked** - parent resources (PodCliqueSet, PodCliqueScalingGroup) can't complete deletion
4. **E2E test failures** due to cleanup timeout

## Proposed Fix

Change `getSelectorLabelsForPods` to get the PCS name from labels instead of owner references:

```go
func getSelectorLabelsForPods(pclqObjectMeta metav1.ObjectMeta) map[string]string {
    // Get PCS name from labels, not owner reference
    // Owner might be PCSG (not PCS) for scaling group PodCliques
    pcsName := pclqObjectMeta.Labels[apicommon.LabelPartOfKey]
    return lo.Assign(
        apicommon.GetDefaultLabelsForPodCliqueSetManagedResources(pcsName),
        map[string]string{
            apicommon.LabelPodClique: pclqObjectMeta.Name,
        },
    )
}
```

This is consistent with how `GetPodCliqueSetName` works elsewhere in the codebase:

```go
// From operator/internal/controller/common/component/utils/podcliqueset.go
func GetPodCliqueSetName(objectMeta metav1.ObjectMeta) string {
    pcsName := objectMeta.GetLabels()[common.LabelPartOfKey]
    return pcsName
}
```

## Files to Modify

- `operator/internal/controller/podclique/components/pod/pod.go` - Fix `getSelectorLabelsForPods()`

## Testing

After the fix:
1. PCSG-owned PodClique deletion should correctly delete owned pods
2. No orphaned pods after PodClique deletion
3. Cleanup should complete within timeout
4. `Test_GS10_GangSchedulingWithPCSScalingMinReplicasAdvanced` should pass cleanup


Timestamp	Event
`01:22:52.306Z`	Test completed successfully
`01:22:52.378Z`	PodClique `workload2-1-pc-a` deletion triggered
`01:22:52.527Z`	PodCliqueSet `workload2` waiting for children: `[workload2-1-sg-x-0, workload2-0-pc-a, workload2-1-pc-a, workload2-0-sg-x, workload2-1-sg-x]`
`01:22:52.958Z`	PodClique `workload2-1-sg-x-0-pc-c` (PCSG-owned): "Triggering delete of all pods"
`01:22:53.033Z`	PodClique `workload2-1-sg-x-0-pc-c`: "Successfully deleted all pods"
`01:22:53.034Z`	PodClique `workload2-1-sg-x-0-pc-c`: "No resources are awaiting cleanup" (BUG!)
`01:22:53.034Z`	PodClique `workload2-1-sg-x-0-pc-c`: Removing finalizer
`01:22:53.681Z`	PodClique `workload2-1-sg-x-0-pc-c` deleted
`01:23:53.671Z`	Cleanup check: Remaining PodCliqueSets: `[default/workload2]`
`01:23:53.682Z`	Cleanup check: Remaining PodCliques: `[default/workload2-0-pc-a, default/workload2-1-pc-a]`
`01:23:56.711Z`	Cleanup check: Remaining pod: `workload2-1-sg-x-0-pc-c-9dhzl` (Running)
`01:23:56.711Z`	CLEANUP FAILURE - 60s timeout reached
`01:24:02.791Z`	Diagnostic: Only 1 pod remains in namespace (the orphaned one)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PCSG-owned PodCliques are being orphaned during deletion #320

Summary

Error Message

Affected Test

Timeline of Events

Root Cause Analysis

The Bug Location

Why This Is Wrong

Evidence from Logs

Contrast with PCS-owned PodCliques

No "Killing" Events for Orphaned Pod

Impact

Proposed Fix

Files to Modify

Testing

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

PCSG-owned PodCliques are being orphaned during deletion #320

Description

Summary

Error Message

Affected Test

Timeline of Events

Root Cause Analysis

The Bug Location

Why This Is Wrong

Evidence from Logs

Contrast with PCS-owned PodCliques

No "Killing" Events for Orphaned Pod

Impact

Proposed Fix

Files to Modify

Testing

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions